publications | Jean Paul Barddal

2025

BRACIS
Training and Test Machine Learning Models on Encrypted Data: Initial Results and Challenges

Rodrigo Kruger, Jean Paul Barddal, and Vinicius Mourão Alves Souza

In Brazilian Conference on Intelligent System (BRACIS) 2025

Abs Bib PDF

Privacy is critical when using Machine Learning (ML) models over sensitive data, like healthcare, finance, and legal systems. Many of these models are trained or executed on cloud services, meaning sensitive data is transmitted over the network, or third-party services operate directly on unprotected data during training and inference, increasing exposure to potential leaks. Data encryption is a promising solution that guarantees high privacy levels. An adequate cryptography solution for ML is Homomorphic Encryption, a cryptographic method that allows mathematical operations on ciphertexts, i.e., encrypted data, producing encrypted models and outputs that only authorized parties can decrypt. However, the protection offered by Homomorphic Encryption comes at a significant computational overhead. Additionally, only specific mathematical operations (typically additions and multiplications) are allowed, and encrypted computations accumulate noise that reduces the result’s precision. This paper discusses the challenges of using encrypted data in training and test steps of ML models. It experimentally analyzes the impact on error rates and processing times when traditional classifiers, such as Artificial Neural Network and Logistic Regression, are adapted to process encrypted data. We adopt the CKKS scheme, a Homomorphic Encryption method that supports approximate computations over real numbers and adapted the activation functions of the classifiers using three approximation methods in an experimental evaluation with five medical datasets.
@inproceedings{HOMOMORPHICALLY_BRACIS:2025, author = {Kruger, Rodrigo and Barddal, Jean Paul and de Souza, Vinicius Mourão Alves}, title = {Training and Test Machine Learning Models on Encrypted Data: Initial Results and Challenges}, booktitle = {Brazilian Conference on Intelligent System (BRACIS)}, year = {2025}, }
BRACIS
OnlineSIRUOS: An Inverse Random Under and Oversampling, Heterogeneous Ensemble, and Meta-Learning Approach for Imbalanced Data Stream Classification

Vinicios Cainã Santos Coelho, Alceu Souza Britto Jr., and Jean Paul Barddal

In Brazilian Conference on Intelligent System (BRACIS) 2025

Abs Bib PDF

Data streams are potentially unbounded data sequences that are made available rapidly and over time. Due to their pervasiveness, mining data streams has become a major scientific and practical issue. Scenarios involving data streams present multiple challenges, including the requirement for single-pass processing due to constraints on computational resources and the necessity to respond to concept drift over time. Another common trait of several streaming scenarios is class imbalance, that is, a class, often of interest, is majorly outnumbered by others, thus hardening the learning process. This paper introduces Online Stacking Inverse Random Under and Over Sampling (OnlineSIRUOS). This ensemble-based approach combines meta-learning, sampling, and heterogeneous components to address class-imbalanced data stream classification. We evaluated our proposal against existing work tailored for class imbalance in data streams using synthetic and real-world datasets. Experimental results show that our proposal achieves competitive F1 scores in different imbalance ratios and is less computationally intensive than its competitors in processing time and memory consumption. The results also show that our proposal is particularly well-suited for highly imbalanced data streams.
@inproceedings{ONLINESIRUOS_BRACIS:2025, author = {dos Santos Coelho, Vinicios Cainã and de Souza Britto Jr., Alceu and Barddal, Jean Paul}, title = {OnlineSIRUOS: An Inverse Random Under and Oversampling, Heterogeneous Ensemble, and Meta-Learning Approach for Imbalanced Data Stream Classification}, booktitle = {Brazilian Conference on Intelligent System (BRACIS)}, year = {2025}, }
BRACIS
A Generative Domain Adaptation Scheme for Swift Deployment of Parking Monitoring Systems

Antonio Michel Ferreira Santos, Paulo Ricardo Lisboa Almeida, Jean Paul Barddal, and André Gustavo Hochuli

In Brazilian Conference on Intelligent System (BRACIS) 2025

Abs Bib PDF

Deep learning models have demonstrated remarkable accuracy in distinguishing between empty and occupied parking spaces when large amounts of annotated training data are available from the target environment. However, in real-world deployments, the major bottleneck lies in the labor-intensive annotation process required whenever a new scenario arises, or retraining is needed due to changes in the camera setup, often driven by maintenance, repositioning, or environmental conditions. This paper addresses this challenge by proposing a generative domain adaptation scheme designed to reduce annotation requirements and accelerate deployment significantly. Instead of relying on extensive labeled datasets and computationally expensive model retraining, our method synthesizes new training samples based on a small subset of instances from the target domain. In particular, by combining generative augmentation with a lightweight convolutional network for inference, our approach achieves a favorable balance between annotation cost, computational efficiency, and accuracy. These results highlight the method’s potential as a cost-effective and rapidly deployable solution for real-world parking lot monitoring. Under a cross-dataset evaluation protocol, we highlight that our approach achieves competitive accuracy (close to 97%) using as few as 256 labeled samples, thus substantially reducing human annotation effort without sacrificing classification performance.
@inproceedings{GENERATIVE_PKLOT_BRACIS:2025, author = {dos Santos, Antonio Michel Ferreira and de Almeida, Paulo Ricardo Lisboa and Barddal, Jean Paul and Hochuli, André Gustavo}, title = {A Generative Domain Adaptation Scheme for Swift Deployment of Parking Monitoring Systems}, booktitle = {Brazilian Conference on Intelligent System (BRACIS)}, year = {2025}, }
ECML PKDD
Adaptive Options for Decision Trees in Evolving Data Stream Classification

Daniel Nowak Assis, Jean Paul Barddal, and Fabrı́cio Enembreck

In 2025

Abs Bib PDF

Decision trees are fundamental components of data stream mining frameworks and pipelines. However, their inherent instability - where small variations in training data can lead to significant structural changes - has motivated research into methods that either (i) mitigate this instability or (ii) exploit it for improved performance. Option trees provide an alternative approach to instability reduction by allowing non-leaf nodes to have multiple subtrees as child nodes. This enables instances to traverse multiple paths within a single decision tree structure, offering greater processing time and memory efficiency compared to ensemble methods - key advantages for streaming data mining, wheredata arrives continuously and potentially without bounds. This paper introduces LASTO, an algorithm with adaptive mechanisms for splitting and dynamically adding option nodes. Our primary contribution lies in the option node addition mechanism, where change detectors monitor branch performance and introduce option nodes when a decline in predictive quality is observed. An option node is only added if the split gain surpasses that of the previous split, ensuring its necessity and effectiveness. Experimental results demonstrate that LASTO achieves statistically significant differences in predictive performance while maintaining computational efficiency comparable to state-of-the-art decision trees for data stream classification.
@inproceedings{ECML_LASTO:2025, author = {Assis, Daniel Nowak and Barddal, Jean Paul and Enembreck, Fabr{\'{\i}}cio}, title = {Adaptive Options for Decision Trees in Evolving Data Stream Classification}, year = {2025}, }
ICAISC
Adaptive Interactive Process Drift Detection: Detecting and Visualizing Process Drifts

Denise Maria Vecino Sato, Sheila Cristiana Freitas, Jefferson Koji Sato, Jean Paul Barddal, and Edson Emilio Scalabrin

In International Conference on Artificial Intelligence and Soft Computing (ICAISC) 2025

Abs Bib PDF

Process mining extracts insights about business processes from information system data. However, traditional techniques often assume static processes, which is unrealistic. Detecting process drifts is crucial for accurate analysis, but existing methods lack consistent detection due to parameter sensitivity and a lack of a standard comparison protocol. This paper introduces the Adaptive Interactive Process Drift Detection (IPDD), which applies the ADWIN change detector to process model quality metrics over time. Adaptive IPDD continuously assesses fitness and precision metrics to detect drifts. Results show IPDD’s effectiveness in synthetic datasets, comparable to Apromore in drift detection and outperforming Apromore AWIN in Mean delay. We also highlight that IPDD exhibits stable performance even with low window size values. We also evaluate a real-life event log representing an Italian company’s ticketing management process using Adaptive IPDD. The results demonstrated that the drift analysis for real scenarios can be improved by exploring the user interface of IPDD.
@inproceedings{IPDD_ICAISC:2025, author = {Sato, Denise Maria Vecino and de Freitas, Sheila Cristiana and Sato, Jefferson Koji and Barddal, Jean Paul and Scalabrin, Edson Emilio}, title = {Adaptive Interactive Process Drift Detection: Detecting and Visualizing Process Drifts}, booktitle = {International Conference on Artificial Intelligence and Soft Computing (ICAISC)}, year = {2025}, }
ICAISC
Behavioral insights of adaptive splitting decision trees in evolving data stream classification

Daniel Nowak Assis, Jean Paul Barddal, and Fabrı́cio Enembreck

Knowl. Inf. Syst. 2025

Abs Bib PDF

Decision Trees are leading in state-of-the-art architectures for classifying high-speed data streams. Hoeffding-based Trees have dominated the field and are more efficient than batch counterparts because they are incremental and distribute the cost of greedy best-split evaluations throughout the stream through periodic evaluation. However, this splitting mechanism is invariant to the state of the stream and the tree, as periodic evaluations cannot capture the state of data or performance of the tree in between assessments. In this work, we significantly outline the main behaviors of a novel decision tree that can deal with high-speed data streams and adapt to performance decays considering the tree state, namely the Local Adaptive Streaming Tree (LAST). We also provide a comprehensive benchmark of online decision trees, analyze how split moments affect performance, how the trees react to increases in the number of features and classes, and how the trees behave in concept drift scenarios. Results show that (i) LAST presents the best results, regardless of the change detector selected, (ii) LAST’s strategy of branching the tree based on performance decays is the reason it outperforms other decision trees, (iii) decision trees present similar CPU-Time, but trees with split reevaluation are more costly in memory and (iv) LAST presents superior results in datasets with abrupt changes, while datasets with gradual changes depend on choosing detectors that are more suitable for gradual changes.
@article{DBLP:journals/kais/AssisBE25, author = {Assis, Daniel Nowak and Barddal, Jean Paul and Enembreck, Fabr{\'{\i}}cio}, title = {Behavioral insights of adaptive splitting decision trees in evolving data stream classification}, journal = {Knowl. Inf. Syst.}, volume = {67}, number = {7}, pages = {5751--5782}, year = {2025}, doi = {10.1007/S10115-025-02395-5}, }

2024

ACM TIST
Concept Drift Adaptation in Text Stream Mining Settings: A Systematic Review

Cristiano Mesquita Garcia, Ramon Abilio, Alessandro Lameiras Koerich, Alceu Souza Britto Jr., and Jean Paul Barddal

ACM Transactions on Intelligent Systems and Technology 2024

Abs Bib PDF

The society produces textual data online in several ways, e.g., via reviews and social media posts. Therefore, numerous researchers have been working on discovering patterns in textual data that can indicate peoples’ opinions, interests, etc. Most tasks regarding natural language processing are addressed using traditional machine learning methods and static datasets. This setting can lead to several problems, e.g., outdated datasets and models, which degrade in performance over time. This is particularly true regarding concept drift, in which the data distribution changes over time. Furthermore, text streaming scenarios also exhibit further challenges, such as the high speed at which data arrives over time. Models for stream scenarios must adhere to the aforementioned constraints while learning from the stream, thus storing texts for limited periods and consuming low memory. This study presents a systematic literature review regarding concept drift adaptation in text stream scenarios. Considering well-defined criteria, we selected 48 papers published between 2018 and August 2024 to unravel aspects such as text drift categories, detection types, model update mechanisms, stream mining tasks addressed, and text representation methods and their update mechanisms. Furthermore, we discussed drift visualization and simulation and listed real-world datasets used in the selected papers. Finally, we brought forward a discussion on existing works in the area, also highlighting open challenges and future research directions for the community.
@article{TIST_CD_TEXT_STREAMS:2024, year = {2024}, publisher = {ACM}, author = {Garcia, Cristiano Mesquita and Abilio, Ramon and Koerich, Alessandro Lameiras and de Souza Britto Jr., Alceu and Barddal, Jean Paul}, title = {Concept Drift Adaptation in Text Stream Mining Settings: A Systematic Review}, journal = {ACM Transactions on Intelligent Systems and Technology}, }
NEUCOM & APPS
Representation ensemble learning applied to facial expression recognition

Bruna Rossetto Delazeri, André Gustavo Hochuli, Jean Paul Barddal, Alessandro Lameiras Koerich, and Alceu Souza Britto Jr.

Neural Computing and Applications 2024

Abs Bib PDF

This work introduces the representation ensemble learning algorithm, a novel approach for generating diverse unsupervised representations rooted in the principles of self-taught learning. The ensemble comprises convolutional autoencoders (CAEs) learned in an unsupervised manner, fostering diversity via a loss function designed to penalize similar CAEs’ latent representations. We employ support vector machines, bagging, and random forest as primary classification methods for the final classification step. Additionally, we incorporate KnoraU, a well-established technique used to dynamically select competent classifiers based on a test sample. We evaluate various fusion strategies, including sum, product, and stacking, to comprehensively assess the ensemble’s performance. A robust experimental protocol considering the facial expression recognition problem shows that the proposed approach based on self-taught learning surpasses the accuracy of fine-tuned convolutional neural network (CNN) models. In terms of accuracy, the proposed method is up to 9.9 and 6.3 percentage points better than the CNN-based models fine-tuned for JAFFE and CK+ datasets, respectively.
@article{NEUCOM_AND_APPLICATIONS:2024, year = {2024}, publisher = {Springer}, author = {Delazeri, Bruna Rossetto and Hochuli, Andr\'{e} Gustavo and Barddal, Jean Paul and Koerich, Alessandro Lameiras and de Souza Britto Jr., Alceu}, title = {Representation ensemble learning applied to facial expression recognition}, journal = {Neural Computing and Applications}, }
IEEE BIG DATA
Is it Fine to Tune? Evaluating SentenceBERT Fine-tuning for Brazilian Portuguese Text Stream Classification

Bruno Yuiti Leão Imai, Cristiano Mesquita Garcia, Marcio Vinicius Rocha, Alessandro Lameiras Koerich, Alceu Souza Britto Jr., and Jean Paul Barddal

In IEEE International Conference on Big Data (IEEE Big Data) 2024

Abs Bib PDF

Pre-trained language models (LMs) have been used in several scenarios and data mining tasks due to their good-quality representations and their use readiness. Although LMs constitute a significant gain in usability, they are frequently utilized statically over time, meaning that these models can suffer from concept drift and semantic shift, which correspond to changes in data distribution and word meanings. These phenomena are more noticeable when new texts become gradually available. This paper evaluates the impact of updating pre-trained SentenceBERT models overtime on a Brazilian news post classification task in text streaming fashion, a paradigm suitable for learning from data streams. While we update the SBERT model yearly with a reduced number of recent posts, we compare it with scenarios using static LMs. We used the adaptive random forest for classification and evaluated it regarding macro F1-score and elapsed time. The experimental results show that regularly leveraging sampled texts from the recent past for fine-tuning LMs can improve performance metrics over time, reaching better results than using static LMs in most years analyzed. We also evaluated the run times, which suggests that fine-tuning LMs over time provides a good trade-off between performance and run time.
@inproceedings{IEEE_BIG_DATA_IS_IT_FINE:2024, author = {Imai, Bruno Yuiti Le\~{a}o and Garcia, Cristiano Mesquita and Rocha, Marcio Vinicius and Koerich, Alessandro Lameiras and de Souza Britto Jr., Alceu and Barddal, Jean Paul}, title = {Is it Fine to Tune? Evaluating SentenceBERT Fine-tuning for Brazilian Portuguese Text Stream Classification}, booktitle = {IEEE International Conference on Big Data (IEEE Big Data)}, year = {2024}, }
IEEE BIG DATA
LongKey: Keyphrase Extraction for Long Documents

Jeovane Honorio Alves, Radu State, Cinthia Obladen Almendra Freitas, and Jean Paul Barddal

In IEEE International Conference on Big Data (IEEE Big Data) 2024

Abs Bib PDF

In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey’s versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.
@inproceedings{IEEE_BIG_DATA_LONGKEY:2024, author = {Alves, Jeovane Honorio and State, Radu and de Almendra Freitas, Cinthia Obladen and Barddal, Jean Paul}, title = {LongKey: Keyphrase Extraction for Long Documents}, booktitle = {IEEE International Conference on Big Data (IEEE Big Data)}, year = {2024}, }
ICMLA
Fuels Demand Forecasting: Identifying Leading Feature Sets, Prediction Strategy, and Regressors

Jonas Krause, Alexandre C. A. Beiruth, Jean Paul Barddal, Alceu Souza Britto Jr, and Vinicius Mourão Alves Souza

In International Conference on Machine Learning and Applications (ICMLA) 2024

Abs Bib PDF

Fuels are crucial for any country’s development and economy, impacting various sectors such as transportation, industry, and electricity generation. Accurate prediction of monthly fuel demand can improve supply chain management, strategic decision-making, and financial planning for businesses while helping governments develop decarbonization policies and estimate pollutant emissions. This paper explores machine learning models to forecast fossil fuels and biofuel demand 12 months ahead, using univariate time series data representing the historical sales of 27 Brazilian states, one of the world’s leading producers and consumers of fuels. We evaluate different time series feature sets, machine learning regression models, and prediction strategies to address the complexity of fuel sales influenced by factors such as economic conditions and geopolitical events. Our comprehensive evaluation aims to determine an effective setting for predictive models in the fuel domain. Our results show that popular feature extractors for time series, such as Catch22 and TsFresh, cannot improve the original data representation for most forecasting models. Although focused on Brazil, our findings apply to other countries, since the trained models do not rely on external variables, such as micro and macroeconomic indicators.
@inproceedings{ICMLA_FUEL_PRED:2024, author = {Krause, Jonas and Beiruth, Alexandre C. A. and Barddal, Jean Paul and de Souza Britto Jr, Alceu and Souza, Vinicius Mourão Alves}, title = {Fuels Demand Forecasting: Identifying Leading Feature Sets, Prediction Strategy, and Regressors}, booktitle = {International Conference on Machine Learning and Applications (ICMLA)}, year = {2024}, }
ICPR
Alleviating Catastrophic Forgetting in Facial Expression Recognition with Emotion-Centered Models

Israel A. Laurensi, Alceu Souza Britto Jr., Jean Paul Barddal, and Alessandro Lameiras Koerich

In International Conference on Pattern Recognition (ICPR) 2024

Abs Bib PDF

Facial expression recognition is pivotal in machine learning, facilitating various applications. However, convolutional neural networks (CNNs) are often plagued by catastrophic forgetting, impeding their adaptability. The proposed method, emotion-centered generative replay (ECgr), tackles this challenge by integrating synthetic images from generative adversarial networks. Moreover, ECgr incorporates a quality assurance algorithm to ensure the fidelity of generated images. This dual approach enables CNNs to retain past knowledge while learning new tasks, enhancing their performance in emotion recognition. The experimental results on four diverse facial expression datasets demonstrate that incorporating images generated by our pseudo-rehearsal method enhances training on the targeted dataset and the source dataset while making the CNN retain previously learned knowledge.
@inproceedings{ICPR_CATASTROPHIC:2024, author = {Laurensi, Israel A. and de Souza Britto Jr., Alceu and Barddal, Jean Paul and Koerich, Alessandro Lameiras}, title = {Alleviating Catastrophic Forgetting in {Facial} Expression Recognition with Emotion-Centered Models}, booktitle = {International Conference on Pattern Recognition (ICPR)}, year = {2024}, }
ICPR
Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams

Cristiano Mesquita Garcia, Alessandro Lameiras Koerich, Alceu Souza Britto Jr., and Jean Paul Barddal

In International Conference on Pattern Recognition (ICPR) 2024

Abs Bib PDF

The proliferation of textual data on the Internet presents a unique opportunity for institutions and companies to monitor public opinion about their services and products. Given the rapid generation of such data, the text stream mining setting, which handles sequentially arriving, potentially infinite text streams, is often more suitable than traditional batch learning. While pre-trained language models are commonly employed for their high-quality text vectorization capabilities in streaming contexts, they face challenges adapting to concept drift—the phenomenon where the data distribution changes over time, adversely affecting model performance. Addressing the issue of concept drift, this study explores the efficacy of seven text sampling methods designed to fine-tune language models, thereby mitigating performance degradation selectively. We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions. Our evaluation, focused on Macro F1-score and elapsed time, employs two text stream datasets and an incremental SVM classifier to benchmark performance. Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification, demonstrating that larger sample sizes correlate with improved macro F1 scores. Notably, our proposed WordPieceToken ratio sampling method significantly enhances performance with the identified loss functions, surpassing baseline results.
@inproceedings{ICPR_SAMPLING:2024, author = {Garcia, Cristiano Mesquita and Koerich, Alessandro Lameiras and de Souza Britto Jr., Alceu and Barddal, Jean Paul}, title = {Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams}, booktitle = {International Conference on Pattern Recognition (ICPR)}, year = {2024}, }
ESWA
Temporal analysis of drifting hashtags in textual data streams: A graph-based application

Cristiano Mesquita Garcia, Alceu Souza Britto Jr., and Jean Paul Barddal

Expert Systems with Applications 2024

Abs Bib PDF

Initially supported by Twitter, hashtags are now used on several social media platforms. Hashtags are helpful for tagging, tracking, and grouping posts on similar topics. In this paper, based on a hashtag stream regarding the hashtag #mybodymychoice, we analyze hashtag drifts over time using concepts from graph analysis and textual data streams using the Girvan–Newman method to uncover hashtag communities in annual snapshots between 2018 and 2022. In addition, we offer insights about some correlated hashtags found in the study. Our approach can be useful for monitoring changes over time in opinions and sentiment patterns about an entity on social media. Even though the hashtag #mybodymychoice was initially coupled with women’s rights, abortion, and bodily autonomy, we observe that it suffered drifts during the studied period across topics such as drug legalization, vaccination, political protests, war, and civil rights. The year 2021 was the most significant drifting year, in which the communities detected and their respective sizes suggest that #mybodymychoice had a significant drift to vaccination and Covid-19-related topics.
@article{ESWA_GARCIA:2024, year = {2024}, publisher = {Elsevier}, author = {Garcia, Cristiano Mesquita and de Souza Britto Jr., Alceu and Barddal, Jean Paul}, title = {Temporal analysis of drifting hashtags in textual data streams: A graph-based application}, journal = {Expert Systems with Applications}, }
APL. SOFT. COMP.
Adaptive Learning on Hierarchical Data Streams using Window-weighted Gaussian Probabilities

Eduardo Tieppo, Julio Cesar Nievola, and Jean Paul Barddal

Applied Soft Computing 2024

Abs Bib PDF

The hierarchical data stream classification task addresses challenges in both hierarchical and data stream classification primary areas. In these scenarios, machine learning models must simultaneously deal with class hierarchies and adapt to respond to nonstationary data. Given such a challenging set of traits, existing techniques are deficient, as they perform incremental learning and are slow to adapt to newer data, thus not capturing their dynamics in a timely fashion. In this study, we propose two novel adaptive Gaussian Naive Bayes classifiers tailored to classify hierarchical data streams. The models use window-weighted Gaussian probabilities to consider current and historical data and improve the adaptability of the classifiers, especially for nonstationary data streams. As a result of our research, we introduce a unified protocol for evaluating and comparing hierarchical data stream classifiers and establish a benchmark for the hierarchical data stream classification task encompassing the proposed methods and state-of-the-art classifiers. The results demonstrate that our proposed algorithms achieve better prediction correctness than their state-of-the-art counterparts while responding more swiftly to changes in data distribution.
@article{ASOC_TIEPPO:2023, year = {2024}, publisher = {Elsevier}, author = {Tieppo, Eduardo and Nievola, Julio Cesar and Barddal, Jean Paul}, title = {Adaptive Learning on Hierarchical Data Streams using Window-weighted Gaussian Probabilities}, journal = {Applied Soft Computing}, }
SAC
Just Change on Change: Adaptive Splitting Time for Decision Trees in Data Stream Classification

Daniel Nowak Assis, Jean Paul Barddal, and Fabricio Enembreck

In Proceedings of the Annual ACM Symposium on Applied Computing, SAC 2024 2024

Abs Bib PDF

Hoeffding Trees are well-established decision trees for classifying streaming data. The Hoeffding bound was widely used in a static periodic manner, applying the bound for impurity measures to determine whether leaf nodes should split. However, this approach does not account for the tree state and its leaf nodes over time. We hypothesize that splitting when data distribution and accuracy changes occur in leaf nodes enhances decision tree performance. This paper introduces the use of change detection algorithms that dictate the moment a split will happen. First, in the local approach, each leaf node has a change detector that monitors either the error rate or purity of a leaf node and a global one, where a detector monitors statistics from the leaf nodes where the instances arrive. Results show that our methods had competitive results while being more efficient regarding processing time than state-of-the-art Hoeffding-based Trees since the periodic and constant evaluation of splits is costly.
@inproceedings{SAC_NOWAK:2024, author = {Assis, Daniel Nowak and Barddal, Jean Paul and Enembreck, Fabricio}, title = {Just Change on Change: Adaptive Splitting Time for Decision Trees in Data Stream Classification}, booktitle = {Proceedings of the Annual {ACM} Symposium on Applied Computing, {SAC} 2024}, year = {2024}, }

2023

STATS. & COMP.
Random Forest Kernel for High-Dimension Low Sample Size Classification

Lucca Portes Cavalheiro, Simon Bernard, Jean Paul Barddal, and Laurent Heutte

Statistics and Computing 2023

Abs Bib PDF

High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the Random Forest Dissimilarity (RFD), that perfoms state-of-the-art results for such problems. In this work, we transpose the core principle of this approach to solving HDLSS classification problems, by using the RF similarity measure as a learned precomputed SVM kernel (RFSVM). We show that such a learned similarity measure is particularly suited and accurate for this classification context. Experiments conducted on 40 public HDLSS classification datasets, supported by rigorous statistical analyses, show that the RFSVM method outperforms existing methods for the majority of HDLSS problems and remains at the same time very competitive for low or non-HDLSS problems.
@article{STATISTICS_COMPUTING_HDLSS:2023, year = {2023}, publisher = {Springer}, author = {Cavalheiro, Lucca Portes and Bernard, Simon and Barddal, Jean Paul and Heutte, Laurent}, title = {Random Forest Kernel for High-Dimension Low Sample Size Classification}, journal = {Statistics and Computing}, }
ICMLA
Detecting Relevant Information in High-Volume Chat Logs: Keyphrase Extraction for Grooming and Drug Dealing Forensic Analysis

Jeovane Honório Alves, Horácio A. C. G. Pedroso, Rafael Honorio Venetikides, Joel E. M. Koster, Luiz Rodrigo Grochocki, Cinthia Obladen Almendra Freitas, and Jean Paul Barddal

In International Conference on Machine Learning with Applications (ICMLA) 2023

Abs Bib PDF

The growing use of digital communication platforms has given rise to various criminal activities, such as grooming and drug dealing, which pose significant challenges to law enforcement and forensic experts. This paper presents a supervised keyphrase extraction approach to detect relevant information in high-volume chat logs involving grooming and drug dealing for forensic analysis. The proposed method, JointKPE++, builds upon the JointKPE keyphrase extractor by employing improvements to handle longer texts effectively. We evaluate JointKPE++ using BERT-based pre-trained models on grooming and drug dealing datasets, including BERT, RoBERTa, SpanBERT, and BERTimbau. The results show significant improvements over traditional approaches and demonstrate the potential for JointKPE++ to aid forensic experts in efficiently detecting keyphrases related to criminal activities.
@inproceedings{ALVES_ICMLA:2023, author = {Alves, Jeovane Honório and Pedroso, Horácio A. C. G. and Venetikides, Rafael Honorio and Koster, Joel E. M. and Grochocki, Luiz Rodrigo and de Almendra Freitas, Cinthia Obladen and Barddal, Jean Paul}, title = {Detecting Relevant Information in High-Volume Chat Logs: Keyphrase Extraction for Grooming and Drug Dealing Forensic Analysis}, booktitle = {International Conference on Machine Learning with Applications (ICMLA)}, year = {2023}, }
ICMLA
Event-driven Sentiment Drift Analysis in Text Streams: An Application in a Soccer Match

Cristiano Mesquita Garcia, Alceu Souza Britto Jr., and Jean Paul Barddal

In International Conference on Machine Learning with Applications (ICMLA) 2023

Abs Bib PDF

Social media has been a data source for various applications, given its characteristic of working as a social sensor. Many applications in several areas, such as brand reputation and online opinion monitoring, use this valuable resource to understand the users of services and products. This paper describes an application in the soccer domain, considering data collected from a social media textual data stream. The goal is to detect possible sentiment drifts related to actual events in a soccer match. This task is challenging as we resort to short texts made available during a short time (match length). We evaluated four drift detectors using four metrics: false alarms, delay (considering the number of posts), delay, and missing drifts. Our results show that ADWIN had a stable performance in sentiment drift detection compared to other methods in timely detecting the flagged drifts, raising a small number of false alarms. Given the drifts detected, we used Incremental Word-Vectors to monitor words of interest and check their relatedness to actual events in the match. We empirically assert that the closest words trace back to the sentiment drift generator events.
@inproceedings{GARCIA_ICMLA:2023, author = {Garcia, Cristiano Mesquita and de Souza Britto Jr., Alceu and Barddal, Jean Paul}, title = {Event-driven Sentiment Drift Analysis in Text Streams: An Application in a Soccer Match}, booktitle = {International Conference on Machine Learning with Applications (ICMLA)}, year = {2023}, }
NEUCOM
Incremental Specialized and Specialized-Generalized Matrix Factorization Models based on Adaptive Learning Rate Optimizers

Antônio David Viniski, Jean Paul Barddal, and Alceu Souza Britto Jr.

Neurocomputing 2023

Abs Bib PDF

Recommender systems suggest items that are likely to be preferred by a particular user based on historical behavior, actions, and feedback. In real-world applications, data on users and items are continuously generated at a fast pace, such as in e-commerce, social media, digital marketing, and content consumption applications. Since interactions occur over time, these scenarios can be formulated as a data stream where users’ interests are potentially dynamic, i.e., they change over time. Given that changes are expected to occur, one of the current research challenges in streaming recommender systems is that models must adapt their parameters when changes occur to maintain performance. As such changes do not occur for all users and items in the stream at the same time, we consider adapting learning schemes to account for user or item identifiers and model individual parameters. Therefore, we used specialized parameters to adjust the step size for each dataset user or item. More specifically, this study proposes four specialized and specialized-generalized variants of four well-known adaptive learning rate optimizers and shows how they are combined with incremental matrix factorization methods. We tested our proposed optimization strategies on different datasets and showed that one of the proposed specialized variants, that is, InAMSGradUser, improves the RECALL and NDCG rates by up to 11.1 and 7.5 percentage points, respectively, compared to the traditional stochastic gradient descent (SGD) optimizer.
@article{NEUCOM_VINISKI:2023, year = {2023}, publisher = {Elsevier}, author = {Viniski, Antônio David and Barddal, Jean Paul and de Souza Britto Jr., Alceu}, title = {Incremental Specialized and Specialized-Generalized Matrix Factorization Models based on Adaptive Learning Rate Optimizers}, journal = {Neurocomputing}, }
ESWA
An Explainable Machine Learning Approach for Student Dropout Prediction

João Gabriel Corrêa Kruger, Alceu Souza Britto Jr., and Jean Paul Barddal

Expert Systems with Applications 2023

Abs Bib PDF

School dropout is a relevant socio-economic problem across the globe. Predictive models have been developed to determine the likelihood of students dropping out of their studies precociously in an attempt to overcome such a problem. Academic systems, which gather data from many students, are potential sources for datasets that feed dropout prediction algorithms, thus leading to general improvements in education quality. Despite successful past attempts to predict dropout, several works depict small datasets with features that are hard to reproduce. Furthermore, predicting whether a student will drop out is not enough to diagnose and prevent the problem as it is also necessary to provide potential justifications for the dropout. This paper proposes an approach for creating and enriching a dataset for dropout prediction, which has been applied for dropout prediction using data from 19 schools in Brazil. With this dataset and using classifiers and model explaining techniques, our experiments achieved Area Under the Precision-Recall Curve (AUC-PR) scores of up to 89.5% when predicting dropout at different year moments. This study also shows differences when predicting dropouts in different educational stages, such as preschool and secondary education, with the former being more complex than the latter. In addition to the high recognition rates, our proposal identifies potential reasons for student dropout, which are relevant for educational institutions to take preemptive actions.
@article{ESWA_ABEC_JOAO:2023, year = {2023}, publisher = {Elsevier}, author = {Kruger, João Gabriel Corrêa and de Souza Britto Jr., Alceu and Barddal, Jean Paul}, title = {An Explainable Machine Learning Approach for Student Dropout Prediction}, journal = {Expert Systems with Applications}, }
BRACIS
A Tool for Measuring Energy Consumption in Data Stream Mining

Eric Kenzo Taniguchi Onuki, Andreia Malucelli, and Jean Paul Barddal

In Brazilian Conference on Intelligent System (BRACIS) 2023

Abs Bib PDF

Energy consumption reduction is an increasing trend in machine learning given its relevance in socio-ecological importance. Consequently, it is important to quantify how real-time learning algorithms tailored for data streams and edge computing behave in terms of accuracy, processing time, memory usage, and energy consumption. In this work, we bring forward a tool for measuring energy consumption in the Massive Online Analysis (MOA). First, we analyze the energy consumption rates obtained by our tool against a gold-standard hardware solution, thus showing the robustness of our approach. Next, we experimentally analyze classification algorithms under different validation protocols and concept drift and highlight how such classifiers behave under such conditions. Results show that our tools enable the identification of different classifiers’ energy consumption. In particular, it allows a better understanding of how energy consumption rates vary in drifting and non-drifting scenarios. Finally, given the insights obtained during experimentation on existing classifiers, we make our tool publicly available to the scientific community so that energy consumption is also accounted for in developing and comparing data stream mining algorithms.
@inproceedings{ONUKI_BRACIS:2023, author = {Onuki, Eric Kenzo Taniguchi and Malucelli, Andreia and Barddal, Jean Paul}, title = {A Tool for Measuring Energy Consumption in Data Stream Mining}, booktitle = {Brazilian Conference on Intelligent System (BRACIS)}, year = {2023}, }
IJCNN
Benchmarking Feature Extraction Techniques for Textual Data Stream Classification

Bruno Siedekum Thuma, Pedro Silva Vargas, Cristiano Garcia, Alceu Souza Britto Jr., and Jean Paul Barddal

In 2023 International Joint Conference on Neural Networks, IJCNN 2023, Gold Coast, Australia, 2023 2023

Abs Bib PDF

Feature extraction regards transforming unstructured or semi-structured data into structured data that can be used as input for classification and sentiment analysis algorithms, among other applications. This task becomes even more challenging and relevant when textual data becomes available over time as a continuous data stream since the lexicon and semantics can be ever-evolving. Data streams are, by definition, potentially infinite sequences of data that may have ephemeral characteristics, that is, where the data behavior changes, it leads to a phenomenon named concept drift. Textual data streams are specialized data streams, in which texts arrive over time from a continual data source, such as social media, raising challenges in which feature extractors are of great help. In this paper, we benchmark different feature extraction algorithms, i.e., Hashing Trick, Word2Vec, BERT, and Incremental Word-Vectors; in textual data stream classification, considering different stream lengths. The evaluation was performed over a binary and a multiclass classification task, considering two different datasets. Results show that pre-trained models, such as BERT, achieve interesting results, while Hashing Trick also performs competitively. We also observe that incremental methods such as Word2Vec and Incremental Word-Vectors are the most prepared for changing scenarios, yet, they are much more computationally intensive compared to the former when applied to larger streams.
@inproceedings{IJCNN_THUMA:2023, author = {Thuma, Bruno Siedekum and de Vargas, Pedro Silva and Garcia, Cristiano and de Souza Britto Jr., Alceu and Barddal, Jean Paul}, title = {Benchmarking Feature Extraction Techniques for Textual Data Stream Classification}, booktitle = {2023 International Joint Conference on Neural Networks, {IJCNN} 2023, Gold Coast, Australia, 2023}, year = {2023}, }
IJCNN
Mass-Based Short Term Selection of Classifiers in Data Streams

Daniel Nowak Assis, Fabricio Enembreck, and Jean Paul Barddal

In 2023 International Joint Conference on Neural Networks, IJCNN 2023, Gold Coast, Australia, 2023 2023

Abs Bib PDF

Dynamic classifier selection (DCS) regards well-known machine learning techniques in the batch setting that leverage ensemble performance. Most of the methods use similarity-based methods as a proxy, culminating in high computation costs and becoming unfeasible in many streaming scenarios. In this paper, we propose a DCS method able to cope with the high-speed streaming setting, which is based on the performance of base learners in the most recent instances. The impact of our method is evaluated with different ensembles for data streams. We also propose modifications to an Online Boosting method, which has its performance improved with DCS. Our method increases the accuracy and kappa statistic of state-of-the-art ensembles with low overhead of time processing and memory.
@inproceedings{IJCNN_NOWAK:2023, author = {Assis, Daniel Nowak and Enembreck, Fabricio and Barddal, Jean Paul}, title = {Mass-Based Short Term Selection of Classifiers in Data Streams}, booktitle = {2023 International Joint Conference on Neural Networks, {IJCNN} 2023, Gold Coast, Australia, 2023}, year = {2023}, }
INFUS
Exploring diversity in data complexity and classifier decision spaces for pool generation

Marcos Monteiro, Alceu S. Britto, Jean Paul Barddal, Luiz S. Oliveira, and Robert Sabourin

Information Fusion 2023

Abs Bib PDF

This paper introduces a novel method for classifier pool generation in which a two-level strategy explores diversity in both data complexity and classifier decision spaces. The rationale is to induce pool members using data subsets representing subproblems with different difficulties while promoting diversity in classifiers’ decisions. Two possible variants of the proposed method with a focus on maximum dispersion and maximum accuracy are presented. These differ in the property used to define the best pool of classifiers provided by an optimization process. A robust experimental protocol encompassing 28 classification datasets shows that the proposed pool generation provided the best accuracy on 327 over 336 experiments (97.3%) when compared to well-known pool generation methods to provide multiple classifier systems with and without dynamic selection.
@article{INFUS_MARCOS_MONTEIRO:2022, title = {Exploring diversity in data complexity and classifier decision spaces for pool generation}, journal = {Information Fusion}, volume = {89}, pages = {567-587}, year = {2023}, issn = {1566-2535}, doi = {https://doi.org/10.1016/j.inffus.2022.09.001}, url = {https://www.sciencedirect.com/science/article/pii/S1566253522001336}, author = {Monteiro, Marcos and Britto, Alceu S. and Barddal, Jean Paul and Oliveira, Luiz S. and Sabourin, Robert}, keywords = {Classifier pool generation, Diversity, Data complexity measures}, }

2022

ARXIV
Evaluating k-NN in the Classification of Data Streams with Concept Drift

Roberto Souto Maior Barros, Silas Garrido Teixeira de Carvalho Santos, and Jean Paul Barddal

2022

Abs Bib PDF

Data streams are often defined as large amounts of data flowing continuously at high speed. Moreover, these data are likely subject to changes in data distribution, known as concept drift. Given all the reasons mentioned above, learning from streams is often online and under restrictions of memory consumption and run-time. Although many classification algorithms exist, most of the works published in the area use Naive Bayes (NB) and Hoeffding Trees (HT) as base learners in their experiments. This article proposes an in-depth evaluation of k-Nearest Neighbors (k-NN) as a candidate for classifying data streams subjected to concept drift. It also analyses the complexity in time and the two main parameters of k-NN, i.e., the number of nearest neighbors used for predictions (k), and window size (w). We compare different parameter values for k-NN and contrast it to NB and HT both with and without a drift detector (RDDM) in many datasets. We formulated and answered 10 research questions which led to the conclusion that k-NN is a worthy candidate for data stream classification, especially when the run-time constraint is not too restrictive.
@misc{BARROS_ARXIV:2022, doi = {10.48550/ARXIV.2210.03119}, author = {de Barros, Roberto Souto Maior and Santos, Silas Garrido Teixeira de Carvalho and Barddal, Jean Paul}, keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Evaluating k-NN in the Classification of Data Streams with Concept Drift}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license}, }
SMC
Pattern Spotting and Image Retrieval in Historical Documents using Deep Hashing

Caio Silva Dias, Alceu Souza Britto Jr., Jean Paul Barddal, Laurent Heutte, and Alessandro Lameiras Koerich

In Proceedings of the IEEE Systems, Man, and Cybernetics Conference (IEEE SMC) 2022

Abs Bib PDF

This paper presents a deep learning approach for image retrieval and pattern spotting in digital collections of historical documents. First, a region proposal algorithm detects object candidates in the document page images. Next, deep learning models are used for feature extraction, considering two distinct variants, which provide either real-valued or binary code representations. Finally, candidate images are ranked by computing the feature similarity with a given input query. A robust experimental protocol evaluates the proposed approach considering each representation scheme (real-valued and binary code) on the DocExplore image database. The experimental results show that the proposed deep models compare favorably to the state-of-the-art image retrieval approaches for images of historical documents, outperforming other deep models by 2.56 percentage points using the same techniques for pattern spotting. Besides, the proposed approach also reduces the search time up to 200x, and the storage cost up to 6,000x when compared to related works based on real-valued representations.
@inproceedings{SMC_PATTERN_SPOTTING:2022, author = {da Silva Dias, Caio and de Souza Britto Jr., Alceu and Barddal, Jean Paul and Heutte, Laurent and Koerich, Alessandro Lameiras}, title = {Pattern Spotting and Image Retrieval in Historical Documents using Deep Hashing}, booktitle = {Proceedings of the IEEE Systems, Man, and Cybernetics Conference (IEEE SMC)}, year = {2022}, }
SMC
Improving Data Stream Classification using Incremental Yeo-Johnson Power Transformation

Eduardo Tieppo, Jean Paul Barddal, and Julio Cesar Nievola

In Proceedings of the IEEE Systems, Man, and Cybernetics Conference (IEEE SMC) 2022

Abs Bib PDF

Data transformation plays an essential role as a preprocessing step in learning models. Several classification techniques have premises about the underlying data distribution, such as normal distribution assumed in Bayesians classifiers. However, applying data transformation in a streaming setting requires processing an infinite and continuous flow of data. In this paper, we propose the Incremental Yeo-Johnson Power Transformation, a variant of the well-known batch Yeo-Johnson transformation that is tailored for streaming settings, i.e., it supports streaming data via statistical sampling and hypothesis testing. Experimental results show that our proposal achieves the same data normality as its batch counterpart. In addition, it improves the prediction performance of a data stream classifier based on Bayesian statistical models. Overall, learning models obtained 3 percentage points improvement.
@inproceedings{SMC_YEOJOHNSON:2022, author = {Tieppo, Eduardo and Barddal, Jean Paul and Nievola, Julio Cesar}, title = {Improving Data Stream Classification using Incremental Yeo-Johnson Power Transformation}, booktitle = {Proceedings of the IEEE Systems, Man, and Cybernetics Conference (IEEE SMC)}, year = {2022}, }
ESANN
A Machine Learning Approach for School Dropout Prediction in Brazil

João Gabriel Corrêa Kruger, Jean Paul Barddal, and Alceu Souza Britto Jr.

In Proceedings of the 30th European Symposium on Artificial Neural Networks (ESANN) 2022

Abs Bib PDF

School dropout is a severe problem that impacts many socio-economic aspects, including inequality. Dropout prediction algorithms can help remediate this problem, although several past attempts in the literature did so using datasets with small numbers of students. This paper brings forward an experimental approach of machine learning for school dropout prediction in Brazilian schools. The data used for this study was first retrieved from the academic systems of a group of Brazilian private schools, which was later enriched with socio-economic data extracted from governmental sources. Using the dataset to train different types of classifiers, we obtained precision scores of up to 95.2% when predicting dropout at different year moments and educational stages, thus allowing schools to plan and apply retention strategies.
@inproceedings{ESANN_KRUGER:2022, author = {Kruger, João Gabriel Corrêa and Barddal, Jean Paul and de Souza Britto Jr., Alceu}, title = {A Machine Learning Approach for School Dropout Prediction in Brazil}, booktitle = {Proceedings of the 30th European Symposium on Artificial Neural Networks (ESANN)}, year = {2022}, }
IJCNN
Classifying Hierarchical Data Streams using Global Classifiers and Summarization Techniques

Eduardo Tieppo, Jean Paul Barddal, and Julio Cesar Nievola

In 2022 International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 2022 2022

Abs Bib PDF

The hierarchical classification of data streams requires models capable of handling a class hierarchy and updating themselves whenever a new example arrives, within restrained processing time and memory consumption. Current state-of-the-art models store raw instances and handle the hierarchy locally, performing a high number of computations at every hierarchy level and with all, eventually redundant, data. This paper introduces Global k-Nearest Centroids (kNC) and Global Dribble, two novel methods for the hierarchical classification of data streams. Both methods use summarization techniques to represent data with constant computational resources usage and a global classification approach to process instances in less time when compared to local strategies. We compare both methods with a state-of-the-art local classifier, and the proposed methods achieved a higher number of correct predictions and process instances nearly twice as fast.
@inproceedings{IJCNN_CHDS:2022, author = {Tieppo, Eduardo and Barddal, Jean Paul and Nievola, Julio Cesar}, title = {Classifying Hierarchical Data Streams using Global Classifiers and Summarization Techniques}, booktitle = {2022 International Joint Conference on Neural Networks, {IJCNN} 2022, Padua, Italy, 2022}, year = {2022}, }
IJCNN
Evaluation of Self-taught Learning-based Representations for Facial Emotion Recognition

Bruna Delazeri, Leonardo Leon Veras, Jean Paul Barddal, Alessandro L. Koerich, and Alceu Souza Britto Jr.

In 2022 International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 2022 2022

Abs Bib PDF

This work describes different strategies to generate unsupervised representations obtained through the concept of self-taught learning for facial emotion recognition (FER). The idea is to create complementary representations promoting diversity by varying the autoencoders’ initialization, architecture, and training data. SVM, Bagging, Random Forest, and a dynamic ensemble selection method are evaluated as final classification methods. Experimental results on JAFFE and Cohn-Kanade datasets using a leave-one-subject-out protocol show that FER methods based on the proposed diverse representations compare favorably against state-of-the-art approaches that also explore unsupervised feature learning.
@inproceedings{IJCNN_FER:2022, author = {Delazeri, Bruna and Veras, Leonardo Leon and Barddal, Jean Paul and Koerich, Alessandro L. and de Souza Britto Jr., Alceu}, title = {Evaluation of Self-taught Learning-based Representations for Facial Emotion Recognition}, booktitle = {2022 International Joint Conference on Neural Networks, {IJCNN} 2022, Padua, Italy, 2022}, year = {2022}, }
IJCNN
Assessing Batch and Online Learning for Delivery in Full and On Time Predictions

Adriano Alves Lima, Márcio Venâncio Batista, Jean Paul Barddal, Danilo Sipoli Sanches, and Luiz Eduardo Soares Oliveira

In 2022 International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 2022 2022

Abs Bib PDF

Improving results by optimizing process execution is one objective of major companies. For these corporations, the main point for achieving better results is the good maintenance of supply chain management. The most important supply chain metric is Delivery in Full and On Time (DIFOT). DIFOT measures how well a supply chain delivers value to the customer. In this work, we bring forward an analysis of DIFOT prediction from large Brazilian food company. More specifically, we compare a batch and online learning algorithm for DIFOT prediction and depict why the latter is suitable for this problem. Furthermore, we report a feature drift analysis to identify whether there are considerable shifts along with the dataset timespan. As a byproduct of this research, we make the dataset used in this analysis publicly available for future research in DIFOT prediction.
@inproceedings{DIFOT_IJCNN:2022, author = {de Lima, Adriano Alves and Batista, Márcio Venâncio and Barddal, Jean Paul and Sanches, Danilo Sipoli and de Oliveira, Luiz Eduardo Soares}, title = {Assessing Batch and Online Learning for Delivery in Full and On Time Predictions}, booktitle = {2022 International Joint Conference on Neural Networks, {IJCNN} 2022, Padua, Italy, 2022}, year = {2022}, }
ESWA
A Systematic Review on Computer Vision-Based Parking Lot Management Applied on Public Datasets

Paulo Ricardo Lisboa Almeida, Jeovane Honório Alves, Rafael Stubs Parpinelli, and Jean Paul Barddal

Expert Systems with Applications 2022

Abs Bib PDF

Computer vision-based parking lot management methods have been extensively researched upon owing to their flexibility and cost-effectiveness. To evaluate such methods authors often employ publicly available parking lot image datasets. In this study, we surveyed and compared robust publicly available image datasets specifically crafted to test computer vision-based methods for parking lot management approaches and consequently present a systematic and comprehensive review of existing works that employ such datasets. The literature review identified relevant gaps that require further research, such as the requirement of dataset-independent approaches and methods suitable for autonomous detection of position of parking spaces. In addition, we have noticed that several important factors such as the presence of the same cars across consecutive images, have been neglected in most studies, thereby rendering unrealistic assessment protocols. Furthermore, the analysis of the datasets also revealed that certain features that should be present when developing new benchmarks, such as the availability of video sequences and images taken in more diverse conditions, including nighttime and snow, have not been incorporated.
@article{ESWA_PKLOT_SURVEY:2022, year = {2022}, publisher = {Elsevier}, author = {de Almeida, Paulo Ricardo Lisboa and Alves, Jeovane Honório and Parpinelli, Rafael Stubs and Barddal, Jean Paul}, title = {A Systematic Review on Computer Vision-Based Parking Lot Management Applied on Public Datasets}, journal = {Expert Systems with Applications}, }
ICAART
Univariate Time Series Prediction Using Data Stream Mining Algorithms and Temporal Dependence

Marcos Alberto Mochinski, Jean Paul Barddal, and Fabricio Enembreck

In Proceedings of International Conference on Agents and Artificial Intelligence, ICAART 2022 2022

Abs Bib PDF

In this paper, we present an exploratory study conducted to evaluate the impact of temporal dependence modeling on time series forecasting with Data Stream Mining (DSM) techniques. DSM algorithms have been used successfully in many domains that exhibit continuous generation of non-stationary data. However, the use of DSM in time series is rare since they usually are univariate and exhibit strong temporal dependence. This is the main motivation for this work, such that this study mitigates such gap by presenting a univariate time series prediction method based on AdaGrad (a DSM algorithm), Auto.Arima (a statistical method) and features extracted from adjusted autocorrelation function (ACF) coefficients. The proposed method uses adjusted ACF features to convert the original series observations into multivariate data, executes the fitting process using the DSM and the statistical algorithm, and combines the AdaGrad’s and Auto.Arima’s forecasts to establish the final predictions for each time series. Experiments conducted with five datasets containing 141,558 time series resulted in up to 12.429% improvements in sMAPE (Symmetric Mean Average Percentage Error) error rates when compared to Auto.Arima. The results depict that combining DSM with ACF features and statistical time series methods is a suitable approach for univariate forecasting.
@inproceedings{TIEPPO_KNC_DRIBBLE, author = {Mochinski, Marcos Alberto and Barddal, Jean Paul and Enembreck, Fabricio}, title = {Univariate Time Series Prediction Using Data Stream Mining Algorithms and Temporal Dependence}, booktitle = {Proceedings of International Conference on Agents and Artificial Intelligence, {ICAART} 2022}, year = {2022}, }
SAC
Automatic Disease Vector Mosquitoes Identification via Hierarchical Data Stream Classification

Eduardo Tieppo, Jean Paul Barddal, and Julio Cesar Nievola

In Proceedings of the Annual ACM Symposium on Applied Computing, SAC 2022 2022

Abs Bib PDF

Vector-borne diseases (VBDs), such as Dengue or Malaria, are one of the main concerns of public health agencies and governments. These diseases are mainly spread by mosquitoes acting as vectors by transmitting infected blood between humans. Machine learning can be used to design and improve control strategies of VBDs by providing models able to recognize disease vector mosquitoes and automatically capture or kill harmful species. The automatic identification of disease vector mosquitoes was not yet addressed concerning the hierarchical classification of data streams. Thus, reliable information has not been used to improve learning models, such as mosquitoes’ hierarchical taxonomy. In this study, we propose a framework for the automatic identification of disease vector mosquitoes in the context of the hierarchical classification of data streams area. To this end, we propose a hierarchical adaptation of a disease vector mosquitoes’ dataset to include their taxonomy and introduce kNC and Dribble, two novel classification methods fitted to hierarchical data streams representing the mosquitoes. Results depicted that our framework, using summarization techniques, achieves significantly better prediction and processing speed rates when compared to existing state-of-the-art models.
@inproceedings{TIEPPO_KNC_DRIBBLF, author = {Tieppo, Eduardo and Barddal, Jean Paul and Nievola, Julio Cesar}, title = {Automatic Disease Vector Mosquitoes Identification via Hierarchical Data Stream Classification}, booktitle = {Proceedings of the Annual {ACM} Symposium on Applied Computing, {SAC} 2022}, year = {2022}, }
ACM CSUR
A Survey on Concept Drift in Process Mining

Denise Maria Vecino Sato, Sheila Cristiana Freitas, Jean Paul Barddal, and Edson Emilio Scalabrin

ACM Computing Surveys 2022

Abs Bib PDF

Concept drift in process mining (PM) is a challenge as classical methods assume processes are in a steady-state, i.e., events share the same process version. We conducted a systematic literature review on the intersection of these areas, and thus, we review concept drift in process mining and bring forward a taxonomy of existing techniques for drift detection and online process mining for evolving environments. Existing works depict that (i) PM still primarily focuses on offline analysis, and (ii) the assessment of concept drift techniques in processes is cumbersome due to the lack of common evaluation protocol, datasets, and metrics.
@article{SURVEY_PROCESS_MINING_DRIFT:2022, year = {2022}, publisher = {ACM}, author = {Sato, Denise Maria Vecino and de Freitas, Sheila Cristiana and Barddal, Jean Paul and Scalabrin, Edson Emilio}, title = {A Survey on Concept Drift in Process Mining}, journal = {ACM Computing Surveys}, }

2021

ICPM
Interactive Process Drift Detection: a framework for visual analysis of process drifts

Denise Maria Vecino Sato, Rafaela Mantovani Fontana, Jean Paul Barddal, and Edson Emilio Scalabrin

In International Conference on Process Mining (ICPM) - Demo track 2021

Abs Bib PDF

Interactive Process Drift Detection (IPDD) is a framework for visual analysis of process drifts. A process drift indicates a change in the process model occurred at some point in time. IPDD approach firstly generates process models for subparts of the event log using a sliding window approach. Then, IPDD detects the drifts by evaluating similarity metrics calculated between adjacent process models; a difference in some of the metrics indicates a drift. The current implementation of IPDD generates the process models using the directly-follows graph and applies two similarity metrics: nodes and edges similarity. The user interface shows the drifts in the process models over time, allowing the user to visually understand the model changes. Also, the user can easily change the hyperparameters for the drift analysis and verify the results on the interface. The user interface of IPDD also allows the user to evaluate the detected drifts by calculating the F-score metrics, which is useful when using artificial datasets. The underlying idea is to ease the choice of a "good" value for the hyperparameter configuration, which is critical for almost any drift detection mechanism.
@inproceedings{SATO_ICPM:2021, author = {Sato, Denise Maria Vecino and Fontana, Rafaela Mantovani and Barddal, Jean Paul and Scalabrin, Edson Emilio}, title = {Interactive Process Drift Detection: a framework for visual analysis of process drifts}, booktitle = {International Conference on Process Mining (ICPM) - Demo track}, year = {2021}, }
AIRE
Hierarchical classification of data streams: a systematic literature review

Eduardo Tieppo, Roger Robson Santos, Jean Paul Barddal, and Júlio Cesar Nievola

Artificial Intelligence Review 2021

Abs Bib PDF

The classification task usually works with flat and batch learners, assuming problems as stationary and without relations between class labels. Nevertheless, several real-world problems do not assume these premises, i.e., data have labels organized hierarchically and are made available in streaming fashion, meaning that their behavior can drift over time. Existing studies on hierarchical classification do not consider data streams as input of their process, and thus, data is assumed as stationary and handled through batch learners. The same can be said about works on streaming data, as the hierarchical classification is overlooked. Studies concerning each area individually are promising, yet, do not tackle their intersection. This study analyzes the main characteristics of the state-of-the-art works on hierarchical classification for streaming data concerning five aspects: (i) problems tackled, (ii) datasets, (iii) algorithms, (iv) evaluation metrics, and (v) research gaps in the area. We performed a systematic literature review of primary studies and retrieved 3,722 papers, of which 42 were identified as relevant and used to answer the aforementioned research questions. We found that the problems handled by hierarchical classification of data streams include mainly classification of images, human activities, texts, and audio; the datasets are mostly created or synthetic data; the algorithms and evaluation metrics are well-known techniques or based on those; and research gaps are related to dynamic context, data complexity, and computational resources constraints. We also provide implications for future research and experiments to consider common characteristics shared amongst hierarchical classification and data stream classification.
@article{TIEPPO_SLR:2021, year = {2021}, publisher = {Springer}, author = {Tieppo, Eduardo and dos Santos, Roger Robson and Barddal, Jean Paul and Nievola, Júlio Cesar}, title = {Hierarchical classification of data streams: a systematic literature review}, journal = {Artificial Intelligence Review}, }
BRACIS
Classifying Potentially Unbounded Hierarchical Data Streams with Incremental Gaussian Naive Bayes

Eduardo Tieppo, Julio Cesar Nievola, and Jean Paul Barddal

In Brazilian Conference on Intelligent System (BRACIS) 2021

Abs Bib PDF

Hierarchical Classification of Data Streams inherits the properties and constraints of Hierarchical Classification and Data Stream Classification areas concomitantly. Therefore, it requires novel approaches that (i) can handle class hierarchies, (ii) can be updated over time, and (iii) are computationally light-weighted regarding processing time and memory usage. In this study, we propose the \emphGaussian Naive Bayes for Hierarchical Data Streams (GNB-hDS) method: an incremental Gaussian Naive Bayes for classifying potentially unbounded hierarchical data streams. The GNB-hDS method uses statistical summaries of the data stream instead of storing actual instances. These statistical summaries allow more efficient data storage, maintain constant computational time and memory, and calculate the probability of an instance belonging to a specific class via the Bayes’ Theorem. We compare our method against a technique that stores raw instances, and results show that our method obtains equivalent prediction rates while being statistically faster.
@inproceedings{TIEPPO_BRACIS:2021, author = {Tieppo, Eduardo and Nievola, Julio Cesar and Barddal, Jean Paul}, title = {Classifying Potentially Unbounded Hierarchical Data Streams with Incremental Gaussian Naive Bayes}, booktitle = {Brazilian Conference on Intelligent System (BRACIS)}, year = {2021}, }
SMC
Adaptive Global k-Nearest Neighbors for Hierarchical Classification of Data Streams

Eduardo Tieppo, Jean Paul Barddal, and Julio Cesar Nievola

In IEEE Conference on Systems, Man, and Cybernetics (SMC) 2021

Abs Bib PDF

Data stream classification differs from batch learning classification methods as data is made available sequentially and may drift over time. Therefore, data stream classification can be simultaneous to all other kinds of classification problems, and it has been revisiting many aspects related to classification in the last years. So far, hierarchical classification was weakly addressed in streaming scenarios despite being a well-established research topic. In this paper, we propose the adaptive global k-Nearest Neighbors for hierarchical classification of data streams (Global kNN-hDS). Our proposal is able to classify hierarchical data streams using a constrained memory buffer and following a global approach. We compare our method against a local kNN also tailored for streaming scenarios, and results show that our method obtains competitive prediction rates while being statistically faster.
@inproceedings{TIEPPO_SMC:2021, author = {Tieppo, Eduardo and Barddal, Jean Paul and Nievola, Julio Cesar}, title = {Adaptive Global k-Nearest Neighbors for Hierarchical Classification of Data Streams}, booktitle = {IEEE Conference on Systems, Man, and Cybernetics (SMC)}, year = {2021}, }
IJCNN
Dynamically Selected Ensemble for Data Stream Classification

Lucca Portes Cavalheiro, Jean Paul Barddal, Alceu Souza Britto Jr., and Laurent Heutte

In International Joint Conference on Neural Networks (IJCNN) 2021

Abs Bib PDF

Mining data streams is a hot topic in the machine learning (ML) community. In addition to learning and updating accurate models over time, these techniques must respect constraints that are not necessarily as strong in batch mode, such as time processing and memory consumption efficiency. A successful family of techniques in batch ML is dynamic classifier selection (DCS). However, these are roughly overlooked in data stream mining. In this paper, we propose a novel dynamic classifier selection framework for data streams called Double Dynamic Classifier Selection (DDCS). We compare DDCS against state-of-art methods for mining data streams in both synthetic and real-world datasets. Results depict that DDCS not only outperforms the state-of-art ensemble methods for data stream classification in terms of accuracy but is also significantly more efficient in terms of processing time and memory consumption.
@inproceedings{LUCCA1_IJCNN:2021, author = {Cavalheiro, Lucca Portes and Barddal, Jean Paul and de Souza Britto Jr., Alceu and Heutte, Laurent}, title = {Dynamically Selected Ensemble for Data Stream Classification}, booktitle = {International Joint Conference on Neural Networks (IJCNN)}, year = {2021}, }
IJCNN
Towards the Overcome of Performance Pitfalls in Data Stream Mining Tools

Lucca Portes Cavalheiro, Marco Antonio Alves Zanata, and Jean Paul Barddal

In International Joint Conference on Neural Networks (IJCNN) 2021

Abs Bib PDF

Data stream mining is an essential task in today’s scientific community. It allows machine learning models to be updated over time as new data becomes available. Three pillars should be accounted for when selecting an appropriate algorithm for data stream mining: accuracy, processing time, and memory consumption. To develop and assess machine learning models in streaming scenarios, different tools have been developed, where the Massive Online Analysis, written in Java, and scikit-multiflow, written in Python, are in the spotlight. Despite the ease of use of both tools, neither are focused on performance, which puts in jeopardy the usage of the computational resources. In this paper, we show that with the right tools, Python libraries reach performance comparable to C/C++. More specifically, we show how optimized implementations in scikit-multiflow using low-level languages, i.e., C++, C++ with Intel Intrinsics, and Rust; with bindings to Python vastly overcome existing tools in computational resources usage while keeping predictive performance intact.
@inproceedings{LUCCA2_IJCNN:2021, author = {Cavalheiro, Lucca Portes and Zanata, Marco Antonio Alves and Barddal, Jean Paul}, title = {Towards the Overcome of Performance Pitfalls in Data Stream Mining Tools}, booktitle = {International Joint Conference on Neural Networks (IJCNN)}, year = {2021}, }
ICAISC
Interactive Process Drift Detection Framework

Denise Maria Vecino Sato, Jean Paul Barddal, and Edson Emilio Scalabrin

In International Conference on Artificial Intelligence and Soft Computing (ICAISC) 2021

Abs Bib PDF

This paper presents a novel tool for detecting drifts in process models. The tool targets the challenge of defining the better parameter configuration for detecting drifts by providing an interactive user interface. Using this interface, the user can quickly change the parameters and verify how the process evolved. The process evolution is presented in a timeline of process models, simulating a “replay” of models over time. One instantiation of the framework was implemented using a fixed-size sliding window, discovering process maps using directly-follows graphs (DFGs), and calculating nodes and edges similarities. This instantiation was evaluated using a benchmarking dataset of simple and complex drift patterns. The tool correctly detected 17 from the 18 change patterns, thus confirming its potential when an adequate window size is set. The user interface shows that replaying the process models provides a visual understanding of the changing process. The concept drift is explained by the similarity metrics’ differences, thus allowing drift localization.
@inproceedings{DENISE:2021, author = {Sato, Denise Maria Vecino and Barddal, Jean Paul and Scalabrin, Edson Emilio}, title = {Interactive Process Drift Detection Framework}, booktitle = {International Conference on Artificial Intelligence and Soft Computing (ICAISC)}, year = {2021}, }
ESWA
A case study of batch and incremental recommender systems in supermarket data under concept drifts and cold start

Antônio David Viniski, Jean Paul Barddal, Alceu Souza Britto Jr., Fabricio Enembreck, and Humberto Vinicius Aparecido Campos

Expert Systems with Applications 2021

Abs Bib PDF

Recommender systems uncover relationships between users and items, thus allowing personalized recommendations. Nonetheless, users’ preferences may change over time, the so-called concept drifts; or new users and items may appear, making the recommender system unable to accurately map the relationship between users and items due to the cold start problem. Consequently, concept drift and cold start are challenges that downgrade the recommender system’s predictive performance. This paper assesses existing approaches for collaborative-filtering recommender systems over a real supermarket dataset that exhibits both of the issues mentioned above. For this purpose, our comparative analysis encompasses batch and streaming learning approaches. As a result, we can observe that streaming-based models achieve better recommendation rates since these are tailored to fit the concept drift. More specifically, the predictive performance of streaming-based recommendations increases by up to 21% over those provided by batch methods. The supermarket dataset used in experimentation is also made publicly available for future studies and recommender systems comparisons.
@article{ESWA_SUPERMARKET:2021, year = {2021}, publisher = {Elsevier}, author = {Viniski, Ant\^{o}nio David and Barddal, Jean Paul and de Souza Britto Jr., Alceu and Enembreck, Fabricio and de Campos, Humberto Vinicius Aparecido}, title = {A case study of batch and incremental recommender systems in supermarket data under concept drifts and cold start}, journal = {Expert Systems with Applications}, }
PAKDD
UKIRF: An Item Rejection Framework for Improving Negative Items Sampling in One-Class Collaborative Filtering

Antônio David Viniski, Jean Paul Barddal, and Alceu Souza Britto Jr.

In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2021

Abs Bib PDF

Collaborative Filtering (CF) is one of the most successful techniques in recommender systems. Most CF scenarios depict positive-only implicit feedback, which means that negative feedback is unavailable. Therefore, One-Class Collaborative Filtering (OCCF) techniques have been tailored to tackling these scenarios. Nonetheless, several OCCF models still require negative observations during training, and thus, a popular approach is to consider randomly selected unknown relationships as negative. In this work, we bring forward a novel approach for selecting negative items called Unknown Item Rejection Framework (UKIRF). More specifically, we instantiate UKIRF using similarity approaches, i.e., TF-IDF and Cosine, to reject items that are similar to those a user interacted with. We apply UKIRF to different OCCF models in different datasets and show that it improves the recall rates up to 24% when compared to random sampling.
@inproceedings{BARDDAL_PAKDD:2021, author = {Viniski, Antônio David and Barddal, Jean Paul and de Souza Britto Jr., Alceu}, title = {UKIRF: An Item Rejection Framework for Improving Negative Items Sampling in One-Class Collaborative Filtering}, booktitle = {Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)}, year = {2021}, }

2020

ICPR
Classifier Pool Generation based on a Two-level Diversity Approach

Marcos Monteiro, Alceu Souza Britto Jr, Jean Paul Barddal, Luiz Soares Oliveira, and Robert Sabourin

In International Conference on Pattern Recognition (ICPR) 2020

Abs Bib PDF

This paper describes a classifier pool generation method guided by the diversity estimated on the data complexity and classifier decisions. First, the behavior of complexity measures is assessed by considering several subsamples of the dataset. The complexity measures with high variability across the subsamples are selected for posterior pool adaptation, where an evolutionary algorithm optimizes diversity in both complexity and decision spaces. A robust experimental protocol with 28 datasets and 20 replications is used to evaluate the proposed method. Results show significant accuracy improvements in 69.4% of the experiments when Dynamic Classifier Selection and Dynamic Ensemble Selection methods are applied.
@inproceedings{BARDDAL_ICPR:2020, author = {Monteiro, Marcos and de Souza Britto Jr, Alceu and Barddal, Jean Paul and Oliveira, Luiz Soares and Sabourin, Robert}, title = {Classifier Pool Generation based on a Two-level Diversity Approach}, booktitle = {International Conference on Pattern Recognition (ICPR)}, year = {2020}, }
SMC
Combining Slow and Fast Learning for Improved Credit Scoring

Lucas Loezer Jean Paul Barddal, and Riccardo Lanzuolo

In IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC) 2020

Abs Bib PDF

The financial credibility of a person is a relevant factor to determine whether a loan should be approved or not, and it is quantified by a credit score, which is computed using past performance on debt obligations, profiling, and other data available. Credit scoring becomes even a hotter topic in emerging countries, as interest rates and customer behavior swiftly vary, given the economic (in)stability of the country and as fintechs are chasing robust solutions for improved credit scoring solutions. Batch machine learning is often deployed for credit scoring, yet, they are tailored for static scenarios, i.e., they are not prepared to swiftly detect and adapt to changes in customer behavior, thus leading to slow recovery in such scenarios. In this paper, we bring forward an analysis on how batch machine learning can be combined with data stream mining techniques, thus leading to better recognition rates in credit scoring scenarios. We analyze three different real-world datasets from Brazilian financial institutions, whilst keeping their secrecy preserved, and show how batch and stream learning can be combined towards improved credit scoring systems, as well as highlighting relevant gaps that still require attention.
@inproceedings{BARDDAL_SMC1_2020, author = {Jean Paul Barddal, Fabricio Enembreck, Lucas Loezer and Lanzuolo, Riccardo}, title = {Combining Slow and Fast Learning for Improved Credit Scoring }, booktitle = {IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC)}, year = {2020}, }
SMC
Naïve Approaches to Deal with Concept Drifts

Alceu Souza Britto Jr Almeida, and Jean Paul Barddal

In IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC) 2020

Abs Bib PDF

A common problem in machine learning is to find representative real-world problems to put the methods to test. When developing approaches to deal with concept drifts, some datasets such as the Forest Covertype and Nebraska Weather are a common choice for testing. We argue that some well-known real-world concept drift datasets present a high serial dependence in the target class and may have only minor changes. With this in mind, we propose the use of naïve methods that should be used for comparison with methods that deal with concept drifts. The experimental results using six real-world well-known concept drift datasets show that the naïve approaches can be better than some methods to deal with possible concept drifts in datasets such as the Forest Covertype, Electricity, and Nebraska Weather. These results suggest that some widely used datasets may be trivial from the concept drift standpoint, and thus, should be avoided or at least the results should be compared with the proposed naïve methods.
@inproceedings{BARDDAL_SMC2_2020, author = {Paulo Ricardo Lisboa de Almeida, Luiz Oliveira, Alceu Souza Britto Jr and Barddal, Jean Paul}, title = {Naïve Approaches to Deal with Concept Drifts }, booktitle = {IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC)}, year = {2020}, }
SMC
Improving Multiple Time Series Forecasting with Data Stream Mining Algorithms

Jean Paul Barddal Marcos Alberto Mochinski, and Fabricio Enembreck

In IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC) 2020

Abs Bib PDF

This paper proposes a hybrid ensemble learning approach that combines statistical and data stream mining algorithms to obtain better forecasting performance in multiple time series prediction problems. Although some multiple time series algorithms perform surprisingly well in a variety of domains, it is well-known that no one is dominant for every existent domain. Therefore, we developed a meta-technique based on data stream mining and static ensemble selection strategy and evaluated its forecasting goodness-of-fit in time series datasets from M3 and M4 competitions. After training different regression models, we show how the combination of auto.arima and AdaGrad lead to improved forecasting rates, thus surpassing the results of state-of-art algorithms.
@inproceedings{BARDDAL_SMC3_2020, author = {Marcos Alberto Mochinski, Jean Paul Barddal and Enembreck, Fabricio}, title = {Improving Multiple Time Series Forecasting with Data Stream Mining Algorithms }, booktitle = {IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC)}, year = {2020}, }
SMC
ADADRIFT: An Adaptive Learning Technique for Long-History Stream-Based Recommender Systems

Fabricio Enembreck Eduardo Ferreira José, and Jean Paul Barddal

In IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC) 2020

Abs Bib PDF

Adaptive recommender systems are increasingly showing their importance as profiling is a dynamic problem. Their goal is to update recommendation models as new interactions take place, thus swiftly adapting to drifts in the user’s behavior and desires, and item’s audience. However, existing recommendation algorithms usually do not perform well during drifts, as they take long to adapt to changes, or these updates are suboptimal since they account for all profiles’ preferences equally, which is often untrue as each individual and its changes are unique. In this paper, we propose the ADADRIFT algorithm to deal with user and item-based drifts in adaptive recommender systems using personalized learning rates based on profile statistics. The experiments using stream-based recommender systems (ISGD and BRISMF) across four different datasets show that ADADRIFT surpasses ADADELTA with significant improvements in recommendation rates. The best results appear when the data streams have a long history of the users’ or items’ interactions and drifts become noticeable. The experimentation in this work highlight the importance of handling drifts in recommender systems.
@inproceedings{BARDDAL_SMC4_2020, author = {Eduardo Ferreira José, Fabricio Enembreck and Barddal, Jean Paul}, title = {{ADADRIFT}: An Adaptive Learning Technique for Long-History Stream-Based Recommender Systems }, booktitle = {IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC)}, year = {2020}, }
ESWA
Lessons learned from data stream classification applied to credit scoring

Jean Paul Barddal, Lucas Loezer, Fabrício Enembreck, and Riccardo Lanzuolo

Expert Systems with Applications 2020

Abs Bib PDF

The financial credibility of a person is a factor used to determine whether a loan should be approved or not, and this is quantified by a ‘credit score,’ which is calculated using a variety of factors, including past performance on debt obligations, profiling, amongst others. Machine learning has been widely applied to automate the development of effective credit scoring models over the years. Yet, studies show that the development of robust credit scoring models may take longer than a year, and thus, if the behavior of customers changes over time, the model will be outdated even before its deployment. In this paper, we made 3 anonymized real-world credit scoring datasets available alongside the results obtained. In each of these datasets, we verify whether the credit scoring task should be thought as an ephemeral scenario since many of the variables may drift over time, and thus, data stream mining techniques should be used since they were tailored for incremental learning and to detect and adapt to changes in the data distribution. Therefore, we compare both traditional batch machine learning algorithms with data stream algorithms in different validation schemes using both Kolmogorov–Smirnov and Population Stability Index metrics. Furthermore, we also provide insights on the importance of features according to their Information Value, Mean Decrease Impurity, and Mean Positional Gain metrics, such that the last depicts changes in the importance of features over time. For 2 of the 3 tested datasets, the results obtained by data stream learners are comparable to predictive models currently in use, thus showing the efficiency of data stream classification for the credit scoring task.
@article{BARDDAL_LESSONS_LEARNED:2020, title = {Lessons learned from data stream classification applied to credit scoring}, journal = {Expert Systems with Applications}, pages = {113899}, year = {2020}, issn = {0957-4174}, doi = {https://doi.org/10.1016/j.eswa.2020.113899}, url = {http://www.sciencedirect.com/science/article/pii/S0957417420306928}, author = {Barddal, Jean Paul and Loezer, Lucas and Enembreck, Fabrício and Lanzuolo, Riccardo}, keywords = {Credit scoring, Machine learning datasets, Data stream classification}, }
ANN. TELECOM.
Regularized and incremental decision trees for data streams

Jean Paul Barddal, and Fabricio Enembreck

Annals of Telecommunications 2020

Abs Bib PDF

Decision trees are a widely used family of methods for learning predictive models from both batch and streaming data. Despite depicting positive results in a multitude of applications, incremental decision trees continuously grow in terms of nodes as new data becomes available, i.e., they eventually split on all features available, and also multiple times using the same feature; thus leading to unnecessary complexity and overfitting. With this behavior, incremental trees lose the ability to generalize well, be human-understandable and computationally efficient. To tackle these issues, we proposed in a previous study a regularization scheme for Hoeffding decision trees that: (i) uses a penalty factor to control the gain obtained by creating a new split node using a feature that has not been used thus far; and (ii) uses information from previous splits in the current branch to determine whether the gain observed indeed justifies a new split. In this paper, we extend this analysis and apply the proposed regularization scheme to other types of incremental decision trees and report the results in both synthetic and real-world scenarios. The main interest is to verify whether and how the proposed regularization scheme affects the different types of incremental trees. Results show that in addition to the original Hoeffding Tree, the Adaptive Random Forest also benefits from regularization, yet, McDiarmid Trees and Extremely Fast Decision trees observe declines in accuracy.
@article{Barddal_Enembreck:2020, year = {2020}, publisher = {Springer}, author = {Barddal, Jean Paul and Enembreck, Fabricio}, title = {Regularized and incremental decision trees for data streams}, journal = {Annals of Telecommunications}, }
IJCNN
An End-to-End Approach for Recognition of Modern and Historical Handwritten Numeral Strings

André Gustavo Hochuli, Alceu Souza Britto Jr., Jean Paul Barddal, Luiz Eduardo Soares Oliveira, and Robert Sabourin

In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN) Glasgow, Scotland 2020

Abs Bib PDF

An end-to-end solution for handwritten numeral string recognition is proposed, in which the numeral string is considered as composed of objects automatically detected and recognized by a YoLo-based model. The main contribution of this paper is to avoid heuristic-based methods for string preprocessing and segmentation, the need for task-oriented classifiers, and also the use of specific constraints related to the string length. A robust experimental protocol based on several numeral string datasets, including one composed of historical documents, has shown that the proposed method is a feasible end-to-end solution for numeral string recognition. Besides, it reduces the complexity of the string recognition task considerably since it drops out classical steps, in special preprocessing, segmentation, and a set of classifiers devoted to strings with a specific length.
@inproceedings{Hochuli2020, author = {Hochuli, Andr\'{e} Gustavo and de Souza Britto Jr., Alceu and Barddal, Jean Paul and Oliveira, Luiz Eduardo Soares and Sabourin, Robert}, title = {An End-to-End Approach for Recognition of Modern and Historical Handwritten Numeral Strings}, booktitle = {Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN) Glasgow, Scotland}, year = {2020}, }
SAC
Cost-sensitive learning for imbalanced data streams

Lucas Loezer, Fabricio Enembreck, Jean Paul Barddal, and Alceu Souza Britto Jr.

In Proceedings of the 34rd Annual ACM Symposium on Applied Computing, SAC 2020, Brno, Czech Republic, March 30 - April 3, 2020 2020

Abs Bib PDF

The data imbalance problem hampers the classification task. In streaming environments, this becomes even more cumbersome as the proportion of classes can vary over time. Approaches based on misclassification costs can be used to mitigate this problem. In this paper, we present the Cost-sensitive Adaptive Random Forest (CSARF) and compare it to the Adaptive Random Forest (ARF) and ARF with Resampling (ARF_RE) in six real-world and six synthetic data sets with different class ratios. The empirical study analyzes two misclassification costs strategies of the CSARF and shows that the CSARF obtained statistically superior w.r.t. the average recall and average F1 when compared to ARF.
@inproceedings{Loezer2020, author = {Loezer, Lucas and Enembreck, Fabricio and Barddal, Jean Paul and de Souza Britto Jr., Alceu}, title = {Cost-sensitive learning for imbalanced data streams}, booktitle = {Proceedings of the 34rd Annual {ACM} Symposium on Applied Computing, {SAC} 2020, Brno, Czech Republic, March 30 - April 3, 2020}, year = {2020}, }
ANÁLISE PREDITIVA E DECISÕES JUDICIAIS: controvérsia ou realidade?

Cinthia Obladen Almendra Freitas, and Jean Paul Barddal

Revista Democracia Digital e Governo Eletrônico Jan 2020

Abs Bib PDF

In this paper, we provide an overview of how Data Analytics, Big Data, and Machine Learning may assist the judicial system by providing insightful information to citizens, police, lawyers, and judges, in a fast and accurate way. We conduct a bidirectional analysis between Law and Predictive Analytics applying the deductive method and bibliographic technique. We report concerns that Law should have with the application of computational techniques in different scenarios, mainly in the judicial system. Finally, we bring forward controversies between these areas, such as the new companies that target the use of personal and sensitive data in Law applications, and how these are potentially hurting fundamental rights and leading to biases in critical systems, wuch as predictive systems for crime recidivism.
@article{Obladen2019, url = {http://buscalegis.ufsc.br/revistas/index.php/observatoriodoegov/article/view/314}, year = {2020}, month = jan, volume = {1}, number = {18}, pages = {107--126}, author = {de Almendra Freitas, Cinthia Obladen and Barddal, Jean Paul}, title = {AN\'{A}LISE PREDITIVA E DECIS\~{O}ES JUDICIAIS: controvérsia ou realidade?}, journal = {Revista Democracia Digital e Governo Eletrônico}, }

2019

ACM SIGKDD
Machine learning for streaming data

Heitor Murilo Gomes, Jesse Read, Albert Bifet, Jean Paul Barddal, and João Gama

ACM SIGKDD Explorations Newsletter Nov 2019

Abs Bib PDF

Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., Velocity and Volume. Given the current industry needs, there are many challenges to be addressed before existing methods can be efficiently applied to real-world problems. In this work, we focus on elucidating the connections among the current state-of-the-art on related fields; and clarifying open challenges in both academia and industry. We treat with special care topics that were not thoroughly investigated in past position and survey papers. This work aims to evoke discussion and elucidate the current research opportunities, highlighting the relationship of different subareas and suggesting courses of action when possible.
@article{Gomes2019, doi = {10.1145/3373464.3373470}, url = {https://doi.org/10.1145/3373464.3373470}, year = {2019}, month = nov, publisher = {Association for Computing Machinery ({ACM})}, volume = {21}, number = {2}, pages = {6--22}, author = {Gomes, Heitor Murilo and Read, Jesse and Bifet, Albert and Barddal, Jean Paul and Gama, Jo{\~{a}}o}, title = {Machine learning for streaming data}, journal = {{ACM} {SIGKDD} Explorations Newsletter}, }
SIGAPP ACR
Addressing Feature Drift in Data Streams Using Iterative Subset Selection

Lanqin Yuan, Bernhard Pfahringer, and Jean Paul Barddal

SIGAPP Appl. Comput. Rev. Apr 2019

Abs Bib PDF

Data streams are prone to various forms of concept drift over time including, for instance, changes to the relevance of features. This specific kind of drift is known as feature drift and requires techniques tailored not only to determine which features are the most important but also to take advantage of them. Feature selection has been studied and shown to improve classifier performance in standard batch data mining, yet it is mostly unexplored in data stream mining. This paper presents a novel method of feature subset selection specialized for dealing with the occurrence of feature drifts called Iterative Subset Selection (ISS), which splits the feature selection process into two stages by first ranking the features using some scoring function, and then iteratively selecting feature subsets using this ranking. This work further extends upon our prior work by exploring feeding information from the subset selection stage back into the ranking process. Applying our method to the Naïve Bayes and k-Nearest Neighbour classifier, we obtain compelling accuracy improvements when compared to existing works.
@article{JOURNAL_FRANKIE, author = {Yuan, Lanqin and Pfahringer, Bernhard and Barddal, Jean Paul}, title = {Addressing Feature Drift in Data Streams Using Iterative Subset Selection}, journal = {SIGAPP Appl. Comput. Rev.}, issue_date = {March 2019}, volume = {19}, number = {1}, month = apr, year = {2019}, issn = {1559-6915}, pages = {20--33}, numpages = {14}, url = {http://doi.acm.org/10.1145/3325061.3325063}, doi = {10.1145/3325061.3325063}, acmid = {3325063}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {backward feature elimination, concept drift, data stream mining, embedded feature selection, feature selection, iterative subset selection}, }
IJCNN
Vertical and Horizontal Partitioning in Data Stream Regression Ensembles

Jean Paul Barddal

In 2019 International Joint Conference on Neural Networks, IJCNN 2019, Budapest, Hungary, July 14-19, 2019 Apr 2019

Abs Bib PDF

Data stream mining is an emerging topic in machine learning that targets the creation and update of predictive models over time as new data becomes available. Regarding existing works, classification is the most widely tackled task, which leaves regression nearly untouched. In this paper, the focus relies on ensemble learning for data stream regression, more specifically on vertical and horizontal data partitioning techniques. The goal is to determine whether and under which conditions partitioning can lessen the error rates of different types of learners in the data stream regression task. The proposed method combines vertical and horizontal partitioning, and it is compared with and against different types of learners and existing ensembles.
@inproceedings{VHPRE, author = {Barddal, Jean Paul}, title = {Vertical and Horizontal Partitioning in Data Stream Regression Ensembles}, booktitle = {2019 International Joint Conference on Neural Networks, {IJCNN} 2019, Budapest, Hungary, July 14-19, 2019}, year = {2019}, }
SAC
Learning Regularized Hoeffding Trees from Data Streams

Jean Paul Barddal, and Fabricio Enembreck

In Proceedings of the 34rd Annual ACM Symposium on Applied Computing, SAC 2019, Limassol, Cyprus, April 08-12, 2019 Apr 2019

Abs Bib PDF

Learning from data streams is a hot topic in machine learning that targets the learning and update of predictive models as data becomes available for both training and query. Due to their simplicity and convincing results in a multitude of applications, Hoeffding Trees are, by far, the most widely used family of methods for learning decision trees from streaming data. Despite the aforementioned positive characteristics, Hoeffding Trees tend to continuously grow in terms of nodes as new data becomes available, i.e., they eventually split on all features available, and multiple times on the same feature; thus leading to unnecessary complexity. With this behavior, Hoeffding Trees lose the ability to be human-understandable and computationally efficient. To tackle these issues, we propose a regularization scheme for Hoeffding Trees that (i) uses a penalty factor to control the gain obtained by creating a new split node using a feature that has not been used thus far; and (ii) uses information from previous splits in the current branch to determine whether the gain observed indeed justifies a new split. The proposed scheme is combined with both standard and adaptive variants of Hoeffding Trees. Experiments using real-world, stationary and drifting synthetic data show that the proposed method prevents both original and adaptive Hoeffding Trees from unnecessarily growing while maintaining impressive accuracy rates. As a byproduct of the regularization process, significant improvements in processing time, model complexity, and memory consumption have also been observed, thus showing the effectiveness of the proposed regularization scheme.
@inproceedings{REGULARIZED_HTS, author = {Barddal, Jean Paul and Enembreck, Fabricio}, title = {Learning Regularized Hoeffding Trees from Data Streams}, booktitle = {Proceedings of the 34rd Annual {ACM} Symposium on Applied Computing, {SAC} 2019, Limassol, Cyprus, April 08-12, 2019}, year = {2019}, }
SAC
Decision tree-based Feature Ranking in Concept Drifting Data Streams

Andreia Malucelli Jean Antonio Karax, and Jean Paul Barddal

In Proceedings of the 34rd Annual ACM Symposium on Applied Computing, SAC 2019, Limassol, Cyprus, April 08-12, 2019 Apr 2019

Abs Bib PDF

Data stream mining targets the learning of predictive models that evolve over time according to changes in arriving data. Throughout the years, several approaches have been tailored to create and continuously update predictive models from these streams, and from these, Hoeffding Trees became a popular choice for learning decision trees from data streams. In this paper, we aim at quantifying and expressing the importance of features in dynamic scenarios is of the utmost importance as they allow domain experts to back up, or invalidate, a predictive model. Therefore, we propose and assess a positional gain method tailored for for both individual and ensembles of Hoeffding Trees and how these behave in both synthetic and real-world scenarios.
@inproceedings{KARAX_TREE_FEATURE_IMPORTANCE, author = {Jean Antonio Karax, Andreia Malucelli and Barddal, Jean Paul}, title = {Decision tree-based Feature Ranking in Concept Drifting Data Streams}, booktitle = {Proceedings of the 34rd Annual {ACM} Symposium on Applied Computing, {SAC} 2019, Limassol, Cyprus, April 08-12, 2019}, year = {2019}, }
INFSYS
Boosting decision stumps for dynamic feature selection on data streams

Jean Paul Barddal, Fabrício Enembreck, Heitor Murilo Gomes, Albert Bifet, and Bernhard Pfahringer

Information Systems Apr 2019

Abs Bib PDF

Feature selection targets the identification of which features of a dataset are relevant to the learning task. It is also widely known and used to improve computation times, reduce computation requirements, and to decrease the impact of the curse of dimensionality and enhancing the generalization rates of classifiers. In data streams, classifiers shall benefit from all the items above, but more importantly, from the fact that the relevant subset of features may drift over time. In this paper, we propose a novel dynamic feature selection method for data streams called Adaptive Boosting for Feature Selection (ABFS). ABFS chains decision stumps and drift detectors, and as a result, identifies which features are relevant to the learning task as the stream progresses with reasonable success. In addition to our proposed algorithm, we bring feature selection-specific metrics from batch learning to streaming scenarios. Next, we evaluate ABFS according to these metrics in both synthetic and real-world scenarios. As a result, ABFS improves the classification rates of different types of learners and eventually enhances computational resources usage.
@article{BARDDAL201913, title = {Boosting decision stumps for dynamic feature selection on data streams}, journal = {Information Systems}, volume = {83}, pages = {13 - 29}, year = {2019}, issn = {0306-4379}, doi = {https://doi.org/10.1016/j.is.2019.02.003}, url = {http://www.sciencedirect.com/science/article/pii/S0306437918303399}, author = {Barddal, Jean Paul and Enembreck, Fabrício and Gomes, Heitor Murilo and Bifet, Albert and Pfahringer, Bernhard}, keywords = {Data stream mining, Feature selection, Concept drift, Feature drift}, }
ESWA
Merit-guided dynamic feature selection filter for data streams

Jean Paul Barddal, Fabrı́cio Enembreck, Heitor Murilo Gomes, Albert Bifet, and Bernhard Pfahringer

Expert Syst. Appl. Apr 2019

Abs Bib PDF

Learning from ephemeral data streams has garnered the interest of both researchers and practitioners towards adaptive learning techniques. Despite the convincing results obtained thus far, most of the current research still overlooks that the relevance of features may change throughout the learning process. Scenarios where features become - or cease to be - relevant to the learning task are called feature drifting data streams, and the identification of which features are relevant becomes even more challenging when the feature space is high-dimensional. To select relevant features during the progress of data streams, we propose a merit-guided and classifier-independent dynamic feature selection algorithm named DynamIc SymmetriCal Uncertainty Selection for Streams (DISCUSS). We evaluate our proposal on both synthetic and real-world datasets and show that DISCUSS can boost kNN and Naive Bayes classifiers’ accuracy rates on high-dimensional data streams, while at the expense of limited processing time and memory space. Finally, the drawbacks of the proposed method are assessed, and possible future works on the topic are also discussed.
@article{DBLP:journals/eswa/BarddalEGBP19, author = {Barddal, Jean Paul and Enembreck, Fabr{\'{\i}}cio and Gomes, Heitor Murilo and Bifet, Albert and Pfahringer, Bernhard}, title = {Merit-guided dynamic feature selection filter for data streams}, journal = {Expert Syst. Appl.}, volume = {116}, pages = {227--242}, year = {2019}, url = {https://doi.org/10.1016/j.eswa.2018.09.031}, }

2018

ESANN
Adaptive random forests for data stream regression

Heitor Murilo Gomes, Jean Paul Barddal, Luis Eduardo Boiko Ferreira, and Albert Bifet

In 26th European Symposium on Artificial Neural Networks, ESANN 2018, Bruges, Belgium, April 25-27, 2018 Apr 2018

Abs Bib PDF

Data stream mining is a hot topic in the machine learning community that tackles the problem of learning and updating predictive models as new data becomes available over time. Even though several new methods are proposed every year, most focus on the classification task and overlook the regression task. In this paper, we propose an adaptation to the Adaptive Random Forest so that it can handle regression tasks, namely ARF-Reg. ARF-Reg is empirically evaluated and compared to the state-of-the-art data stream regression algorithms, thus highlighting its applicability in different data stream scenarios.
@inproceedings{DBLP:conf/esann/GomesBFB18, author = {Gomes, Heitor Murilo and Barddal, Jean Paul and Ferreira, Luis Eduardo Boiko and Bifet, Albert}, title = {Adaptive random forests for data stream regression}, booktitle = {26th European Symposium on Artificial Neural Networks, {ESANN} 2018, Bruges, Belgium, April 25-27, 2018}, year = {2018}, crossref = {DBLP:conf/esann/2018}, timestamp = {Fri, 09 Nov 2018 12:29:56 +0100}, biburl = {https://dblp.org/rec/bib/conf/esann/GomesBFB18}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
IJCNN
An Experimental Perspective on Sampling Methods for Imbalanced Learning From Financial Databases

Luis Eduardo Boiko Ferreira, Jean Paul Barddal, Fabrı́cio Enembreck, and Heitor Murilo Gomes

In 2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018 Apr 2018

Abs Bib PDF

The financial market is one of the major consumers of data mining techniques, and the main reason is their efficiency to analyze complex data. One important trait shared between most financial applications is class imbalance. Since traditional classification methods assume nearly balanced classes and equal misclassification costs, they usually fail to deal with imbalanced data. However, in financial contexts, problems are usually imbalanced, and instances from the minority class are known for deficits of millions of dollars every year, e.g., credit card frauds, money laundering transactions and so forth. Over the years, several techniques for dealing with class imbalance have been developed, such as sampling techniques and algorithm adaptations. In this study, we analyze how different sampling techniques impact the performance of different classification systems on financial applications. Results show that, for the given datasets, sampling techniques allow the improvement of prediction performance of the minority class while also improving overall classification rates. Nevertheless, their use often deteriorates the performance in predicting the majority class.
@inproceedings{DBLP:conf/ijcnn/FerreiraBEG18, author = {Ferreira, Luis Eduardo Boiko and Barddal, Jean Paul and Enembreck, Fabr{\'{\i}}cio and Gomes, Heitor Murilo}, title = {An Experimental Perspective on Sampling Methods for Imbalanced Learning From Financial Databases}, booktitle = {2018 International Joint Conference on Neural Networks, {IJCNN} 2018, Rio de Janeiro, Brazil, July 8-13, 2018}, pages = {1--6}, year = {2018}, crossref = {DBLP:conf/ijcnn/2018}, url = {https://doi.org/10.1109/IJCNN.2018.8489290}, doi = {10.1109/IJCNN.2018.8489290}, timestamp = {Mon, 22 Oct 2018 13:07:32 +0200}, biburl = {https://dblp.org/rec/bib/conf/ijcnn/FerreiraBEG18}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
INDIN
Are fintechs really a hype? A machine learning-based polarity analysis of Brazilian posts on social media

Marina Ponestke Seara, Andreia Malucelli, Altair Olivo Santin, and Jean Paul Barddal

In 16th IEEE International Conference on Industrial Informatics, INDIN 2018, Porto, Portugal, July 18-20, 2018 Apr 2018

Abs Bib PDF

Fintechs are technology companies that, in contrast to traditional banks, are engaged in digital solutions for payment, money transfers, and real-time notifications. Taking advantage of digital means of communication, most of the service interactions between fintechs and customers occurs via chats or posts in social media. In this work, our goal is to use machine learning to analyze these posts and identify what are the terms used by customers to express positive, neutral and negative customer experiences. During this analysis, we assess the following questions using data from the 3 biggest fintechs in Brazil: (i) what are the most commented topics on social media regarding fintechs, (ii) what are the words more often used by customers to express positive, negative and neutral reactions to the customer service obtained; and (iii) what kind of machine learning model should a fintech use to automatically identify whether a post is positive, negative or neutral.
@inproceedings{DBLP:conf/indin/SearaMSB18, author = {Seara, Marina Ponestke and Malucelli, Andreia and Santin, Altair Olivo and Barddal, Jean Paul}, title = {Are fintechs really a hype? {A} machine learning-based polarity analysis of Brazilian posts on social media}, booktitle = {16th {IEEE} International Conference on Industrial Informatics, {INDIN} 2018, Porto, Portugal, July 18-20, 2018}, pages = {233--238}, year = {2018}, crossref = {DBLP:conf/indin/2018}, url = {https://doi.org/10.1109/INDIN.2018.8471986}, doi = {10.1109/INDIN.2018.8471986}, timestamp = {Fri, 12 Oct 2018 10:26:41 +0200}, biburl = {https://dblp.org/rec/bib/conf/indin/SearaMSB18}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
SAC
Iterative subset selection for feature drifting data streams

Lanqin Yuan, Bernhard Pfahringer, and Jean Paul Barddal

In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, SAC 2018, Pau, France, April 09-13, 2018 Apr 2018

Abs Bib PDF

Feature selection has been studied and shown to improve classifier performance in standard batch data mining but is mostly unexplored in data stream mining. Feature selection becomes even more important when the relevant subset of features changes over time, as the underlying concept of a data stream drifts. This specific kind of drift is known as feature drift and requires specific techniques not only to determine which features are the most important but also to take advantage of them. This paper presents a novel method of feature subset selection specialized for dealing with the occurrence of feature drifts called Iterative Subset Selection (ISS), which splits the feature selection process into two stages by first ranking the features, and then iteratively selecting features from the ranking. Applying our feature selection method together with Naive Bayes or k-Nearest Neighbour as a classifier, results in compelling accuracy improvements, compared to prior work.
@inproceedings{DBLP:conf/sac/YuanPB18, author = {Yuan, Lanqin and Pfahringer, Bernhard and Barddal, Jean Paul}, title = {Iterative subset selection for feature drifting data streams}, booktitle = {Proceedings of the 33rd Annual {ACM} Symposium on Applied Computing, {SAC} 2018, Pau, France, April 09-13, 2018}, pages = {510--517}, year = {2018}, crossref = {DBLP:conf/sac/2018}, url = {https://doi.org/10.1145/3167132.3167188}, doi = {10.1145/3167132.3167188}, timestamp = {Wed, 21 Nov 2018 12:43:56 +0100}, biburl = {https://dblp.org/rec/bib/conf/sac/YuanPB18}, bibsource = {dblp computer science bibliography, https://dblp.org}, }

2017

ACM CSUR
A Survey on Ensemble Learning for Data Stream Classification

Heitor Murilo Gomes, Jean Paul Barddal, Fabrı́cio Enembreck, and Albert Bifet

ACM Comput. Surv. Apr 2017

Abs Bib PDF

Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integrated with drift detection algorithms and incorporate dynamic updates, such as selective removal or addition of classifiers. This work proposes a taxonomy for data stream ensemble learning as derived from reviewing over 60 algorithms. Important aspects such as combination, diversity, and dynamic updates, are thoroughly discussed. Additional contributions include a listing of popular open-source tools and a discussion about current data stream research challenges and how they relate to ensemble learning (big data streams, concept evolution, feature drifts, temporal dependencies, and others).
@article{DBLP:journals/csur/GomesBEB17, author = {Gomes, Heitor Murilo and Barddal, Jean Paul and Enembreck, Fabr{\'{\i}}cio and Bifet, Albert}, title = {A Survey on Ensemble Learning for Data Stream Classification}, journal = {{ACM} Comput. Surv.}, volume = {50}, number = {2}, pages = {23:1--23:36}, year = {2017}, url = {https://doi.org/10.1145/3054925}, doi = {10.1145/3054925}, timestamp = {Fri, 30 Nov 2018 12:48:46 +0100}, biburl = {https://dblp.org/rec/bib/journals/csur/GomesBEB17}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
JSS
A survey on feature drift adaptation: Definition, benchmark, challenges and future directions

Jean Paul Barddal, Heitor Murilo Gomes, Fabrı́cio Enembreck, and Bernhard Pfahringer

Journal of Systems and Software Apr 2017

Abs Bib PDF

Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely feature drift. Feature drift occurs whenever a subset of features becomes, or ceases to be, relevant to the learning task; thus, learners must detect and adapt to these changes accordingly. We survey existing work on feature drift adaptation with both explicit and implicit approaches. Additionally, we benchmark several algorithms and a naive feature drift detection approach using synthetic and real-world datasets. The results from our experiments indicate the need for future research in this area as even naive approaches produced gains in accuracy while reducing resources usage. Finally, we state current research topics, challenges and future directions for feature drift adaptation.
@article{DBLP:journals/jss/BarddalGEP17, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and Enembreck, Fabr{\'{\i}}cio and Pfahringer, Bernhard}, title = {A survey on feature drift adaptation: Definition, benchmark, challenges and future directions}, journal = {Journal of Systems and Software}, volume = {127}, pages = {278--294}, year = {2017}, url = {https://doi.org/10.1016/j.jss.2016.07.005}, doi = {10.1016/j.jss.2016.07.005}, timestamp = {Tue, 06 Jun 2017 22:24:00 +0200}, biburl = {https://dblp.org/rec/bib/journals/jss/BarddalGEP17}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ML
Adaptive random forests for evolving data stream classification

Heitor Murilo Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabrı́cio Enembreck, Bernhard Pfharinger, Geoff Holmes, and Talel Abdessalem

Machine Learning Apr 2017

Abs Bib PDF

Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.
@article{DBLP:journals/ml/GomesBRBEPHA17, author = {Gomes, Heitor Murilo and Bifet, Albert and Read, Jesse and Barddal, Jean Paul and Enembreck, Fabr{\'{\i}}cio and Pfharinger, Bernhard and Holmes, Geoff and Abdessalem, Talel}, title = {Adaptive random forests for evolving data stream classification}, journal = {Machine Learning}, volume = {106}, number = {9-10}, pages = {1469--1495}, year = {2017}, url = {https://doi.org/10.1007/s10994-017-5642-8}, doi = {10.1007/s10994-017-5642-8}, timestamp = {Tue, 26 Jun 2018 14:09:25 +0200}, biburl = {https://dblp.org/rec/bib/journals/ml/GomesBRBEPHA17}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ICTAI
Improving Credit Risk Prediction in Online Peer-to-Peer (P2P) Lending Using Imbalanced Learning Techniques

Luis Eduardo Boiko Ferreira, Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck

In 29th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2017, Boston, MA, USA, November 6-8, 2017 Apr 2017

Abs Bib PDF

Peer-to-peer (P2P) lending is a global trend of financial markets that allow individuals to obtain and concede loans without having financial institutions as a strong proxy. As many real-world applications, P2P lending presents an imbalanced characteristic, where the number of creditworthy loan requests is much larger than the number of non-creditworthy ones. In this work, we wrangle a real-world P2P lending data set from Lending Club, containing a large amount of data gathered from 2007 up to 2016. We analyze how supervised classification models and techniques to handle class imbalance impact creditworthiness prediction rates. Ensembles, cost-sensitive and sampling methods are combined and evaluated along logistic regression, decision tree, and bayesian learning schemes. Results show that, in average, sampling techniques outperform ensembles and cost sensitive approaches.
@inproceedings{DBLP:conf/ictai/FerreiraBGE17, author = {Ferreira, Luis Eduardo Boiko and Barddal, Jean Paul and Gomes, Heitor Murilo and Enembreck, Fabr{\'{\i}}cio}, title = {Improving Credit Risk Prediction in Online Peer-to-Peer {(P2P)} Lending Using Imbalanced Learning Techniques}, booktitle = {29th {IEEE} International Conference on Tools with Artificial Intelligence, {ICTAI} 2017, Boston, MA, USA, November 6-8, 2017}, pages = {175--181}, year = {2017}, crossref = {DBLP:conf/ictai/2017}, url = {https://doi.org/10.1109/ICTAI.2017.00037}, doi = {10.1109/ICTAI.2017.00037}, timestamp = {Tue, 31 Jul 2018 12:20:29 +0200}, biburl = {https://dblp.org/rec/bib/conf/ictai/FerreiraBGE17}, bibsource = {dblp computer science bibliography, https://dblp.org}, }

2016

INFSYS
SNCStream+: Extending a high quality true anytime data stream clustering algorithm

Jean Paul Barddal, Heitor Murilo Gomes, Fabrı́cio Enembreck, and Jean-Paul A. Barthès

Inf. Syst. Apr 2016

Abs Bib PDF

Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally as data arrives. On top of that, due to the inherent evolving nature of data streams, it is expected that algorithms undergo both concept drifts and evolutions, which must be taken into account by the clustering algorithm, allowing incremental clustering updates. In this paper we present the Social Network Clusterer Stream+ (SNCStream+). SNCStream+ tackles the data stream clustering problem as a network formation and evolution problem, where instances and micro-clusters form clusters based on homophily. Our proposal has its parameters analyzed and it is evaluated in a broad set of problems against literature baselines. Results show that SNCStream+ achieves superior clustering quality (CMM), and feasible processing time and memory space usage when compared to the original SNCStream and other proposals of the literature.
@article{DBLP:journals/is/BarddalGEB16, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and Enembreck, Fabr{\'{\i}}cio and Barth{\`{e}}s, Jean{-}Paul A.}, title = {SNCStream+: Extending a high quality true anytime data stream clustering algorithm}, journal = {Inf. Syst.}, volume = {62}, pages = {60--73}, year = {2016}, url = {https://doi.org/10.1016/j.is.2016.06.007}, doi = {10.1016/j.is.2016.06.007}, timestamp = {Tue, 06 Jun 2017 22:22:00 +0200}, biburl = {https://dblp.org/rec/bib/journals/is/BarddalGEB16}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ICPR
A benchmark of classifiers on feature drifting data streams

Jean Paul Barddal, Heitor Murilo Gomes, Alceu Souza Britto Jr., and Fabrı́cio Enembreck

In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016 Apr 2016

Abs Bib PDF

The ever increasing data generation confronts both practitioners and researchers on handling massive and sequentially generated amounts of information, the so-called data streams. In this context, a lot of effort has been put on the extraction of useful patterns from streaming scenarios. Learning from data streams embeds a variety of problems, and by far, the most challenging is concept drift, i.e. changes in data distribution. In this paper, we focus on a specific type of drift uncommonly assessed in the literature: feature drifts. Feature drifts occur whenever a subset of features becomes, or ceases to be, relevant to the concept to be learned. We propose and review several feature drifting data stream generators and use them to benchmark state-of-the-art data stream classification algorithms and their combination with drift detectors. Results show that, although drift detectors enable slight quicker recovery to feature drifts, best results are obtained by Hoeffding Adaptive Tree, the only learner that performs dynamic feature selection as streams progress.
@inproceedings{DBLP:conf/icpr/BarddalGBE16, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and de Souza Britto Jr., Alceu and Enembreck, Fabr{\'{\i}}cio}, title = {A benchmark of classifiers on feature drifting data streams}, booktitle = {23rd International Conference on Pattern Recognition, {ICPR} 2016, Canc{\'{u}}n, Mexico, December 4-8, 2016}, pages = {2180--2185}, year = {2016}, crossref = {DBLP:conf/icpr/2016}, url = {https://doi.org/10.1109/ICPR.2016.7899959}, doi = {10.1109/ICPR.2016.7899959}, timestamp = {Wed, 24 May 2017 08:30:42 +0200}, biburl = {https://dblp.org/rec/bib/conf/icpr/BarddalGBE16}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ICPR
Overcoming feature drifts via dynamic feature weighted k-nearest neighbor learning

Jean Paul Barddal, Heitor Murilo Gomes, Jones Granatyr, Alceu Souza Britto Jr., and Fabrı́cio Enembreck

In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016 Apr 2016

Abs Bib PDF

Extracting useful knowledge from data streams is problematic, mainly due to changes in their data distribution, a phenomenon named concept drift. Recently, studies have shown that most of existing algorithms for learning from data streams do not encompass techniques for a specific kind of drift: feature drifts. Feature drifts occur when features become, or cease to be, relevant to the learning task. In this paper, we propose an extension to the k-nearest neighbor classifier, so its distances’ computations are weighted according to their current discriminative power. On our proposal, the discriminative power of features is given by entropy, which is swiftly computed over a sliding window. Empirical evidence shows that our approach is able to overcome several existing algorithms in accuracy and feature drift adaptation, while at the expense of bounded processing time and memory space.
@inproceedings{DBLP:conf/icpr/BarddalGGBE16, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and Granatyr, Jones and de Souza Britto Jr., Alceu and Enembreck, Fabr{\'{\i}}cio}, title = {Overcoming feature drifts via dynamic feature weighted k-nearest neighbor learning}, booktitle = {23rd International Conference on Pattern Recognition, {ICPR} 2016, Canc{\'{u}}n, Mexico, December 4-8, 2016}, pages = {2186--2191}, year = {2016}, crossref = {DBLP:conf/icpr/2016}, url = {https://doi.org/10.1109/ICPR.2016.7899960}, doi = {10.1109/ICPR.2016.7899960}, timestamp = {Wed, 24 May 2017 08:30:48 +0200}, biburl = {https://dblp.org/rec/bib/conf/icpr/BarddalGGBE16}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
IJCNN
Towards emotion-based reputation guessing learning agents

Jones Granatyr, Jean Paul Barddal, Adriano Weihmayer Almeida, Fabrı́cio Enembreck, and Adaiane Pereira Santos Granatyr

In 2016 International Joint Conference on Neural Networks, IJCNN 2016, Vancouver, BC, Canada, July 24-29, 2016 Apr 2016

Abs Bib PDF

Trust and reputation mechanisms are part of the logical protection of intelligent agents, preventing malicious agents from acting egotistically or with the intention to damage others. Several studies in Psychology, Neurology and Anthropology claim that emotions are part of human’s decision making process. However, there is a lack of understanding about how affective aspects, such as emotions, influence trust or reputation levels of intelligent agents when they are inserted into an information exchange environment, e.g. an evaluation system. In this paper we propose a reputation model that accounts for emotional bounds given by Ekman’s basic emotions and inductive machine learning. Our proposal is evaluated by extracting emotions from texts provided by two online human-fed evaluation systems. Empirical results show significant agent’s utility improvements with p <; .05 when compared to non-emotion-wise proposals, thus, showing the need for future research in this area.
@inproceedings{DBLP:conf/ijcnn/GranatyrBAEG16, author = {Granatyr, Jones and Barddal, Jean Paul and Almeida, Adriano Weihmayer and Enembreck, Fabr{\'{\i}}cio and dos Santos Granatyr, Adaiane Pereira}, title = {Towards emotion-based reputation guessing learning agents}, booktitle = {2016 International Joint Conference on Neural Networks, {IJCNN} 2016, Vancouver, BC, Canada, July 24-29, 2016}, pages = {3801--3808}, year = {2016}, crossref = {DBLP:conf/ijcnn/2016}, url = {https://doi.org/10.1109/IJCNN.2016.7727690}, doi = {10.1109/IJCNN.2016.7727690}, timestamp = {Fri, 26 May 2017 00:50:11 +0200}, biburl = {https://dblp.org/rec/bib/conf/ijcnn/GranatyrBAEG16}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ECML PKDD
On Dynamic Feature Weighting for Feature Drifting Data Streams

Jean Paul Barddal, Heitor Murilo Gomes, Fabrı́cio Enembreck, Bernhard Pfahringer, and Albert Bifet

In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II Apr 2016

Abs Bib PDF

The ubiquity of data streams has been encouraging the development of new incremental and adaptive learning algorithms. Data stream learners must be fast, memory-bounded, but mainly, tailored to adapt to possible changes in the data distribution, a phenomenon named concept drift. Recently, several works have shown the impact of a so far nearly neglected type of drifcccct: feature drifts. Feature drifts occur whenever a subset of features becomes, or ceases to be, relevant to the learning task. In this paper we (i) provide insights into how the relevance of features can be tracked as a stream progresses according to information theoretical Symmetrical Uncertainty; and (ii) how it can be used to boost two learning schemes: Naive Bayesian and k-Nearest Neighbor. Furthermore, we investigate the usage of these two new dynamically weighted learners as prediction models in the leaves of the Hoeffding Adaptive Tree classifier. Results show improvements in accuracy (an average of 10.69 % for k-Nearest Neighbor, 6.23 % for Naive Bayes and 4.42 % for Hoeffding Adaptive Trees) in both synthetic and real-world datasets at the expense of a bounded increase in both memory consumption and processing time.
@inproceedings{DBLP:conf/pkdd/BarddalGEPB16, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and Enembreck, Fabr{\'{\i}}cio and Pfahringer, Bernhard and Bifet, Albert}, title = {On Dynamic Feature Weighting for Feature Drifting Data Streams}, booktitle = {Machine Learning and Knowledge Discovery in Databases - European Conference, {ECML} {PKDD} 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part {II}}, pages = {129--144}, year = {2016}, crossref = {DBLP:conf/pkdd/2016-2}, url = {https://doi.org/10.1007/978-3-319-46227-1\_9}, doi = {10.1007/978-3-319-46227-1\_9}, timestamp = {Thu, 15 Jun 2017 21:40:02 +0200}, biburl = {https://dblp.org/rec/bib/conf/pkdd/BarddalGEPB16}, bibsource = {dblp computer science bibliography, https://dblp.org}, }

2015

IJNCR
Advances on Concept Drift Detection in Regression Tasks Using Social Networks Theory

Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck

IJNCR Apr 2015

Abs Bib PDF

Mining data streams is one of the main studies in machine learning area due to its application in many knowledge areas. One of the major challenges on mining data streams is concept drift, which requires the learner to discard the current concept and adapt to a new one. Ensemble-based drift detection algorithms have been used successfully to the classification task but usually maintain a fixed size ensemble of learners running the risk of needlessly spending processing time and memory. In this paper the authors present improvements to the Scale-free Network Regressor (SFNR), a dynamic ensemble-based method for regression that employs social networks theory. In order to detect concept drifts SFNR uses the Adaptive Window (ADWIN) algorithm. Results show improvements in accuracy, especially in concept drift situations and better performance compared to other state-of-the-art algorithms in both real and synthetic data.
@article{DBLP:journals/ijncr/BarddalGE15, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and Enembreck, Fabr{\'{\i}}cio}, title = {Advances on Concept Drift Detection in Regression Tasks Using Social Networks Theory}, journal = {{IJNCR}}, volume = {5}, number = {1}, pages = {26--41}, year = {2015}, url = {https://doi.org/10.4018/ijncr.2015010102}, doi = {10.4018/ijncr.2015010102}, timestamp = {Sun, 28 May 2017 13:18:05 +0200}, biburl = {https://dblp.org/rec/bib/journals/ijncr/BarddalGE15}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ICEIS
Applying Ensemble-based Online Learning Techniques on Crime Forecasting

Anderson José Souza, André Pinz Borges, Heitor Murilo Gomes, Jean Paul Barddal, and Fabrı́cio Enembreck

In ICEIS 2015 - Proceedings of the 17th International Conference on Enterprise Information Systems, Volume 1, Barcelona, Spain, 27-30 April, 2015 Apr 2015

Abs Bib PDF

Traditional prediction algorithms assume that the underlying concept is stationary, i.e., no changes are expected to happen during the deployment of an algorithm that would render it obsolete. Although, for many real world scenarios changes in the data distribution, namely concept drifts, are expected to occur due to variations in the hidden context, e.g., new government regulations, climatic changes, or adversary adaptation. In this paper, we analyze the problem of predicting the most susceptible types of victims of crimes occurred in a large city of Brazil. It is expected that criminals change their victims’ types to counter police methods and vice-versa. Therefore, the challenge is to obtain a model capable of adapting rapidly to the current preferred criminal victims, such that police resources can be allocated accordingly. In this type of problem the most appropriate learning models are provided by data stream mining, since the learning algorithms from this domain assume that concept drifts may occur over time, and are ready to adapt to them. In this paper we apply ensemble-based data stream methods, since they provide good accuracy and the ability to adapt to concept drifts. Results show that the application of these ensemble-based algorithms (Leveraging Bagging, SFNClassifier, ADWIN Bagging and Online Bagging) reach feasible accuracy for this task.
@inproceedings{DBLP:conf/iceis/SouzaBGBE15, author = {de Souza, Anderson Jos{\'{e}} and Borges, Andr{\'{e}} Pinz and Gomes, Heitor Murilo and Barddal, Jean Paul and Enembreck, Fabr{\'{\i}}cio}, title = {Applying Ensemble-based Online Learning Techniques on Crime Forecasting}, booktitle = {{ICEIS} 2015 - Proceedings of the 17th International Conference on Enterprise Information Systems, Volume 1, Barcelona, Spain, 27-30 April, 2015}, pages = {17--24}, year = {2015}, crossref = {DBLP:conf/iceis/2015-1}, url = {https://doi.org/10.5220/0005335700170024}, doi = {10.5220/0005335700170024}, timestamp = {Wed, 17 May 2017 10:54:49 +0200}, biburl = {https://dblp.org/rec/bib/conf/iceis/SouzaBGBE15}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ICONIP
Analyzing the Impact of Feature Drifts in Streaming Learning

Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck

In Neural Information Processing - 22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part I Apr 2015

Abs Bib PDF

Learning from data streams requires efficient algorithms capable of deriving a model accordingly to the arrival of new instances. Data streams are by definition unbounded sequences of data that are possibly non stationary, i.e. they may undergo changes in data distribution, phenomenon named concept drift. Concept drifts force streaming learning algorithms to detect and adapt to such changes in order to present feasible accuracy throughout time. Nonetheless, most of works presented in the literature do not account for a specific kind of drifts: feature drifts. Feature drifts occur whenever the relevance of an arbitrary attribute changes through time, also impacting the concept to be learned. In this paper we (i) verify the occurrence of feature drift in a publicly available dataset, (ii) present a synthetic data stream generator capable of performing feature drifts and (iii) analyze the impact of this type of drift in stream learning algorithms, enlightening that there is room and the need for dynamic feature selection strategies for data streams.
@inproceedings{DBLP:conf/iconip/BarddalGE15, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and Enembreck, Fabr{\'{\i}}cio}, title = {Analyzing the Impact of Feature Drifts in Streaming Learning}, booktitle = {Neural Information Processing - 22nd International Conference, {ICONIP} 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part {I}}, pages = {21--28}, year = {2015}, crossref = {DBLP:conf/iconip/2015-1}, url = {https://doi.org/10.1007/978-3-319-26532-2\_3}, doi = {10.1007/978-3-319-26532-2\_3}, timestamp = {Sun, 08 Jul 2018 23:29:36 +0200}, biburl = {https://dblp.org/rec/bib/conf/iconip/BarddalGE15}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ICONIP
On the Discovery of Time Distance Constrained Temporal Association Rules

Heitor Murilo Gomes, Deborah Ribeiro Carvalho, Lourdes Zubieta, Jean Paul Barddal, and Andreia Malucelli

In Neural Information Processing - 22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part II Apr 2015

Abs Bib PDF

The increased use of data mining algorithms reflects the need for automatic extraction of knowledge from large volumes of data. This work presents a temporal data mining algorithm that discovers frequent Association Rules from timestamped data. These rules are named Cause-Effect Rules, each represented by a multiset of unordered events (Cause) followed by a singleton event (Effect). Also, a Cause-Effect Rule is valid within an specific constraint that defines the minimum and maximum time distance between its Cause and Effect. Our algorithm was tested on a data set from two hospital emergency departments in Sherbrooke, QC, Canada.
@inproceedings{DBLP:conf/iconip/GomesCZBM15, author = {Gomes, Heitor Murilo and de Carvalho, Deborah Ribeiro and Zubieta, Lourdes and Barddal, Jean Paul and Malucelli, Andreia}, title = {On the Discovery of Time Distance Constrained Temporal Association Rules}, booktitle = {Neural Information Processing - 22nd International Conference, {ICONIP} 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part {II}}, pages = {510--519}, year = {2015}, crossref = {DBLP:conf/iconip/2015-2}, url = {https://doi.org/10.1007/978-3-319-26535-3\_58}, doi = {10.1007/978-3-319-26535-3\_58}, timestamp = {Sun, 08 Jul 2018 23:29:36 +0200}, biburl = {https://dblp.org/rec/bib/conf/iconip/GomesCZBM15}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ICONIP
A Complex Network-Based Anytime Data Stream Clustering Algorithm

Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck

In Neural Information Processing - 22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part I Apr 2015

Abs Bib PDF

Data stream mining is an active area of research that poses challenging research problems. In the latter years, a variety of data stream clustering algorithms have been proposed to perform unsupervised learning using a two-step framework. Additionally, dealing with non-stationary, unbounded data streams requires the development of algorithms capable of performing fast and incremental clustering addressing time and memory limitations without jeopardizing clustering quality. In this paper we present CNDenStream, a one-step data stream clustering algorithm capable of finding non-hyper-spherical clusters which, in opposition to other data stream clustering algorithms, is able to maintain updated clusters after the arrival of each instance by using a complex network construction and evolution model based on homophily. Empirical studies show that CNDenStream is able to surpass other algorithms in clustering quality and requires a feasible amount of resources when compared to other algorithms presented in the literature.
@inproceedings{DBLP:conf/iconip/BarddalGE15a, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and Enembreck, Fabr{\'{\i}}cio}, title = {A Complex Network-Based Anytime Data Stream Clustering Algorithm}, booktitle = {Neural Information Processing - 22nd International Conference, {ICONIP} 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part {I}}, pages = {615--622}, year = {2015}, crossref = {DBLP:conf/iconip/2015-1}, url = {https://doi.org/10.1007/978-3-319-26532-2\_68}, doi = {10.1007/978-3-319-26532-2\_68}, timestamp = {Sun, 08 Jul 2018 23:29:36 +0200}, biburl = {https://dblp.org/rec/bib/conf/iconip/BarddalGE15a}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ICTAI
A Survey on Feature Drift Adaptation

Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck

In 27th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2015, Vietri sul Mare, Italy, November 9-11, 2015 Apr 2015

Abs Bib PDF

Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely feature drift. Feature drift occurs whenever a subset of features becomes, or ceases to be, relevant to the learning task; thus, learners must detect and adapt to these changes accordingly. We survey existing work on feature drift adaptation with both explicit and implicit approaches. Additionally, we benchmark several algorithms and a naive feature drift detection approach using synthetic and real-world datasets. The results from our experiments indicate the need for future research in this area as even naive approaches produced gains in accuracy while reducing resources usage. Finally, we state current research topics, challenges and future directions for feature drift adaptation.
@inproceedings{DBLP:conf/ictai/BarddalGE15, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and Enembreck, Fabr{\'{\i}}cio}, title = {A Survey on Feature Drift Adaptation}, booktitle = {27th {IEEE} International Conference on Tools with Artificial Intelligence, {ICTAI} 2015, Vietri sul Mare, Italy, November 9-11, 2015}, pages = {1053--1060}, year = {2015}, crossref = {DBLP:conf/ictai/2015}, url = {https://doi.org/10.1109/ICTAI.2015.150}, doi = {10.1109/ICTAI.2015.150}, timestamp = {Fri, 26 May 2017 00:51:05 +0200}, biburl = {https://dblp.org/rec/bib/conf/ictai/BarddalGE15}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
SAC
SNCStream: a social network-based data stream clustering algorithm

Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck

In Proceedings of the 30th Annual ACM Symposium on Applied Computing, Salamanca, Spain, April 13-17, 2015 Apr 2015

Abs Bib PDF

Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally. On top of that, due to the inherent evolving nature of data streams, it is expected that these algorithms manage to quickly adapt to both concept drifts and the appearance and disappearance of clusters. Nevertheless, many of the developed two-step algorithms are only capable of finding hyper-spherical clusters and are highly dependant on parametrization. In this paper we introduce SNCStream, a one-step online clustering algorithm based on Social Networks Theory, which uses homophily to find non-hyper-spherical clusters. Our empirical studies show that SNCStream is able to surpass density-based algorithms in cluster quality and requires feasible amount of resources (time and memory) when compared to other algorithms.
@inproceedings{DBLP:conf/sac/BarddalGE15, author = {Barddal, Jean Paul and Gomes, Heitor Murilo and Enembreck, Fabr{\'{\i}}cio}, title = {SNCStream: a social network-based data stream clustering algorithm}, booktitle = {Proceedings of the 30th Annual {ACM} Symposium on Applied Computing, Salamanca, Spain, April 13-17, 2015}, pages = {935--940}, year = {2015}, crossref = {DBLP:conf/sac/2015}, url = {https://doi.org/10.1145/2695664.2695674}, doi = {10.1145/2695664.2695674}, timestamp = {Tue, 06 Nov 2018 11:06:47 +0100}, biburl = {https://dblp.org/rec/bib/conf/sac/BarddalGE15}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
SAC
Pairwise combination of classifiers for ensemble learning on data streams

Heitor Murilo Gomes, Jean Paul Barddal, and Fabrı́cio Enembreck

In Proceedings of the 30th Annual ACM Symposium on Applied Computing, Salamanca, Spain, April 13-17, 2015 Apr 2015

Abs Bib PDF

This work presents two different voting strategies for ensemble learning on data streams based on pairwise combination of component classifiers. Despite efforts to build a diverse ensemble, there is always some degree of overlap between component classifiers models. Our voting strategies are aimed at using these overlaps to support ensemble prediction. We hypothesize that by combining pairs of classifiers it is possible to alleviate incorrect individual predictions that would otherwise negatively impact the overall ensemble decision. The first strategy, Pairwise Accuracy (PA), combines the shared accuracy estimation of all possible pairs in the ensemble, while the second strategy, Pairwise Patterns (PP), record patterns of pairwise decisions during training and use these patterns during prediction. We present empirical results comparing ensemble classifiers with their original voting methods and our proposed methods in both real and synthetic datasets, with and without concept drifts. Our analysis indicates that pairwise voting is able to enhance overall performance for PP, especially on real datasets, and that PA is useful whenever there are noticeable differences in accuracy estimates among ensemble members, which is common during concept drifts.
@inproceedings{DBLP:conf/sac/GomesBE15, author = {Gomes, Heitor Murilo and Barddal, Jean Paul and Enembreck, Fabr{\'{\i}}cio}, title = {Pairwise combination of classifiers for ensemble learning on data streams}, booktitle = {Proceedings of the 30th Annual {ACM} Symposium on Applied Computing, Salamanca, Spain, April 13-17, 2015}, pages = {941--946}, year = {2015}, crossref = {DBLP:conf/sac/2015}, url = {https://doi.org/10.1145/2695664.2695754}, doi = {10.1145/2695664.2695754}, timestamp = {Tue, 06 Nov 2018 11:06:46 +0100}, biburl = {https://dblp.org/rec/bib/conf/sac/GomesBE15}, bibsource = {dblp computer science bibliography, https://dblp.org}, }