The hierarchical data stream classification task addresses challenges in both hierarchical and data stream classification primary areas. In these scenarios, machine learning models must simultaneously deal with class hierarchies and adapt to respond to nonstationary data. Given such a challenging set of traits, existing techniques are deficient, as they perform incremental learning and are slow to adapt to newer data, thus not capturing their dynamics in a timely fashion. In this study, we propose two novel adaptive Gaussian Naive Bayes classifiers tailored to classify hierarchical data streams. The models use window-weighted Gaussian probabilities to consider current and historical data and improve the adaptability of the classifiers, especially for nonstationary data streams. As a result of our research, we introduce a unified protocol for evaluating and comparing hierarchical data stream classifiers and establish a benchmark for the hierarchical data stream classification task encompassing the proposed methods and state-of-the-art classifiers. The results demonstrate that our proposed algorithms achieve better prediction correctness than their state-of-the-art counterparts while responding more swiftly to changes in data distribution.
SAC
Just Change on Change: Adaptive Splitting Time for Decision Trees in Data Stream Classification
Daniel Nowak Assis, Jean Paul Barddal, and Fabricio Enembreck
In Proceedings of the Annual ACM Symposium on Applied Computing, SAC 2024 2024
Hoeffding Trees are well-established decision trees for classifying streaming data. The Hoeffding bound was widely used in a static periodic manner, applying the bound for impurity measures to determine whether leaf nodes should split. However, this approach does not account for the tree state and its leaf nodes over time. We hypothesize that splitting when data distribution and accuracy changes occur in leaf nodes enhances decision tree performance. This paper introduces the use of change detection algorithms that dictate the moment a split will happen. First, in the local approach, each leaf node has a change detector that monitors either the error rate or purity of a leaf node and a global one, where a detector monitors statistics from the leaf nodes where the instances arrive. Results show that our methods had competitive results while being more efficient regarding processing time than state-of-the-art Hoeffding-based Trees since the periodic and constant evaluation of splits is costly.
2023
STATS. & COMP.
Random Forest Kernel for High-Dimension Low Sample Size Classification
Lucca Portes Cavalheiro, Simon Bernard, Jean Paul Barddal, and Laurent Heutte
High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the Random Forest Dissimilarity (RFD), that perfoms state-of-the-art results for such problems. In this work, we transpose the core principle of this approach to solving HDLSS classification problems, by using the RF similarity measure as a learned precomputed SVM kernel (RFSVM). We show that such a learned similarity measure is particularly suited and accurate for this classification context. Experiments conducted on 40 public HDLSS classification datasets, supported by rigorous statistical analyses, show that the RFSVM method outperforms existing methods for the majority of HDLSS problems and remains at the same time very competitive for low or non-HDLSS problems.
ICMLA
Detecting Relevant Information in High-Volume Chat Logs: Keyphrase Extraction for Grooming and Drug Dealing Forensic Analysis
Jeovane Honório Alves, Horácio A. C. G. Pedroso, Rafael Honorio Venetikides, Joel E. M. Koster, Luiz Rodrigo Grochocki, Cinthia Obladen Almendra Freitas, and Jean Paul Barddal
In International Conference on Machine Learning with Applications (ICMLA) 2023
The growing use of digital communication platforms has given rise to various criminal activities, such as grooming and drug dealing, which pose significant challenges to law enforcement and forensic experts. This paper presents a supervised keyphrase extraction approach to detect relevant information in high-volume chat logs involving grooming and drug dealing for forensic analysis. The proposed method, JointKPE++, builds upon the JointKPE keyphrase extractor by employing improvements to handle longer texts effectively. We evaluate JointKPE++ using BERT-based pre-trained models on grooming and drug dealing datasets, including BERT, RoBERTa, SpanBERT, and BERTimbau. The results show significant improvements over traditional approaches and demonstrate the potential for JointKPE++ to aid forensic experts in efficiently detecting keyphrases related to criminal activities.
ICMLA
Event-driven Sentiment Drift Analysis in Text Streams: An Application in a Soccer Match
Cristiano Mesquita Garcia, Alceu Souza Britto Jr., and Jean Paul Barddal
In International Conference on Machine Learning with Applications (ICMLA) 2023
Social media has been a data source for various applications, given its characteristic of working as a social sensor. Many applications in several areas, such as brand reputation and online opinion monitoring, use this valuable resource to understand the users of services and products. This paper describes an application in the soccer domain, considering data collected from a social media textual data stream. The goal is to detect possible sentiment drifts related to actual events in a soccer match. This task is challenging as we resort to short texts made available during a short time (match length). We evaluated four drift detectors using four metrics: false alarms, delay (considering the number of posts), delay, and missing drifts. Our results show that ADWIN had a stable performance in sentiment drift detection compared to other methods in timely detecting the flagged drifts, raising a small number of false alarms. Given the drifts detected, we used Incremental Word-Vectors to monitor words of interest and check their relatedness to actual events in the match. We empirically assert that the closest words trace back to the sentiment drift generator events.
NEUCOM
Incremental Specialized and Specialized-Generalized Matrix Factorization Models based on Adaptive Learning Rate Optimizers
Antônio David Viniski, Jean Paul Barddal, and Alceu Souza Britto Jr.
Recommender systems suggest items that are likely to be preferred by a particular user based on historical behavior, actions, and feedback. In real-world applications, data on users and items are continuously generated at a fast pace, such as in e-commerce, social media, digital marketing, and content consumption applications. Since interactions occur over time, these scenarios can be formulated as a data stream where users’ interests are potentially dynamic, i.e., they change over time. Given that changes are expected to occur, one of the current research challenges in streaming recommender systems is that models must adapt their parameters when changes occur to maintain performance. As such changes do not occur for all users and items in the stream at the same time, we consider adapting learning schemes to account for user or item identifiers and model individual parameters. Therefore, we used specialized parameters to adjust the step size for each dataset user or item. More specifically, this study proposes four specialized and specialized-generalized variants of four well-known adaptive learning rate optimizers and shows how they are combined with incremental matrix factorization methods. We tested our proposed optimization strategies on different datasets and showed that one of the proposed specialized variants, that is, InAMSGradUser, improves the RECALL and NDCG rates by up to 11.1 and 7.5 percentage points, respectively, compared to the traditional stochastic gradient descent (SGD) optimizer.
ESWA
An Explainable Machine Learning Approach for Student Dropout Prediction
João Gabriel Corrêa Kruger, Alceu Souza Britto Jr., and Jean Paul Barddal
School dropout is a relevant socio-economic problem across the globe. Predictive models have been developed to determine the likelihood of students dropping out of their studies precociously in an attempt to overcome such a problem. Academic systems, which gather data from many students, are potential sources for datasets that feed dropout prediction algorithms, thus leading to general improvements in education quality. Despite successful past attempts to predict dropout, several works depict small datasets with features that are hard to reproduce. Furthermore, predicting whether a student will drop out is not enough to diagnose and prevent the problem as it is also necessary to provide potential justifications for the dropout. This paper proposes an approach for creating and enriching a dataset for dropout prediction, which has been applied for dropout prediction using data from 19 schools in Brazil. With this dataset and using classifiers and model explaining techniques, our experiments achieved Area Under the Precision-Recall Curve (AUC-PR) scores of up to 89.5% when predicting dropout at different year moments. This study also shows differences when predicting dropouts in different educational stages, such as preschool and secondary education, with the former being more complex than the latter. In addition to the high recognition rates, our proposal identifies potential reasons for student dropout, which are relevant for educational institutions to take preemptive actions.
BRACIS
A Tool for Measuring Energy Consumption in Data Stream Mining
Eric Kenzo Taniguchi Onuki, Andreia Malucelli, and Jean Paul Barddal
In Brazilian Conference on Intelligent System (BRACIS) 2023
Energy consumption reduction is an increasing trend in machine learning given its relevance in socio-ecological importance. Consequently, it is important to quantify how real-time learning algorithms tailored for data streams and edge computing behave in terms of accuracy, processing time, memory usage, and energy consumption. In this work, we bring forward a tool for measuring energy consumption in the Massive Online Analysis (MOA). First, we analyze the energy consumption rates obtained by our tool against a gold-standard hardware solution, thus showing the robustness of our approach. Next, we experimentally analyze classification algorithms under different validation protocols and concept drift and highlight how such classifiers behave under such conditions. Results show that our tools enable the identification of different classifiers’ energy consumption. In particular, it allows a better understanding of how energy consumption rates vary in drifting and non-drifting scenarios. Finally, given the insights obtained during experimentation on existing classifiers, we make our tool publicly available to the scientific community so that energy consumption is also accounted for in developing and comparing data stream mining algorithms.
IJCNN
Benchmarking Feature Extraction Techniques for Textual Data Stream Classification
Bruno Siedekum Thuma, Pedro Silva Vargas, Cristiano Garcia, Alceu Souza Britto Jr., and Jean Paul Barddal
In 2023 International Joint Conference on Neural Networks, IJCNN 2023, Gold Coast, Australia, 2023 2023
Feature extraction regards transforming unstructured or semi-structured data into structured data that can be used as input for classification and sentiment analysis algorithms, among other applications. This task becomes even more challenging and relevant when textual data becomes available over time as a continuous data stream since the lexicon and semantics can be ever-evolving. Data streams are, by definition, potentially infinite sequences of data that may have ephemeral characteristics, that is, where the data behavior changes, it leads to a phenomenon named concept drift. Textual data streams are specialized data streams, in which texts arrive over time from a continual data source, such as social media, raising challenges in which feature extractors are of great help. In this paper, we benchmark different feature extraction algorithms, i.e., Hashing Trick, Word2Vec, BERT, and Incremental Word-Vectors; in textual data stream classification, considering different stream lengths. The evaluation was performed over a binary and a multiclass classification task, considering two different datasets. Results show that pre-trained models, such as BERT, achieve interesting results, while Hashing Trick also performs competitively. We also observe that incremental methods such as Word2Vec and Incremental Word-Vectors are the most prepared for changing scenarios, yet, they are much more computationally intensive compared to the former when applied to larger streams.
IJCNN
Mass-Based Short Term Selection of Classifiers in Data Streams
Daniel Nowak Assis, Fabricio Enembreck, and Jean Paul Barddal
In 2023 International Joint Conference on Neural Networks, IJCNN 2023, Gold Coast, Australia, 2023 2023
Dynamic classifier selection (DCS) regards well-known machine learning techniques in the batch setting that leverage ensemble performance. Most of the methods use similarity-based methods as a proxy, culminating in high computation costs and becoming unfeasible in many streaming scenarios. In this paper, we propose a DCS method able to cope with the high-speed streaming setting, which is based on the performance of base learners in the most recent instances. The impact of our method is evaluated with different ensembles for data streams. We also propose modifications to an Online Boosting method, which has its performance improved with DCS. Our method increases the accuracy and kappa statistic of state-of-the-art ensembles with low overhead of time processing and memory.
INFUS
Exploring diversity in data complexity and classifier decision spaces for pool generation
Marcos Monteiro, Alceu S. Britto, Jean Paul Barddal, Luiz S. Oliveira, and Robert Sabourin
This paper introduces a novel method for classifier pool generation in which a two-level strategy explores diversity in both data complexity and classifier decision spaces. The rationale is to induce pool members using data subsets representing subproblems with different difficulties while promoting diversity in classifiers’ decisions. Two possible variants of the proposed method with a focus on maximum dispersion and maximum accuracy are presented. These differ in the property used to define the best pool of classifiers provided by an optimization process. A robust experimental protocol encompassing 28 classification datasets shows that the proposed pool generation provided the best accuracy on 327 over 336 experiments (97.3%) when compared to well-known pool generation methods to provide multiple classifier systems with and without dynamic selection.
2022
ARXIV
Evaluating k-NN in the Classification of Data Streams with Concept Drift
Roberto Souto Maior Barros, Silas Garrido Teixeira de Carvalho Santos, and Jean Paul Barddal
Data streams are often defined as large amounts of data flowing continuously at high speed. Moreover, these data are likely subject to changes in data distribution, known as concept drift. Given all the reasons mentioned above, learning from streams is often online and under restrictions of memory consumption and run-time. Although many classification algorithms exist, most of the works published in the area use Naive Bayes (NB) and Hoeffding Trees (HT) as base learners in their experiments. This article proposes an in-depth evaluation of k-Nearest Neighbors (k-NN) as a candidate for classifying data streams subjected to concept drift. It also analyses the complexity in time and the two main parameters of k-NN, i.e., the number of nearest neighbors used for predictions (k), and window size (w). We compare different parameter values for k-NN and contrast it to NB and HT both with and without a drift detector (RDDM) in many datasets. We formulated and answered 10 research questions which led to the conclusion that k-NN is a worthy candidate for data stream classification, especially when the run-time constraint is not too restrictive.
SMC
Pattern Spotting and Image Retrieval in Historical Documents using Deep Hashing
Caio Silva Dias, Alceu Souza Britto Jr., Jean Paul Barddal, Laurent Heutte, and Alessandro Lameiras Koerich
In Proceedings of the IEEE Systems, Man, and Cybernetics Conference (IEEE SMC) 2022
This paper presents a deep learning approach for image retrieval and pattern spotting in digital collections of historical documents. First, a region proposal algorithm detects object candidates in the document page images. Next, deep learning models are used for feature extraction, considering two distinct variants, which provide either real-valued or binary code representations. Finally, candidate images are ranked by computing the feature similarity with a given input query. A robust experimental protocol evaluates the proposed approach considering each representation scheme (real-valued and binary code) on the DocExplore image database. The experimental results show that the proposed deep models compare favorably to the state-of-the-art image retrieval approaches for images of historical documents, outperforming other deep models by 2.56 percentage points using the same techniques for pattern spotting. Besides, the proposed approach also reduces the search time up to 200x, and the storage cost up to 6,000x when compared to related works based on real-valued representations.
SMC
Improving Data Stream Classification using Incremental Yeo-Johnson Power Transformation
Eduardo Tieppo, Jean Paul Barddal, and Julio Cesar Nievola
In Proceedings of the IEEE Systems, Man, and Cybernetics Conference (IEEE SMC) 2022
Data transformation plays an essential role as a preprocessing step in learning models. Several classification techniques have premises about the underlying data distribution, such as normal distribution assumed in Bayesians classifiers. However, applying data transformation in a streaming setting requires processing an infinite and continuous flow of data. In this paper, we propose the Incremental Yeo-Johnson Power Transformation, a variant of the well-known batch Yeo-Johnson transformation that is tailored for streaming settings, i.e., it supports streaming data via statistical sampling and hypothesis testing. Experimental results show that our proposal achieves the same data normality as its batch counterpart. In addition, it improves the prediction performance of a data stream classifier based on Bayesian statistical models. Overall, learning models obtained 3 percentage points improvement.
ESANN
A Machine Learning Approach for School Dropout Prediction in Brazil
João Gabriel Corrêa Kruger, Jean Paul Barddal, and Alceu Souza Britto Jr.
In Proceedings of the 30th European Symposium on Artificial Neural Networks (ESANN) 2022
School dropout is a severe problem that impacts many socio-economic aspects, including inequality. Dropout prediction algorithms can help remediate this problem, although several past attempts in the literature did so using datasets with small numbers of students. This paper brings forward an experimental approach of machine learning for school dropout prediction in Brazilian schools. The data used for this study was first retrieved from the academic systems of a group of Brazilian private schools, which was later enriched with socio-economic data extracted from governmental sources. Using the dataset to train different types of classifiers, we obtained precision scores of up to 95.2% when predicting dropout at different year moments and educational stages, thus allowing schools to plan and apply retention strategies.
IJCNN
Classifying Hierarchical Data Streams using Global Classifiers and Summarization Techniques
Eduardo Tieppo, Jean Paul Barddal, and Julio Cesar Nievola
In 2022 International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 2022 2022
The hierarchical classification of data streams requires models capable of handling a class hierarchy and updating themselves whenever a new example arrives, within restrained processing time and memory consumption. Current state-of-the-art models store raw instances and handle the hierarchy locally, performing a high number of computations at every hierarchy level and with all, eventually redundant, data. This paper introduces Global k-Nearest Centroids (kNC) and Global Dribble, two novel methods for the hierarchical classification of data streams. Both methods use summarization techniques to represent data with constant computational resources usage and a global classification approach to process instances in less time when compared to local strategies. We compare both methods with a state-of-the-art local classifier, and the proposed methods achieved a higher number of correct predictions and process instances nearly twice as fast.
IJCNN
Evaluation of Self-taught Learning-based Representations for Facial Emotion Recognition
Bruna Delazeri, Leonardo Leon Veras, Jean Paul Barddal, Alessandro L. Koerich, and Alceu Souza Britto Jr.
In 2022 International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 2022 2022
This work describes different strategies to generate unsupervised representations obtained through the concept of self-taught learning for facial emotion recognition (FER). The idea is to create complementary representations promoting diversity by varying the autoencoders’ initialization, architecture, and training data. SVM, Bagging, Random Forest, and a dynamic ensemble selection method are evaluated as final classification methods. Experimental results on JAFFE and Cohn-Kanade datasets using a leave-one-subject-out protocol show that FER methods based on the proposed diverse representations compare favorably against state-of-the-art approaches that also explore unsupervised feature learning.
IJCNN
Assessing Batch and Online Learning for Delivery in Full and On Time Predictions
Adriano Alves Lima, Márcio Venâncio Batista, Jean Paul Barddal, Danilo Sipoli Sanches, and Luiz Eduardo Soares Oliveira
In 2022 International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 2022 2022
Improving results by optimizing process execution is one objective of major companies. For these corporations, the main point for achieving better results is the good maintenance of supply chain management. The most important supply chain metric is Delivery in Full and On Time (DIFOT). DIFOT measures how well a supply chain delivers value to the customer. In this work, we bring forward an analysis of DIFOT prediction from large Brazilian food company. More specifically, we compare a batch and online learning algorithm for DIFOT prediction and depict why the latter is suitable for this problem. Furthermore, we report a feature drift analysis to identify whether there are considerable shifts along with the dataset timespan. As a byproduct of this research, we make the dataset used in this analysis publicly available for future research in DIFOT prediction.
ESWA
A Systematic Review on Computer Vision-Based Parking Lot Management Applied on Public Datasets
Paulo Ricardo Lisboa Almeida, Jeovane Honório Alves, Rafael Stubs Parpinelli, and Jean Paul Barddal
Computer vision-based parking lot management methods have been extensively researched upon owing to their flexibility and cost-effectiveness. To evaluate such methods authors often employ publicly available parking lot image datasets. In this study, we surveyed and compared robust publicly available image datasets specifically crafted to test computer vision-based methods for parking lot management approaches and consequently present a systematic and comprehensive review of existing works that employ such datasets. The literature review identified relevant gaps that require further research, such as the requirement of dataset-independent approaches and methods suitable for autonomous detection of position of parking spaces. In addition, we have noticed that several important factors such as the presence of the same cars across consecutive images, have been neglected in most studies, thereby rendering unrealistic assessment protocols. Furthermore, the analysis of the datasets also revealed that certain features that should be present when developing new benchmarks, such as the availability of video sequences and images taken in more diverse conditions, including nighttime and snow, have not been incorporated.
ICAART
Univariate Time Series Prediction Using Data Stream Mining Algorithms and Temporal Dependence
Marcos Alberto Mochinski, Jean Paul Barddal, and Fabricio Enembreck
In Proceedings of International Conference on Agents and Artificial Intelligence, ICAART 2022 2022
In this paper, we present an exploratory study conducted to evaluate the impact of temporal dependence modeling on time series forecasting with Data Stream Mining (DSM) techniques. DSM algorithms have been used successfully in many domains that exhibit continuous generation of non-stationary data. However, the use of DSM in time series is rare since they usually are univariate and exhibit strong temporal dependence. This is the main motivation for this work, such that this study mitigates such gap by presenting a univariate time series prediction method based on AdaGrad (a DSM algorithm), Auto.Arima (a statistical method) and features extracted from adjusted autocorrelation function (ACF) coefficients. The proposed method uses adjusted ACF features to convert the original series observations into multivariate data, executes the fitting process using the DSM and the statistical algorithm, and combines the AdaGrad’s and Auto.Arima’s forecasts to establish the final predictions for each time series. Experiments conducted with five datasets containing 141,558 time series resulted in up to 12.429% improvements in sMAPE (Symmetric Mean Average Percentage Error) error rates when compared to Auto.Arima. The results depict that combining DSM with ACF features and statistical time series methods is a suitable approach for univariate forecasting.
SAC
Automatic Disease Vector Mosquitoes Identification via Hierarchical Data Stream Classification
Eduardo Tieppo, Jean Paul Barddal, and Julio Cesar Nievola
In Proceedings of the Annual ACM Symposium on Applied Computing, SAC 2022 2022
Vector-borne diseases (VBDs), such as Dengue or Malaria, are one of the main concerns of public health agencies and governments. These diseases are mainly spread by mosquitoes acting as vectors by transmitting infected blood between humans. Machine learning can be used to design and improve control strategies of VBDs by providing models able to recognize disease vector mosquitoes and automatically capture or kill harmful species. The automatic identification of disease vector mosquitoes was not yet addressed concerning the hierarchical classification of data streams. Thus, reliable information has not been used to improve learning models, such as mosquitoes’ hierarchical taxonomy. In this study, we propose a framework for the automatic identification of disease vector mosquitoes in the context of the hierarchical classification of data streams area. To this end, we propose a hierarchical adaptation of a disease vector mosquitoes’ dataset to include their taxonomy and introduce kNC and Dribble, two novel classification methods fitted to hierarchical data streams representing the mosquitoes. Results depicted that our framework, using summarization techniques, achieves significantly better prediction and processing speed rates when compared to existing state-of-the-art models.
ACM CSUR
A Survey on Concept Drift in Process Mining
Denise Maria Vecino Sato, Sheila Cristiana Freitas, Jean Paul Barddal, and Edson Emilio Scalabrin
Concept drift in process mining (PM) is a challenge as classical methods assume processes are in a steady-state, i.e., events share the same process version. We conducted a systematic literature review on the intersection of these areas, and thus, we review concept drift in process mining and bring forward a taxonomy of existing techniques for drift detection and online process mining for evolving environments. Existing works depict that (i) PM still primarily focuses on offline analysis, and (ii) the assessment of concept drift techniques in processes is cumbersome due to the lack of common evaluation protocol, datasets, and metrics.
2021
ICPM
Interactive Process Drift Detection: a framework for visual analysis of process drifts
Denise Maria Vecino Sato, Rafaela Mantovani Fontana, Jean Paul Barddal, and Edson Emilio Scalabrin
In International Conference on Process Mining (ICPM) - Demo track 2021
Interactive Process Drift Detection (IPDD) is a framework for visual analysis of process drifts. A process drift indicates a change in the process model occurred at some point in time. IPDD approach firstly generates process models for subparts of the event log using a sliding window approach. Then, IPDD detects the drifts by evaluating similarity metrics calculated between adjacent process models; a difference in some of the metrics indicates a drift. The current implementation of IPDD generates the process models using the directly-follows graph and applies two similarity metrics: nodes and edges similarity. The user interface shows the drifts in the process models over time, allowing the user to visually understand the model changes. Also, the user can easily change the hyperparameters for the drift analysis and verify the results on the interface. The user interface of IPDD also allows the user to evaluate the detected drifts by calculating the F-score metrics, which is useful when using artificial datasets. The underlying idea is to ease the choice of a "good" value for the hyperparameter configuration, which is critical for almost any drift detection mechanism.
AIRE
Hierarchical classification of data streams: a systematic literature review
Eduardo Tieppo, Roger Robson Santos, Jean Paul Barddal, and Júlio Cesar Nievola
The classification task usually works with flat and batch learners, assuming problems as stationary and without relations between class labels. Nevertheless, several real-world problems do not assume these premises, i.e., data have labels organized hierarchically and are made available in streaming fashion, meaning that their behavior can drift over time. Existing studies on hierarchical classification do not consider data streams as input of their process, and thus, data is assumed as stationary and handled through batch learners. The same can be said about works on streaming data, as the hierarchical classification is overlooked. Studies concerning each area individually are promising, yet, do not tackle their intersection. This study analyzes the main characteristics of the state-of-the-art works on hierarchical classification for streaming data concerning five aspects: (i) problems tackled, (ii) datasets, (iii) algorithms, (iv) evaluation metrics, and (v) research gaps in the area. We performed a systematic literature review of primary studies and retrieved 3,722 papers, of which 42 were identified as relevant and used to answer the aforementioned research questions. We found that the problems handled by hierarchical classification of data streams include mainly classification of images, human activities, texts, and audio; the datasets are mostly created or synthetic data; the algorithms and evaluation metrics are well-known techniques or based on those; and research gaps are related to dynamic context, data complexity, and computational resources constraints. We also provide implications for future research and experiments to consider common characteristics shared amongst hierarchical classification and data stream classification.
BRACIS
Classifying Potentially Unbounded Hierarchical Data Streams with Incremental Gaussian Naive Bayes
Eduardo Tieppo, Julio Cesar Nievola, and Jean Paul Barddal
In Brazilian Conference on Intelligent System (BRACIS) 2021
Hierarchical Classification of Data Streams inherits the properties and constraints of Hierarchical Classification and Data Stream Classification areas concomitantly. Therefore, it requires novel approaches that (i) can handle class hierarchies, (ii) can be updated over time, and (iii) are computationally light-weighted regarding processing time and memory usage. In this study, we propose the \emphGaussian Naive Bayes for Hierarchical Data Streams (GNB-hDS) method: an incremental Gaussian Naive Bayes for classifying potentially unbounded hierarchical data streams. The GNB-hDS method uses statistical summaries of the data stream instead of storing actual instances. These statistical summaries allow more efficient data storage, maintain constant computational time and memory, and calculate the probability of an instance belonging to a specific class via the Bayes’ Theorem. We compare our method against a technique that stores raw instances, and results show that our method obtains equivalent prediction rates while being statistically faster.
SMC
Adaptive Global k-Nearest Neighbors for Hierarchical Classification of Data Streams
Eduardo Tieppo, Jean Paul Barddal, and Julio Cesar Nievola
In IEEE Conference on Systems, Man, and Cybernetics (SMC) 2021
Data stream classification differs from batch learning classification methods as data is made available sequentially and may drift over time. Therefore, data stream classification can be simultaneous to all other kinds of classification problems, and it has been revisiting many aspects related to classification in the last years. So far, hierarchical classification was weakly addressed in streaming scenarios despite being a well-established research topic. In this paper, we propose the adaptive global k-Nearest Neighbors for hierarchical classification of data streams (Global kNN-hDS). Our proposal is able to classify hierarchical data streams using a constrained memory buffer and following a global approach. We compare our method against a local kNN also tailored for streaming scenarios, and results show that our method obtains competitive prediction rates while being statistically faster.
IJCNN
Dynamically Selected Ensemble for Data Stream Classification
Lucca Portes Cavalheiro, Jean Paul Barddal, Alceu Souza Britto Jr., and Laurent Heutte
In International Joint Conference on Neural Networks (IJCNN) 2021
Mining data streams is a hot topic in the machine learning (ML) community. In addition to learning and updating accurate models over time, these techniques must respect constraints that are not necessarily as strong in batch mode, such as time processing and memory consumption efficiency. A successful family of techniques in batch ML is dynamic classifier selection (DCS). However, these are roughly overlooked in data stream mining. In this paper, we propose a novel dynamic classifier selection framework for data streams called Double Dynamic Classifier Selection (DDCS). We compare DDCS against state-of-art methods for mining data streams in both synthetic and real-world datasets. Results depict that DDCS not only outperforms the state-of-art ensemble methods for data stream classification in terms of accuracy but is also significantly more efficient in terms of processing time and memory consumption.
IJCNN
Towards the Overcome of Performance Pitfalls in Data Stream Mining Tools
Lucca Portes Cavalheiro, Marco Antonio Alves Zanata, and Jean Paul Barddal
In International Joint Conference on Neural Networks (IJCNN) 2021
Data stream mining is an essential task in today’s scientific community. It allows machine learning models to be updated over time as new data becomes available. Three pillars should be accounted for when selecting an appropriate algorithm for data stream mining: accuracy, processing time, and memory consumption. To develop and assess machine learning models in streaming scenarios, different tools have been developed, where the Massive Online Analysis, written in Java, and scikit-multiflow, written in Python, are in the spotlight. Despite the ease of use of both tools, neither are focused on performance, which puts in jeopardy the usage of the computational resources. In this paper, we show that with the right tools, Python libraries reach performance comparable to C/C++. More specifically, we show how optimized implementations in scikit-multiflow using low-level languages, i.e., C++, C++ with Intel Intrinsics, and Rust; with bindings to Python vastly overcome existing tools in computational resources usage while keeping predictive performance intact.
ICAISC
Interactive Process Drift Detection Framework
Denise Maria Vecino Sato, Jean Paul Barddal, and Edson Emilio Scalabrin
In International Conference on Artificial Intelligence and Soft Computing (ICAISC) 2021
This paper presents a novel tool for detecting drifts in process models. The tool targets the challenge of defining the better parameter configuration for detecting drifts by providing an interactive user interface. Using this interface, the user can quickly change the parameters and verify how the process evolved. The process evolution is presented in a timeline of process models, simulating a “replay” of models over time. One instantiation of the framework was implemented using a fixed-size sliding window, discovering process maps using directly-follows graphs (DFGs), and calculating nodes and edges similarities. This instantiation was evaluated using a benchmarking dataset of simple and complex drift patterns. The tool correctly detected 17 from the 18 change patterns, thus confirming its potential when an adequate window size is set. The user interface shows that replaying the process models provides a visual understanding of the changing process. The concept drift is explained by the similarity metrics’ differences, thus allowing drift localization.
ESWA
A case study of batch and incremental recommender systems in supermarket data under concept drifts and cold start
Antônio David Viniski, Jean Paul Barddal, Alceu Souza Britto Jr., Fabricio Enembreck, and Humberto Vinicius Aparecido Campos
Recommender systems uncover relationships between users and items, thus allowing personalized recommendations. Nonetheless, users’ preferences may change over time, the so-called concept drifts; or new users and items may appear, making the recommender system unable to accurately map the relationship between users and items due to the cold start problem. Consequently, concept drift and cold start are challenges that downgrade the recommender system’s predictive performance. This paper assesses existing approaches for collaborative-filtering recommender systems over a real supermarket dataset that exhibits both of the issues mentioned above. For this purpose, our comparative analysis encompasses batch and streaming learning approaches. As a result, we can observe that streaming-based models achieve better recommendation rates since these are tailored to fit the concept drift. More specifically, the predictive performance of streaming-based recommendations increases by up to 21% over those provided by batch methods. The supermarket dataset used in experimentation is also made publicly available for future studies and recommender systems comparisons.
PAKDD
UKIRF: An Item Rejection Framework for Improving Negative Items Sampling in One-Class Collaborative Filtering
Antônio David Viniski, Jean Paul Barddal, and Alceu Souza Britto Jr.
In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2021
Collaborative Filtering (CF) is one of the most successful techniques in recommender systems. Most CF scenarios depict positive-only implicit feedback, which means that negative feedback is unavailable. Therefore, One-Class Collaborative Filtering (OCCF) techniques have been tailored to tackling these scenarios. Nonetheless, several OCCF models still require negative observations during training, and thus, a popular approach is to consider randomly selected unknown relationships as negative. In this work, we bring forward a novel approach for selecting negative items called Unknown Item Rejection Framework (UKIRF). More specifically, we instantiate UKIRF using similarity approaches, i.e., TF-IDF and Cosine, to reject items that are similar to those a user interacted with. We apply UKIRF to different OCCF models in different datasets and show that it improves the recall rates up to 24% when compared to random sampling.
ICPR
Classifier Pool Generation based on a Two-level Diversity Approach
Marcos Monteiro, Alceu Souza Britto Jr, Jean Paul Barddal, Luiz Soares Oliveira, and Robert Sabourin
In International Conference on Pattern Recognition (ICPR) 2021
This paper describes a classifier pool generation method guided by the diversity estimated on the data complexity and classifier decisions. First, the behavior of complexity measures is assessed by considering several subsamples of the dataset. The complexity measures with high variability across the subsamples are selected for posterior pool adaptation, where an evolutionary algorithm optimizes diversity in both complexity and decision spaces. A robust experimental protocol with 28 datasets and 20 replications is used to evaluate the proposed method. Results show significant accuracy improvements in 69.4% of the experiments when Dynamic Classifier Selection and Dynamic Ensemble Selection methods are applied.
2020
SMC
Combining Slow and Fast Learning for Improved Credit Scoring
Lucas Loezer Jean Paul Barddal, and Riccardo Lanzuolo
In IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC) 2020
The financial credibility of a person is a relevant factor to determine whether a loan should be approved or not, and it is quantified by a credit score, which is computed using past performance on debt obligations, profiling, and other data available. Credit scoring becomes even a hotter topic in emerging countries, as interest rates and customer behavior swiftly vary, given the economic (in)stability of the country and as fintechs are chasing robust solutions for improved credit scoring solutions. Batch machine learning is often deployed for credit scoring, yet, they are tailored for static scenarios, i.e., they are not prepared to swiftly detect and adapt to changes in customer behavior, thus leading to slow recovery in such scenarios. In this paper, we bring forward an analysis on how batch machine learning can be combined with data stream mining techniques, thus leading to better recognition rates in credit scoring scenarios. We analyze three different real-world datasets from Brazilian financial institutions, whilst keeping their secrecy preserved, and show how batch and stream learning can be combined towards improved credit scoring systems, as well as highlighting relevant gaps that still require attention.
SMC
Naïve Approaches to Deal with Concept Drifts
Alceu Souza Britto Jr Almeida, and Jean Paul Barddal
In IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC) 2020
A common problem in machine learning is to find representative real-world problems to put the methods to test. When developing approaches to deal with concept drifts, some datasets such as the Forest Covertype and Nebraska Weather are a common choice for testing. We argue that some well-known real-world concept drift datasets present a high serial dependence in the target class and may have only minor changes. With this in mind, we propose the use of naïve methods that should be used for comparison with methods that deal with concept drifts. The experimental results using six real-world well-known concept drift datasets show that the naïve approaches can be better than some methods to deal with possible concept drifts in datasets such as the Forest Covertype, Electricity, and Nebraska Weather. These results suggest that some widely used datasets may be trivial from the concept drift standpoint, and thus, should be avoided or at least the results should be compared with the proposed naïve methods.
SMC
Improving Multiple Time Series Forecasting with Data Stream Mining Algorithms
Jean Paul Barddal Marcos Alberto Mochinski, and Fabricio Enembreck
In IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC) 2020
This paper proposes a hybrid ensemble learning approach that combines statistical and data stream mining algorithms to obtain better forecasting performance in multiple time series prediction problems. Although some multiple time series algorithms perform surprisingly well in a variety of domains, it is well-known that no one is dominant for every existent domain. Therefore, we developed a meta-technique based on data stream mining and static ensemble selection strategy and evaluated its forecasting goodness-of-fit in time series datasets from M3 and M4 competitions. After training different regression models, we show how the combination of auto.arima and AdaGrad lead to improved forecasting rates, thus surpassing the results of state-of-art algorithms.
SMC
ADADRIFT: An Adaptive Learning Technique for Long-History Stream-Based Recommender Systems
Fabricio Enembreck Eduardo Ferreira José, and Jean Paul Barddal
In IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC) 2020
Adaptive recommender systems are increasingly showing their importance as profiling is a dynamic problem. Their goal is to update recommendation models as new interactions take place, thus swiftly adapting to drifts in the user’s behavior and desires, and item’s audience. However, existing recommendation algorithms usually do not perform well during drifts, as they take long to adapt to changes, or these updates are suboptimal since they account for all profiles’ preferences equally, which is often untrue as each individual and its changes are unique. In this paper, we propose the ADADRIFT algorithm to deal with user and item-based drifts in adaptive recommender systems using personalized learning rates based on profile statistics. The experiments using stream-based recommender systems (ISGD and BRISMF) across four different datasets show that ADADRIFT surpasses ADADELTA with significant improvements in recommendation rates. The best results appear when the data streams have a long history of the users’ or items’ interactions and drifts become noticeable. The experimentation in this work highlight the importance of handling drifts in recommender systems.
ESWA
Lessons learned from data stream classification applied to credit scoring
Jean Paul Barddal, Lucas Loezer, Fabrício Enembreck, and Riccardo Lanzuolo
The financial credibility of a person is a factor used to determine whether a loan should be approved or not, and this is quantified by a ‘credit score,’ which is calculated using a variety of factors, including past performance on debt obligations, profiling, amongst others. Machine learning has been widely applied to automate the development of effective credit scoring models over the years. Yet, studies show that the development of robust credit scoring models may take longer than a year, and thus, if the behavior of customers changes over time, the model will be outdated even before its deployment. In this paper, we made 3 anonymized real-world credit scoring datasets available alongside the results obtained. In each of these datasets, we verify whether the credit scoring task should be thought as an ephemeral scenario since many of the variables may drift over time, and thus, data stream mining techniques should be used since they were tailored for incremental learning and to detect and adapt to changes in the data distribution. Therefore, we compare both traditional batch machine learning algorithms with data stream algorithms in different validation schemes using both Kolmogorov–Smirnov and Population Stability Index metrics. Furthermore, we also provide insights on the importance of features according to their Information Value, Mean Decrease Impurity, and Mean Positional Gain metrics, such that the last depicts changes in the importance of features over time. For 2 of the 3 tested datasets, the results obtained by data stream learners are comparable to predictive models currently in use, thus showing the efficiency of data stream classification for the credit scoring task.
ANN. TELECOM.
Regularized and incremental decision trees for data streams
Decision trees are a widely used family of methods for learning predictive models from both batch and streaming data. Despite depicting positive results in a multitude of applications, incremental decision trees continuously grow in terms of nodes as new data becomes available, i.e., they eventually split on all features available, and also multiple times using the same feature; thus leading to unnecessary complexity and overfitting. With this behavior, incremental trees lose the ability to generalize well, be human-understandable and computationally efficient. To tackle these issues, we proposed in a previous study a regularization scheme for Hoeffding decision trees that: (i) uses a penalty factor to control the gain obtained by creating a new split node using a feature that has not been used thus far; and (ii) uses information from previous splits in the current branch to determine whether the gain observed indeed justifies a new split. In this paper, we extend this analysis and apply the proposed regularization scheme to other types of incremental decision trees and report the results in both synthetic and real-world scenarios. The main interest is to verify whether and how the proposed regularization scheme affects the different types of incremental trees. Results show that in addition to the original Hoeffding Tree, the Adaptive Random Forest also benefits from regularization, yet, McDiarmid Trees and Extremely Fast Decision trees observe declines in accuracy.
IJCNN
An End-to-End Approach for Recognition of Modern and Historical Handwritten Numeral Strings
André Gustavo Hochuli, Alceu Souza Britto Jr., Jean Paul Barddal, Luiz Eduardo Soares Oliveira, and Robert Sabourin
In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN) Glasgow, Scotland 2020
An end-to-end solution for handwritten numeral string recognition is proposed, in which the numeral string is considered as composed of objects automatically detected and recognized by a YoLo-based model. The main contribution of this paper is to avoid heuristic-based methods for string preprocessing and segmentation, the need for task-oriented classifiers, and also the use of specific constraints related to the string length. A robust experimental protocol based on several numeral string datasets, including one composed of historical documents, has shown that the proposed method is a feasible end-to-end solution for numeral string recognition. Besides, it reduces the complexity of the string recognition task considerably since it drops out classical steps, in special preprocessing, segmentation, and a set of classifiers devoted to strings with a specific length.
SAC
Cost-sensitive learning for imbalanced data streams
Lucas Loezer, Fabricio Enembreck, Jean Paul Barddal, and Alceu Souza Britto Jr.
In Proceedings of the 34rd Annual ACM Symposium on Applied Computing, SAC 2020, Brno, Czech Republic, March 30 - April 3, 2020 2020
The data imbalance problem hampers the classification task. In streaming environments, this becomes even more cumbersome as the proportion of classes can vary over time. Approaches based on misclassification costs can be used to mitigate this problem. In this paper, we present the Cost-sensitive Adaptive Random Forest (CSARF) and compare it to the Adaptive Random Forest (ARF) and ARF with Resampling (ARF_RE) in six real-world and six synthetic data sets with different class ratios. The empirical study analyzes two misclassification costs strategies of the CSARF and shows that the CSARF obtained statistically superior w.r.t. the average recall and average F1 when compared to ARF.
ANÁLISE PREDITIVA E DECISÕES JUDICIAIS: controvérsia ou realidade?
Cinthia Obladen Almendra Freitas, and Jean Paul Barddal
Revista Democracia Digital e Governo Eletrônico Jan 2020
In this paper, we provide an overview of how Data Analytics, Big Data, and Machine Learning may assist the judicial system by providing insightful information to citizens, police, lawyers, and judges, in a fast and accurate way. We conduct a bidirectional analysis between Law and Predictive Analytics applying the deductive method and bibliographic technique. We report concerns that Law should have with the application of computational techniques in different scenarios, mainly in the judicial system. Finally, we bring forward controversies between these areas, such as the new companies that target the use of personal and sensitive data in Law applications, and how these are potentially hurting fundamental rights and leading to biases in critical systems, wuch as predictive systems for crime recidivism.
2019
ACM SIGKDD
Machine learning for streaming data
Heitor Murilo Gomes, Jesse Read, Albert Bifet, Jean Paul Barddal, and João Gama
Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., Velocity and Volume. Given the current industry needs, there are many challenges to be addressed before existing methods can be efficiently applied to real-world problems. In this work, we focus on elucidating the connections among the current state-of-the-art on related fields; and clarifying open challenges in both academia and industry. We treat with special care topics that were not thoroughly investigated in past position and survey papers. This work aims to evoke discussion and elucidate the current research opportunities, highlighting the relationship of different subareas and suggesting courses of action when possible.
SIGAPP ACR
Addressing Feature Drift in Data Streams Using Iterative Subset Selection
Lanqin Yuan, Bernhard Pfahringer, and Jean Paul Barddal
Data streams are prone to various forms of concept drift over time including, for instance, changes to the relevance of features. This specific kind of drift is known as feature drift and requires techniques tailored not only to determine which features are the most important but also to take advantage of them. Feature selection has been studied and shown to improve classifier performance in standard batch data mining, yet it is mostly unexplored in data stream mining. This paper presents a novel method of feature subset selection specialized for dealing with the occurrence of feature drifts called Iterative Subset Selection (ISS), which splits the feature selection process into two stages by first ranking the features using some scoring function, and then iteratively selecting feature subsets using this ranking. This work further extends upon our prior work by exploring feeding information from the subset selection stage back into the ranking process. Applying our method to the Naïve Bayes and k-Nearest Neighbour classifier, we obtain compelling accuracy improvements when compared to existing works.
IJCNN
Vertical and Horizontal Partitioning in Data Stream Regression Ensembles
Jean Paul Barddal
In 2019 International Joint Conference on Neural Networks, IJCNN 2019, Budapest, Hungary, July 14-19, 2019 Apr 2019
Data stream mining is an emerging topic in machine learning that targets the creation and update of predictive models over time as new data becomes available. Regarding existing works, classification is the most widely tackled task, which leaves regression nearly untouched. In this paper, the focus relies on ensemble learning for data stream regression, more specifically on vertical and horizontal data partitioning techniques. The goal is to determine whether and under which conditions partitioning can lessen the error rates of different types of learners in the data stream regression task. The proposed method combines vertical and horizontal partitioning, and it is compared with and against different types of learners and existing ensembles.
SAC
Learning Regularized Hoeffding Trees from Data Streams
Jean Paul Barddal, and Fabricio Enembreck
In Proceedings of the 34rd Annual ACM Symposium on Applied Computing, SAC 2019, Limassol, Cyprus, April 08-12, 2019 Apr 2019
Learning from data streams is a hot topic in machine learning that targets the learning and update of predictive models as data becomes available for both training and query. Due to their simplicity and convincing results in a multitude of applications, Hoeffding Trees are, by far, the most widely used family of methods for learning decision trees from streaming data. Despite the aforementioned positive characteristics, Hoeffding Trees tend to continuously grow in terms of nodes as new data becomes available, i.e., they eventually split on all features available, and multiple times on the same feature; thus leading to unnecessary complexity. With this behavior, Hoeffding Trees lose the ability to be human-understandable and computationally efficient. To tackle these issues, we propose a regularization scheme for Hoeffding Trees that (i) uses a penalty factor to control the gain obtained by creating a new split node using a feature that has not been used thus far; and (ii) uses information from previous splits in the current branch to determine whether the gain observed indeed justifies a new split. The proposed scheme is combined with both standard and adaptive variants of Hoeffding Trees. Experiments using real-world, stationary and drifting synthetic data show that the proposed method prevents both original and adaptive Hoeffding Trees from unnecessarily growing while maintaining impressive accuracy rates. As a byproduct of the regularization process, significant improvements in processing time, model complexity, and memory consumption have also been observed, thus showing the effectiveness of the proposed regularization scheme.
SAC
Decision tree-based Feature Ranking in Concept Drifting Data Streams
Andreia Malucelli Jean Antonio Karax, and Jean Paul Barddal
In Proceedings of the 34rd Annual ACM Symposium on Applied Computing, SAC 2019, Limassol, Cyprus, April 08-12, 2019 Apr 2019
Data stream mining targets the learning of predictive models that evolve over time according to changes in arriving data. Throughout the years, several approaches have been tailored to create and continuously update predictive models from these streams, and from these, Hoeffding Trees became a popular choice for learning decision trees from data streams. In this paper, we aim at quantifying and expressing the importance of features in dynamic scenarios is of the utmost importance as they allow domain experts to back up, or invalidate, a predictive model. Therefore, we propose and assess a positional gain method tailored for for both individual and ensembles of Hoeffding Trees and how these behave in both synthetic and real-world scenarios.
INFSYS
Boosting decision stumps for dynamic feature selection on data streams
Jean Paul Barddal, Fabrício Enembreck, Heitor Murilo Gomes, Albert Bifet, and Bernhard Pfahringer
Feature selection targets the identification of which features of a dataset are relevant to the learning task. It is also widely known and used to improve computation times, reduce computation requirements, and to decrease the impact of the curse of dimensionality and enhancing the generalization rates of classifiers. In data streams, classifiers shall benefit from all the items above, but more importantly, from the fact that the relevant subset of features may drift over time. In this paper, we propose a novel dynamic feature selection method for data streams called Adaptive Boosting for Feature Selection (ABFS). ABFS chains decision stumps and drift detectors, and as a result, identifies which features are relevant to the learning task as the stream progresses with reasonable success. In addition to our proposed algorithm, we bring feature selection-specific metrics from batch learning to streaming scenarios. Next, we evaluate ABFS according to these metrics in both synthetic and real-world scenarios. As a result, ABFS improves the classification rates of different types of learners and eventually enhances computational resources usage.
ESWA
Merit-guided dynamic feature selection filter for data streams
Jean Paul Barddal, Fabrı́cio Enembreck, Heitor Murilo Gomes, Albert Bifet, and Bernhard Pfahringer
Learning from ephemeral data streams has garnered the interest of both researchers and practitioners towards adaptive learning techniques. Despite the convincing results obtained thus far, most of the current research still overlooks that the relevance of features may change throughout the learning process. Scenarios where features become - or cease to be - relevant to the learning task are called feature drifting data streams, and the identification of which features are relevant becomes even more challenging when the feature space is high-dimensional. To select relevant features during the progress of data streams, we propose a merit-guided and classifier-independent dynamic feature selection algorithm named DynamIc SymmetriCal Uncertainty Selection for Streams (DISCUSS). We evaluate our proposal on both synthetic and real-world datasets and show that DISCUSS can boost kNN and Naive Bayes classifiers’ accuracy rates on high-dimensional data streams, while at the expense of limited processing time and memory space. Finally, the drawbacks of the proposed method are assessed, and possible future works on the topic are also discussed.
2018
ESANN
Adaptive random forests for data stream regression
Heitor Murilo Gomes, Jean Paul Barddal, Luis Eduardo Boiko Ferreira, and Albert Bifet
In 26th European Symposium on Artificial Neural Networks, ESANN 2018, Bruges, Belgium, April 25-27, 2018 Apr 2018
Data stream mining is a hot topic in the machine learning community that tackles the problem of learning and updating predictive models as new data becomes available over time. Even though several new methods are proposed every year, most focus on the classification task and overlook the regression task. In this paper, we propose an adaptation to the Adaptive Random Forest so that it can handle regression tasks, namely ARF-Reg. ARF-Reg is empirically evaluated and compared to the state-of-the-art data stream regression algorithms, thus highlighting its applicability in different data stream scenarios.
IJCNN
An Experimental Perspective on Sampling Methods for Imbalanced Learning From Financial Databases
Luis Eduardo Boiko Ferreira, Jean Paul Barddal, Fabrı́cio Enembreck, and Heitor Murilo Gomes
In 2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018 Apr 2018
The financial market is one of the major consumers of data mining techniques, and the main reason is their efficiency to analyze complex data. One important trait shared between most financial applications is class imbalance. Since traditional classification methods assume nearly balanced classes and equal misclassification costs, they usually fail to deal with imbalanced data. However, in financial contexts, problems are usually imbalanced, and instances from the minority class are known for deficits of millions of dollars every year, e.g., credit card frauds, money laundering transactions and so forth. Over the years, several techniques for dealing with class imbalance have been developed, such as sampling techniques and algorithm adaptations. In this study, we analyze how different sampling techniques impact the performance of different classification systems on financial applications. Results show that, for the given datasets, sampling techniques allow the improvement of prediction performance of the minority class while also improving overall classification rates. Nevertheless, their use often deteriorates the performance in predicting the majority class.
INDIN
Are fintechs really a hype? A machine learning-based polarity analysis of Brazilian posts on social media
Marina Ponestke Seara, Andreia Malucelli, Altair Olivo Santin, and Jean Paul Barddal
In 16th IEEE International Conference on Industrial Informatics, INDIN 2018, Porto, Portugal, July 18-20, 2018 Apr 2018
Fintechs are technology companies that, in contrast to traditional banks, are engaged in digital solutions for payment, money transfers, and real-time notifications. Taking advantage of digital means of communication, most of the service interactions between fintechs and customers occurs via chats or posts in social media. In this work, our goal is to use machine learning to analyze these posts and identify what are the terms used by customers to express positive, neutral and negative customer experiences. During this analysis, we assess the following questions using data from the 3 biggest fintechs in Brazil: (i) what are the most commented topics on social media regarding fintechs, (ii) what are the words more often used by customers to express positive, negative and neutral reactions to the customer service obtained; and (iii) what kind of machine learning model should a fintech use to automatically identify whether a post is positive, negative or neutral.
SAC
Iterative subset selection for feature drifting data streams
Lanqin Yuan, Bernhard Pfahringer, and Jean Paul Barddal
In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, SAC 2018, Pau, France, April 09-13, 2018 Apr 2018
Feature selection has been studied and shown to improve classifier performance in standard batch data mining but is mostly unexplored in data stream mining. Feature selection becomes even more important when the relevant subset of features changes over time, as the underlying concept of a data stream drifts. This specific kind of drift is known as feature drift and requires specific techniques not only to determine which features are the most important but also to take advantage of them. This paper presents a novel method of feature subset selection specialized for dealing with the occurrence of feature drifts called Iterative Subset Selection (ISS), which splits the feature selection process into two stages by first ranking the features, and then iteratively selecting features from the ranking. Applying our feature selection method together with Naive Bayes or k-Nearest Neighbour as a classifier, results in compelling accuracy improvements, compared to prior work.
2017
ACM CSUR
A Survey on Ensemble Learning for Data Stream Classification
Heitor Murilo Gomes, Jean Paul Barddal, Fabrı́cio Enembreck, and Albert Bifet
Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integrated with drift detection algorithms and incorporate dynamic updates, such as selective removal or addition of classifiers. This work proposes a taxonomy for data stream ensemble learning as derived from reviewing over 60 algorithms. Important aspects such as combination, diversity, and dynamic updates, are thoroughly discussed. Additional contributions include a listing of popular open-source tools and a discussion about current data stream research challenges and how they relate to ensemble learning (big data streams, concept evolution, feature drifts, temporal dependencies, and others).
JSS
A survey on feature drift adaptation: Definition, benchmark, challenges and future directions
Jean Paul Barddal, Heitor Murilo Gomes, Fabrı́cio Enembreck, and Bernhard Pfahringer
Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely feature drift. Feature drift occurs whenever a subset of features becomes, or ceases to be, relevant to the learning task; thus, learners must detect and adapt to these changes accordingly. We survey existing work on feature drift adaptation with both explicit and implicit approaches. Additionally, we benchmark several algorithms and a naive feature drift detection approach using synthetic and real-world datasets. The results from our experiments indicate the need for future research in this area as even naive approaches produced gains in accuracy while reducing resources usage. Finally, we state current research topics, challenges and future directions for feature drift adaptation.
ML
Adaptive random forests for evolving data stream classification
Heitor Murilo Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabrı́cio Enembreck, Bernhard Pfharinger, Geoff Holmes, and Talel Abdessalem
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.
ICTAI
Improving Credit Risk Prediction in Online Peer-to-Peer (P2P) Lending Using Imbalanced Learning Techniques
Luis Eduardo Boiko Ferreira, Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck
In 29th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2017, Boston, MA, USA, November 6-8, 2017 Apr 2017
Peer-to-peer (P2P) lending is a global trend of financial markets that allow individuals to obtain and concede loans without having financial institutions as a strong proxy. As many real-world applications, P2P lending presents an imbalanced characteristic, where the number of creditworthy loan requests is much larger than the number of non-creditworthy ones. In this work, we wrangle a real-world P2P lending data set from Lending Club, containing a large amount of data gathered from 2007 up to 2016. We analyze how supervised classification models and techniques to handle class imbalance impact creditworthiness prediction rates. Ensembles, cost-sensitive and sampling methods are combined and evaluated along logistic regression, decision tree, and bayesian learning schemes. Results show that, in average, sampling techniques outperform ensembles and cost sensitive approaches.
2016
INFSYS
SNCStream+: Extending a high quality true anytime data stream clustering algorithm
Jean Paul Barddal, Heitor Murilo Gomes, Fabrı́cio Enembreck, and Jean-Paul A. Barthès
Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally as data arrives. On top of that, due to the inherent evolving nature of data streams, it is expected that algorithms undergo both concept drifts and evolutions, which must be taken into account by the clustering algorithm, allowing incremental clustering updates. In this paper we present the Social Network Clusterer Stream+ (SNCStream+). SNCStream+ tackles the data stream clustering problem as a network formation and evolution problem, where instances and micro-clusters form clusters based on homophily. Our proposal has its parameters analyzed and it is evaluated in a broad set of problems against literature baselines. Results show that SNCStream+ achieves superior clustering quality (CMM), and feasible processing time and memory space usage when compared to the original SNCStream and other proposals of the literature.
ICPR
A benchmark of classifiers on feature drifting data streams
Jean Paul Barddal, Heitor Murilo Gomes, Alceu Souza Britto Jr., and Fabrı́cio Enembreck
In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016 Apr 2016
The ever increasing data generation confronts both practitioners and researchers on handling massive and sequentially generated amounts of information, the so-called data streams. In this context, a lot of effort has been put on the extraction of useful patterns from streaming scenarios. Learning from data streams embeds a variety of problems, and by far, the most challenging is concept drift, i.e. changes in data distribution. In this paper, we focus on a specific type of drift uncommonly assessed in the literature: feature drifts. Feature drifts occur whenever a subset of features becomes, or ceases to be, relevant to the concept to be learned. We propose and review several feature drifting data stream generators and use them to benchmark state-of-the-art data stream classification algorithms and their combination with drift detectors. Results show that, although drift detectors enable slight quicker recovery to feature drifts, best results are obtained by Hoeffding Adaptive Tree, the only learner that performs dynamic feature selection as streams progress.
ICPR
Overcoming feature drifts via dynamic feature weighted k-nearest neighbor learning
Jean Paul Barddal, Heitor Murilo Gomes, Jones Granatyr, Alceu Souza Britto Jr., and Fabrı́cio Enembreck
In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016 Apr 2016
Extracting useful knowledge from data streams is problematic, mainly due to changes in their data distribution, a phenomenon named concept drift. Recently, studies have shown that most of existing algorithms for learning from data streams do not encompass techniques for a specific kind of drift: feature drifts. Feature drifts occur when features become, or cease to be, relevant to the learning task. In this paper, we propose an extension to the k-nearest neighbor classifier, so its distances’ computations are weighted according to their current discriminative power. On our proposal, the discriminative power of features is given by entropy, which is swiftly computed over a sliding window. Empirical evidence shows that our approach is able to overcome several existing algorithms in accuracy and feature drift adaptation, while at the expense of bounded processing time and memory space.
IJCNN
Towards emotion-based reputation guessing learning agents
Jones Granatyr, Jean Paul Barddal, Adriano Weihmayer Almeida, Fabrı́cio Enembreck, and Adaiane Pereira Santos Granatyr
In 2016 International Joint Conference on Neural Networks, IJCNN 2016, Vancouver, BC, Canada, July 24-29, 2016 Apr 2016
Trust and reputation mechanisms are part of the logical protection of intelligent agents, preventing malicious agents from acting egotistically or with the intention to damage others. Several studies in Psychology, Neurology and Anthropology claim that emotions are part of human’s decision making process. However, there is a lack of understanding about how affective aspects, such as emotions, influence trust or reputation levels of intelligent agents when they are inserted into an information exchange environment, e.g. an evaluation system. In this paper we propose a reputation model that accounts for emotional bounds given by Ekman’s basic emotions and inductive machine learning. Our proposal is evaluated by extracting emotions from texts provided by two online human-fed evaluation systems. Empirical results show significant agent’s utility improvements with p <; .05 when compared to non-emotion-wise proposals, thus, showing the need for future research in this area.
ECML PKDD
On Dynamic Feature Weighting for Feature Drifting Data Streams
Jean Paul Barddal, Heitor Murilo Gomes, Fabrı́cio Enembreck, Bernhard Pfahringer, and Albert Bifet
In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II Apr 2016
The ubiquity of data streams has been encouraging the development of new incremental and adaptive learning algorithms. Data stream learners must be fast, memory-bounded, but mainly, tailored to adapt to possible changes in the data distribution, a phenomenon named concept drift. Recently, several works have shown the impact of a so far nearly neglected type of drifcccct: feature drifts. Feature drifts occur whenever a subset of features becomes, or ceases to be, relevant to the learning task. In this paper we (i) provide insights into how the relevance of features can be tracked as a stream progresses according to information theoretical Symmetrical Uncertainty; and (ii) how it can be used to boost two learning schemes: Naive Bayesian and k-Nearest Neighbor. Furthermore, we investigate the usage of these two new dynamically weighted learners as prediction models in the leaves of the Hoeffding Adaptive Tree classifier. Results show improvements in accuracy (an average of 10.69 % for k-Nearest Neighbor, 6.23 % for Naive Bayes and 4.42 % for Hoeffding Adaptive Trees) in both synthetic and real-world datasets at the expense of a bounded increase in both memory consumption and processing time.
2015
IJNCR
Advances on Concept Drift Detection in Regression Tasks Using Social Networks Theory
Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck
Mining data streams is one of the main studies in machine learning area due to its application in many knowledge areas. One of the major challenges on mining data streams is concept drift, which requires the learner to discard the current concept and adapt to a new one. Ensemble-based drift detection algorithms have been used successfully to the classification task but usually maintain a fixed size ensemble of learners running the risk of needlessly spending processing time and memory. In this paper the authors present improvements to the Scale-free Network Regressor (SFNR), a dynamic ensemble-based method for regression that employs social networks theory. In order to detect concept drifts SFNR uses the Adaptive Window (ADWIN) algorithm. Results show improvements in accuracy, especially in concept drift situations and better performance compared to other state-of-the-art algorithms in both real and synthetic data.
ICEIS
Applying Ensemble-based Online Learning Techniques on Crime Forecasting
Anderson José Souza, André Pinz Borges, Heitor Murilo Gomes, Jean Paul Barddal, and Fabrı́cio Enembreck
In ICEIS 2015 - Proceedings of the 17th International Conference on Enterprise Information Systems, Volume 1, Barcelona, Spain, 27-30 April, 2015 Apr 2015
Traditional prediction algorithms assume that the underlying concept is stationary, i.e., no changes are expected to happen during the deployment of an algorithm that would render it obsolete. Although, for many real world scenarios changes in the data distribution, namely concept drifts, are expected to occur due to variations in the hidden context, e.g., new government regulations, climatic changes, or adversary adaptation. In this paper, we analyze the problem of predicting the most susceptible types of victims of crimes occurred in a large city of Brazil. It is expected that criminals change their victims’ types to counter police methods and vice-versa. Therefore, the challenge is to obtain a model capable of adapting rapidly to the current preferred criminal victims, such that police resources can be allocated accordingly. In this type of problem the most appropriate learning models are provided by data stream mining, since the learning algorithms from this domain assume that concept drifts may occur over time, and are ready to adapt to them. In this paper we apply ensemble-based data stream methods, since they provide good accuracy and the ability to adapt to concept drifts. Results show that the application of these ensemble-based algorithms (Leveraging Bagging, SFNClassifier, ADWIN Bagging and Online Bagging) reach feasible accuracy for this task.
ICONIP
Analyzing the Impact of Feature Drifts in Streaming Learning
Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck
In Neural Information Processing - 22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part I Apr 2015
Learning from data streams requires efficient algorithms capable of deriving a model accordingly to the arrival of new instances. Data streams are by definition unbounded sequences of data that are possibly non stationary, i.e. they may undergo changes in data distribution, phenomenon named concept drift. Concept drifts force streaming learning algorithms to detect and adapt to such changes in order to present feasible accuracy throughout time. Nonetheless, most of works presented in the literature do not account for a specific kind of drifts: feature drifts. Feature drifts occur whenever the relevance of an arbitrary attribute changes through time, also impacting the concept to be learned. In this paper we (i) verify the occurrence of feature drift in a publicly available dataset, (ii) present a synthetic data stream generator capable of performing feature drifts and (iii) analyze the impact of this type of drift in stream learning algorithms, enlightening that there is room and the need for dynamic feature selection strategies for data streams.
ICONIP
On the Discovery of Time Distance Constrained Temporal Association Rules
Heitor Murilo Gomes, Deborah Ribeiro Carvalho, Lourdes Zubieta, Jean Paul Barddal, and Andreia Malucelli
In Neural Information Processing - 22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part II Apr 2015
The increased use of data mining algorithms reflects the need for automatic extraction of knowledge from large volumes of data. This work presents a temporal data mining algorithm that discovers frequent Association Rules from timestamped data. These rules are named Cause-Effect Rules, each represented by a multiset of unordered events (Cause) followed by a singleton event (Effect). Also, a Cause-Effect Rule is valid within an specific constraint that defines the minimum and maximum time distance between its Cause and Effect. Our algorithm was tested on a data set from two hospital emergency departments in Sherbrooke, QC, Canada.
ICONIP
A Complex Network-Based Anytime Data Stream Clustering Algorithm
Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck
In Neural Information Processing - 22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part I Apr 2015
Data stream mining is an active area of research that poses challenging research problems. In the latter years, a variety of data stream clustering algorithms have been proposed to perform unsupervised learning using a two-step framework. Additionally, dealing with non-stationary, unbounded data streams requires the development of algorithms capable of performing fast and incremental clustering addressing time and memory limitations without jeopardizing clustering quality. In this paper we present CNDenStream, a one-step data stream clustering algorithm capable of finding non-hyper-spherical clusters which, in opposition to other data stream clustering algorithms, is able to maintain updated clusters after the arrival of each instance by using a complex network construction and evolution model based on homophily. Empirical studies show that CNDenStream is able to surpass other algorithms in clustering quality and requires a feasible amount of resources when compared to other algorithms presented in the literature.
ICTAI
A Survey on Feature Drift Adaptation
Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck
In 27th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2015, Vietri sul Mare, Italy, November 9-11, 2015 Apr 2015
Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely feature drift. Feature drift occurs whenever a subset of features becomes, or ceases to be, relevant to the learning task; thus, learners must detect and adapt to these changes accordingly. We survey existing work on feature drift adaptation with both explicit and implicit approaches. Additionally, we benchmark several algorithms and a naive feature drift detection approach using synthetic and real-world datasets. The results from our experiments indicate the need for future research in this area as even naive approaches produced gains in accuracy while reducing resources usage. Finally, we state current research topics, challenges and future directions for feature drift adaptation.
SAC
SNCStream: a social network-based data stream clustering algorithm
Jean Paul Barddal, Heitor Murilo Gomes, and Fabrı́cio Enembreck
In Proceedings of the 30th Annual ACM Symposium on Applied Computing, Salamanca, Spain, April 13-17, 2015 Apr 2015
Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally. On top of that, due to the inherent evolving nature of data streams, it is expected that these algorithms manage to quickly adapt to both concept drifts and the appearance and disappearance of clusters. Nevertheless, many of the developed two-step algorithms are only capable of finding hyper-spherical clusters and are highly dependant on parametrization. In this paper we introduce SNCStream, a one-step online clustering algorithm based on Social Networks Theory, which uses homophily to find non-hyper-spherical clusters. Our empirical studies show that SNCStream is able to surpass density-based algorithms in cluster quality and requires feasible amount of resources (time and memory) when compared to other algorithms.
SAC
Pairwise combination of classifiers for ensemble learning on data streams
Heitor Murilo Gomes, Jean Paul Barddal, and Fabrı́cio Enembreck
In Proceedings of the 30th Annual ACM Symposium on Applied Computing, Salamanca, Spain, April 13-17, 2015 Apr 2015
This work presents two different voting strategies for ensemble learning on data streams based on pairwise combination of component classifiers. Despite efforts to build a diverse ensemble, there is always some degree of overlap between component classifiers models. Our voting strategies are aimed at using these overlaps to support ensemble prediction. We hypothesize that by combining pairs of classifiers it is possible to alleviate incorrect individual predictions that would otherwise negatively impact the overall ensemble decision. The first strategy, Pairwise Accuracy (PA), combines the shared accuracy estimation of all possible pairs in the ensemble, while the second strategy, Pairwise Patterns (PP), record patterns of pairwise decisions during training and use these patterns during prediction. We present empirical results comparing ensemble classifiers with their original voting methods and our proposed methods in both real and synthetic datasets, with and without concept drifts. Our analysis indicates that pairwise voting is able to enhance overall performance for PP, especially on real datasets, and that PA is useful whenever there are noticeable differences in accuracy estimates among ensemble members, which is common during concept drifts.