Marek Sikora - Academia.edu (original) (raw)

Papers by Marek Sikora

Research paper thumbnail of Classification, Regression, and Survival Rule Induction with Complex and M-of-N Elementary Conditions

Machine learning and knowledge extraction, Mar 5, 2024

This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY

Research paper thumbnail of A Sensor Data-Driven Decision Support System for Liquefied Petroleum Gas Suppliers

Applied Sciences

Currently, efficiency in the supply domain and the ability to make quick and accurate decisions a... more Currently, efficiency in the supply domain and the ability to make quick and accurate decisions and to assess risk properly play a crucial role. The role of a decision support system (DSS) is to support the decision-making process in the enterprise, and for this, it is yet not enough to have up-to-date data; reliable predictions are necessary. Each application area has its own specificity, and so far, no dedicated DSS for liquefied petroleum gas (LPG) supply has been presented. This study presents a decision support system dedicated to support the LPG supply process from the perspective of gas demand analysis. This perspective includes a short- and medium-term gas demand prediction, as well as the definition and monitoring of key performance indicators. The analysis performed within the system is based exclusively on the collected sensory data; no data from any external enterprise resource planning (ERP) systems are used. Examples of forecasts and KPIs presented in the study show wh...

Research paper thumbnail of Gradient Boosting Application in Forecasting of Performance Indicators Values for Measuring the Efficiency of Promotions in FMCG Retail

Proceedings of the 2020 Federated Conference on Computer Science and Information Systems

In the paper, a problem of forecasting promotion efficiency is raised. The authors propose a new ... more In the paper, a problem of forecasting promotion efficiency is raised. The authors propose a new approach, using the gradient boosting method for this task. Six performance indicators are introduced to capture the promotion effect. For each of them, within predefined groups of products, a model was trained. A description of using these models for forecasting and optimising promotion efficiency is provided. Data preparation and hyperparameters tuning processes are also described. The experiments were performed for three groups of products from a large grocery company.

Research paper thumbnail of Outlier Detection in Network Traffic Monitoring

Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods, 2021

Network traffic monitoring becomes, year by year, an increasingly more important branch of networ... more Network traffic monitoring becomes, year by year, an increasingly more important branch of network infrastructure maintenance. There exist many dedicated tools for on-line network traffic monitoring that can defend the typical (and known) types of attacks by blocking some parts of the traffic immediately. However, there may occur some yet unknown risks in network traffic whose statistical description should be reflected as slow-intime changing characteristics. Such non-rapidly changing variable values probably should not be detectable by on-line tools. Still, it is possible to detect these changes with the data mining method. In the paper the popular anomaly detection methods with the application of the moving window procedure are presented as one of the approaches for anomaly (outlier) detection in network traffic monitoring. The paper presents results obtained on the real outer traffic data, collected in the Institute. 2 RESEARCH CONTEXT RegSOC is a specialized Security Operations Centre (SOC), mainly for public institutions. Each SOC is based on three pillars: people, processes and technology. Highly qualified cybersecurity specialists of

Research paper thumbnail of Predicting Dangerous Seismic Events: AAIA'16 Data Mining Challenge

Annals of Computer Science and Information Systems, 2016

This paper summarizes AAIA'16 Data Mining Challenge: Predicting Dangerous Seismic Events in Activ... more This paper summarizes AAIA'16 Data Mining Challenge: Predicting Dangerous Seismic Events in Active Coal Mines which was held between October 5, 2015 and March 4, 2016 at the Knowledge Pit platform. It describes the scope and background of this competition and explains our research objectives which motivated the specific design of the competition rules. The paper also briefly overviews the results of this challenge, showing the way in which those results can help in solving practical problems related to the safety of miners working underground. In particular, our analysis focuses on applications of prediction models in order to facilitate the assessment of seismic hazards, in a situation when the exploration of a given working site has just started and there is very little historical data available.

Research paper thumbnail of Application of a hybrid method of machine learning for description and on-line estimation of methane hazard in mine workings

Journal of Mining Science, 2011

The paper presents application of a hybrid method of methane hazard prediction in exploited mine ... more The paper presents application of a hybrid method of methane hazard prediction in exploited mine workings in coal mines. For prediction, the authors used so-called local linear models, the number of which is defined in an adaptive way, and the model of time series prediction ARIMA. The prediction task consists in generating the maximum predicted methane concentration value in a certain time horizon. This forecast is then used to define a methane hazard level by means of a fuzzy system of the Mamdami type. Another important issue covered by the paper is processing of row measurement data to an acceptable form using analytical method and adaptation of the model to changing environmental conditions. The experimental part of the paper presents results of data analysis completed for two longwalls.

Research paper thumbnail of HuntMi: an efficient and taxon-specific approach in pre-miRNA identification

BMC Bioinformatics, 2013

Background: Machine learning techniques are known to be a powerful way of distinguishing microRNA... more Background: Machine learning techniques are known to be a powerful way of distinguishing microRNA hairpins from pseudo hairpins and have been applied in a number of recognised miRNA search tools. However, many current methods based on machine learning suffer from some drawbacks, including not addressing the class imbalance problem properly. It may lead to overlearning the majority class and/or incorrect assessment of classification performance. Moreover, those tools are effective for a narrow range of species, usually the model ones. This study aims at improving performance of miRNA classification procedure, extending its usability and reducing computational time. Results: We present HuntMi, a stand-alone machine learning miRNA classification tool. We developed a novel method of dealing with the class imbalance problem called ROC-select, which is based on thresholding score function produced by traditional classifiers. We also introduced new features to the data representation. Several classification algorithms in combination with ROC-select were tested and random forest was selected for the best balance between sensitivity and specificity. Reliable assessment of classification performance is guaranteed by using large, strongly imbalanced, and taxon-specific datasets in 10-fold cross-validation procedure. As a result, HuntMi achieves a considerably better performance than any other miRNA classification tool and can be applied in miRNA search experiments in a wide range of species. Conclusions: Our results indicate that HuntMi represents an effective and flexible tool for identification of new microRNAs in animals, plants and viruses. ROC-select strategy proves to be superior to other methods of dealing with class imbalance problem and can possibly be used in other machine learning classification tasks. The HuntMi software as well as datasets used in the research are freely available at http://lemur.amu.edu.pl/share/HuntMi/.

Research paper thumbnail of Rule-based approximation of black-box classifiers for tabular data to generate global and local explanations

Annals of Computer Science and Information Systems

The need to understand the decision bases of artificial intelligence methods is becoming widespre... more The need to understand the decision bases of artificial intelligence methods is becoming widespread. One method to obtain explanations of machine learning models and their decisions is the approximation of a complex model treated as a black box by an interpretable rule-based model. Such an approach allows detailed and understandable explanations to be generated from the elementary conditions contained in the rule premises. However, there is a lack of research on the evaluation of such an approximation and the influence of the parameters of the rule-based approximator. In this work, a rulebased approximation of complex classifier for tabular data is evaluated. Moreover, it was investigated how selected measures of rule quality affect the approximation. The obtained results show what quality of approximation can be expected and indicate which measure of rule quality is worth using in such application.

Research paper thumbnail of Anomaly Detection Module for Network Traffic Monitoring in Public Institutions

Sensors

It seems to be a truism to say that we should pay more and more attention to network traffic safe... more It seems to be a truism to say that we should pay more and more attention to network traffic safety. Such a goal may be achieved with many different approaches. In this paper, we put our attention on the increase in network traffic safety based on the continuous monitoring of network traffic statistics and detecting possible anomalies in the network traffic description. The developed solution, called the anomaly detection module, is mostly dedicated to public institutions as the additional component of the network security services. Despite the use of well-known anomaly detection methods, the novelty of the module is based on providing an exhaustive strategy of selecting the best combination of models as well as tuning the models in a much faster offline mode. It is worth emphasizing that combined models were able to achieve 100% balanced accuracy level of specific attack detection.

Research paper thumbnail of Classification supporting COVID-19 diagnostics based on patient survey data

arXiv (Cornell University), Nov 24, 2020

Distinguishing COVID-19 from other flu-like illnesses can be difficult due to ambiguous symptoms ... more Distinguishing COVID-19 from other flu-like illnesses can be difficult due to ambiguous symptoms and still an initial experience of doctors. Whereas, it is crucial to filter out those sick patients who do not need to be tested for SARS-CoV-2 infection, especially in the event of the overwhelming increase in disease. As a part of the presented research, logistic regression and XGBoost classifiers, that allow for effective screening of patients for COVID-19, were generated. Each of the methods was tuned to achieve an assumed acceptable threshold of negative predictive values during classification. Additionally, an explanation of the obtained classification models was presented. The explanation enables the users to understand what was the basis of the decision made by the model. The obtained classification models provided the basis for the DECODE service (decode.polsl.pl), which can serve as support in screening patients with COVID-19 * Those authors contributed equally to this paper, and should be regarded as co-first authors.

Research paper thumbnail of Machine Learning Based Analysis of Relations between Antigen Expression and Genetic Aberrations in Childhood B-Cell Precursor Acute Lymphoblastic Leukaemia

Journal of Clinical Medicine

Flow cytometry technique (FC) is a standard diagnostic tool for diagnostics of B-cell precursor a... more Flow cytometry technique (FC) is a standard diagnostic tool for diagnostics of B-cell precursor acute lymphoblastic leukemia (BCP-ALL) assessing the immunophenotype of blast cells. BCP-ALL is often associated with underlying genetic aberrations, that have evidenced prognostic significance and can impact the disease outcome. Since the determination of patient prognosis is already important at the initial phase of BCP-ALL diagnostics, we aimed to reveal specific genetic aberrations by finding specific multiple antigen expression patterns with FC immunophenotyping. The FC immunophenotype data were analysed using machine learning methods (gradient boosting, decision trees, classification rules). The obtained results were verified with the use of repeated cross-validation. The t(12;21)/ETV6-RUNX1 aberration occurs more often when blasts present high expression of CD10, CD38, low CD34, CD45 and specific low expression of CD81. The t(v;11q23)/KMT2A is associated with positive NG2 expressio...

Research paper thumbnail of Separate and conquer heuristic allows robust mining of contrast sets in classification, regression, and survival data

arXiv (Cornell University), Apr 1, 2022

Identifying differences between groups is one of the most important knowledge discovery problems.... more Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on separate and conquer-a well established heuristic for decision rule induction. Multiple passes accompanied with an attribute penalization scheme provide contrast sets describing same examples with different attributes, distinguishing presented approach from the standard separate and conquer. The algorithm was also generalized for regression and survival data allowing identification of contrast sets whose label attribute/survival prognosis is consistent with the label/prognosis for the predefined contrast groups. This feature, not provided by the existing approaches, further extends the usability of RuleKit-CS. Experiments on over 130 data sets from various areas and detailed analysis of selected cases confirmed RuleKit-CS to be a useful tool for discovering differences between defined groups. The algorithm was implemented as a part of the RuleKit suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit).

Research paper thumbnail of Energy Consumption Forecasting for the Digital-Twin Model of the Building

Energies

The aim of the paper is to propose a new approach to forecast the energy consumption for the next... more The aim of the paper is to propose a new approach to forecast the energy consumption for the next day using the unique data obtained from a digital twin model of a building. In the research, we tested which of the chosen forecasting methods and which set of input data gave the best results. We tested naive methods, linear regression, LSTM and the Prophet method. We found that the Prophet model using information about the total energy consumption and real data about the energy consumption of the top 10 energy-consuming devices gave the best forecast of energy consumption for the following day. In this paper, we also presented a methodology of using decision trees and a unique set of conditional attributes to understand the errors made by the forecast model. This methodology was also proposed to reduce the number of monitored devices. The research that is described in this article was carried out in the context of a project that deals with the development of a digital twin model of a ...

Research paper thumbnail of SCARI: Separate and conquer algorithm for action rules and recommendations induction

Information Sciences

This article describes an action rule induction algorithm based on a sequential covering approach... more This article describes an action rule induction algorithm based on a sequential covering approach. Two variants of the algorithm are presented. The algorithm allows the action rule induction from a source and a target decision class point of view. The application of rule quality measures enables the induction of action rules that meet various quality criteria. The article also presents a method for recommendation induction. The recommendations indicate the actions to be taken to move a given test example, representing the source class, to the target one. The recommendation method is based on a set of induced action rules. The experimental part of the article presents the results of the algorithm operation on sixteen data sets. As a result of the conducted research the Ac-Rules package was made available.

Research paper thumbnail of Sensor-Based Predictive Maintenance with Reduction of False Alarms—A Case Study in Heavy Industry

Sensors, 2021

In this paper, the problem of the identification of undesirable events is discussed. Such events ... more In this paper, the problem of the identification of undesirable events is discussed. Such events can be poorly represented in the historical data, and it is predominantly impossible to learn from past examples. The discussed issue is considered in the work in the context of two use cases in which vibration and temperature measurements collected by wireless sensors are analysed. These use cases include crushers at a coal-fired power plant and gantries in a steelworks converter. The awareness, resulting from the cooperation with industry, of the need for a system that works in cold start conditions and does not flood the machine operator with alarms was the motivation for proposing a new predictive maintenance method. The proposed solution is based on the methods of outlier identification. These methods are applied to the collected data that was transformed into a multidimensional feature vector. The novelty of the proposed solution stems from the creation of a methodology for the red...

Research paper thumbnail of MAINE: a web tool for multi-omics feature selection and rule-based data exploration

Bioinformatics, 2021

Summary Patient multi-omics datasets are often characterized by a high dimensionality; however, u... more Summary Patient multi-omics datasets are often characterized by a high dimensionality; however, usually only a small fraction of the features is informative, that is change in their value is directly related to the disease outcome or patient survival. In medical sciences, in addition to a robust feature selection procedure, the ability to discover human-readable patterns in the analyzed data is also desirable. To address this need, we created MAINE—Multi-omics Analysis and Exploration. The unique functionality of MAINE is the ability to discover multidimensional dependencies between the selected multi-omics features and event outcome prediction as well as patient survival probability. Learned patterns are visualized in the form of interpretable decision/survival trees and rules. Availability and implementation MAINE is freely available at maine.ibemag.pl as an online web application. Supplementary information Supplementary data are available at Bioinformatics online.

Research paper thumbnail of An R package for induction and evaluation of classification rules

An R package for induction and evaluation of classification rules

The primary goal of this paper is to present an R package for induction and evaluation of classif... more The primary goal of this paper is to present an R package for induction and evaluation of classification rules. The implemented rule induction algorithm employs a so-called covering strategy. A unique feature of the algorithm is the possibility of using different rule quality measures during growing and pruning of rules. The presented implementation is one of the first available for R environment.

Research paper thumbnail of Application of rule-based models for seismic hazard prediction in coal mines

The paper presents results of application of a machine learning method, namely the induction of c... more The paper presents results of application of a machine learning method, namely the induction of classification and regression rules, for seismic hazard prediction in coal mines. The main aim of this research was to verify if machine learning methods would be able to predict seismic hazard more accurately than methods routinely used in Polish coal mines on the basis of data gathered by monitoring systems. In this paper three classification and two regression tasks of prediction of seismic hazards in a longwall were defined. The first part of the paper describes the principles according to which the assessment of seismic hazard in Polish mines is made. These methods are called routine and allow to assess seismic hazard for a particular longwall. The next part of the paper discusses the algorithms of classification and regression rule induction and describes their use for seismic hazard assessment. The input data, which are the basis for rule induction, are: measurement data coming fro...

Research paper thumbnail of Screening Support System Based on Patient Survey Data—Case Study on Classification of Initial, Locally Collected COVID-19 Data

Applied Sciences, 2021

New diseases constantly endanger the lives of populations, and, nowadays, they can spread easily ... more New diseases constantly endanger the lives of populations, and, nowadays, they can spread easily and constitute a global threat. The COVID-19 pandemic has shown that the fight against a new disease may be difficult, especially at the initial stage of the epidemic, when medical knowledge is not complete and the symptoms are ambiguous. The use of machine learning tools can help to filter out those sick patients who do not need to be tested for spreading the pathogen, especially in the event of an overwhelming increase in disease transmission. This work presents a screening support system that can precisely identify patients who do not carry the disease. The decision of the system is made on the basis of patient survey data that are easy to collect. A case study on a data set of symptomatic COVID-19 patients shows that the system can be effective in the initial phase of the epidemic. The case study presents an analysis of two classifiers that were tuned to achieve an assumed acceptable...

Research paper thumbnail of Impact of time series clustering on fuel sales prediction results

Position and Communication Papers of the 16th Conference on Computer Science and Intelligence Systems, 2021

The purpose of the paper is to check the impact of data clustering in the process of predicting d... more The purpose of the paper is to check the impact of data clustering in the process of predicting demand. We checked different ways of adding information about similar datasets to the forecasting process and we grouped the measurements in multiple ways. The experiments were executed on 50 time series describing fuels sales (gasoline and diesel sales) on 25 petrol stations from an international company. We described the data preparation process and feature extraction process. In the 9 presented experiments, we used the XGBoost algorithm and some typical time series forecasting methods (ARIMA, moving average). We showed a case study for two datasets and we discussed the practical usage of the tested solutions. The results showed that the solution which used XGBoost model utilising data gathered from all available petrol stations, in general, worked the best and it outperformed more advanced approaches as well as typical time series methods.

Research paper thumbnail of Classification, Regression, and Survival Rule Induction with Complex and M-of-N Elementary Conditions

Machine learning and knowledge extraction, Mar 5, 2024

This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY

Research paper thumbnail of A Sensor Data-Driven Decision Support System for Liquefied Petroleum Gas Suppliers

Applied Sciences

Currently, efficiency in the supply domain and the ability to make quick and accurate decisions a... more Currently, efficiency in the supply domain and the ability to make quick and accurate decisions and to assess risk properly play a crucial role. The role of a decision support system (DSS) is to support the decision-making process in the enterprise, and for this, it is yet not enough to have up-to-date data; reliable predictions are necessary. Each application area has its own specificity, and so far, no dedicated DSS for liquefied petroleum gas (LPG) supply has been presented. This study presents a decision support system dedicated to support the LPG supply process from the perspective of gas demand analysis. This perspective includes a short- and medium-term gas demand prediction, as well as the definition and monitoring of key performance indicators. The analysis performed within the system is based exclusively on the collected sensory data; no data from any external enterprise resource planning (ERP) systems are used. Examples of forecasts and KPIs presented in the study show wh...

Research paper thumbnail of Gradient Boosting Application in Forecasting of Performance Indicators Values for Measuring the Efficiency of Promotions in FMCG Retail

Proceedings of the 2020 Federated Conference on Computer Science and Information Systems

In the paper, a problem of forecasting promotion efficiency is raised. The authors propose a new ... more In the paper, a problem of forecasting promotion efficiency is raised. The authors propose a new approach, using the gradient boosting method for this task. Six performance indicators are introduced to capture the promotion effect. For each of them, within predefined groups of products, a model was trained. A description of using these models for forecasting and optimising promotion efficiency is provided. Data preparation and hyperparameters tuning processes are also described. The experiments were performed for three groups of products from a large grocery company.

Research paper thumbnail of Outlier Detection in Network Traffic Monitoring

Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods, 2021

Network traffic monitoring becomes, year by year, an increasingly more important branch of networ... more Network traffic monitoring becomes, year by year, an increasingly more important branch of network infrastructure maintenance. There exist many dedicated tools for on-line network traffic monitoring that can defend the typical (and known) types of attacks by blocking some parts of the traffic immediately. However, there may occur some yet unknown risks in network traffic whose statistical description should be reflected as slow-intime changing characteristics. Such non-rapidly changing variable values probably should not be detectable by on-line tools. Still, it is possible to detect these changes with the data mining method. In the paper the popular anomaly detection methods with the application of the moving window procedure are presented as one of the approaches for anomaly (outlier) detection in network traffic monitoring. The paper presents results obtained on the real outer traffic data, collected in the Institute. 2 RESEARCH CONTEXT RegSOC is a specialized Security Operations Centre (SOC), mainly for public institutions. Each SOC is based on three pillars: people, processes and technology. Highly qualified cybersecurity specialists of

Research paper thumbnail of Predicting Dangerous Seismic Events: AAIA'16 Data Mining Challenge

Annals of Computer Science and Information Systems, 2016

This paper summarizes AAIA'16 Data Mining Challenge: Predicting Dangerous Seismic Events in Activ... more This paper summarizes AAIA'16 Data Mining Challenge: Predicting Dangerous Seismic Events in Active Coal Mines which was held between October 5, 2015 and March 4, 2016 at the Knowledge Pit platform. It describes the scope and background of this competition and explains our research objectives which motivated the specific design of the competition rules. The paper also briefly overviews the results of this challenge, showing the way in which those results can help in solving practical problems related to the safety of miners working underground. In particular, our analysis focuses on applications of prediction models in order to facilitate the assessment of seismic hazards, in a situation when the exploration of a given working site has just started and there is very little historical data available.

Research paper thumbnail of Application of a hybrid method of machine learning for description and on-line estimation of methane hazard in mine workings

Journal of Mining Science, 2011

The paper presents application of a hybrid method of methane hazard prediction in exploited mine ... more The paper presents application of a hybrid method of methane hazard prediction in exploited mine workings in coal mines. For prediction, the authors used so-called local linear models, the number of which is defined in an adaptive way, and the model of time series prediction ARIMA. The prediction task consists in generating the maximum predicted methane concentration value in a certain time horizon. This forecast is then used to define a methane hazard level by means of a fuzzy system of the Mamdami type. Another important issue covered by the paper is processing of row measurement data to an acceptable form using analytical method and adaptation of the model to changing environmental conditions. The experimental part of the paper presents results of data analysis completed for two longwalls.

Research paper thumbnail of HuntMi: an efficient and taxon-specific approach in pre-miRNA identification

BMC Bioinformatics, 2013

Background: Machine learning techniques are known to be a powerful way of distinguishing microRNA... more Background: Machine learning techniques are known to be a powerful way of distinguishing microRNA hairpins from pseudo hairpins and have been applied in a number of recognised miRNA search tools. However, many current methods based on machine learning suffer from some drawbacks, including not addressing the class imbalance problem properly. It may lead to overlearning the majority class and/or incorrect assessment of classification performance. Moreover, those tools are effective for a narrow range of species, usually the model ones. This study aims at improving performance of miRNA classification procedure, extending its usability and reducing computational time. Results: We present HuntMi, a stand-alone machine learning miRNA classification tool. We developed a novel method of dealing with the class imbalance problem called ROC-select, which is based on thresholding score function produced by traditional classifiers. We also introduced new features to the data representation. Several classification algorithms in combination with ROC-select were tested and random forest was selected for the best balance between sensitivity and specificity. Reliable assessment of classification performance is guaranteed by using large, strongly imbalanced, and taxon-specific datasets in 10-fold cross-validation procedure. As a result, HuntMi achieves a considerably better performance than any other miRNA classification tool and can be applied in miRNA search experiments in a wide range of species. Conclusions: Our results indicate that HuntMi represents an effective and flexible tool for identification of new microRNAs in animals, plants and viruses. ROC-select strategy proves to be superior to other methods of dealing with class imbalance problem and can possibly be used in other machine learning classification tasks. The HuntMi software as well as datasets used in the research are freely available at http://lemur.amu.edu.pl/share/HuntMi/.

Research paper thumbnail of Rule-based approximation of black-box classifiers for tabular data to generate global and local explanations

Annals of Computer Science and Information Systems

The need to understand the decision bases of artificial intelligence methods is becoming widespre... more The need to understand the decision bases of artificial intelligence methods is becoming widespread. One method to obtain explanations of machine learning models and their decisions is the approximation of a complex model treated as a black box by an interpretable rule-based model. Such an approach allows detailed and understandable explanations to be generated from the elementary conditions contained in the rule premises. However, there is a lack of research on the evaluation of such an approximation and the influence of the parameters of the rule-based approximator. In this work, a rulebased approximation of complex classifier for tabular data is evaluated. Moreover, it was investigated how selected measures of rule quality affect the approximation. The obtained results show what quality of approximation can be expected and indicate which measure of rule quality is worth using in such application.

Research paper thumbnail of Anomaly Detection Module for Network Traffic Monitoring in Public Institutions

Sensors

It seems to be a truism to say that we should pay more and more attention to network traffic safe... more It seems to be a truism to say that we should pay more and more attention to network traffic safety. Such a goal may be achieved with many different approaches. In this paper, we put our attention on the increase in network traffic safety based on the continuous monitoring of network traffic statistics and detecting possible anomalies in the network traffic description. The developed solution, called the anomaly detection module, is mostly dedicated to public institutions as the additional component of the network security services. Despite the use of well-known anomaly detection methods, the novelty of the module is based on providing an exhaustive strategy of selecting the best combination of models as well as tuning the models in a much faster offline mode. It is worth emphasizing that combined models were able to achieve 100% balanced accuracy level of specific attack detection.

Research paper thumbnail of Classification supporting COVID-19 diagnostics based on patient survey data

arXiv (Cornell University), Nov 24, 2020

Distinguishing COVID-19 from other flu-like illnesses can be difficult due to ambiguous symptoms ... more Distinguishing COVID-19 from other flu-like illnesses can be difficult due to ambiguous symptoms and still an initial experience of doctors. Whereas, it is crucial to filter out those sick patients who do not need to be tested for SARS-CoV-2 infection, especially in the event of the overwhelming increase in disease. As a part of the presented research, logistic regression and XGBoost classifiers, that allow for effective screening of patients for COVID-19, were generated. Each of the methods was tuned to achieve an assumed acceptable threshold of negative predictive values during classification. Additionally, an explanation of the obtained classification models was presented. The explanation enables the users to understand what was the basis of the decision made by the model. The obtained classification models provided the basis for the DECODE service (decode.polsl.pl), which can serve as support in screening patients with COVID-19 * Those authors contributed equally to this paper, and should be regarded as co-first authors.

Research paper thumbnail of Machine Learning Based Analysis of Relations between Antigen Expression and Genetic Aberrations in Childhood B-Cell Precursor Acute Lymphoblastic Leukaemia

Journal of Clinical Medicine

Flow cytometry technique (FC) is a standard diagnostic tool for diagnostics of B-cell precursor a... more Flow cytometry technique (FC) is a standard diagnostic tool for diagnostics of B-cell precursor acute lymphoblastic leukemia (BCP-ALL) assessing the immunophenotype of blast cells. BCP-ALL is often associated with underlying genetic aberrations, that have evidenced prognostic significance and can impact the disease outcome. Since the determination of patient prognosis is already important at the initial phase of BCP-ALL diagnostics, we aimed to reveal specific genetic aberrations by finding specific multiple antigen expression patterns with FC immunophenotyping. The FC immunophenotype data were analysed using machine learning methods (gradient boosting, decision trees, classification rules). The obtained results were verified with the use of repeated cross-validation. The t(12;21)/ETV6-RUNX1 aberration occurs more often when blasts present high expression of CD10, CD38, low CD34, CD45 and specific low expression of CD81. The t(v;11q23)/KMT2A is associated with positive NG2 expressio...

Research paper thumbnail of Separate and conquer heuristic allows robust mining of contrast sets in classification, regression, and survival data

arXiv (Cornell University), Apr 1, 2022

Identifying differences between groups is one of the most important knowledge discovery problems.... more Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on separate and conquer-a well established heuristic for decision rule induction. Multiple passes accompanied with an attribute penalization scheme provide contrast sets describing same examples with different attributes, distinguishing presented approach from the standard separate and conquer. The algorithm was also generalized for regression and survival data allowing identification of contrast sets whose label attribute/survival prognosis is consistent with the label/prognosis for the predefined contrast groups. This feature, not provided by the existing approaches, further extends the usability of RuleKit-CS. Experiments on over 130 data sets from various areas and detailed analysis of selected cases confirmed RuleKit-CS to be a useful tool for discovering differences between defined groups. The algorithm was implemented as a part of the RuleKit suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit).

Research paper thumbnail of Energy Consumption Forecasting for the Digital-Twin Model of the Building

Energies

The aim of the paper is to propose a new approach to forecast the energy consumption for the next... more The aim of the paper is to propose a new approach to forecast the energy consumption for the next day using the unique data obtained from a digital twin model of a building. In the research, we tested which of the chosen forecasting methods and which set of input data gave the best results. We tested naive methods, linear regression, LSTM and the Prophet method. We found that the Prophet model using information about the total energy consumption and real data about the energy consumption of the top 10 energy-consuming devices gave the best forecast of energy consumption for the following day. In this paper, we also presented a methodology of using decision trees and a unique set of conditional attributes to understand the errors made by the forecast model. This methodology was also proposed to reduce the number of monitored devices. The research that is described in this article was carried out in the context of a project that deals with the development of a digital twin model of a ...

Research paper thumbnail of SCARI: Separate and conquer algorithm for action rules and recommendations induction

Information Sciences

This article describes an action rule induction algorithm based on a sequential covering approach... more This article describes an action rule induction algorithm based on a sequential covering approach. Two variants of the algorithm are presented. The algorithm allows the action rule induction from a source and a target decision class point of view. The application of rule quality measures enables the induction of action rules that meet various quality criteria. The article also presents a method for recommendation induction. The recommendations indicate the actions to be taken to move a given test example, representing the source class, to the target one. The recommendation method is based on a set of induced action rules. The experimental part of the article presents the results of the algorithm operation on sixteen data sets. As a result of the conducted research the Ac-Rules package was made available.

Research paper thumbnail of Sensor-Based Predictive Maintenance with Reduction of False Alarms—A Case Study in Heavy Industry

Sensors, 2021

In this paper, the problem of the identification of undesirable events is discussed. Such events ... more In this paper, the problem of the identification of undesirable events is discussed. Such events can be poorly represented in the historical data, and it is predominantly impossible to learn from past examples. The discussed issue is considered in the work in the context of two use cases in which vibration and temperature measurements collected by wireless sensors are analysed. These use cases include crushers at a coal-fired power plant and gantries in a steelworks converter. The awareness, resulting from the cooperation with industry, of the need for a system that works in cold start conditions and does not flood the machine operator with alarms was the motivation for proposing a new predictive maintenance method. The proposed solution is based on the methods of outlier identification. These methods are applied to the collected data that was transformed into a multidimensional feature vector. The novelty of the proposed solution stems from the creation of a methodology for the red...

Research paper thumbnail of MAINE: a web tool for multi-omics feature selection and rule-based data exploration

Bioinformatics, 2021

Summary Patient multi-omics datasets are often characterized by a high dimensionality; however, u... more Summary Patient multi-omics datasets are often characterized by a high dimensionality; however, usually only a small fraction of the features is informative, that is change in their value is directly related to the disease outcome or patient survival. In medical sciences, in addition to a robust feature selection procedure, the ability to discover human-readable patterns in the analyzed data is also desirable. To address this need, we created MAINE—Multi-omics Analysis and Exploration. The unique functionality of MAINE is the ability to discover multidimensional dependencies between the selected multi-omics features and event outcome prediction as well as patient survival probability. Learned patterns are visualized in the form of interpretable decision/survival trees and rules. Availability and implementation MAINE is freely available at maine.ibemag.pl as an online web application. Supplementary information Supplementary data are available at Bioinformatics online.

Research paper thumbnail of An R package for induction and evaluation of classification rules

An R package for induction and evaluation of classification rules

The primary goal of this paper is to present an R package for induction and evaluation of classif... more The primary goal of this paper is to present an R package for induction and evaluation of classification rules. The implemented rule induction algorithm employs a so-called covering strategy. A unique feature of the algorithm is the possibility of using different rule quality measures during growing and pruning of rules. The presented implementation is one of the first available for R environment.

Research paper thumbnail of Application of rule-based models for seismic hazard prediction in coal mines

The paper presents results of application of a machine learning method, namely the induction of c... more The paper presents results of application of a machine learning method, namely the induction of classification and regression rules, for seismic hazard prediction in coal mines. The main aim of this research was to verify if machine learning methods would be able to predict seismic hazard more accurately than methods routinely used in Polish coal mines on the basis of data gathered by monitoring systems. In this paper three classification and two regression tasks of prediction of seismic hazards in a longwall were defined. The first part of the paper describes the principles according to which the assessment of seismic hazard in Polish mines is made. These methods are called routine and allow to assess seismic hazard for a particular longwall. The next part of the paper discusses the algorithms of classification and regression rule induction and describes their use for seismic hazard assessment. The input data, which are the basis for rule induction, are: measurement data coming fro...

Research paper thumbnail of Screening Support System Based on Patient Survey Data—Case Study on Classification of Initial, Locally Collected COVID-19 Data

Applied Sciences, 2021

New diseases constantly endanger the lives of populations, and, nowadays, they can spread easily ... more New diseases constantly endanger the lives of populations, and, nowadays, they can spread easily and constitute a global threat. The COVID-19 pandemic has shown that the fight against a new disease may be difficult, especially at the initial stage of the epidemic, when medical knowledge is not complete and the symptoms are ambiguous. The use of machine learning tools can help to filter out those sick patients who do not need to be tested for spreading the pathogen, especially in the event of an overwhelming increase in disease transmission. This work presents a screening support system that can precisely identify patients who do not carry the disease. The decision of the system is made on the basis of patient survey data that are easy to collect. A case study on a data set of symptomatic COVID-19 patients shows that the system can be effective in the initial phase of the epidemic. The case study presents an analysis of two classifiers that were tuned to achieve an assumed acceptable...

Research paper thumbnail of Impact of time series clustering on fuel sales prediction results

Position and Communication Papers of the 16th Conference on Computer Science and Intelligence Systems, 2021

The purpose of the paper is to check the impact of data clustering in the process of predicting d... more The purpose of the paper is to check the impact of data clustering in the process of predicting demand. We checked different ways of adding information about similar datasets to the forecasting process and we grouped the measurements in multiple ways. The experiments were executed on 50 time series describing fuels sales (gasoline and diesel sales) on 25 petrol stations from an international company. We described the data preparation process and feature extraction process. In the 9 presented experiments, we used the XGBoost algorithm and some typical time series forecasting methods (ARIMA, moving average). We showed a case study for two datasets and we discussed the practical usage of the tested solutions. The results showed that the solution which used XGBoost model utilising data gathered from all available petrol stations, in general, worked the best and it outperformed more advanced approaches as well as typical time series methods.