Gianpiero Bianchi - Academia.edu (original) (raw)

Papers by Gianpiero Bianchi

Research paper thumbnail of Optimization techniques for data mining and information reconstruction

Research paper thumbnail of Presentation to SIS 2016 48th Scientific Meeting of the paper "Machine learning and statistical inference: the case of Istat survey on ICT

Research paper thumbnail of A study of VTL Usability and re-usability of VTL

This chapter discusses the usability aspects of VTL.

Research paper thumbnail of A robust procedure based on forward search to detect outliers

It is now widely recognized that the presence of outliers or errors in the data collection proces... more It is now widely recognized that the presence of outliers or errors in the data collection process can affect the results of any statistical analysis. The effect is likely to be even more severe in presence of complex surveys like Census. In the context of the VI Italian agriculture census, ISTAT has used a robust procedure based on the Forward Search to detect cases in which the collected information by the census was not in agreement with that coming from the General Agency for Agricultural Subsidies (AGEA). The controls have concerned total agricultural area (SAT), used agricultural area (SAU), land for vineyards and olive groves. The outliers have been subject to further investigation to subject matter experts of the regions. This process has enabled to improve in a significant way both the quality of data in the Agriculture census and those of AGEA. This paper summarizes how ISTAT tackled the problems of data correction and control, discusses the methodological problems found d...

Research paper thumbnail of Robustness Analysis of a Website Categorization Procedure based on Machine Learning

Website categorization has recently emerged as a very important task in several contexts. A huge ... more Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used to accomplish statistical surveys, saving the cost of the surveys, or to validate already surveyed data. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a dicult task in practice. This work describes techniques that can be used to convert website categorization into a supervised classification problem. To do so, each data record should summarize the content of an entire website. We generate this kind of records by using web scraping and optical character recognition, followed by a number of automated feature engineering steps. When such records have been produced, we apply to them state-of-the-art classification techniques to categorize the websites according to the aspect of interest. We use Support Vector Machines, Random For...

Research paper thumbnail of Text mining and machine learning techniques for text classification , with application to the automatic categorization of websites

Website categorization has recently emerged as a very important task in several contexts. A huge ... more Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could for instance be used to accomplish statistical surveys. However, the information of interest for the specific task under consideration has to be mined among that huge amount, and this turns out to be a difficult operation in practice. This work describes techniques that can be used to convert website categorization into a supervised classification problem. Each data record should summarize the content of an entire website. Records are obtained by using web scraping procedures, followed by a number of feature extraction and selection steps. When such records are completed, we apply state of the art classification techniques to categorize the websites according to the aspect of interest. Since in many applicative cases the labels available for the training set may be noisy, we also analyze the robustness of our proced...

Research paper thumbnail of Criteri e metodi per la determinazione ex-ante del campo di osservazione del Censimento dell’Agricoltura 2010

Il problema della selezione ottima del Campo di Osservazione del Censimento Italiano dell'Agr... more Il problema della selezione ottima del Campo di Osservazione del Censimento Italiano dell'Agricoltura 2010 è stato formulato come problema di knapsack e risolto attraverso le generazione di tagli derivati da minimal cover. I risultati ottenuti sono molto soddifacenti

Research paper thumbnail of Machine learning and statistical inference: the case of Istat survey on ICT

Istat is experimenting web scraping, text mining and machine learning techniques in order to obta... more Istat is experimenting web scraping, text mining and machine learning techniques in order to obtain a subset of the estimates currently produced by the sampling survey on “Survey on ICT usage and e-Commerce in Enterprises”, yearly carried out by Istat and by the other member states in the EU. Target estimates of this survey include the characteristics of websites used by enterprises to present their business (for instance, if the website offers e-commerce facilities or job vacancies). The aim of the experiment is to evaluate the possibility to use the sample of surveyed data as a training set in order to fit models that will be applied to the generality of websites. The usefulness of such an approach is twofold: (i) to enrich the information available in the Business Register, (ii) to increase the quality of the estimates produced by the survey. These different objectives can be reached by combining web scraping procedures together with text mining and machine learning techniques,...

Research paper thumbnail of The corporate identity of Italian Universities on the Web: a webometrics approach

In parallel with the increasing marketisation and globalisation of higher education, Universities... more In parallel with the increasing marketisation and globalisation of higher education, Universities’ corporate websites have become institutional virtual storefronts largely contributing to reinforcing the organisations’ brand, to disseminate information on their main achievements and to communicate with both enrolled students and potential “customers” worldwide. Thus, the effectiveness of Universities’ websites to deliver value in terms of information on the organisations’ activities and to interact with actual and potential students as well as partner institutions in education and research projects is to be regarded as a key objective of all Universities. The level of accomplishment of this task, measured so far mostly on a case-study basis, can be more extensively surveyed by adopting a webometric approach combining the use of web analytics as indicators of efficiency with selected indicators of contents collected through web scraping techniques. This approach has been tested on th...

Research paper thumbnail of Exploring the Potentialities of Automatic Extraction of University Webometric Information

Journal of Data and Information Science, 2020

Purpose The main objective of this work is to show the potentialities of recently developed appro... more Purpose The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities. Design/methodology/approach Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric ap...

Research paper thumbnail of Identifying e-Commerce in Enterprises by means of Text Mining and Classification Algorithms

Mathematical Problems in Engineering, 2018

Monitoring specific features of the enterprises, for example, the adoption of e-commerce, is an i... more Monitoring specific features of the enterprises, for example, the adoption of e-commerce, is an important and basic task for several economic activities. This type of information is usually obtained by means of surveys, which are costly due to the amount of personnel involved in the task. An automatic detection of this information would allow consistent savings. This can actually be performed by relying on computer engineering, since in general this information is publicly available on-line through the corporate websites. This work describes how to convert the detection of e-commerce into a supervised classification problem, where each record is obtained from the automatic analysis of one corporate website, and the class is the presence or the absence of e-commerce facilities. The automatic generation of similar data records requires the use of several Text Mining phases; in particular we compare six strategies based on the selection of best words and best n-grams. After this, we cl...

Research paper thumbnail of Website categorization: A formal approach and robustness analysis in the case of e-commerce detection

Expert Systems with Applications, 2019

Website categorization has recently emerged as a very important task in several contexts. A huge ... more Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website

Research paper thumbnail of Logical Analysis of Data as a tool for the analysis of Probabilistic Discrete Choice Behavior

Computers & Operations Research, 2018

Probabilistic Discrete Choice Models (PDCM) have been extensively used to interpret the behavior ... more Probabilistic Discrete Choice Models (PDCM) have been extensively used to interpret the behavior of heterogeneous decision makers that face discrete alternatives. The classification approach of Logical Analysis of Data (LAD) uses discrete optimization to generate patterns, which are logic formulas characterizing the different classes. Patterns can be seen as rules explaining the phenomenon under analysis. In this work we discuss how LAD can be used as the first phase of the specification of PDCM. Since in this task the number of patterns generated may be extremely large, and many of them may be nearly equivalent, additional processing is necessary to obtain practically meaningful information. Hence, we propose computationally viable techniques to obtain small sets of patterns that constitute meaningful representations of the phenomenon and allow to discover significant associations between subsets of explanatory variables and the output. We consider the complex socioeconomic problem of the analysis of the utilization of the Internet in Italy, using real data gathered by the Italian National Institute of Statistics.

Research paper thumbnail of A min-cut approach to functional regionalization, with a case study of the Italian local labour market areas

Optimization Letters, 2015

In several economical, statistical and geographical applications, a territory must be subdivided ... more In several economical, statistical and geographical applications, a territory must be subdivided into functional regions. Such regions are not fixed and politically delimited, but should be identified by analyzing the interactions among all its constituent localities. This is a very delicate and important task, that often turns out to be computationally difficult. In this work we propose an innovative approach to this problem based on the solution of minimum cut problems over an undirected graph called here transitions graph. The proposed procedure guarantees that the obtained regions satisfy all the statistical conditions required when considering this type of problems. Results on real-world instances show the effectiveness of the proposed approach.

Research paper thumbnail of A combinatorial optimization approach to the selection of statistical units

Journal of Industrial and Management Optimization, 2015

In the case of some large statistical surveys, the set of units that will constitute the scope of... more In the case of some large statistical surveys, the set of units that will constitute the scope of the survey must be selected. We focus on the real case of a Census of Agriculture, where the units are farms. Surveying each unit has a cost and brings a different portion of the whole information. In this case, one wants to determine a subset of units producing the minimum total cost for being surveyed and representing at least a certain portion of the total information. Uncertainty aspects also occur, because the portion of information corresponding to each unit is not perfectly known before surveying it. The proposed approach is based on combinatorial optimization, and the arising decision problems are modeled as multidimensional binary knapsack problems. Experimental results show the effectiveness of the proposed approach.

Research paper thumbnail of Effective Classification Using a Small Training Set Based on Discretization and Statistical Analysis

IEEE Transactions on Knowledge and Data Engineering, 2015

This work deals with the problem of producing a fast and accurate data classification, learning i... more This work deals with the problem of producing a fast and accurate data classification, learning it from a possibly small set of records that are already classified. The proposed approach is based on the framework of the so-called Logical Analysis of Data (LAD), but enriched with information obtained from statistical considerations on the data. A number of discrete optimization problems are solved in the different steps of the procedure, but their computational demand can be controlled. The accuracy of the proposed approach is compared to that of the standard LAD algorithm, of Support Vector Machines and of Label Propagation algorithm on publicly available datasets of the UCI repository. Encouraging results are obtained and discussed.

Research paper thumbnail of Open Source Integer Linear Programming Solvers for Error Localization in Numerical Data

Advances in Theoretical and Applied Statistics, 2013

Error localization problems can be converted into Integer Linear Programming problems. This appro... more Error localization problems can be converted into Integer Linear Programming problems. This approach provides several advantages and guarantees to find a set of erroneous fields having minimum total cost. By doing so, each erroneous record produces an Integer Linear Programming model that should be solved. This requires the use of specific solution softwares called Integer Linear Programming solvers. Some of these solvers are available as open source software. A study on the performance of internationally recognized open source Integer Linear Programming solvers, compared to a reference commercial solver on real-world data having only numerical fields, is reported. The aim was to produce a stressing test environment for selecting the most appropriate open source solver for performing error localization in numerical data.

Research paper thumbnail of Data Clustering for Improving the Selection of Donor for Data Imputation

Il presente lavoro si inserisce nell'ambito dell'imputazione automatica dei dati effettuata per m... more Il presente lavoro si inserisce nell'ambito dell'imputazione automatica dei dati effettuata per mezzo di dati esatti, detti donatori. Per ogni record errato, occorre selezionare un numero di donatori aventi particolari caratteristiche. Quando tale selezione deve essere effettata all'interno di serbatoi di potenziali donatori molto ampi, come nel caso di un censimento della popolazione, i tempi di calcolo possono rivelarsi troppo elevati. Al fine di ridurre il numero di potenziali donatori da esaminare, è qui proposto l'innovativo utilizzo di una procedura di clustering. L'insieme dei potenziali donatori viene diviso in numerosi sottoinsiemi, in modo che elementi dello stesso sottoinsieme abbiano caratteristiche simili. È stato in particolare sviluppato un algoritmo per il clustering di dati demografici. I risultati sono molto soddisfacenti, dal punto di vista sia della qualità dei dati, sia computazionale.

Research paper thumbnail of Information reconstruction via discrete optimization for agricultural census data

In the case of large-scale surveys, such as a Census, data may contain errors or missing values. ... more In the case of large-scale surveys, such as a Census, data may contain errors or missing values. An automatic error detection and correction procedure is therefore needed. We propose here an approach to this problem based on Discrete Optimization. The treatment of each data record is converted into a mixed integer linear programming model and solved by means of state-of-the-art branch and cut procedures. Results on real-world Agricultural Census data show the effectiveness of the proposed procedure.

Research paper thumbnail of Balancing of agricultural census data by using discrete optimization

Optimization Letters, 2013

In the case of large-scale surveys, such as a Census, data may contain errors or missing values. ... more In the case of large-scale surveys, such as a Census, data may contain errors or missing values. An automatic error correction procedure is therefore needed. We focus on the problem of restoring the consistency of agricultural data concerning cultivation areas and number of livestock, and we propose here an approach to this balancing problem based on Optimization. Possible alternative models, either linear, quadratic or mixed integer, are presented. The mixed integer linear one has been preferred and used for the treatment of possibly unbalanced data records. Results on real-world Agricultural Census data show the effectiveness of the proposed approach.

Research paper thumbnail of Optimization techniques for data mining and information reconstruction

Research paper thumbnail of Presentation to SIS 2016 48th Scientific Meeting of the paper "Machine learning and statistical inference: the case of Istat survey on ICT

Research paper thumbnail of A study of VTL Usability and re-usability of VTL

This chapter discusses the usability aspects of VTL.

Research paper thumbnail of A robust procedure based on forward search to detect outliers

It is now widely recognized that the presence of outliers or errors in the data collection proces... more It is now widely recognized that the presence of outliers or errors in the data collection process can affect the results of any statistical analysis. The effect is likely to be even more severe in presence of complex surveys like Census. In the context of the VI Italian agriculture census, ISTAT has used a robust procedure based on the Forward Search to detect cases in which the collected information by the census was not in agreement with that coming from the General Agency for Agricultural Subsidies (AGEA). The controls have concerned total agricultural area (SAT), used agricultural area (SAU), land for vineyards and olive groves. The outliers have been subject to further investigation to subject matter experts of the regions. This process has enabled to improve in a significant way both the quality of data in the Agriculture census and those of AGEA. This paper summarizes how ISTAT tackled the problems of data correction and control, discusses the methodological problems found d...

Research paper thumbnail of Robustness Analysis of a Website Categorization Procedure based on Machine Learning

Website categorization has recently emerged as a very important task in several contexts. A huge ... more Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used to accomplish statistical surveys, saving the cost of the surveys, or to validate already surveyed data. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a dicult task in practice. This work describes techniques that can be used to convert website categorization into a supervised classification problem. To do so, each data record should summarize the content of an entire website. We generate this kind of records by using web scraping and optical character recognition, followed by a number of automated feature engineering steps. When such records have been produced, we apply to them state-of-the-art classification techniques to categorize the websites according to the aspect of interest. We use Support Vector Machines, Random For...

Research paper thumbnail of Text mining and machine learning techniques for text classification , with application to the automatic categorization of websites

Website categorization has recently emerged as a very important task in several contexts. A huge ... more Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could for instance be used to accomplish statistical surveys. However, the information of interest for the specific task under consideration has to be mined among that huge amount, and this turns out to be a difficult operation in practice. This work describes techniques that can be used to convert website categorization into a supervised classification problem. Each data record should summarize the content of an entire website. Records are obtained by using web scraping procedures, followed by a number of feature extraction and selection steps. When such records are completed, we apply state of the art classification techniques to categorize the websites according to the aspect of interest. Since in many applicative cases the labels available for the training set may be noisy, we also analyze the robustness of our proced...

Research paper thumbnail of Criteri e metodi per la determinazione ex-ante del campo di osservazione del Censimento dell’Agricoltura 2010

Il problema della selezione ottima del Campo di Osservazione del Censimento Italiano dell'Agr... more Il problema della selezione ottima del Campo di Osservazione del Censimento Italiano dell'Agricoltura 2010 è stato formulato come problema di knapsack e risolto attraverso le generazione di tagli derivati da minimal cover. I risultati ottenuti sono molto soddifacenti

Research paper thumbnail of Machine learning and statistical inference: the case of Istat survey on ICT

Istat is experimenting web scraping, text mining and machine learning techniques in order to obta... more Istat is experimenting web scraping, text mining and machine learning techniques in order to obtain a subset of the estimates currently produced by the sampling survey on “Survey on ICT usage and e-Commerce in Enterprises”, yearly carried out by Istat and by the other member states in the EU. Target estimates of this survey include the characteristics of websites used by enterprises to present their business (for instance, if the website offers e-commerce facilities or job vacancies). The aim of the experiment is to evaluate the possibility to use the sample of surveyed data as a training set in order to fit models that will be applied to the generality of websites. The usefulness of such an approach is twofold: (i) to enrich the information available in the Business Register, (ii) to increase the quality of the estimates produced by the survey. These different objectives can be reached by combining web scraping procedures together with text mining and machine learning techniques,...

Research paper thumbnail of The corporate identity of Italian Universities on the Web: a webometrics approach

In parallel with the increasing marketisation and globalisation of higher education, Universities... more In parallel with the increasing marketisation and globalisation of higher education, Universities’ corporate websites have become institutional virtual storefronts largely contributing to reinforcing the organisations’ brand, to disseminate information on their main achievements and to communicate with both enrolled students and potential “customers” worldwide. Thus, the effectiveness of Universities’ websites to deliver value in terms of information on the organisations’ activities and to interact with actual and potential students as well as partner institutions in education and research projects is to be regarded as a key objective of all Universities. The level of accomplishment of this task, measured so far mostly on a case-study basis, can be more extensively surveyed by adopting a webometric approach combining the use of web analytics as indicators of efficiency with selected indicators of contents collected through web scraping techniques. This approach has been tested on th...

Research paper thumbnail of Exploring the Potentialities of Automatic Extraction of University Webometric Information

Journal of Data and Information Science, 2020

Purpose The main objective of this work is to show the potentialities of recently developed appro... more Purpose The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities. Design/methodology/approach Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric ap...

Research paper thumbnail of Identifying e-Commerce in Enterprises by means of Text Mining and Classification Algorithms

Mathematical Problems in Engineering, 2018

Monitoring specific features of the enterprises, for example, the adoption of e-commerce, is an i... more Monitoring specific features of the enterprises, for example, the adoption of e-commerce, is an important and basic task for several economic activities. This type of information is usually obtained by means of surveys, which are costly due to the amount of personnel involved in the task. An automatic detection of this information would allow consistent savings. This can actually be performed by relying on computer engineering, since in general this information is publicly available on-line through the corporate websites. This work describes how to convert the detection of e-commerce into a supervised classification problem, where each record is obtained from the automatic analysis of one corporate website, and the class is the presence or the absence of e-commerce facilities. The automatic generation of similar data records requires the use of several Text Mining phases; in particular we compare six strategies based on the selection of best words and best n-grams. After this, we cl...

Research paper thumbnail of Website categorization: A formal approach and robustness analysis in the case of e-commerce detection

Expert Systems with Applications, 2019

Website categorization has recently emerged as a very important task in several contexts. A huge ... more Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website

Research paper thumbnail of Logical Analysis of Data as a tool for the analysis of Probabilistic Discrete Choice Behavior

Computers & Operations Research, 2018

Probabilistic Discrete Choice Models (PDCM) have been extensively used to interpret the behavior ... more Probabilistic Discrete Choice Models (PDCM) have been extensively used to interpret the behavior of heterogeneous decision makers that face discrete alternatives. The classification approach of Logical Analysis of Data (LAD) uses discrete optimization to generate patterns, which are logic formulas characterizing the different classes. Patterns can be seen as rules explaining the phenomenon under analysis. In this work we discuss how LAD can be used as the first phase of the specification of PDCM. Since in this task the number of patterns generated may be extremely large, and many of them may be nearly equivalent, additional processing is necessary to obtain practically meaningful information. Hence, we propose computationally viable techniques to obtain small sets of patterns that constitute meaningful representations of the phenomenon and allow to discover significant associations between subsets of explanatory variables and the output. We consider the complex socioeconomic problem of the analysis of the utilization of the Internet in Italy, using real data gathered by the Italian National Institute of Statistics.

Research paper thumbnail of A min-cut approach to functional regionalization, with a case study of the Italian local labour market areas

Optimization Letters, 2015

In several economical, statistical and geographical applications, a territory must be subdivided ... more In several economical, statistical and geographical applications, a territory must be subdivided into functional regions. Such regions are not fixed and politically delimited, but should be identified by analyzing the interactions among all its constituent localities. This is a very delicate and important task, that often turns out to be computationally difficult. In this work we propose an innovative approach to this problem based on the solution of minimum cut problems over an undirected graph called here transitions graph. The proposed procedure guarantees that the obtained regions satisfy all the statistical conditions required when considering this type of problems. Results on real-world instances show the effectiveness of the proposed approach.

Research paper thumbnail of A combinatorial optimization approach to the selection of statistical units

Journal of Industrial and Management Optimization, 2015

In the case of some large statistical surveys, the set of units that will constitute the scope of... more In the case of some large statistical surveys, the set of units that will constitute the scope of the survey must be selected. We focus on the real case of a Census of Agriculture, where the units are farms. Surveying each unit has a cost and brings a different portion of the whole information. In this case, one wants to determine a subset of units producing the minimum total cost for being surveyed and representing at least a certain portion of the total information. Uncertainty aspects also occur, because the portion of information corresponding to each unit is not perfectly known before surveying it. The proposed approach is based on combinatorial optimization, and the arising decision problems are modeled as multidimensional binary knapsack problems. Experimental results show the effectiveness of the proposed approach.

Research paper thumbnail of Effective Classification Using a Small Training Set Based on Discretization and Statistical Analysis

IEEE Transactions on Knowledge and Data Engineering, 2015

This work deals with the problem of producing a fast and accurate data classification, learning i... more This work deals with the problem of producing a fast and accurate data classification, learning it from a possibly small set of records that are already classified. The proposed approach is based on the framework of the so-called Logical Analysis of Data (LAD), but enriched with information obtained from statistical considerations on the data. A number of discrete optimization problems are solved in the different steps of the procedure, but their computational demand can be controlled. The accuracy of the proposed approach is compared to that of the standard LAD algorithm, of Support Vector Machines and of Label Propagation algorithm on publicly available datasets of the UCI repository. Encouraging results are obtained and discussed.

Research paper thumbnail of Open Source Integer Linear Programming Solvers for Error Localization in Numerical Data

Advances in Theoretical and Applied Statistics, 2013

Error localization problems can be converted into Integer Linear Programming problems. This appro... more Error localization problems can be converted into Integer Linear Programming problems. This approach provides several advantages and guarantees to find a set of erroneous fields having minimum total cost. By doing so, each erroneous record produces an Integer Linear Programming model that should be solved. This requires the use of specific solution softwares called Integer Linear Programming solvers. Some of these solvers are available as open source software. A study on the performance of internationally recognized open source Integer Linear Programming solvers, compared to a reference commercial solver on real-world data having only numerical fields, is reported. The aim was to produce a stressing test environment for selecting the most appropriate open source solver for performing error localization in numerical data.

Research paper thumbnail of Data Clustering for Improving the Selection of Donor for Data Imputation

Il presente lavoro si inserisce nell'ambito dell'imputazione automatica dei dati effettuata per m... more Il presente lavoro si inserisce nell'ambito dell'imputazione automatica dei dati effettuata per mezzo di dati esatti, detti donatori. Per ogni record errato, occorre selezionare un numero di donatori aventi particolari caratteristiche. Quando tale selezione deve essere effettata all'interno di serbatoi di potenziali donatori molto ampi, come nel caso di un censimento della popolazione, i tempi di calcolo possono rivelarsi troppo elevati. Al fine di ridurre il numero di potenziali donatori da esaminare, è qui proposto l'innovativo utilizzo di una procedura di clustering. L'insieme dei potenziali donatori viene diviso in numerosi sottoinsiemi, in modo che elementi dello stesso sottoinsieme abbiano caratteristiche simili. È stato in particolare sviluppato un algoritmo per il clustering di dati demografici. I risultati sono molto soddisfacenti, dal punto di vista sia della qualità dei dati, sia computazionale.

Research paper thumbnail of Information reconstruction via discrete optimization for agricultural census data

In the case of large-scale surveys, such as a Census, data may contain errors or missing values. ... more In the case of large-scale surveys, such as a Census, data may contain errors or missing values. An automatic error detection and correction procedure is therefore needed. We propose here an approach to this problem based on Discrete Optimization. The treatment of each data record is converted into a mixed integer linear programming model and solved by means of state-of-the-art branch and cut procedures. Results on real-world Agricultural Census data show the effectiveness of the proposed procedure.

Research paper thumbnail of Balancing of agricultural census data by using discrete optimization

Optimization Letters, 2013

In the case of large-scale surveys, such as a Census, data may contain errors or missing values. ... more In the case of large-scale surveys, such as a Census, data may contain errors or missing values. An automatic error correction procedure is therefore needed. We focus on the problem of restoring the consistency of agricultural data concerning cultivation areas and number of livestock, and we propose here an approach to this balancing problem based on Optimization. Possible alternative models, either linear, quadratic or mixed integer, are presented. The mixed integer linear one has been preferred and used for the treatment of possibly unbalanced data records. Results on real-world Agricultural Census data show the effectiveness of the proposed approach.