M. Diligenti - Academia.edu (original) (raw)
Papers by M. Diligenti
2011 10th International Conference on Machine Learning and Applications and Workshops, 2011
ABSTRACT This paper presents a novel framework to integrate prior knowledge, represented as a col... more ABSTRACT This paper presents a novel framework to integrate prior knowledge, represented as a collection of First Order Logic (FOL) clauses, into regularization over discrete domains. In particular, we consider tasks in which a set of items are connected to each other by given relationships yielding a graph, whose nodes correspond to the available objects, and it is required to estimate a set of functions defined on each node of the graph, given a small set of labeled nodes for each function. The available prior knowledge imposes a set of constraints among the function values. In particular, we consider background knowledge expressed as FOL clauses, whose predicates correspond to the functions and whose variables range over the nodes of the graph. These clauses can be converted into a set of constraints that can be embedded into a graph regularization schema. The experimental results evaluate the proposed technique on an image tagging task, showing how the proposed approach provides a significantly higher tagging accuracy than simple graph regularization.
International Conference on Advances in Pattern Recognition, 1999
... algorithm is likely to be more robust with respect to noise than the truly math-ematical one.... more ... algorithm is likely to be more robust with respect to noise than the truly math-ematical one. ... logos can be rotated up to 360 degrees and are corrupted by salt and pepper noise and ... a joint project between the Uni-versity of Wollongong (Australia), the University of Pisa (Italy) and ...
Lecture Notes in Computer Science, 2002
Extracting and processing information from web pages is an important task in many areas like cons... more Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using spatial information one is able to define heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes c lassifier, taking into account the proposed representation, clearly outperforms a Naive Bayes classifier using only information about the content of documents.
International Journal of Electronic Business, 2003
Extracting and processing information from web pages is an important task in many areas like cons... more Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the web. A common approach in the extraction process is to represent a page as a 'bag of words' and then to perform additional processing on such a flat representation. In this paper we propose a
International Journal on Document Analysis and Recognition, 2001
... be clustered with respect to their subject or use, according to the users' view:... more ... be clustered with respect to their subject or use, according to the users' view: for example, journals,tax forms, business ... At each step a new node is added to the path. ... decreas-ing function: if the level drop between two nodes is large then the function g assumes a small value. ...
2011 10th International Conference on Machine Learning and Applications and Workshops, 2011
ABSTRACT Learning to rank from examples is an important task in modern Information Retrieval syst... more ABSTRACT Learning to rank from examples is an important task in modern Information Retrieval systems like Web search engines, where the large number of available features makes hard to manually devise high-performing ranking functions. This paper presents a novel approach to learning-to-rank, which can natively integrate any target metric with no modifications. The target metric is optimized via maximum-likelihood estimation of a probability distribution over the ranks, which are assumed to follow a Boltzmann distribution. Unlike other approaches in the literature like BoltzRank, this approach does not rely on maximizing the expected value of the target score as a proxy of the optimization of target metric. This has both theoretical and performance advantages as the expected value can not be computed both accurately and efficiently. Furthermore, our model employs the pseudo-likelihood as an accurate surrogate of the likelihood to avoid to explicitly compute the normalization factor of the Boltzmann distribution, which is intractable in this context. The experimental results show that the approach provides state-of-the-art results on various benchmarks and on a dataset built from the logs of a commercial search engine.
Proceedings of Sixth International Conference on Document Analysis and Recognition, 2001
Abstract Content-based search and organization of Web documents poses new issues in information r... more Abstract Content-based search and organization of Web documents poses new issues in information retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are split into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the ...
Theoretical Computer Science, 2004
This paper emphasizes some intriguing links between neural computation on graphical domains and s... more This paper emphasizes some intriguing links between neural computation on graphical domains and social networks, like those used in nowadays search engines to score the page authority. It is pointed out that the introduction of web domains creates a uniÿed mathematical framework for these computational schemes. It is shown that one of the major limitations of currently used connectionist models, namely their scarce ability to capture the topological features of patterns, can be e ectively faced by computing the node rank according to social-based computation, like Google's PageRank. The main contribution of the paper is the introduction of a novel graph spectral notion, which can be naturally used for the graph isomorphism problem. In particular, a class of graphs is introduced for which the problem is proven to be polynomial. It is also pointed out that the derived spectral representations can be nicely combined with learning, thus opening the doors to many applications typically faced within the framework of neural computation.
Pattern Recognition Letters, 2003
Machine Learning, 2012
We propose a general framework to incorporate first-order logic (FOL) clauses, that are thought o... more We propose a general framework to incorporate first-order logic (FOL) clauses, that are thought of as an abstract and partial representation of the environment, into kernel machines that learn within a semi-supervised scheme. We rely on a multi-task learning scheme where each task is associated with a unary predicate defined on the feature space, while higher level abstract representations consist of FOL clauses made of those predicates. We re-use the kernel machine mathematical apparatus to solve the problem as primal optimization of a function composed of the loss on the supervised examples, the regularization term, and a penalty term deriving from forcing real-valued constraints deriving from the predicates. Unlike for classic kernel machines, however, depending on the logic clauses, the overall function to be optimized is not convex anymore. An important contribution is to show that while tackling the optimization by classic numerical schemes is likely to be hopeless, a stage-based learning scheme, in which we start learning the supervised examples until convergence is reached, and then continue by forcing the logic clauses is a viable direction to attack the problem. Some promising experimental results are given on artificial learning tasks and on the automatic tagging of bibtex entries to emphasize the comparison with plain kernel machines.
International Journal on Artificial Intelligence Tools, 2012
Crossword puzzles are used everyday by millions of people for entertainment, but have application... more Crossword puzzles are used everyday by millions of people for entertainment, but have applications also in educational and rehabilitation contexts. Unfortunately, the generation of ad-hoc puzzles, especially on specific subjects, typically requires a great deal of human expert work. This paper presents the architecture of WebCrow-generation, a system that is able to generate crosswords with no human intervention, including clue generation and crossword compilation. In particular, the proposed system crawls information sources on the Web, extracts definitions from the downloaded pages using state-of-the-art natural language processing techniques and, finally, compiles the crossword schema with the extracted definitions by constraint satisfaction programming. The system has been tested on the creation of Italian crosswords, but the extensive use of machine learning makes the system easily portable to other languages.
IEEE Transactions on Knowledge and Data Engineering, 2004
The denition of ecient page ranking algorithms is becoming an important issue in the design of th... more The denition of ecient page ranking algorithms is becoming an important issue in the design of the query interface of Web search engines. Information ooding is a common experience especially when broad topic queries are issued. Queries containing only one or two keywords usually match a huge number of documents, while users can only aord to visit the rst positions
Automatic processing of Web documents is an important issue in the design of search engines, of W... more Automatic processing of Web documents is an important issue in the design of search engines, of Web mining tools, and of applications for Web information extraction. Simple text-based approaches are typically used in which most of the information provided by the page visual layout is discarded. Only some visual features, as the font face and size, are effectively used to weigh the importance of the words in the page. In this paper, we propose to use a hierarchical representation, which includes the visual screen coordinates for every HTML object in the page. The use of the visual layout allows us to identify common page components such as the header, the navigation bars, the left and right menus, the footer, and the informative parts of the page. The recognition of the functional role of each object is performed by a set of heuristic rules. The experimental results show that page areas are correctly classified in 73% of the cases. The identification of different functional areas on the page allows the definition of a more accurate method for representing the page text contents, which splits the text features into different subsets according to the area they belong to. We show that this approach can improve the classification accuracy for page topic categorization by more than 10% with respect to the use of a flat "bag-of-words" representation.
Extracting and processing information from web pages is an important task in many areas like cons... more Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In t his paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Such spatial information allows the definition of heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show a preliminary experiment where our heuristics are able to correctly recognize objects in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.
This paper presents a general framework to integrate prior knowledge in the form of logic constra... more This paper presents a general framework to integrate prior knowledge in the form of logic constraints among a set of task functions into kernel machines. The logic propositions provide a partial representation of the environment, in which the learner operates, that is exploited by the learning algorithm together with the information available in the supervised examples. In particular, we consider a multi-task learning scheme, where multiple unary predicates on the feature space are to be learned by kernel machines and a higher level abstract representation consists of logic clauses on these predicates, known to hold for any input. A general approach is presented to convert the logic clauses into a continuous implementation, that processes the outputs computed by the kernel-based predicates. The learning task is formulated as a primal optimization problem of a loss function that combines a term measuring the fitting of the supervised examples, a regularization term, and a penalty term that enforces the constraints on both supervised and unsupervised examples. The proposed semi-supervised learning framework is particularly suited for learning in high dimensionality feature spaces, where the supervised training examples tend to be sparse and generalization difficult. Unlike for standard kernel machines, the cost function to optimize is not generally guaranteed to be convex. However, the experimental results show that it is still possible to find good solutions using a two stage learning schema, in which first the supervised examples are learned until convergence and then the logic constraints are forced. Some promising experimental results on artificial multi-task learning tasks are reported, showing how the classification accuracy can be effectively improved by exploiting the a priori rules and the unsupervised examples.
2011 10th International Conference on Machine Learning and Applications and Workshops, 2011
ABSTRACT This paper presents a novel framework to integrate prior knowledge, represented as a col... more ABSTRACT This paper presents a novel framework to integrate prior knowledge, represented as a collection of First Order Logic (FOL) clauses, into regularization over discrete domains. In particular, we consider tasks in which a set of items are connected to each other by given relationships yielding a graph, whose nodes correspond to the available objects, and it is required to estimate a set of functions defined on each node of the graph, given a small set of labeled nodes for each function. The available prior knowledge imposes a set of constraints among the function values. In particular, we consider background knowledge expressed as FOL clauses, whose predicates correspond to the functions and whose variables range over the nodes of the graph. These clauses can be converted into a set of constraints that can be embedded into a graph regularization schema. The experimental results evaluate the proposed technique on an image tagging task, showing how the proposed approach provides a significantly higher tagging accuracy than simple graph regularization.
International Conference on Advances in Pattern Recognition, 1999
... algorithm is likely to be more robust with respect to noise than the truly math-ematical one.... more ... algorithm is likely to be more robust with respect to noise than the truly math-ematical one. ... logos can be rotated up to 360 degrees and are corrupted by salt and pepper noise and ... a joint project between the Uni-versity of Wollongong (Australia), the University of Pisa (Italy) and ...
Lecture Notes in Computer Science, 2002
Extracting and processing information from web pages is an important task in many areas like cons... more Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using spatial information one is able to define heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes c lassifier, taking into account the proposed representation, clearly outperforms a Naive Bayes classifier using only information about the content of documents.
International Journal of Electronic Business, 2003
Extracting and processing information from web pages is an important task in many areas like cons... more Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the web. A common approach in the extraction process is to represent a page as a 'bag of words' and then to perform additional processing on such a flat representation. In this paper we propose a
International Journal on Document Analysis and Recognition, 2001
... be clustered with respect to their subject or use, according to the users' view:... more ... be clustered with respect to their subject or use, according to the users' view: for example, journals,tax forms, business ... At each step a new node is added to the path. ... decreas-ing function: if the level drop between two nodes is large then the function g assumes a small value. ...
2011 10th International Conference on Machine Learning and Applications and Workshops, 2011
ABSTRACT Learning to rank from examples is an important task in modern Information Retrieval syst... more ABSTRACT Learning to rank from examples is an important task in modern Information Retrieval systems like Web search engines, where the large number of available features makes hard to manually devise high-performing ranking functions. This paper presents a novel approach to learning-to-rank, which can natively integrate any target metric with no modifications. The target metric is optimized via maximum-likelihood estimation of a probability distribution over the ranks, which are assumed to follow a Boltzmann distribution. Unlike other approaches in the literature like BoltzRank, this approach does not rely on maximizing the expected value of the target score as a proxy of the optimization of target metric. This has both theoretical and performance advantages as the expected value can not be computed both accurately and efficiently. Furthermore, our model employs the pseudo-likelihood as an accurate surrogate of the likelihood to avoid to explicitly compute the normalization factor of the Boltzmann distribution, which is intractable in this context. The experimental results show that the approach provides state-of-the-art results on various benchmarks and on a dataset built from the logs of a commercial search engine.
Proceedings of Sixth International Conference on Document Analysis and Recognition, 2001
Abstract Content-based search and organization of Web documents poses new issues in information r... more Abstract Content-based search and organization of Web documents poses new issues in information retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are split into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the ...
Theoretical Computer Science, 2004
This paper emphasizes some intriguing links between neural computation on graphical domains and s... more This paper emphasizes some intriguing links between neural computation on graphical domains and social networks, like those used in nowadays search engines to score the page authority. It is pointed out that the introduction of web domains creates a uniÿed mathematical framework for these computational schemes. It is shown that one of the major limitations of currently used connectionist models, namely their scarce ability to capture the topological features of patterns, can be e ectively faced by computing the node rank according to social-based computation, like Google's PageRank. The main contribution of the paper is the introduction of a novel graph spectral notion, which can be naturally used for the graph isomorphism problem. In particular, a class of graphs is introduced for which the problem is proven to be polynomial. It is also pointed out that the derived spectral representations can be nicely combined with learning, thus opening the doors to many applications typically faced within the framework of neural computation.
Pattern Recognition Letters, 2003
Machine Learning, 2012
We propose a general framework to incorporate first-order logic (FOL) clauses, that are thought o... more We propose a general framework to incorporate first-order logic (FOL) clauses, that are thought of as an abstract and partial representation of the environment, into kernel machines that learn within a semi-supervised scheme. We rely on a multi-task learning scheme where each task is associated with a unary predicate defined on the feature space, while higher level abstract representations consist of FOL clauses made of those predicates. We re-use the kernel machine mathematical apparatus to solve the problem as primal optimization of a function composed of the loss on the supervised examples, the regularization term, and a penalty term deriving from forcing real-valued constraints deriving from the predicates. Unlike for classic kernel machines, however, depending on the logic clauses, the overall function to be optimized is not convex anymore. An important contribution is to show that while tackling the optimization by classic numerical schemes is likely to be hopeless, a stage-based learning scheme, in which we start learning the supervised examples until convergence is reached, and then continue by forcing the logic clauses is a viable direction to attack the problem. Some promising experimental results are given on artificial learning tasks and on the automatic tagging of bibtex entries to emphasize the comparison with plain kernel machines.
International Journal on Artificial Intelligence Tools, 2012
Crossword puzzles are used everyday by millions of people for entertainment, but have application... more Crossword puzzles are used everyday by millions of people for entertainment, but have applications also in educational and rehabilitation contexts. Unfortunately, the generation of ad-hoc puzzles, especially on specific subjects, typically requires a great deal of human expert work. This paper presents the architecture of WebCrow-generation, a system that is able to generate crosswords with no human intervention, including clue generation and crossword compilation. In particular, the proposed system crawls information sources on the Web, extracts definitions from the downloaded pages using state-of-the-art natural language processing techniques and, finally, compiles the crossword schema with the extracted definitions by constraint satisfaction programming. The system has been tested on the creation of Italian crosswords, but the extensive use of machine learning makes the system easily portable to other languages.
IEEE Transactions on Knowledge and Data Engineering, 2004
The denition of ecient page ranking algorithms is becoming an important issue in the design of th... more The denition of ecient page ranking algorithms is becoming an important issue in the design of the query interface of Web search engines. Information ooding is a common experience especially when broad topic queries are issued. Queries containing only one or two keywords usually match a huge number of documents, while users can only aord to visit the rst positions
Automatic processing of Web documents is an important issue in the design of search engines, of W... more Automatic processing of Web documents is an important issue in the design of search engines, of Web mining tools, and of applications for Web information extraction. Simple text-based approaches are typically used in which most of the information provided by the page visual layout is discarded. Only some visual features, as the font face and size, are effectively used to weigh the importance of the words in the page. In this paper, we propose to use a hierarchical representation, which includes the visual screen coordinates for every HTML object in the page. The use of the visual layout allows us to identify common page components such as the header, the navigation bars, the left and right menus, the footer, and the informative parts of the page. The recognition of the functional role of each object is performed by a set of heuristic rules. The experimental results show that page areas are correctly classified in 73% of the cases. The identification of different functional areas on the page allows the definition of a more accurate method for representing the page text contents, which splits the text features into different subsets according to the area they belong to. We show that this approach can improve the classification accuracy for page topic categorization by more than 10% with respect to the use of a flat "bag-of-words" representation.
Extracting and processing information from web pages is an important task in many areas like cons... more Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In t his paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Such spatial information allows the definition of heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show a preliminary experiment where our heuristics are able to correctly recognize objects in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.
This paper presents a general framework to integrate prior knowledge in the form of logic constra... more This paper presents a general framework to integrate prior knowledge in the form of logic constraints among a set of task functions into kernel machines. The logic propositions provide a partial representation of the environment, in which the learner operates, that is exploited by the learning algorithm together with the information available in the supervised examples. In particular, we consider a multi-task learning scheme, where multiple unary predicates on the feature space are to be learned by kernel machines and a higher level abstract representation consists of logic clauses on these predicates, known to hold for any input. A general approach is presented to convert the logic clauses into a continuous implementation, that processes the outputs computed by the kernel-based predicates. The learning task is formulated as a primal optimization problem of a loss function that combines a term measuring the fitting of the supervised examples, a regularization term, and a penalty term that enforces the constraints on both supervised and unsupervised examples. The proposed semi-supervised learning framework is particularly suited for learning in high dimensionality feature spaces, where the supervised training examples tend to be sparse and generalization difficult. Unlike for standard kernel machines, the cost function to optimize is not generally guaranteed to be convex. However, the experimental results show that it is still possible to find good solutions using a two stage learning schema, in which first the supervised examples are learned until convergence and then the logic constraints are forced. Some promising experimental results on artificial multi-task learning tasks are reported, showing how the classification accuracy can be effectively improved by exploiting the a priori rules and the unsupervised examples.