Róbert Ormándi | University of Szeged (original) (raw)

Papers by Róbert Ormándi

Fully distributed data mining algorithms build global models over large amounts of data distribut... more Fully distributed data mining algorithms build global models over large amounts of data distributed over a large number of peers in a network, without moving the data itself. In the area of peer-to-peer (P2P) networks, such algorithms have various applications in P2P social networking, and also in trackerless BitTorrent communities. The difficulty of the problem involves realizing good quality models with an affordable communication complexity, while assuming as little as possible about the communication model. Here we describe a conceptually simple, yet powerful generic approach for designing efficient, fully distributed, asynchronous, local algorithms for learning models of fully distributed data. The key idea is that many models perform a random walk over the network while being gradually adjusted to fit the data they encounter, using a stochastic gradient descent search. We demonstrate our approach by implementing the support vector machine (SVM) method and by experimentally evaluating its performance in various failure scenarios over different benchmark datasets. Our algorithm scheme can implement a wide range of machine learning methods in an extremely robust manner.

Machine learning over fully distributed data poses an important problem in peer-to-peer (P2P) app... more Machine learning over fully distributed data poses an important problem in peer-to-peer (P2P) applications. In this model we have one data record at each network node, but without the possibility to move raw data due to privacy considerations. For example, user profiles, ratings, history, or sensor readings can represent this case. This problem is difficult, because there is no possibility to learn local models, the system model offers almost no guarantees for reliability, yet the communication cost needs to be kept low. Here we propose gossip learning, a generic approach that is based on multiple models taking random walks over the network in parallel, while applying an online learning algorithm to improve themselves, and getting combined via ensemble learning methods. We present an instantiation of this approach for the case of classification with linear models. Our main contribution is an ensemble learning method which---through the continuous combination of the models in the network---implements a virtual weighted voting mechanism over an exponential number of models at practically no extra cost as compared to independent random walks. We prove the convergence of the method theoretically, and perform extensive experiments on benchmark datasets. Our experimental analysis demonstrates the performance and robustness of the proposed approach.

Here we propose a novel approach for the task of domain adaptation for Natural Language Processin... more Here we propose a novel approach for the task of domain adaptation for Natural Language Processing. Our approach captures relations between the source and target domains by applying a model transformation mechanism which can be learnt by using labeled data of limited size taken from the target domain. Experimental results on several Opinion Mining datasets show that our approach significantly outperforms baselines and published systems when the amount of labeled data is extremely small.

In online communities, like Wikipedia, where content edition is avail- able for every visitor use... more In online communities, like Wikipedia, where content edition is avail- able for every visitor users who deliberately make incorrect, vandal comments are sure to turn up. In this paper we propose a strong feature set and a method that can handle this problem and automatically decide whether an edit is a vandal contribution or not. We present a new feature

Peer-to-peer file-sharing has been increasingly popular in the last decade. In most cases file-sh... more Peer-to-peer file-sharing has been increasingly popular in the last decade. In most cases file-sharing communities provide only minimal functionality, such as search and download. Extra features such as recommendation are difficult to implement because users are typically unwilling to provide sufficient rating information for the items they download. For this reason, it would be desirable to utilize user behavior to infer implicit ratings. For example, if a user deletes a file after downloading it, we could infer that the rating is low, or if the user is seeding the file for a long time, the rating is high. In this paper we demonstrate that it is indeed possible to infer implicit ratings from user behavior. We work with a large trace of Filelist.org, a BitTorrent-based private community, and demonstrate that we can identify a binary like/dislike distinction over the set of files users are downloading, using dynamic features of swarm membership. The resulting database containing the inferred ratings will be published online publicly and it can be used as a benchmark for P2P recommender systems.

To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the pur... more To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. Among others, selection criteria required the given word form to be frequent in Hungarian language usage (frequency rates available in the Hungarian National Corpus (HNC) were used for measurement ), and to have more than one sense considered frequent in usage. HNC and its Heti Világgazdaság (HVG) subcorpus provided the basis for corpus text selection. This way, each sample has a relevant context (the whole HVG article), and information on the lemma, POS-tagging and automatic tokenization is also available.

Here we propose a novel machine learning method for time series forecasting which is based on the... more Here we propose a novel machine learning method for time series forecasting which is based on the widely-used Least Squares Support Vector Machine (LS-SVM) approach. The objective function of our method contains a weighted variance minimization part as well. This modification makes the method more efficient in time series forecasting, as this paper will show. The proposed method is a generalization of the well-known LS-SVM algorithm. It has similar advantages like the applicability of the kernel-trick, it has a linear and unique solution, and a short computational time, but can perform better in certain scenarios. The main purpose of this paper is to introduce the novel Variance Minimization Least Squares Support Vector Machine (VMLS-SVM) method and to show its superiority through experimental results using standard benchmark time series prediction datasets.

Identifying the lemma of a Named Entity is important for many Natural Language Processing applica... more Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%.

Offering personalized recommendation as a service in fully distributed applications such as file-... more Offering personalized recommendation as a service in fully distributed applications such as file-sharing, distributed search, social networking, P2P television, etc, is an increasingly important problem. In such networked environments recommender algorithms should meet the same performance and reliability requirements as in centralized services. To achieve this is a challenge because a large amount of distributed data needs to be managed, and at the same time additional constraints need to be taken into account such as balancing resource usage over the network. In this paper we focus on a common component of many fully distributed recommender systems, namely the overlay network. We point out that the overlay topologies that are typically defined by node similarity have highly unbalanced degree distributions in a wide range of available benchmark datasets: a fact that has important-but so far largely overlooked-consequences on the load balancing of overlay protocols. We propose algorithms with a favorable convergence speed and prediction accuracy that also take load balancing into account. We perform extensive simulation experiments with the proposed algorithms, and compare them with known algorithms from related work on wellknown benchmark datasets.

Here we present a manually annotated corpus of web pages and annotation tool for Web Content Mini... more Here we present a manually annotated corpus of web pages and annotation tool for Web Content Mining. The corpus is extensively annotated, has a hierarchical label structure and is freely available for research purposes. The annotation tool is a Firefox extension which allows the annotator to work with the pages in their original appearance. This tool handles the annotation hierarchy independently of the DOM tree of the web pages, and it allows overlapped annotation between the HTML tags.

The development of highly accurate Named Entity Recognition (NER) systems can be beneficial to a ... more The development of highly accurate Named Entity Recognition (NER) systems can be beneficial to a wide range of Human Language Technology applications. In this paper we introduce three heuristics that exploit a variety of knowledge sources (the World Wide Web, Wikipedia and WordNet) and are capable of improving further a state-of-the-art multilingual and domain independent NER system. Moreover we describe our investigations on entity recognition in simulated speech-to-text output. Our web-based heuristics attained a slight improvement over the best results published on a standard NER task, and proved to be particularly effective in the speech-to-text scenario.

Journal of The American Medical Informatics Association, 2009

In this study the authors describe the system submitted by the team of University of Szeged to th... more In this study the authors describe the system submitted by the team of University of Szeged to the second i2b2 Challenge in Natural Language Processing for Clinical Data. The challenge focused on the development of automatic systems that analyzed clinical discharge summary texts and addressed the following question: "Who's obese and what co-morbidities do they (definitely/most likely) have?". Target diseases included obesity and its 15 most frequent comorbidities exhibited by patients, while the target labels corresponded to expert judgments based on textual evidence and intuition (separately).

Journal of The American Medical Informatics Association, 2009