István T. Nagy | University of Szeged (original) (raw)

Papers by István T. Nagy

Research paper thumbnail of HunOr: A Hungarian–Russian Parallel Corpus

In this paper, we present HunOr, the first multi-domain Hungarian–Russian parallel corpus. Some o... more In this paper, we present HunOr, the first multi-domain Hungarian–Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use and science, however, we would like to add texts from the news domain to the corpus. In the future, we are planning to carry out a ...

Research paper thumbnail of Learning to Detect English and Hungarian Light Verb Constructions

Light verb constructions consist of a verbal and a nominal component, where the noun preserves it... more Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions.

Research paper thumbnail of English Nominal Compound Detection with Wikipedia-Based Methods

Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on thei... more Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on their own, function as a noun and have a special added meaning. Here, we present the results of our experiments on how the growth of Wikipedia added to the performance of our dictionary labeling methods to detecting NCs. We also investigated how the size of an automatically generated silver standard corpus can affect the performance of our machine learning-based method. The results we obtained demonstrate that the bigger the dataset, the better the performance will be.

Research paper thumbnail of Person attribute extraction from the textual parts of web pages

Third Web People Search Evaluation Forum ( …, Jan 1, 2010

We present the RGAI systems which participated in the third Web People Search Task challenge. The... more We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask, and average results on the clustering subtask.

Research paper thumbnail of Web-based lemmatisation of Named Entities

Text, Speech and …, Jan 1, 2008

Identifying the lemma of a Named Entity is important for many Natural Language Processing applica... more Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%.

Research paper thumbnail of Researcher affiliation extraction from homepages

Proceedings of the 2009 Workshop on …, Jan 1, 2009

Research paper thumbnail of Person attribute extraction from the textual parts of web pages

Third Web People Search Evaluation Forum ( …, Jan 1, 2010

We present the RGAI systems which participated in the third Web People Search Task challenge. The... more We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask, and average results on the clustering subtask.

Research paper thumbnail of On positive and unlabeled learning for text classification

Text, Speech and Dialogue, Jan 1, 2011

In this paper we present a slightly modified machine learning approach for text classification wo... more In this paper we present a slightly modified machine learning approach for text classification working exclusively from positive and unlabeled samples. Our method can assure that the positive class is not underrepresented during the iterative training process and it can achieve 30% better F-value when the amount of positive examples is low.

Research paper thumbnail of Detecting noun compounds and light verb constructions: a contrastive study

ACL HLT 2011, Jan 1, 2011

Research paper thumbnail of Text, Speech and Dialogue: 14th International Conference, Tsd 2011, Pilsen, Czech Republic, September 1-5, 2011, Proceedings

... Entering the grounds of the historical center, you walk through streets that still respect th... more ... Entering the grounds of the historical center, you walk through streets that still respect the original Gothic urban layout, ie, the unique developed chess ground plan. ... Agnieszka Mykowiecka and Malgorzata Marciniak Automatic Switchboard Operator..... ...

Research paper thumbnail of Domain-Dependent Identification of Multiword Expressions

aclweb.org

The identification of different kinds of multiword expressions require different solutions, on th... more The identification of different kinds of multiword expressions require different solutions, on the other hand, there might be domain-related differences in their frequency and typology.

Research paper thumbnail of Identifying verbal collocations in wikipedia articles

Text, Speech and Dialogue, Jan 1, 2011

In this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle ... more In this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle constructions and light verb constructions in Wikipedia articles. Our results suggest that for verb-particle constructions, POS-tagging and restriction on the particle seem to yield the best result whereas the combination of POS-tagging, syntactic information and restrictions on the nominal and verbal component have the most beneficial effect on identifying light verb constructions. The identification of multiword semantic units can be successfully exploited in several applications in the fields of machine translation or information extraction.

Research paper thumbnail of Detecting noun compounds and light verb constructions: a contrastive study

ACL HLT 2011, Jan 1, 2011

Research paper thumbnail of Multiword expressions and named entities in Wikipedia articles

Proceedings of RANLP, Jan 1, 2011

Research paper thumbnail of Domain-Dependent Identification of Multiword Expressions

aclweb.org

The identification of different kinds of multiword expressions require different solutions, on th... more The identification of different kinds of multiword expressions require different solutions, on the other hand, there might be domain-related differences in their frequency and typology.

Research paper thumbnail of Noun Compound and Named Entity Recognition and their Usability in Keyphrase Extraction

aclweb.org

We investigate how the automatic identification of noun compounds and named entities can contribu... more We investigate how the automatic identification of noun compounds and named entities can contribute to keyphrase extraction and we also show how previously identified noun compounds affect named entity recognition and vice versa, how noun compound detection is supported by identified named entities. Our experiments demonstrate that already known noun compounds yield better performance in named entity recognition and already known named entities enhance noun compound detection.

Research paper thumbnail of HunOr: A Hungarian–Russian Parallel Corpus

In this paper, we present HunOr, the first multi-domain Hungarian–Russian parallel corpus. Some o... more In this paper, we present HunOr, the first multi-domain Hungarian–Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use and science, however, we would like to add texts from the news domain to the corpus. In the future, we are planning to carry out a ...

Research paper thumbnail of Learning to Detect English and Hungarian Light Verb Constructions

Light verb constructions consist of a verbal and a nominal component, where the noun preserves it... more Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions.

Research paper thumbnail of English Nominal Compound Detection with Wikipedia-Based Methods

Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on thei... more Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on their own, function as a noun and have a special added meaning. Here, we present the results of our experiments on how the growth of Wikipedia added to the performance of our dictionary labeling methods to detecting NCs. We also investigated how the size of an automatically generated silver standard corpus can affect the performance of our machine learning-based method. The results we obtained demonstrate that the bigger the dataset, the better the performance will be.

Research paper thumbnail of Person attribute extraction from the textual parts of web pages

Third Web People Search Evaluation Forum ( …, Jan 1, 2010

We present the RGAI systems which participated in the third Web People Search Task challenge. The... more We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask, and average results on the clustering subtask.

Research paper thumbnail of Web-based lemmatisation of Named Entities

Text, Speech and …, Jan 1, 2008

Identifying the lemma of a Named Entity is important for many Natural Language Processing applica... more Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%.

Research paper thumbnail of Researcher affiliation extraction from homepages

Proceedings of the 2009 Workshop on …, Jan 1, 2009

Research paper thumbnail of Person attribute extraction from the textual parts of web pages

Third Web People Search Evaluation Forum ( …, Jan 1, 2010

We present the RGAI systems which participated in the third Web People Search Task challenge. The... more We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask, and average results on the clustering subtask.

Research paper thumbnail of On positive and unlabeled learning for text classification

Text, Speech and Dialogue, Jan 1, 2011

In this paper we present a slightly modified machine learning approach for text classification wo... more In this paper we present a slightly modified machine learning approach for text classification working exclusively from positive and unlabeled samples. Our method can assure that the positive class is not underrepresented during the iterative training process and it can achieve 30% better F-value when the amount of positive examples is low.

Research paper thumbnail of Detecting noun compounds and light verb constructions: a contrastive study

ACL HLT 2011, Jan 1, 2011

Research paper thumbnail of Text, Speech and Dialogue: 14th International Conference, Tsd 2011, Pilsen, Czech Republic, September 1-5, 2011, Proceedings

... Entering the grounds of the historical center, you walk through streets that still respect th... more ... Entering the grounds of the historical center, you walk through streets that still respect the original Gothic urban layout, ie, the unique developed chess ground plan. ... Agnieszka Mykowiecka and Malgorzata Marciniak Automatic Switchboard Operator..... ...

Research paper thumbnail of Domain-Dependent Identification of Multiword Expressions

aclweb.org

The identification of different kinds of multiword expressions require different solutions, on th... more The identification of different kinds of multiword expressions require different solutions, on the other hand, there might be domain-related differences in their frequency and typology.

Research paper thumbnail of Identifying verbal collocations in wikipedia articles

Text, Speech and Dialogue, Jan 1, 2011

In this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle ... more In this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle constructions and light verb constructions in Wikipedia articles. Our results suggest that for verb-particle constructions, POS-tagging and restriction on the particle seem to yield the best result whereas the combination of POS-tagging, syntactic information and restrictions on the nominal and verbal component have the most beneficial effect on identifying light verb constructions. The identification of multiword semantic units can be successfully exploited in several applications in the fields of machine translation or information extraction.

Research paper thumbnail of Detecting noun compounds and light verb constructions: a contrastive study

ACL HLT 2011, Jan 1, 2011

Research paper thumbnail of Multiword expressions and named entities in Wikipedia articles

Proceedings of RANLP, Jan 1, 2011

Research paper thumbnail of Domain-Dependent Identification of Multiword Expressions

aclweb.org

The identification of different kinds of multiword expressions require different solutions, on th... more The identification of different kinds of multiword expressions require different solutions, on the other hand, there might be domain-related differences in their frequency and typology.

Research paper thumbnail of Noun Compound and Named Entity Recognition and their Usability in Keyphrase Extraction

aclweb.org

We investigate how the automatic identification of noun compounds and named entities can contribu... more We investigate how the automatic identification of noun compounds and named entities can contribute to keyphrase extraction and we also show how previously identified noun compounds affect named entity recognition and vice versa, how noun compound detection is supported by identified named entities. Our experiments demonstrate that already known noun compounds yield better performance in named entity recognition and already known named entities enhance noun compound detection.