István T. Nagy | University of Szeged (original) (raw)
Papers by István T. Nagy
In this paper, we present HunOr, the first multi-domain Hungarian–Russian parallel corpus. Some o... more In this paper, we present HunOr, the first multi-domain Hungarian–Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use and science, however, we would like to add texts from the news domain to the corpus. In the future, we are planning to carry out a ...
Light verb constructions consist of a verbal and a nominal component, where the noun preserves it... more Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions.
Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on thei... more Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on their own, function as a noun and have a special added meaning. Here, we present the results of our experiments on how the growth of Wikipedia added to the performance of our dictionary labeling methods to detecting NCs. We also investigated how the size of an automatically generated silver standard corpus can affect the performance of our machine learning-based method. The results we obtained demonstrate that the bigger the dataset, the better the performance will be.
Third Web People Search Evaluation Forum ( …, Jan 1, 2010
We present the RGAI systems which participated in the third Web People Search Task challenge. The... more We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask, and average results on the clustering subtask.
Text, Speech and …, Jan 1, 2008
Identifying the lemma of a Named Entity is important for many Natural Language Processing applica... more Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%.
Proceedings of the 2009 Workshop on …, Jan 1, 2009
Third Web People Search Evaluation Forum ( …, Jan 1, 2010
We present the RGAI systems which participated in the third Web People Search Task challenge. The... more We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask, and average results on the clustering subtask.
Text, Speech and Dialogue, Jan 1, 2011
In this paper we present a slightly modified machine learning approach for text classification wo... more In this paper we present a slightly modified machine learning approach for text classification working exclusively from positive and unlabeled samples. Our method can assure that the positive class is not underrepresented during the iterative training process and it can achieve 30% better F-value when the amount of positive examples is low.
ACL HLT 2011, Jan 1, 2011
... Entering the grounds of the historical center, you walk through streets that still respect th... more ... Entering the grounds of the historical center, you walk through streets that still respect the original Gothic urban layout, ie, the unique developed chess ground plan. ... Agnieszka Mykowiecka and Malgorzata Marciniak Automatic Switchboard Operator..... ...
aclweb.org
The identification of different kinds of multiword expressions require different solutions, on th... more The identification of different kinds of multiword expressions require different solutions, on the other hand, there might be domain-related differences in their frequency and typology.
Text, Speech and Dialogue, Jan 1, 2011
In this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle ... more In this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle constructions and light verb constructions in Wikipedia articles. Our results suggest that for verb-particle constructions, POS-tagging and restriction on the particle seem to yield the best result whereas the combination of POS-tagging, syntactic information and restrictions on the nominal and verbal component have the most beneficial effect on identifying light verb constructions. The identification of multiword semantic units can be successfully exploited in several applications in the fields of machine translation or information extraction.
ACL HLT 2011, Jan 1, 2011
Proceedings of RANLP, Jan 1, 2011
aclweb.org
The identification of different kinds of multiword expressions require different solutions, on th... more The identification of different kinds of multiword expressions require different solutions, on the other hand, there might be domain-related differences in their frequency and typology.
aclweb.org
We investigate how the automatic identification of noun compounds and named entities can contribu... more We investigate how the automatic identification of noun compounds and named entities can contribute to keyphrase extraction and we also show how previously identified noun compounds affect named entity recognition and vice versa, how noun compound detection is supported by identified named entities. Our experiments demonstrate that already known noun compounds yield better performance in named entity recognition and already known named entities enhance noun compound detection.
In this paper, we present HunOr, the first multi-domain Hungarian–Russian parallel corpus. Some o... more In this paper, we present HunOr, the first multi-domain Hungarian–Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use and science, however, we would like to add texts from the news domain to the corpus. In the future, we are planning to carry out a ...
Light verb constructions consist of a verbal and a nominal component, where the noun preserves it... more Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions.
Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on thei... more Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on their own, function as a noun and have a special added meaning. Here, we present the results of our experiments on how the growth of Wikipedia added to the performance of our dictionary labeling methods to detecting NCs. We also investigated how the size of an automatically generated silver standard corpus can affect the performance of our machine learning-based method. The results we obtained demonstrate that the bigger the dataset, the better the performance will be.
Third Web People Search Evaluation Forum ( …, Jan 1, 2010
We present the RGAI systems which participated in the third Web People Search Task challenge. The... more We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask, and average results on the clustering subtask.
Text, Speech and …, Jan 1, 2008
Identifying the lemma of a Named Entity is important for many Natural Language Processing applica... more Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%.
Proceedings of the 2009 Workshop on …, Jan 1, 2009
Third Web People Search Evaluation Forum ( …, Jan 1, 2010
We present the RGAI systems which participated in the third Web People Search Task challenge. The... more We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask, and average results on the clustering subtask.
Text, Speech and Dialogue, Jan 1, 2011
In this paper we present a slightly modified machine learning approach for text classification wo... more In this paper we present a slightly modified machine learning approach for text classification working exclusively from positive and unlabeled samples. Our method can assure that the positive class is not underrepresented during the iterative training process and it can achieve 30% better F-value when the amount of positive examples is low.
ACL HLT 2011, Jan 1, 2011
... Entering the grounds of the historical center, you walk through streets that still respect th... more ... Entering the grounds of the historical center, you walk through streets that still respect the original Gothic urban layout, ie, the unique developed chess ground plan. ... Agnieszka Mykowiecka and Malgorzata Marciniak Automatic Switchboard Operator..... ...
aclweb.org
The identification of different kinds of multiword expressions require different solutions, on th... more The identification of different kinds of multiword expressions require different solutions, on the other hand, there might be domain-related differences in their frequency and typology.
Text, Speech and Dialogue, Jan 1, 2011
In this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle ... more In this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle constructions and light verb constructions in Wikipedia articles. Our results suggest that for verb-particle constructions, POS-tagging and restriction on the particle seem to yield the best result whereas the combination of POS-tagging, syntactic information and restrictions on the nominal and verbal component have the most beneficial effect on identifying light verb constructions. The identification of multiword semantic units can be successfully exploited in several applications in the fields of machine translation or information extraction.
ACL HLT 2011, Jan 1, 2011
Proceedings of RANLP, Jan 1, 2011
aclweb.org
The identification of different kinds of multiword expressions require different solutions, on th... more The identification of different kinds of multiword expressions require different solutions, on the other hand, there might be domain-related differences in their frequency and typology.
aclweb.org
We investigate how the automatic identification of noun compounds and named entities can contribu... more We investigate how the automatic identification of noun compounds and named entities can contribute to keyphrase extraction and we also show how previously identified noun compounds affect named entity recognition and vice versa, how noun compound detection is supported by identified named entities. Our experiments demonstrate that already known noun compounds yield better performance in named entity recognition and already known named entities enhance noun compound detection.