Using Annotated Suffix Tree Similarity Measure for Text Summarisation (original) (raw)

2337 Accesses
1 Citation

Abstract

The paper describes an attempt to improve the TextRank algorithm. TextRank is an algorithm for unsupervised text summarisation. It has two main stages: first stage is representing a text as a weighted directed graph, where nodes stand for single sentences, and edges are weighted with sentence similarity and connect consequent sentences. The second stage is applying the PageRank algorithm as is to the graph. The nodes that get the highest ranks form the summary of the text. We focus on the first stage, especially on measuring the sentence similarity. Mihalcea and Tarau suggest to employ the common scheme: use the vector space model (VSM), so that every text is a vector in space of words or stems, and compute cosine similarity between these vectors. Our idea is to replace this scheme by using the annotated suffix trees (AST) model for sentence representation. The AST overcomes several limitations of the VSM model, such as being dependent on the size of vocabulary, the length of sentences and demanding stemming or lemmatisation. This is achieved by taking all fuzzy matches between sentences into account and computing probabilities of matched concurrencies. For testing the method on Russian texts we made our own collection based on newspapers articles with some sentences highlighted as being more important. Using the AST similarity measure on this collection allows to achieve a slight improvement in comparison with using the cosine similarity measure.

References

Bougouin, A., Boudin, F., & Daille, B. (2013). TopicRank: Graph-based topic tanking for keyphrase extraction. In Proceedings of International Joint Conference on Natural Language Processing (pp. 543–551).
Google Scholar
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International Conference on World Wide Web 7 (pp. 107–117).
Google Scholar
Cruz, F., Troyano, J. A., & Enruquez, F. (2006). Supervised TextRank. In Advances in natural language processing (pp. 632–639). Berlin/Heidelberg: Springer.
Chapter Google Scholar
Document Understanding Conference. Retrieved October 20, 2014, http://www-nlpir.nist.gov/ (Web source)
Enhanced Annotated Suffix Tree . Retrieved January 15, 2015, https://pypi.python.org/pypi/EAST/0.2.2/ (Web source)
Erkan, G., & Radev, D. R. (2004). LexRank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22(1), 457–479.
Google Scholar
Garg, N., Favre, B., Reidhammer, K., & Hakkani-Tur, D. (2009). ClusterRank: a graph based method for meeting summarization. In Interspeech, ISCA (pp. 1499–1502).
Google Scholar
Gusfield, D. (1997). Algorithms on strings, trees and sequences: Computer science and computational biology. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Hahn, U., & Mani, I. (2000). The challenges of automatic summarization. Computer, 33(11), 29–36.
Article Google Scholar
Mihalcea, R., & Tarau P. (2004). TextRank: bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 404–411).
Google Scholar
Pampapathi, R., Mirkin, B., & Levene, M. (2008). A suffix tree approach to anti-spam email filtering. Machine Learning, 65(1), 309–338.
Article Google Scholar
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Article Google Scholar

Download references

Acknowledgements

The article was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2014–2015 (grant No 15-05-0041) and supported within the framework of a subsidy granted to the HSE by the Government of the Russian Federation for the implementation of the Global Competitiveness Program.

Author information

Authors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Maxim Yakovlev & Ekaterina Chernyak

Authors

Maxim Yakovlev
Ekaterina Chernyak

Corresponding author

Correspondence toMaxim Yakovlev .

Editor information

Editors and Affiliations

Jacobs University Bremen , Bremen, Germany
Adalbert F.X. Wilhelm
Universität Ulm, Institute of Medical Systems Biology Universität Ulm, Ulm, Baden-Württemberg, Germany
Hans A. Kestler

Rights and permissions

Copyright information

About this paper

Cite this paper

Yakovlev, M., Chernyak, E. (2016). Using Annotated Suffix Tree Similarity Measure for Text Summarisation. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1\_9

Download citation

.RIS
.ENW
.BIB
DOI: https://doi.org/10.1007/978-3-319-25226-1\_9
Published: 04 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25224-7
Online ISBN: 978-3-319-25226-1
eBook Packages: Mathematics and Statistics Mathematics and Statistics (R0)Springer Nature Proceedings excluding Computer Science

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.