Using Annotated Suffix Tree Similarity Measure for Text Summarisation (original) (raw)

Abstract

The paper describes an attempt to improve the TextRank algorithm. TextRank is an algorithm for unsupervised text summarisation. It has two main stages: first stage is representing a text as a weighted directed graph, where nodes stand for single sentences, and edges are weighted with sentence similarity and connect consequent sentences. The second stage is applying the PageRank algorithm as is to the graph. The nodes that get the highest ranks form the summary of the text. We focus on the first stage, especially on measuring the sentence similarity. Mihalcea and Tarau suggest to employ the common scheme: use the vector space model (VSM), so that every text is a vector in space of words or stems, and compute cosine similarity between these vectors. Our idea is to replace this scheme by using the annotated suffix trees (AST) model for sentence representation. The AST overcomes several limitations of the VSM model, such as being dependent on the size of vocabulary, the length of sentences and demanding stemming or lemmatisation. This is achieved by taking all fuzzy matches between sentences into account and computing probabilities of matched concurrencies. For testing the method on Russian texts we made our own collection based on newspapers articles with some sentences highlighted as being more important. Using the AST similarity measure on this collection allows to achieve a slight improvement in comparison with using the cosine similarity measure.

Similar content being viewed by others

References

Download references

Acknowledgements

The article was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2014–2015 (grant No 15-05-0041) and supported within the framework of a subsidy granted to the HSE by the Government of the Russian Federation for the implementation of the Global Competitiveness Program.

Author information

Authors and Affiliations

  1. National Research University Higher School of Economics, Moscow, Russia
    Maxim Yakovlev & Ekaterina Chernyak

Authors

  1. Maxim Yakovlev
  2. Ekaterina Chernyak

Corresponding author

Correspondence toMaxim Yakovlev .

Editor information

Editors and Affiliations

  1. Jacobs University Bremen , Bremen, Germany
    Adalbert F.X. Wilhelm
  2. Universität Ulm, Institute of Medical Systems Biology Universität Ulm, Ulm, Baden-Württemberg, Germany
    Hans A. Kestler

Rights and permissions

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Yakovlev, M., Chernyak, E. (2016). Using Annotated Suffix Tree Similarity Measure for Text Summarisation. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1\_9

Download citation

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Publish with us