Python | Text Summarizer (original) (raw)

Today various organizations, be it online shopping, government and private sector organizations, catering and tourism industry or other institutions that offer customer services are concerned about their customers and ask for feedback every single time we use their services. Consider the fact, that these companies may be receiving enormous amounts of user feedback every single day. And it would become quite tedious for the management to sit and analyze each of those. But, the technologies today have reached to an extent where they can do all the tasks of human beings. And the field which makes these things happen is Machine Learning. The machines have become capable of understanding human languages using Natural Language Processing. Today researches are being done in the field of text analytics. And one such application of text analytics and NLP is a Feedback Summarizer which helps in summarizing and shortening the text in the user feedback. This can be done an algorithm to reduce bodies of text but keeping its original meaning, or giving a great insight into the original text. If you're interested in Data Analytics, you will find learning about Natural Language Processing very useful. Python provides immense library support for NLP. We will be using NLTK - the Natural Language Toolkit. which will serve our purpose right. Install NLTK module on your system using : sudo pip install nltkLet's understand the steps - Step 1: Importing required libraries There are two NLTK libraries that will be necessary for building an efficient feedback summarizer.

Python3 1== `

from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize

Terms Used :

CorpusCorpus means a collection of text. It could be data sets of anything containing texts be it poems by a certain poet, bodies of work by a certain author, etc. In this case, we are going to use a data set of pre-determined stop words.
Tokenizers it divides a text into a series of tokens. There are three main tokenizers - word, sentence, and regex tokenizer. We will only use the word and sentence tokenizer

Step 2: Removing Stop Words and storing them in a separate array of words. Stop WordAny word like (is, a, an, the, for) that does not add value to the meaning of a sentence. For example, let's say we have the sentence

GeeksForGeeks is one of the most useful websites for competitive programming.

After removing stop words, we can narrow the number of words and preserve the meaning as follows:

['GeeksForGeeks', 'one', 'useful', 'website', 'competitive', 'programming', '.']

Step 3: Create a frequency table of words A python dictionary that'll keep a record of how many times each word appears in the feedback after removing the stop words.we can use the dictionary over every sentence to know which sentences have the most relevant content in the overall text.

Python3 1== `

stopWords = set(stopwords.words("english")) words = word_tokenize(text) freqTable = dict()

Step 4: Assign score to each sentence depending on the words it contains and the frequency table We can use the sent_tokenize() method to create the array of sentences. Secondly, we will need a dictionary to keep the score of each sentence, we will later go through the dictionary to generate the summary.

Python3 1== `

sentences = sent_tokenize(text) sentenceValue = dict()

Step 5: Assign a certain score to compare the sentences within the feedback. A simple approach to compare our scores would be to find the average score of a sentence. The average itself can be a good threshold.

Python3 1== `

sumValues = 0 for sentence in sentenceValue: sumValues += sentenceValue[sentence] average = int(sumValues / len(sentenceValue))

Apply the threshold value and store sentences in order into the summary. Code : Complete implementation of Text Summarizer using Python

Python3 1== `

importing libraries

import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize

Input text - to summarize

text = """ """

Tokenizing the text

stopWords = set(stopwords.words("english")) words = word_tokenize(text)

Creating a frequency table to keep the

score of each word

freqTable = dict() for word in words: word = word.lower() if word in stopWords: continue if word in freqTable: freqTable[word] += 1 else: freqTable[word] = 1

Creating a dictionary to keep the score

of each sentence

sentences = sent_tokenize(text) sentenceValue = dict()

for sentence in sentences: for word, freq in freqTable.items(): if word in sentence.lower(): if sentence in sentenceValue: sentenceValue[sentence] += freq else: sentenceValue[sentence] = freq

sumValues = 0 for sentence in sentenceValue: sumValues += sentenceValue[sentence]

Average value of a sentence from the original text

average = int(sumValues / len(sentenceValue))

Storing sentences into our summary.

summary = '' for sentence in sentences: if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)): summary += " " + sentence print(summary)