Frequency Word Lists (original) (raw)
I will start working on 2018 OpenSubtitles dataset soon. Watch the space.
Download Frequency Words lists for 2016 OpenSubtitles datasets and the code used to generate them are now publicly available.
Click here to go to the GitHub
Previous post and links to old data files
I originally created the word lists while I was trying to improve the dictionaries I used for my windows phone app called Slydr.
Of course there were commercial options – however I was quoted about £500 per language for a nice / cleaned wordlist.. Me of course being a cheap git.. decided to create my own.
If you decide to use it, please let me know what you are using it for. Its yours to use.
Note: I used public / free subtitles to generate these and like most things, it will have errors.
I would like to thank opensubtitles.org as their subtitles form the basis of the word lists. I would also like to thank the Tehran University for Persian Language corpus which allowed me to build Persian / Farsi word list (2011 version).
While the subtitles are free, donations do motivate further work. If you would like to donate, please click the Donate button to donate using Paypal.
If you like to create you own word lists, here’s something to get you started. Download FrequencyWordsHelper. When you run the app, it will ask for a directory to scan and then ask for output filename. once you provide both, it will scan the directory for all txt files and create a word list out of it. The app requires .NET framework 4.5
Format of the frequency lists:
word1 number1 (number1 represents occurance of word1 across all files)
word2 number2 (number2 represents occurance of word2 across all files)
