JESC (original) (raw)

5/12/2019 -- new version -- de-duplicated and slightly cleaner

JESC aims to support the research and development of machine translation systems, information extraction, and other language processing techniques.

JESC is the product of a collaboration between Stanford University, Google Brain, and Rakuten Institute of Technology. It was created by crawling the internet for movie and tv subtitles and aligining their captions. It is one of the largest freely available EN-JA corpus, and covers the poorly represented domain of colloquial language.

You can download the scripts, tools, and crawlers used to create this dataset on Github.

You can read the paper here.

These data are released under a Creative Commons (CC) license.