Access to Information and Optical Character Recognition (OCR): A Step-by-Step Guide to Tesseract. Part one of the CAIJ Computer Literacy Series (original) (raw)

2020, Access to Information and Optical Character Recognition (OCR): A Step-by-Step Guide to Tesseract. Part one of the CAIJ Computer Literacy Series

It is a perennial problem in Canada that municipal, provincial, and federal government agencies disclose records under Access to Information (ATI) / Freedom of Information (FOI) law in non-machine readable (image) format by default. The same problem regularly emerges in historical and archival research. The inability to machine read these texts limits the analytic techniques that may be applied. It is also a barrier to access. Fortunately, there exist a number of free and open-source solutions to this problem. In the field of computer science, transforming scanned images into machine readable text is considered to be a “solved” problem. One state-of-the-art solution is the Tesseract Optical Character Recognition (OCR) engine, which is considered to be one of the best OCR engines available. This report will teach you how to use Tesseract OCR, which is made easily accessible with some simple Python code. Our larger goal is to improve access to open-source tools that can eliminate barriers to accessing information. The ability to convert a document into a format that can be searched for keywords, phrases, and possibly studied using natural language processing (NLP) or corpus linguistic methods alongside more traditional qualitative ones promises to revolutionize social science research. We hope this tool will help ATI/FOI system users as well as historians and archivists render their files more accessible. The discoverability of texts is a crucial element of access to information.