Add initial support for traineddata files in standard archive formats by stweil · Pull Request #2290 · tesseract-ocr/tesseract (original) (raw)

This requires libarchive-dev.

Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:

$ unzip -l /usr/local/share/tessdata/zip.traineddata
Archive:  /usr/local/share/tessdata/zip.traineddata
  Length      Date    Time    Name
---------  ---------- -----   ----
       55  2019-03-05 15:27   bagit.txt
        0  2019-03-05 15:25   data/
     1557  2019-03-05 15:28   manifest-sha256.txt
  1082890  2019-03-05 15:25   data/eng.word-dawg
  1487588  2019-03-05 15:25   data/eng.lstm
     7477  2019-03-05 15:25   data/eng.unicharset
    63346  2019-03-05 15:25   data/eng.shapetable
   976552  2019-03-05 15:25   data/eng.inttemp
    13408  2019-03-05 15:25   data/eng.normproto
     4322  2019-03-05 15:25   data/eng.punc-dawg
     4738  2019-03-05 15:25   data/eng.lstm-number-dawg
     1410  2019-03-05 15:25   data/eng.freq-dawg
      844  2019-03-05 15:25   data/eng.pffmtable
     6360  2019-03-05 15:25   data/eng.lstm-unicharset
     1012  2019-03-05 15:25   data/eng.lstm-recoder
     1047  2019-03-05 15:25   data/eng.unicharambigs
     4322  2019-03-05 15:25   data/eng.lstm-punc-dawg
 16109842  2019-03-05 15:25   data/eng.bigram-dawg
       80  2019-03-05 15:25   data/eng.version
     6426  2019-03-05 15:25   data/eng.number-dawg
  3694794  2019-03-05 15:25   data/eng.lstm-word-dawg
---------                     -------
 23468070                     21 files

combine_tessdata -d and combine_tessdata -u also work.

The traineddata files in the new format can be generated with
standard tools like zip or tar.

More work is needed for other training tools and big endian support.

Signed-off-by: Stefan Weil sw@weilnetz.de