RFC: Add initial support for traineddata files in compressed archive formats (don't merge) by stweil · Pull Request #911 · tesseract-ocr/tesseract (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation82 Commits8 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

This requires libminizip-dev, so expect failures from CI.

Up to now, little endian tesseract works with the new zip format.

More work is needed for training tools and big endian support and also to maintain
compatibility with the current proprietary format.

Signed-off-by: Stefan Weil sw@weilnetz.de

Open questions:

Do we want a new format? As I'm not the first one who had the idea, I think the answer is yes.
Do we need support for both old (current) and new format? I'd drop support for the old format
and remove combine_tessdata.
Should the traineddata files in the new format add .zip to the file names? I'd omit .zip.
Should the code for minizip be added to the Tesseract sources, or should we add an external dependency to libminizip-dev?
Which one is better, zip or compressed tar?

Yes, there have been requests for more compact/compressed traineddata files.

Another Qn.

Should the new format be limited to tesseract 4.0 or also applied to 3.05?

libminizip-dev was added in Ubuntu Xenial (16.04), so the current Travis build environment which is based on Ubuntu Trusty does not provide it.

libminizip-dev was added in Ubuntu Xenial (16.04), so the current Travis build environment which is based on Ubuntu Trusty does not provide it.

Why not use a different compression library that is available on different o/s as well as older ubuntu versions?

On my Debian system I find these libraries: minizip (supported since 16.04), libzip (supported since 12.04), zzlib (supported since 12.04), libarchive (supported since 12.04). As far as I know all use licenses which are compatible with Tesseract. I assume any of those can be used and expect that none of them will be available as a binary for Windows (maybe also not for macOS), but I did not have a look.

Zip format reduces eng.traineddata from about 31 MiB to 16 MiB (48 % compression) by default. zip -9 improves the compression to 49 %. Other compressed formats achieve even better compression values:

31887360 eng.traineddata.tar
31873501 eng.traineddata
18121906 eng.traineddata.lz4
16461487 eng.traineddata.zip (default)
16372645 eng.traineddata.zip (maximum compression)
15193532 eng.traineddata.tar.bz2
13274164 eng.traineddata.tar.xz
13273173 eng.traineddata.7z

75100160 mya.traineddata.tar
75085274 mya.traineddata
42274775 mya.traineddata.lz4
39468033 mya.traineddata.tar.bz2
36296750 mya.traineddata.tar.gz
36075469 mya.traineddata.zip
28097639 mya.traineddata.7z
27937332 mya.traineddata.tar.xz

Please move discussion to tesseract-dev forum. This is significant change.

Please move discussion to tesseract-dev forum. This is significant change.

See this discussion in the forum. I added a link to GitHub there.

libarchive handles all formats

It is also supported by current Linux distributions and would be interesting if compressed tar instead of zip is preferred. I added it to my previous post.

What about lz4?

It does not compress very good (see result added to the list above).

Actually I wanted to say lzma, which is .xz/.7z extensions.
Sorry. :)

Rebased and added support for libzip.

What libraries are currently in use in your PR?
libarchive? minizip? libzip?
I see libarchive in build scripts, but not in code.
Maybe it worth it to use only one implementation (library)? I don't like multiple implementations for same thing.

This is experimental code, as there is still no decision whether compressed archives should be supported at all, if yes with which format and which library.

The current code uses libzip, if that is not found minizip. If neither of those is found, it uses the normal code. I prepared the code for more experiments to support compressed tar archives with libarchive for example.

As you can see here, the implementations for the two currently supported libraries are very similar.

The latest code also supports libarchive (highest priority). With that library, all kinds of compressed archives should work (up to now I tested with zip only).

stweil changed the title~~RFC: Add initial support for traineddata files in zip format (don't merge)~~ RFC: Add initial support for traineddata files in zip and other compressed archive formats (don't merge)

May 13, 2017

As libarchive indeed supports all formats, I could compare the time needed for each format. Tesseract was run 5 times on each format with English on a simple hello world text. Below is the result sorted by time in seconds for each test. Interpretation:

The original Tesseract format, uncompressed tar and lz4 tar are similar and fastest.
zip needs about 150 ms more time than the original Tesseract format.
7z and xz tar need about 850 ms more time than the original Tesseract format.
bz2 tar is slowest and needs about 1450 ms more time than the original Tesseract format.

The file i/o from disk did not play a role in this test because of the Linux file cache and the SSD of my computer.

  0.13 eng.traineddata.tar
  0.14 eng.traineddata
  0.14 eng.traineddata.tar
  0.14 eng.traineddata.tar
  0.14 eng.traineddata.tar
  0.15 eng.traineddata
  0.15 eng.traineddata.tar
  0.15 eng.traineddata.lz4
  0.16 eng.traineddata
  0.16 eng.traineddata
  0.17 eng.traineddata.lz4
  0.17 eng.traineddata.lz4
  0.18 eng.traineddata
  0.18 eng.traineddata.lz4
  0.22 eng.traineddata.lz4
  0.29 eng.traineddata.zip
  0.29 eng.traineddata.zip
  0.29 eng.traineddata.zip
  0.30 eng.traineddata.zip
  0.30 eng.traineddata.zip
  0.97 eng.traineddata.7z
  0.98 eng.traineddata.7z
  0.98 eng.traineddata.7z
  0.99 eng.traineddata.7z
  0.99 eng.traineddata.tar.xz
  0.99 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.04 eng.traineddata.7z
  1.55 eng.traineddata.tar.bz2
  1.56 eng.traineddata.tar.bz2
  1.61 eng.traineddata.tar.bz2
  1.62 eng.traineddata.tar.bz2
  1.66 eng.traineddata.tar.bz2

Please also try the test with a different language. Maybe one which has the largest traineddata size, to see if filesize has any impact to the relative speeds Thanks.

Test results with libarchive for mya.traineddata (the largest of all traineddata files). I did not test lz4, but added a test with gz format.

0.48 mya.traineddata.tar
0.49 mya.traineddata
0.49 mya.traineddata.tar
0.49 mya.traineddata.tar
0.49 mya.traineddata.tar
0.50 mya.traineddata
0.52 mya.traineddata
0.52 mya.traineddata.tar
0.54 mya.traineddata
0.54 mya.traineddata
0.79 mya.traineddata.tar.gz
0.80 mya.traineddata.tar.gz
0.80 mya.traineddata.tar.gz
0.82 mya.traineddata.zip
0.84 mya.traineddata.zip
0.85 mya.traineddata.zip
0.86 mya.traineddata.tar.gz
0.86 mya.traineddata.zip
0.88 mya.traineddata.tar.gz
0.90 mya.traineddata.zip
2.38 mya.traineddata.7z
2.38 mya.traineddata.7z
2.38 mya.traineddata.tar.xz
2.40 mya.traineddata.7z
2.41 mya.traineddata.tar.xz
2.45 mya.traineddata.7z
2.45 mya.traineddata.tar.xz
2.46 mya.traineddata.7z
2.46 mya.traineddata.tar.xz
2.49 mya.traineddata.tar.xz
3.69 mya.traineddata.tar.bz2
3.74 mya.traineddata.tar.bz2
3.75 mya.traineddata.tar.bz2
3.79 mya.traineddata.tar.bz2
3.84 mya.traineddata.tar.bz2

libzip gives similar results, but only supports the zip format:

0.83 mya.traineddata.zip
0.84 mya.traineddata.zip
0.87 mya.traineddata.zip
0.88 mya.traineddata.zip
0.93 mya.traineddata.zip

libminizip:

0.84 mya.traineddata.zip
0.84 mya.traineddata.zip
0.85 mya.traineddata.zip
0.87 mya.traineddata.zip
0.92 mya.traineddata.zip

libzzip:

0.75 mya.traineddata.zip
0.78 mya.traineddata.zip
0.78 mya.traineddata.zip
0.79 mya.traineddata.zip
0.84 mya.traineddata.zip

lzma compresses slower but better? Or is it also decompress slower?

lzma created the xz files. 7zip and lzma gave the best compression ratios, but both also need some time for the decompression (which is relevant for Tesseract): they need about 1.9 s more time (but still are faster than bz2).

Please note that the current code for all formats reads all parts of the tessdata file, no matter whether they are used or not, so the decompression overhead could be reduced.

@theraysmith wrote on 4/18/14

I have no objection to switching to zip (with no tar) for the tessdata files. That should be usable by everybody more easily.

and on 4/20/14

I spent some time looking at zlib. It doesn't seem to make it easy to randomly access named entities in >a gzip file, unless I am missing something. The memory compress/uncompress functions are quite nice >though.

For the next version it would be nice to:
Update tessdatamanager to cope with compressed components.
Eliminate fread/fscanf from file input code and allow everything to read from a memory buffer.
These can probably both be achieved with the TFile class that I added for 3.03.

This is a change in direction from my previous work with new classifier experiments, where I have been writing everything to use Serialize/DeSerialize and FILE streams, but this doesn't seem to be as portable as I had hoped, due to its reliance on fmemopen. It seems it would be better to make everything use memory buffers and push the file I/O responsibility out to TessDataManager/TFile, which could then just as easily deal with compressed files or in-memory data.

@stweil Do all the methods you tested support randomly accessing named entities?

@theraysmith Is there a particular reason for zip (with no tar)?

The current Tesseract code reads the whole tessdata file into memory and gets all data from memory. My implementation for compressed archive files does that, too. Therefore random access is trivial: all component files are in a vector of byte arrays.

@stweil @amitdo @egorpugin have you tested zstd compression? I have tested it, and its very fast. Also if you add a dictionary to it, the compression ratio would be even better, I think it's a game changer.
https://github.com/facebook/zstd

ZSTD compressing by a dictionary:

Create the dictionary
zstd --train FullPathToTrainingSet/* -o dictionaryName
Compress with dictionary
zstd -D dictionaryName FILE
Decompress with dictionary
zstd -D dictionaryName --decompress FILE.zst
Increase dicitonary size
zstd --train dirSamples/* -o dictionaryName --maxdict=1024KB

I have not tested it yet, but it looks like we get Zstandard support with libarchive. Pull request libarchive/libarchive#905 added Zstandard there.

AFAIR there was intention to use already used libraries e.g. not to increase dependencies.
To build tesseract with VS without cppan on Windows is already pain...

With next libarchive release I'll add zstd dependency into it in cppan, so tesseract will get it automatically.
(libarchive is used inside cppan extensively.)

Tesseract only needs to add a dependency on libarchive to get support for compressed archives.

So If I understand it right if we compress datafiles with Zstandard users will need on all platform to compile libarchive + Zstandard...

That's correct. Therefore I still would distribute the datafiles with zip format which hopefully has good support on all platforms. But users who need maximum performance then could repack their needed datafiles with a different compression standard.

Milestone is set to 4.1.0. Is it time to merge it? There was not a lof of changes here in last months...

Signed-off-by: Stefan Weil sw@weilnetz.de

This requires libarchive-dev, libzip-dev or libminizip-dev.

Up to now, little endian tesseract works with the new format. More work is needed for training tools and big endian support.

Signed-off-by: Stefan Weil sw@weilnetz.de

Pull request #2290 now includes the implementation with libarchive, so this proof of concept is now obsolete and can be closed.