RFC: Add initial support for traineddata files in compressed archive formats (don't merge) by stweil · Pull Request #911 · tesseract-ocr/tesseract (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation82 Commits8 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

stweil

This requires libminizip-dev, so expect failures from CI.

Up to now, little endian tesseract works with the new zip format.

More work is needed for training tools and big endian support and also to maintain
compatibility with the current proprietary format.

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil

Open questions:

@Shreeshrii

Yes, there have been requests for more compact/compressed traineddata files.

Another Qn.

@stweil

libminizip-dev was added in Ubuntu Xenial (16.04), so the current Travis build environment which is based on Ubuntu Trusty does not provide it.

@Shreeshrii

libminizip-dev was added in Ubuntu Xenial (16.04), so the current Travis build environment which is based on Ubuntu Trusty does not provide it.

Why not use a different compression library that is available on different o/s as well as older ubuntu versions?

@stweil

On my Debian system I find these libraries: minizip (supported since 16.04), libzip (supported since 12.04), zzlib (supported since 12.04), libarchive (supported since 12.04). As far as I know all use licenses which are compatible with Tesseract. I assume any of those can be used and expect that none of them will be available as a binary for Windows (maybe also not for macOS), but I did not have a look.

@stweil

Zip format reduces eng.traineddata from about 31 MiB to 16 MiB (48 % compression) by default. zip -9 improves the compression to 49 %. Other compressed formats achieve even better compression values:

31887360 eng.traineddata.tar
31873501 eng.traineddata
18121906 eng.traineddata.lz4
16461487 eng.traineddata.zip (default)
16372645 eng.traineddata.zip (maximum compression)
15193532 eng.traineddata.tar.bz2
13274164 eng.traineddata.tar.xz
13273173 eng.traineddata.7z

75100160 mya.traineddata.tar
75085274 mya.traineddata
42274775 mya.traineddata.lz4
39468033 mya.traineddata.tar.bz2
36296750 mya.traineddata.tar.gz
36075469 mya.traineddata.zip
28097639 mya.traineddata.7z
27937332 mya.traineddata.tar.xz

@zdenop

Please move discussion to tesseract-dev forum. This is significant change.

@egorpugin

@stweil

Please move discussion to tesseract-dev forum. This is significant change.

See this discussion in the forum. I added a link to GitHub there.

@stweil

libarchive handles all formats

It is also supported by current Linux distributions and would be interesting if compressed tar instead of zip is preferred. I added it to my previous post.

@egorpugin

@stweil

What about lz4?

It does not compress very good (see result added to the list above).

@egorpugin

Actually I wanted to say lzma, which is .xz/.7z extensions.
Sorry. :)

@stweil

Rebased and added support for libzip.

@egorpugin

What libraries are currently in use in your PR?
libarchive? minizip? libzip?
I see libarchive in build scripts, but not in code.
Maybe it worth it to use only one implementation (library)? I don't like multiple implementations for same thing.

@stweil

This is experimental code, as there is still no decision whether compressed archives should be supported at all, if yes with which format and which library.

The current code uses libzip, if that is not found minizip. If neither of those is found, it uses the normal code. I prepared the code for more experiments to support compressed tar archives with libarchive for example.

@stweil

As you can see here, the implementations for the two currently supported libraries are very similar.

@stweil

The latest code also supports libarchive (highest priority). With that library, all kinds of compressed archives should work (up to now I tested with zip only).

@stweil stweil changed the titleRFC: Add initial support for traineddata files in zip format (don't merge) RFC: Add initial support for traineddata files in zip and other compressed archive formats (don't merge)

May 13, 2017

@stweil

As libarchive indeed supports all formats, I could compare the time needed for each format. Tesseract was run 5 times on each format with English on a simple hello world text. Below is the result sorted by time in seconds for each test. Interpretation:

The file i/o from disk did not play a role in this test because of the Linux file cache and the SSD of my computer.

  0.13 eng.traineddata.tar
  0.14 eng.traineddata
  0.14 eng.traineddata.tar
  0.14 eng.traineddata.tar
  0.14 eng.traineddata.tar
  0.15 eng.traineddata
  0.15 eng.traineddata.tar
  0.15 eng.traineddata.lz4
  0.16 eng.traineddata
  0.16 eng.traineddata
  0.17 eng.traineddata.lz4
  0.17 eng.traineddata.lz4
  0.18 eng.traineddata
  0.18 eng.traineddata.lz4
  0.22 eng.traineddata.lz4
  0.29 eng.traineddata.zip
  0.29 eng.traineddata.zip
  0.29 eng.traineddata.zip
  0.30 eng.traineddata.zip
  0.30 eng.traineddata.zip
  0.97 eng.traineddata.7z
  0.98 eng.traineddata.7z
  0.98 eng.traineddata.7z
  0.99 eng.traineddata.7z
  0.99 eng.traineddata.tar.xz
  0.99 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.00 eng.traineddata.tar.xz
  1.04 eng.traineddata.7z
  1.55 eng.traineddata.tar.bz2
  1.56 eng.traineddata.tar.bz2
  1.61 eng.traineddata.tar.bz2
  1.62 eng.traineddata.tar.bz2
  1.66 eng.traineddata.tar.bz2

@Shreeshrii

Please also try the test with a different language. Maybe one which has the largest traineddata size, to see if filesize has any impact to the relative speeds Thanks.

@stweil

Test results with libarchive for mya.traineddata (the largest of all traineddata files). I did not test lz4, but added a test with gz format.

0.48 mya.traineddata.tar
0.49 mya.traineddata
0.49 mya.traineddata.tar
0.49 mya.traineddata.tar
0.49 mya.traineddata.tar
0.50 mya.traineddata
0.52 mya.traineddata
0.52 mya.traineddata.tar
0.54 mya.traineddata
0.54 mya.traineddata
0.79 mya.traineddata.tar.gz
0.80 mya.traineddata.tar.gz
0.80 mya.traineddata.tar.gz
0.82 mya.traineddata.zip
0.84 mya.traineddata.zip
0.85 mya.traineddata.zip
0.86 mya.traineddata.tar.gz
0.86 mya.traineddata.zip
0.88 mya.traineddata.tar.gz
0.90 mya.traineddata.zip
2.38 mya.traineddata.7z
2.38 mya.traineddata.7z
2.38 mya.traineddata.tar.xz
2.40 mya.traineddata.7z
2.41 mya.traineddata.tar.xz
2.45 mya.traineddata.7z
2.45 mya.traineddata.tar.xz
2.46 mya.traineddata.7z
2.46 mya.traineddata.tar.xz
2.49 mya.traineddata.tar.xz
3.69 mya.traineddata.tar.bz2
3.74 mya.traineddata.tar.bz2
3.75 mya.traineddata.tar.bz2
3.79 mya.traineddata.tar.bz2
3.84 mya.traineddata.tar.bz2

libzip gives similar results, but only supports the zip format:

0.83 mya.traineddata.zip
0.84 mya.traineddata.zip
0.87 mya.traineddata.zip
0.88 mya.traineddata.zip
0.93 mya.traineddata.zip

libminizip:

0.84 mya.traineddata.zip
0.84 mya.traineddata.zip
0.85 mya.traineddata.zip
0.87 mya.traineddata.zip
0.92 mya.traineddata.zip

libzzip:

0.75 mya.traineddata.zip
0.78 mya.traineddata.zip
0.78 mya.traineddata.zip
0.79 mya.traineddata.zip
0.84 mya.traineddata.zip

@egorpugin

lzma compresses slower but better? Or is it also decompress slower?

@stweil

lzma created the xz files. 7zip and lzma gave the best compression ratios, but both also need some time for the decompression (which is relevant for Tesseract): they need about 1.9 s more time (but still are faster than bz2).

Please note that the current code for all formats reads all parts of the tessdata file, no matter whether they are used or not, so the decompression overhead could be reduced.

@Shreeshrii

@theraysmith wrote on 4/18/14

I have no objection to switching to zip (with no tar) for the tessdata files. That should be usable by everybody more easily.

and on 4/20/14

I spent some time looking at zlib. It doesn't seem to make it easy to randomly access named entities in >a gzip file, unless I am missing something. The memory compress/uncompress functions are quite nice >though.

For the next version it would be nice to:
Update tessdatamanager to cope with compressed components.
Eliminate fread/fscanf from file input code and allow everything to read from a memory buffer.
These can probably both be achieved with the TFile class that I added for 3.03.

This is a change in direction from my previous work with new classifier experiments, where I have been writing everything to use Serialize/DeSerialize and FILE streams, but this doesn't seem to be as portable as I had hoped, due to its reliance on fmemopen. It seems it would be better to make everything use memory buffers and push the file I/O responsibility out to TessDataManager/TFile, which could then just as easily deal with compressed files or in-memory data.

@stweil Do all the methods you tested support randomly accessing named entities?

@theraysmith Is there a particular reason for zip (with no tar)?

@stweil

The current Tesseract code reads the whole tessdata file into memory and gets all data from memory. My implementation for compressed archive files does that, too. Therefore random access is trivial: all component files are in a vector of byte arrays.

@ghost

@stweil @amitdo @egorpugin have you tested zstd compression? I have tested it, and its very fast. Also if you add a dictionary to it, the compression ratio would be even better, I think it's a game changer.
https://github.com/facebook/zstd

ZSTD compressing by a dictionary:

@stweil

I have not tested it yet, but it looks like we get Zstandard support with libarchive. Pull request libarchive/libarchive#905 added Zstandard there.

@zdenop

AFAIR there was intention to use already used libraries e.g. not to increase dependencies.
To build tesseract with VS without cppan on Windows is already pain...

@egorpugin

With next libarchive release I'll add zstd dependency into it in cppan, so tesseract will get it automatically.
(libarchive is used inside cppan extensively.)

@stweil

Tesseract only needs to add a dependency on libarchive to get support for compressed archives.

@zdenop

So If I understand it right if we compress datafiles with Zstandard users will need on all platform to compile libarchive + Zstandard...

@stweil

That's correct. Therefore I still would distribute the datafiles with zip format which hopefully has good support on all platforms. But users who need maximum performance then could repack their needed datafiles with a different compression standard.

@Shreeshrii

@zdenop

Milestone is set to 4.1.0. Is it time to merge it? There was not a lof of changes here in last months...

@stweil

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil

This requires libarchive-dev, libzip-dev or libminizip-dev.

Up to now, little endian tesseract works with the new format. More work is needed for training tools and big endian support.

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil

Pull request #2290 now includes the implementation with libarchive, so this proof of concept is now obsolete and can be closed.