GitHub - gugray/HanziLookupJS: Free, open-source Chinese handwriting recognition in Javascript (original) (raw)

Check out hanzi_lookup, a Rust + WebAssembly port of this library, for a faster and more modern component.

HanziLookupJS

Free, open-source Chinese handwriting recognition in Javascript.

The library is based on Jordan Kiang's HanziLookup. It contains data derived from Shaunak Kishore's Make Me a Hanzi, and an improved character recognition algorithm.

Online demo: https://gugray.github.io/hanzilookupjs

HanziLookupJS demo

Getting started

You can use the library immediately if you clone the repository, publish it through an embedded server such as Mongoose, and open /demo/index.html in a browser. To use the library in your own application:

  $(document).ready(function () {  
    // Only fetch data (large, takes long) when the page has loaded  
    HanziLookup.init("mmah", "../dist/mmah.json", fileLoaded);  
    HanziLookup.init("orig", "../dist/orig.json", fileLoaded);  
  });  
  // Decompose character from drawing board  
  var analyzedChar = new HanziLookup.AnalyzedCharacter(_drawingBoard.cloneStrokes());  
  // Look up with MMAH data  
  var matcherMMAH = new HanziLookup.Matcher("mmah");  
  matcherMMAH.match(analyzedChar, 8, function(matches) {  
    // The matches array contains results, best first  
    showResults($(".mmahLookupChars"), matches);  
  });  

The two data files

The file orig.json contains the original data from Jordan Kiang's HanziLookup. Character entries contain the character's stroke count, substroke count, and a pointer into the substroke data. Each substroke is represented by a direction and a normalized length. 10,657 characters are encoded this way, but there are multiple entries for some to account for alternative ways of writing them, resulting in a total of 15,652 entries.

The file mmah.json is derived from Make Me a Hanzi's graphics.txt and encodes 9,507 characters. This file is richer because its substroke data also contains the normalized location (center point) of every substroke. The matching algorithm recognizes at runtime that this information is non-zero and calculates the score accordingly: a substroke that is in the wrong place counts for less.

The conversion tool is a .NET Core application, included in the /mmah-convert subfolder. To repeat the conversion, you need to place graphics.txt from the MMAH repository in the /work subfolder.

Both data sets store substroke descriptors in an ArrayBuffer, Base64-encoded in the JSON file. Each substroke is represented by 3 bytes: (1) Direction in radians, with 0-2*PI normalized to 0-255; (2) Length normalized to 0-255, where 255 is the bounding square's full width; (3) Centerpoint X and Y, both normalized to 0-15, with X in the 4 higher bits.

License

This Javascript library is derived from Jordan Kiang's original HanziLookup. In compliance with the original, it is licensed under GNU GPL. This license covers both the code and the original data in orig.json.

The data in mmah.json is ultimately derived from the following fonts, via Make Me a Hanzi's graphics.txt:

You can redistribute and/or modify mmah.json under the terms of the Arphic Public License as published by Arphic Technology Co., Ltd. The license is reproduced in LICENSE-APL; you can also find it online at http://ftp.gnu.org/non-gnu/chinese-fonts-truetype/LICENSE.