GitHub - BYVoid/OpenCC: Library for conversion between Traditional and Simplified Chinese (original) (raw)

Open Chinese Convert 開放中文轉換

CMake Bazel MSVC Node.js CI Python CI AppVeyor

npm package badge PyPI version Debian package latest packaged version(s)

Introduction 介紹

OpenCC

Open Chinese Convert (OpenCC, 開放中文轉換) is an open source project for conversions between Traditional Chinese, Simplified Chinese and Japanese Kanji (Shinjitai). It supports character-level and phrase-level conversion, character variant handling, and regional vocabulary variants across Mainland China, Taiwan and Hong Kong. This is not a translation tool between Mandarin and Cantonese, etc.

中文簡繁轉換開源項目,支持詞彙級別的轉換、異體字轉換和地區習慣用詞轉換(中國大陸、台灣、香港)及日本新字體轉換。不提供普通話與粵語之間的轉換。

Discussion (Telegram): https://t.me/open_chinese_convert

Features 特點

詳情參閱OpenCC 設計思想地區詞收錄標準

Installation 安裝

Package Managers 包管理器

Prebuilt binaries 預編譯二進位檔

Usage 使用

Online 線上轉換

https://opencc.js.org/converter?config=s2t

Node.js

npm install opencc

The npm package supports Node.js >=20.17. It uses bundled Node-API prebuilds when available and falls back to a local node-gyp build when the current platform does not have a matching prebuild.

To install the npm CLI:

npm install -g opencc opencc -c s2t.json -i input.txt -o output.txt

The npm CLI supports basic text conversion. Plugins, --inspect, and--segmentation require the native OpenCC CLI.

import { OpenCC } from 'opencc'; async function main() { const converter: OpenCC = new OpenCC('s2t.json'); const result: string = await converter.convertPromise('汉字'); console.log(result); // 漢字 }

See demo.js and ts-demo.ts.

Python

pip install opencc (Windows, Linux, macOS)

import opencc converter = opencc.OpenCC('s2t.json') converter.convert('汉字') # 漢字

C++

#include "opencc.h"

int main() { const opencc::SimpleConverter converter("s2t.json"); converter.Convert("汉字"); // 漢字 return 0; }

Full example with Bazel

When OpenCC is embedded in a server binary or self-contained application, the JSON config can stay small while dictionary resources are loaded from explicit resource directories:

#include #include

#include "SimpleConverter.hpp"

int main() { auto resources = std::make_sharedopencc::FilesystemResourceProvider( std::vectorstd::string{ "/opt/my-app/opencc", "/opt/my-app/plugins/opencc-jieba", "/usr/share/opencc", }); const opencc::SimpleConverter converter("s2t.json", resources); converter.Convert("汉字"); return 0; }

FilesystemResourceProvider searches directories in order. ExistingSimpleConverter("s2t.json") and CLI behavior continue to use the config file location, current directory, explicit paths, and installed OpenCC data directory as before.

C

#include "opencc.h"

int main() { opencc_t opencc = opencc_open("s2t.json"); const char* input = "汉字"; char* converted = opencc_convert_utf8(opencc, input, strlen(input)); // 漢字 opencc_convert_utf8_free(converted); opencc_close(opencc); return 0; }

Full Document 完整文檔

Command Line

Segmentation and Inspection Modes

OpenCC CLI supports two diagnostic modes that output JSON instead of converted text:

--segmentation — Output segmentation result only (no conversion):

echo "他只看了几行日志,就一叶知秋,猜到整个系统是数据库连接池出了问题" | opencc -c s2twp.json --segmentation

{"input":"他只看了几行日志,就一叶知秋,猜到整个系统是数据库连接池出了问题","segments":["他","只看","了几行","日志",",就","一叶知秋",",猜到","整个","系统","是","数据库","连接池","出了","问题"]}

--inspect — Output full inspection result (segmentation + per-stage conversion + final output):

echo "他只看了几行日志,就一叶知秋,猜到整个系统是数据库连接池出了问题" | opencc -c s2twp.json --inspect

{"input":"他只看了几行日志,就一叶知秋,猜到整个系统是数据库连接池出了问题","segments":["他","只看","了几行","日志",",就","一叶知秋",",猜到","整个","系统","是","数据库","连接池","出了","问题"],"stages":[{"index":1,"segments":["他","只看","了幾行","日誌",",就","一葉知秋",",猜到","整個","系統","是","數據庫","連接池","出了","問題"]},{"index":2,"segments":["他","只看","了幾行","日誌",",就","一葉知秋",",猜到","整個","系統","是","資料庫","連線池","出了","問題"]},{"index":3,"segments":["他","只看","了幾行","日誌",",就","一葉知秋",",猜到","整個","系統","是","資料庫","連線池","出了","問題"]}],"output":"他只看了幾行日誌,就一葉知秋,猜到整個系統是資料庫連線池出了問題"}

Pretty-print with jq:

echo "他只看了几行日志,就一叶知秋,猜到整个系统是数据库连接池出了问题" | opencc -c s2twp.json --inspect | jq .

These modes are useful for diagnosing conversion issues:

  1. Use --segmentation to verify that the input is segmented as expected.
  2. Use --inspect to see which conversion stage produces an unexpected result.

Rules:

Other Ports (Unofficial)

Configurations 配置文件

預設配置文件

下列配置文件仍在開發中,歡迎貢獻新詞組:

下列配置文件僅供探索性研究,不建議用於生產環境:

指定配置文件

通过环境变量OPENCC_DATA_DIR加载指定路径下的配置文件

OPENCC_DATA_DIR=/path/to/your/config/dir opencc --help

內聯字典(inline dictionary)

配置檔中的字典可使用 type: "inline",直接在 JSON 裡定義小型自訂詞彙, 不必修改外部字典檔。例如在 group.dicts 最前面加入覆寫規則:

{ "conversion_chain": [ { "dict": { "type": "group", "dicts": [ { "type": "inline", "entries": { "麦旋风": "冰炫風", "服务器": "伺服器" } }, { "type": "ocd2", "file": "STPhrases.ocd2" }, { "type": "ocd2", "file": "STCharacters.ocd2" } ] } } ] }

規則與限制:

備註:OpenCC 1.3.2+ 解析器支援有限 JSONC 語法(///* */ 註解與尾逗號)。 若需跨實作相容,建議使用嚴格 JSON,不依賴 JSONC 擴充。

更多完整示例可見 examples/config/。該目錄僅供學習與自訂參考,不屬於官方內建 配置列表。

Experimental Plugins 試驗性插件

OpenCC 現已支援外部 C++ 分詞插件。當前第一個插件為 opencc-jieba, 可通過 s2t_jieba.jsons2tw_jieba.jsons2hk_jieba.jsons2twp_jieba.jsontw2sp_jieba.json 等插件配置啓用。

OpenCC now supports external C++ segmentation plugins. The first plugin isopencc-jieba, which can be enabled through plugin-backed configs such ass2t_jieba.json, s2tw_jieba.json, s2hk_jieba.json,s2twp_jieba.json, and tw2sp_jieba.json.

注意:

Notes:

Build 編譯

Build with CMake

Linux & macOS

g++ 4.6+ or clang 3.2+ is required.

Windows Visual Studio:

Build with Bazel

Test 測試

Linux & macOS

Windows Visual Studio:

Test with Bazel

bazel test --test_output=all //src/... //data/... //python/... //test/...

Benchmark 基準測試

詳情見 doc/benchmark.md 檔案。

Projects using OpenCC 使用 OpenCC 的項目

Please update if your project is using OpenCC.

License 許可協議

Apache License 2.0

Third Party Libraries 第三方庫

Change History 版本歷史

Contributors 貢獻者

Please feel free to update this list if you have contributed OpenCC.