unihan-etl · ¶
An ETL tool for the Unicode Han Unification (UNIHAN) database releases. unihan-etl is designed to fetch (download), unpack (unzip), and convert the database from the Unicode website into either a flattened, tabular format or a structured, hierarchical format.
unihan-etl serves dual purposes: as a Python library offering an API for accessing data as Python objects, and as a command-line interface (CLI) for exporting data into CSV, JSON, or YAML formats.
This tool is a component of the cihai suite of CJK related projects. For a similar tool, see libUnihan.
As of v0.31.0, unihan-etl is compatible with UNIHAN Version 15.1.0 (released on 2023-09-01, revision 35).
The UNIHAN database¶
The UNIHAN database organizes data across multiple files, exemplified below:
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type.
kHanyuPinyin maps Unicode codepoints to
Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn
represents
an entry. Complicating it further, more variations:
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
kHanyuPinyin supports multiple entries delimited by spaces. “:” (colon) separate locations in the work from pinyin readings. “,” (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.
Tabular, “Flat” output¶
CSV (default)¶
$ unihan-etl
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With $ unihan-etl -F yaml --no-expand
:
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
JSON¶
$ unihan-etl -F json --no-expand
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
Tools:
YAML¶
$ unihan-etl -F yaml --no-expand
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
Filter via the CLI with yq.
“Structured” output¶
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.
Why not CSV?
Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.
JSON¶
$ unihan-etl -F json
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": ["(same as U+4E18 丘) hillock or mound"],
"kCantonese": ["jau1"],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
"kCantonese": ["tim2"],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": ["tiàn"]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML¶
$ unihan-etl -F yaml
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features¶
automatically downloads UNIHAN from the internet
strives for accuracy with the specifications described in UNIHAN’s database design
export to JSON, CSV and YAML (requires pyyaml) via
-F
configurable to export specific fields via
-f
accounts for encoding conflicts due to the Unicode-heavy content
designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
core component and dependency of cihai, a CJK library
data package support
expansion of multi-value delimited fields in YAML, JSON and python dictionaries
supports >= 3.7 and pypy
If you encounter a problem or have a question, please create an issue.
Installation¶
To download and build your own UNIHAN export:
$ pip install --user unihan-etl
or by pipx:
$ pipx install unihan-etl
Developmental releases¶
pip:
$ pip install --user --upgrade --pre unihan-etl
pipx:
$ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force
// Usage: unihan-etl@next load yoursession
Usage¶
unihan-etl
offers customizable builds via its command line arguments.
See unihan-etl CLI arguments for information on how you can specify columns, files, download URL’s, and output destination.
To output CSV, the default format:
$ unihan-etl
To output JSON:
$ unihan-etl -F json
To output YAML:
$ pip install --user pyyaml
$ unihan-etl -F yaml
To only output the kDefinition field in a csv:
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
Code layout¶
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
core.py # argparse, download, extract, transform UNIHAN's data
options.py # configuration object
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
types.py # type annotations
util.py # utility / helper functions
# test suite
tests/*
API¶
The package is python underneath the hood, you can utilize its full API. Example:
>>> from unihan_etl.core import Packager
>>> pkgr = Packager()
>>> hasattr(pkgr.options, 'destination')
True
Developing¶
$ git clone https://github.com/cihai/unihan-etl.git
$ cd unihan-etl
Bootstrap your environment and learn more about contributing. We use the same conventions / tools across all cihai projects: pytest
, sphinx
, mypy
, ruff
, tmuxp
, and file watcher helpers (e.g. entr(1)
).