unihan-etl¶
unihan-etl - ETL tool for Unicode’s Han Unification (UNIHAN) database releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the database from Unicode’s website to a flat, tabular or structured, tree-like format.
unihan-etl can be used as a python library through its API, to retrieve data as a python object, or through the CLI to retrieve a CSV, JSON, or YAML file.
Part of the cihai project. Similar project: libUnihan.
UNIHAN Version compatibility (as of unihan-etl v0.10.0): 11.0.0 (released 2018-05-08, revision 25).
UNIHAN’s data is dispersed across multiple files in the format of:
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type.
kHanyuPinyin
maps Unicode codepoints to Hànyǔ Dà Zìdiǎn,
where 10019.020:tiàn
represents an entry. Complicating it further,
more variations:
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
kHanyuPinyin supports multiple entries delimited by spaces. “:” (colon) separate locations in the work from pinyin readings. “,” (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.
Tabular, “Flat” output¶
CSV (default), $ unihan-etl
:
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With $ unihan-etl -F yaml --no-expand
:
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
With $ unihan-etl -F json --no-expand
:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
“Structured” output¶
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.
Why not CSV?
Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.
JSON, $ unihan-etl -F json
:
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": [
"(same as U+4E18 丘) hillock or mound"
],
"kCantonese": [
"jau1"
],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": [
"to lick",
"to taste, a mat, bamboo bark"
],
"kCantonese": [
"tim2"
],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": [
"tiàn"
]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML $ unihan-etl -F yaml
:
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features¶
- automatically downloads UNIHAN from the internet
- strives for accuracy with the specifications described in UNIHAN’s database design
- export to JSON, CSV and YAML (requires pyyaml) via
-F
- configurable to export specific fields via
-f
- accounts for encoding conflicts due to the Unicode-heavy content
- designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
- core component and dependency of cihai, a CJK library
- data package support
- expansion of multi-value delimited fields in YAML, JSON and python dictionaries
- supports python 2.7, >= 3.5 and pypy
If you encounter a problem or have a question, please create an issue.
Usage¶
unihan-etl
offers customizable builds via its command line arguments.
See unihan-etl CLI arguments for information on how you can specify columns, files, download URL’s, and output destination.
To download and build your own UNIHAN export:
$ pip install --user unihan-etl
To output CSV, the default format:
$ unihan-etl
To output JSON:
$ unihan-etl -F json
To output YAML:
$ pip install --user pyyaml
$ unihan-etl -F yaml
To only output the kDefinition field in a csv:
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
Code layout¶
# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
# output dir
{XDG data dir}/unihan_etl/
unihan.json
unihan.csv
unihan.yaml # (requires pyyaml)
# package dir
unihan_etl/
process.py # argparse, download, extract, transform UNIHAN's data
constants.py # immutable data vars (field to filename mappings, etc)
expansion.py # extracting details baked inside of fields
_compat.py # python 2/3 compatibility module
util.py # utility / helper functions
# test suite
tests/*
Developing¶
poetry is a required package to develop.
git clone https://github.com/cihai/unihan-etl.git
cd unihan-etl
poetry install -E "docs test coverage lint format"
Makefile commands prefixed with watch_
will watch files and rerun.
Tests¶
poetry run py.test
Helpers: make test
Rerun tests on file change: make watch_test
(requires entr(1))
Documentation¶
Default preview server: http://localhost:8039
cd docs/
and make html
to build. make serve
to start http server.
Helpers:
make build_docs
, make serve_docs
Rebuild docs on file change: make watch_docs
(requires entr(1))
Rebuild docs and run server via one terminal: make dev_docs
(requires above, and a
make(1)
with -J
support, e.g. GNU Make)
Formatting / Linting¶
The project uses black and isort (one after the other) and runs flake8 via CI. See the configuration in pyproject.toml and setup.cfg:
make black isort
: Run black
first, then isort
to handle import nuances
make flake8
, to watch (requires entr(1)
): make watch_flake8
Releasing¶
As of 0.11, poetry handles virtualenv creation, package requirements, versioning, building, and publishing. Therefore there is no setup.py or requirements files.
Update __version__ in __about__.py and pyproject.toml:
git commit -m 'build(unihan-etl): Tag v0.1.1'
git tag v0.1.1
git push
git push --tags
poetry build
poetry deploy
- About unihan-etl
- About UNIHAN
- Command Line Interface
- API
- Frequently Asked Questions
- History
- unihan-etl 0.11.0 (2020-08-09)
- unihan-etl 0.10.4 (2020-08-05)
- unihan-etl 0.10.3 (2019-08-18)
- unihan-etl 0.10.2 (2019-08-17)
- unihan-etl 0.10.1 (2017-09-08)
- unihan-etl 0.10.0 (2017-08-29)
- unihan-etl 0.9.5 (2017-06-26)
- unihan-etl 0.9.4 (2017-06-05)
- unihan-etl 0.9.3 (2017-05-31)
- unihan-etl 0.9.2 (2017-05-31)
- unihan-etl 0.9.1 (2017-05-27)
- unihan-etl 0.9.0 (2017-05-26)
- unihan-etl 0.8.1 (2017-05-20)
- unihan-etl 0.8.0 (2017-05-17)
- unihan-etl 0.7.4 (2017-05-14)
- unihan-etl 0.7.3 (2017-05-13)
- unihan-etl 0.7.2 (2017-05-13)
- unihan-etl 0.7.1 (2017-05-12)
- unihan-etl 0.7.0 (2017-05-12)
- unihan-etl 0.6.3 (2017-05-11)
- unihan-etl 0.6.2 (2017-05-11)
- unihan-etl 0.6.1 (2017-05-10)
- unihan-etl 0.6.0 (2017-05-10)
- unihan-etl 0.5.1 (2017-05-08)
- unihan-etl 0.5.0 (2017-05-08)
- unihan-etl 0.4.2 (2017-05-07)
- unihan-etl 0.4.1 (2017-05-07)
- unihan-etl 0.4.0 (2017-05-07)
- unihan-etl 0.3.0 (2017-04-17)