unihan-etl · Python Package License Code Coverage#

ETL tool for Unicode’s Han Unification (UNIHAN) database releases. unihan-etl retrieves (downloads), extracts (unzips), and transforms the database from Unicode’s website to a flat, tabular or structured, tree-like format.

unihan-etl can be used as a python library through its API, to retrieve data as a python object, or through the CLI to retrieve a CSV, JSON, or YAML file.

Part of the cihai project. Similar project: libUnihan.

UNIHAN Version compatibility (as of unihan-etl v0.10.0): 11.0.0 (released 2018-05-08, revision 25).

UNIHAN’s data is dispersed across multiple files in the format of:

U+3400  kCantonese  jau1
U+3400  kDefinition (same as U+4E18 丘) hillock or mound
U+3400  kMandarin   qiū
U+3401  kCantonese  tim2
U+3401  kDefinition to lick; to taste, a mat, bamboo bark
U+3401  kHanyuPinyin    10019.020:tiàn
U+3401  kMandarin   tiàn

Values vary in shape and structure depending on their field type. kHanyuPinyin maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents an entry. Complicating it further, more variations:

U+5EFE  kHanyuPinyin    10513.110,10514.010,10514.020:gǒng
U+5364  kHanyuPinyin    10093.130:xī,lǔ 74609.020:lǔ,xī

kHanyuPinyin supports multiple entries delimited by spaces. “:” (colon) separate locations in the work from pinyin readings. “,” (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.

Tabular, “Flat” output#

CSV (default)#

$ unihan-etl
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn

To preview in the CLI, try tabview or csvlens.


$ unihan-etl -F json --no-expand
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": "(same as U+4E18 丘) hillock or mound",
    "kCantonese": "jau1",
    "kHanyuPinyin": null,
    "kMandarin": "qiū"
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": "to lick; to taste, a mat, bamboo bark",
    "kCantonese": "tim2",
    "kHanyuPinyin": "10019.020:tiàn",
    "kMandarin": "tiàn"



$ unihan-etl -F yaml --no-expand
- char: 
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

Filter via the CLI with yq.

“Structured” output#

Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.

To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.

Why not CSV?

Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.


$ unihan-etl -F json
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": ["(same as U+4E18 丘) hillock or mound"],
    "kCantonese": ["jau1"],
    "kMandarin": {
      "zh-Hans": "qiū",
      "zh-Hant": "qiū"
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
    "kCantonese": ["tim2"],
    "kHanyuPinyin": [
        "locations": [
            "volume": 1,
            "page": 19,
            "character": 2,
            "virtual": 0
        "readings": ["tiàn"]
    "kMandarin": {
      "zh-Hans": "tiàn",
      "zh-Hant": "tiàn"


$ unihan-etl -F yaml
- char: 
    - jau1
    - (same as U+4E18 丘) hillock or mound
    zh-Hans: qiū
    zh-Hant: qiū
  ucn: U+3400
- char: 
    - tim2
    - to lick
    - to taste, a mat, bamboo bark
    - locations:
        - character: 2
          page: 19
          virtual: 0
          volume: 1
        - tiàn
    zh-Hans: tiàn
    zh-Hant: tiàn
  ucn: U+3401


  • automatically downloads UNIHAN from the internet

  • strives for accuracy with the specifications described in UNIHAN’s database design

  • export to JSON, CSV and YAML (requires pyyaml) via -F

  • configurable to export specific fields via -f

  • accounts for encoding conflicts due to the Unicode-heavy content

  • designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets

  • core component and dependency of cihai, a CJK library

  • data package support

  • expansion of multi-value delimited fields in YAML, JSON and python dictionaries

  • supports >= 3.7 and pypy

If you encounter a problem or have a question, please create an issue.


To download and build your own UNIHAN export:

$ pip install --user unihan-etl

or by pipx:

$ pipx install unihan-etl

Developmental releases#


$ pip install --user --upgrade --pre unihan-etl


$ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force
// Usage: unihan-etl@next load yoursession


unihan-etl offers customizable builds via its command line arguments.

See unihan-etl CLI arguments for information on how you can specify columns, files, download URL’s, and output destination.

To output CSV, the default format:

$ unihan-etl

To output JSON:

$ unihan-etl -F json

To output YAML:

$ pip install --user pyyaml
$ unihan-etl -F yaml

To only output the kDefinition field in a csv:

$ unihan-etl -f kDefinition

To output multiple fields, separate with spaces:

$ unihan-etl -f kCantonese kDefinition

To output to a custom file:

$ unihan-etl --destination ./exported.csv

To output to a custom file (templated file extension):

$ unihan-etl --destination ./exported.{ext}

See unihan-etl CLI arguments for advanced usage examples.

Code layout#

# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/

# output dir
{XDG data dir}/unihan_etl/
  unihan.yaml   # (requires pyyaml)

# package dir
  core.py    # argparse, download, extract, transform UNIHAN's data
  options.py    # configuration object
  constants.py  # immutable data vars (field to filename mappings, etc)
  expansion.py  # extracting details baked inside of fields
  types.py      # type annotations
  util.py       # utility / helper functions

# test suite


The package is python underneath the hood, you can utilize its full API. Example:

>>> from unihan_etl.core import Packager
>>> pkgr = Packager()
>>> hasattr(pkgr.options, 'destination')


$ git clone https://github.com/cihai/unihan-etl.git
$ cd unihan-etl

Bootstrap your environment and learn more about contributing. We use the same conventions / tools across all cihai projects: pytest, sphinx, mypy, ruff, tmuxp, and file watcher helpers (e.g. entr(1)).

More information#

Docs Build Status