# Constants - unihan_etl.constants Source: https://unihan-etl.git-pull.com/api/constants/ --- myst: html_meta: "description lang=en": "Extract UNIHAN to CSV, JSON, etc." "keywords": "unihan_etl, unihan-etl, unihan, unihan extractor, cjk, cjk dictionary" "property=og:locale": "en_US" --- # Constants - `unihan_etl.constants` ```{eval-rst} .. automodule:: unihan_etl.constants :members: :undoc-members: :show-inheritance: ``` --- # Core - unihan_etl.core Source: https://unihan-etl.git-pull.com/api/core/ --- myst: html_meta: "description lang=en": "Extract UNIHAN to CSV, JSON, etc." "keywords": "unihan_etl, unihan-etl, unihan, unihan extractor, cjk, cjk dictionary" "property=og:locale": "en_US" --- # Core - `unihan_etl.core` ```{eval-rst} .. automodule:: unihan_etl.core :members: :undoc-members: :show-inheritance: ``` --- # Expansion - unihan_etl.expansion Source: https://unihan-etl.git-pull.com/api/expansion/ --- myst: html_meta: "description lang=en": "Extract UNIHAN to CSV, JSON, etc." "keywords": "unihan_etl, unihan-etl, unihan, unihan extractor, cjk, cjk dictionary" "property=og:locale": "en_US" --- # Expansion - `unihan_etl.expansion` ```{eval-rst} .. automodule:: unihan_etl.expansion :members: :undoc-members: :show-inheritance: ``` --- # API Reference Source: https://unihan-etl.git-pull.com/api/ (api)= (reference)= # API Reference ```{module} unihan_etl ``` :::{warning} APIs are **not** considered stable before 1.0. They can break or be removed between minor versions. If you need an API stabilized please [file an issue](https://github.com/cihai/unihan-etl/issues). ::: ## Core Modules ::::{grid} 1 2 3 3 :gutter: 2 2 3 3 :::{grid-item-card} Core :link: core :link-type: doc ETL pipeline and Packager: download, normalize, export. ::: :::{grid-item-card} Options :link: options :link-type: doc Configuration dataclass for paths, formats, and field selection. ::: :::{grid-item-card} Expansion :link: expansion :link-type: doc Expand multi-value UNIHAN fields into structured data. ::: :::{grid-item-card} Types :link: types :link-type: doc Shared TypedDicts and type aliases. ::: :::{grid-item-card} Constants :link: constants :link-type: doc Field lists, manifests, and default paths. ::: :::{grid-item-card} Utils :link: utils :link-type: doc Helpers for progress, codepoints, and field resolution. ::: :::: ## Test Utilities ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} Test :link: test :link-type: doc Legacy test harness. ::: :::{grid-item-card} pytest plugin :link: pytest-plugin :link-type: doc Fixtures for quick and full UNIHAN datasets in pytest. ::: :::: ```{toctree} :hidden: core options expansion types constants utils test pytest-plugin ``` --- # Options - unihan_etl.options Source: https://unihan-etl.git-pull.com/api/options/ --- myst: html_meta: "description lang=en": "Extract UNIHAN to CSV, JSON, etc." "keywords": "unihan_etl, unihan-etl, unihan, unihan extractor, cjk, cjk dictionary" "property=og:locale": "en_US" --- # Options - `unihan_etl.options` ```{eval-rst} .. automodule:: unihan_etl.options :members: :undoc-members: :show-inheritance: ``` --- # pytest plugin Source: https://unihan-etl.git-pull.com/api/pytest-plugin/ (pytest_plugin)= # `pytest` plugin unihan-etl ships a pytest plugin that downloads `UNIHAN.zip` once and reuses it across tests, plus an isolated home directory for cache and config setup. The plugin auto-discovers via the `pytest11` entry point — installing `unihan-etl` is enough to make every fixture below available in your tests. See the [test suite](https://github.com/cihai/unihan-etl/tree/master/tests) for usage examples. ## Quick Start Add a fixture name as a test parameter — pytest creates and injects it automatically. You never call fixtures yourself. ```python def test_quick_packager(unihan_quick_packager) -> None: unihan_quick_packager.download() unihan_quick_packager.export() assert unihan_quick_packager.options.destination.exists() def test_with_raw_snippet(unihan_quick_data: str) -> None: assert "kCantonese" in unihan_quick_data ``` ## Which Fixture Do I Need? - Use {fixture}`unihan_quick_packager` when you want a small, fast UNIHAN dataset for unit tests. - Use {fixture}`unihan_full_packager` when you need the complete UNIHAN corpus. - Use {fixture}`unihan_bootstrap_all` (autouse-wrapped) when you want both datasets pre-downloaded at session start. - Use {fixture}`unihan_quick_data` when you only need a raw text snippet rather than a fully bootstrapped Packager. - Override {fixture}`unihan_cache_path` (or {fixture}`unihan_project_cache_path`) to redirect where cached UNIHAN data lives. - Override {fixture}`unihan_home_user_name` when you need a custom test user identity. --- ## Dataset Bootstrap The primary injection points for tests that need a working UNIHAN dataset. ```{eval-rst} .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_packager .. rubric:: Example .. code-block:: python def test_quick(unihan_quick_packager) -> None: unihan_quick_packager.download() unihan_quick_packager.export() assert unihan_quick_packager.options.destination.exists() .. autofixture:: unihan_etl.pytest_plugin.unihan_full_packager .. autofixture:: unihan_etl.pytest_plugin.unihan_ensure_quick .. autofixture:: unihan_etl.pytest_plugin.unihan_ensure_full .. autofixture:: unihan_etl.pytest_plugin.unihan_bootstrap_all .. rubric:: Example .. code-block:: python # conftest.py import pytest @pytest.fixture(scope="session", autouse=True) def bootstrap(unihan_bootstrap_all) -> None: return None ``` ## Dataset Options & Paths Session-scoped fixtures exposing the dataset filesystem layout and the {class}`~unihan_etl.options.Options` objects that drive the {class}`~unihan_etl.core.Packager`. ```{eval-rst} .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_options .. autofixture:: unihan_etl.pytest_plugin.unihan_full_options .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_path .. autofixture:: unihan_etl.pytest_plugin.unihan_full_path .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_zip_path .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_zip ``` ## Raw Data Accessors Lower-level fixtures for tests that need to inspect or transform UNIHAN data without invoking the full Packager pipeline. ```{eval-rst} .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_data .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_fixture_files .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_columns .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_normalized_data .. autofixture:: unihan_etl.pytest_plugin.unihan_quick_expanded_data ``` ## Mock Zip Fixtures Build a synthetic `Unihan.zip` on disk for tests that exercise the download/extract path without hitting the real corpus. ```{eval-rst} .. autofixture:: unihan_etl.pytest_plugin.unihan_mock_zip .. autofixture:: unihan_etl.pytest_plugin.unihan_mock_zip_path .. autofixture:: unihan_etl.pytest_plugin.unihan_mock_zip_pathname .. autofixture:: unihan_etl.pytest_plugin.unihan_mock_test_dir ``` ## Cache Paths (Override Hooks) Override these in your project's `conftest.py` to redirect where unihan-etl caches downloaded archives, extracted files, and intermediate fixture state. ```{eval-rst} .. autofixture:: unihan_etl.pytest_plugin.unihan_user_cache_path :kind: override_hook .. autofixture:: unihan_etl.pytest_plugin.unihan_project_cache_path :kind: override_hook .. autofixture:: unihan_etl.pytest_plugin.unihan_cache_path :kind: override_hook .. rubric:: Example .. code-block:: python # conftest.py import pathlib import pytest @pytest.fixture(scope="session") def unihan_cache_path(tmp_path_factory: pytest.TempPathFactory) -> pathlib.Path: return tmp_path_factory.mktemp("unihan-cache") .. autofixture:: unihan_etl.pytest_plugin.unihan_fixture_root :kind: override_hook ``` ## Home & User Environment Create an isolated filesystem home for the duration of the test session. Override {fixture}`unihan_home_user_name` to control the user identity. ```{eval-rst} .. autofixture:: unihan_etl.pytest_plugin.unihan_home_path .. autofixture:: unihan_etl.pytest_plugin.unihan_home_user_name :kind: override_hook .. rubric:: Example .. code-block:: python # conftest.py import pytest @pytest.fixture(scope="session") def unihan_home_user_name() -> str: return "ci-runner" .. autofixture:: unihan_etl.pytest_plugin.unihan_user_path .. autofixture:: unihan_etl.pytest_plugin.unihan_zshrc ``` ## Function-Scoped Helpers ```{eval-rst} .. autofixture:: unihan_etl.pytest_plugin.unihan_test_options ``` --- ## Types ```{eval-rst} .. autodata:: unihan_etl.pytest_plugin.UnihanTestOptions ``` --- ## Configuration These `conf.py` values control how fixture documentation is rendered: ```{eval-rst} .. confval:: pytest_fixture_hidden_dependencies Fixture names to suppress from "Depends on" lists. Default: common pytest builtins (:external+pytest:std:fixture:`pytestconfig`, :external+pytest:std:fixture:`capfd`, :external+pytest:std:fixture:`capsysbinary`, :external+pytest:std:fixture:`capfdbinary`, :external+pytest:std:fixture:`recwarn`, :external+pytest:std:fixture:`tmpdir`, :external+pytest:std:fixture:`pytester`, :external+pytest:std:fixture:`testdir`, :external+pytest:std:fixture:`record_property`, ``record_xml_attribute``, :external+pytest:std:fixture:`record_testsuite_property`, :external+pytest:std:fixture:`cache`). .. confval:: pytest_fixture_builtin_links URL mapping for builtin fixture external links in "Depends on" blocks. Default: links to pytest docs for :external+pytest:std:fixture:`tmp_path_factory`, :external+pytest:std:fixture:`tmp_path`, :external+pytest:std:fixture:`monkeypatch`, :external+pytest:std:fixture:`request`, :external+pytest:std:fixture:`capsys`, :external+pytest:std:fixture:`caplog`. .. confval:: pytest_external_fixture_links URL mapping for external fixture cross-references. Default: ``{}``. ``` --- ```{note} All fixtures above are also auto-discoverable via: .. autofixtures:: unihan_etl.pytest_plugin :order: source Use ``autofixtures::`` in your own plugin docs to document every fixture from a module without listing each one manually. ``` --- # Test helpers - unihan_etl.test Source: https://unihan-etl.git-pull.com/api/test/ --- myst: html_meta: "description lang=en": "Extract UNIHAN to CSV, JSON, etc." "keywords": "unihan_etl, unihan-etl, unihan, unihan extractor, cjk, cjk dictionary" "property=og:locale": "en_US" --- # Test helpers - `unihan_etl.test` ```{eval-rst} .. automodule:: unihan_etl.test :members: :undoc-members: :show-inheritance: ``` --- # Typings - unihan_etl.types Source: https://unihan-etl.git-pull.com/api/types/ --- myst: html_meta: "description lang=en": "Extract UNIHAN to CSV, JSON, etc." "keywords": "unihan_etl, unihan-etl, unihan, unihan extractor, cjk, cjk dictionary" "property=og:locale": "en_US" --- # Typings - `unihan_etl.types` ```{eval-rst} .. automodule:: unihan_etl.types :members: :undoc-members: :show-inheritance: ``` --- # Utilities - unihan_etl.util Source: https://unihan-etl.git-pull.com/api/utils/ --- myst: html_meta: "description lang=en": "Extract UNIHAN to CSV, JSON, etc." "keywords": "unihan_etl, unihan-etl, unihan, unihan extractor, cjk, cjk dictionary" "property=og:locale": "en_US" --- # Utilities - `unihan_etl.util` ```{eval-rst} .. automodule:: unihan_etl.util :members: :undoc-members: :show-inheritance: ``` --- # unihan-etl download Source: https://unihan-etl.git-pull.com/cli/download/ (cli-download)= # unihan-etl download Download and cache the UNIHAN database without exporting. ## Command ```{eval-rst} .. argparse:: :module: unihan_etl.cli :func: create_parser :prog: unihan-etl :path: download ``` ## Examples Download and cache without exporting: ```console $ unihan-etl download ``` Force re-download: ```console $ unihan-etl download --no-cache ``` --- # unihan-etl export Source: https://unihan-etl.git-pull.com/cli/export/ (cli-export)= # unihan-etl export Export UNIHAN data to CSV, JSON, or YAML format. ## Command ```{eval-rst} .. argparse:: :module: unihan_etl.cli :func: create_parser :prog: unihan-etl :path: export ``` ## Examples Export all UNIHAN data to JSON: ```console $ unihan-etl export -F json ``` Export specific fields: ```console $ unihan-etl export -F json -f kDefinition kMandarin ``` --- # unihan-etl fields Source: https://unihan-etl.git-pull.com/cli/fields/ (cli-fields)= # unihan-etl fields List available UNIHAN fields with their descriptions and source files. ## Command ```{eval-rst} .. argparse:: :module: unihan_etl.cli :func: create_parser :prog: unihan-etl :path: fields ``` ## Examples List all available fields: ```console $ unihan-etl fields ``` List fields as JSON (for programmatic use): ```console $ unihan-etl fields --json ``` List fields from a specific file: ```console $ unihan-etl fields -i Unihan_Readings.txt ``` --- # unihan-etl files Source: https://unihan-etl.git-pull.com/cli/files/ (cli-files)= # unihan-etl files List available UNIHAN source files. ## Command ```{eval-rst} .. argparse:: :module: unihan_etl.cli :func: create_parser :prog: unihan-etl :path: files ``` ## Examples List available UNIHAN source files: ```console $ unihan-etl files ``` Include field names for each file: ```console $ unihan-etl files --with-fields --json ``` --- # CLI Reference Source: https://unihan-etl.git-pull.com/cli/ (cli)= (commands)= # CLI Reference ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} unihan-etl download :link: download :link-type: doc Download and cache the UNIHAN database. ::: :::{grid-item-card} unihan-etl export :link: export :link-type: doc Export UNIHAN data to CSV, JSON, or YAML. ::: :::{grid-item-card} unihan-etl search :link: search :link-type: doc Look up character data by codepoint or field. ::: :::{grid-item-card} unihan-etl fields :link: fields :link-type: doc List available UNIHAN fields. ::: :::{grid-item-card} unihan-etl files :link: files :link-type: doc List available UNIHAN source files. ::: :::: ```{toctree} :caption: Data Operations :maxdepth: 1 export download search ``` ```{toctree} :caption: Information :maxdepth: 1 fields files ``` (cli-main)= ## Main command The `unihan-etl` command is the entry point for all UNIHAN ETL operations. Use subcommands to export data, download the database, or query fields and files. ### Command ```{eval-rst} .. argparse:: :module: unihan_etl.cli :func: create_parser :prog: unihan-etl :nosubcommands: subparser_name : @replace See :ref:`cli-export` ``` ## Global Options - `-l, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}`: Logging level (default: INFO) - `-V, --version`: Show version and exit ## Output Formats The `fields`, `files`, and `search` commands support different output formats: - **Table** (default): Human-readable formatted output - `--json`: Pretty-printed JSON (entire result as array/object) - `--ndjson`: Newline-delimited JSON (one record per line, ideal for LLM consumption) --- # unihan-etl search Source: https://unihan-etl.git-pull.com/cli/search/ (cli-search)= # unihan-etl search Look up character data in the UNIHAN database. ## Command ```{eval-rst} .. argparse:: :module: unihan_etl.cli :func: create_parser :prog: unihan-etl :path: search ``` ## Examples Look up a character by its form: ```console $ unihan-etl search 一 ``` Look up by UCN: ```console $ unihan-etl search U+4E00 ``` Look up by hex codepoint: ```console $ unihan-etl search 4E00 ``` Get JSON output for LLM consumption: ```console $ unihan-etl search 一 --json ``` Filter to specific fields: ```console $ unihan-etl search 一 -f kDefinition kMandarin ``` --- # Changelog Source: https://unihan-etl.git-pull.com/history/ (history)= ```{include} ../CHANGES ``` --- # unihan-etl Source: https://unihan-etl.git-pull.com/ (index)= # unihan-etl Download, search, and export Unicode's UNIHAN CJK character dataset. Normalizes raw Unicode data files into clean JSON, CSV, or YAML. unihan-etl handles the data pipeline. For SQLAlchemy models, see [unihan-db](https://unihan-db.git-pull.com/). For end-user character lookups, see [cihai](https://cihai.git-pull.com/). ::::{grid} 1 2 3 3 :gutter: 2 2 3 3 :::{grid-item-card} Quickstart :link: quickstart :link-type: doc Install and run your first export. ::: :::{grid-item-card} CLI Reference :link: cli/index :link-type: doc Every command, flag, and option. ::: :::{grid-item-card} API Reference :link: api/index :link-type: doc Core modules, types, and pytest plugin. ::: :::: ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} Topics :link: topics/index :link-type: doc About UNIHAN, FAQ, and data format details. ::: :::{grid-item-card} Contributing :link: project/index :link-type: doc Development setup, code style, and release process. ::: :::: ## Install ```console $ uv tool install unihan-etl ``` ```console $ pip install unihan-etl ``` ## At a glance Fetches raw UNIHAN data from unicode.org. ```console $ unihan-etl download ``` Look up a character across all fields. ```console $ unihan-etl search 好 ``` Export the full dataset to JSON (also supports CSV, YAML). ```console $ unihan-etl export -F json ``` ```{toctree} :hidden: quickstart cli/index api/index topics/index internals/index project/index history ``` ```{toctree} :hidden: :caption: More migration GitHub ``` --- # App directories - unihan_etl._internal.app_dirs Source: https://unihan-etl.git-pull.com/internals/api/app_dirs/ --- myst: html_meta: "description lang=en": "Extract UNIHAN to CSV, JSON, etc." "keywords": "unihan_etl, unihan-etl, unihan, unihan extractor, cjk, cjk dictionary" "property=og:locale": "en_US" --- # App directories - `unihan_etl._internal.app_dirs` ```{eval-rst} .. automodule:: unihan_etl._internal.app_dirs :members: :undoc-members: :show-inheritance: :no-value: ``` --- # Internal API Source: https://unihan-etl.git-pull.com/internals/api/ (internal_api)= # Internal API ```{module} unihan_etl ``` :::{warning} Be careful with these! Internal APIs are **not** covered by version policies. They can break or be removed between minor versions! If you need an internal API stabilized please [file an issue](https://github.com/cihai/unihan-etl/issues). ::: ```{toctree} :caption: Internal API :maxdepth: 1 app_dirs ``` --- # Internals Source: https://unihan-etl.git-pull.com/internals/ (internals)= # Internals ```{warning} Everything in this section is **internal implementation detail**. There is no stability guarantee. Interfaces may change or be removed without notice between any release. If you are building an application with unihan-etl, use the [CLI](../cli/index.md) or the [public API](../api/index.md). ``` ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} Internal Python API :link: api/index :link-type: doc Internal module reference for contributors and plugin authors. ::: :::: ```{toctree} :hidden: api/index ``` --- # Migration notes Source: https://unihan-etl.git-pull.com/migration/ (migration)= ```{currentmodule} libtmux ``` ```{include} ../MIGRATION ``` --- # Code Style Source: https://unihan-etl.git-pull.com/project/code-style/ (code-style)= # Code Style ## Formatting and linting unihan-etl uses [ruff](https://docs.astral.sh/ruff/) for both formatting and linting. ```console $ uv run ruff format . ``` ```console $ uv run ruff check . --fix --show-fixes ``` ## Type checking [mypy](https://mypy.readthedocs.io/) runs in strict mode. ```console $ uv run mypy src tests ``` ## Docstrings Follow [NumPy-style](https://numpydoc.readthedocs.io/en/latest/format.html) docstrings in reStructuredText format. ## Imports - Begin every module with `from __future__ import annotations`. - Prefer namespace imports for stdlib: `import typing as t`, `import pathlib`. - Third-party packages may use `from X import Y`. --- # Contributing Source: https://unihan-etl.git-pull.com/project/contributing/ (contributing)= # Contributing unihan-etl is part of the [cihai project](https://cihai.git-pull.com/). Development conventions, issue triage, and PR guidelines follow the shared cihai contributing guide: > ## Quick start ```console $ git clone https://github.com/cihai/unihan-etl.git ``` ```console $ cd unihan-etl ``` ```console $ uv sync --group dev ``` ```console $ uv run py.test ``` ## Continuous testing ```console $ uv run ptw . ``` --- # Project Source: https://unihan-etl.git-pull.com/project/ (project)= # Project Information for contributors and maintainers. ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} Contributing :link: contributing :link-type: doc Development setup, running tests, submitting PRs. ::: :::{grid-item-card} Code Style :link: code-style :link-type: doc Ruff, mypy, NumPy docstrings, import conventions. ::: :::{grid-item-card} Releasing :link: releasing :link-type: doc Release checklist and version policy. ::: :::: ```{toctree} :hidden: contributing code-style releasing ``` --- # Releasing Source: https://unihan-etl.git-pull.com/project/releasing/ (releasing)= # Releasing ## Version policy unihan-etl follows [semantic versioning](https://semver.org/). Until 1.0, minor releases may include breaking API changes. ## Release checklist 1. Update `CHANGES` with new entries under the next version heading. 2. Bump the version in `pyproject.toml`. 3. Commit: `git commit -m "chore: release vX.Y.Z"`. 4. Tag: `git tag vX.Y.Z`. 5. Push: `git push --follow-tags`. 6. CI publishes to PyPI automatically on tagged pushes. --- # Quickstart Source: https://unihan-etl.git-pull.com/quickstart/ (quickstart)= # Quickstart ## Installation Assure you have at least python **>= 3.7**. Using [uv]: ```console $ uv add unihan-etl ``` Run the CLI once without a persistent install via `uvx`: ```console $ uvx unihan-etl ``` Using [pip]: ```console $ pip install --user unihan-etl ``` You can upgrade to the latest release with: ```console $ pip install --user --upgrade unihan-etl ``` (developmental-releases)= ### Developmental releases New versions of unihan-etl are published to PyPI as alpha, beta, or release candidates. In their versions you will see notification like `a1`, `b1`, and `rc1`, respectively. For example, `0.27.0a1` is the first alpha release of `0.27.0` before general availability. - [uv]: ```console $ uv add unihan-etl --prerelease allow ``` - [pip]\: ```console $ pip install --user --upgrade --pre unihan-etl ``` - [pipx]\: ```console $ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force ``` Then run `unihan-etl@next load [session]`. - [uv tool install][uv-tools]: ```console $ uv tool install --prerelease allow unihan-etl ``` - [uvx][uvx]: ```console $ uvx --from 'unihan-etl' --prerelease allow unihan-etl ``` Then rerun with your desired arguments, e.g. `uvx --prerelease allow unihan-etl load [session]`. via trunk (can break easily): - [pip]\: ```console $ pip install --user -e git+https://github.com/cihai/unihan-etl.git#egg=unihan-etl ``` - [pipx]\: ```console $ pipx install --suffix=@master 'unihan-etl @ git+https://github.com/cihai/unihan-etl.git@master' --force ``` - `uvx`\*: ```console $ uvx --from git+https://github.com/cihai/unihan-etl.git@master unihan-etl ``` \*`uvx --from` lets you run directly from a VCS URL. [pip]: https://pip.pypa.io/en/stable/ [pipx]: https://pypa.github.io/pipx/docs/ [uv]: https://docs.astral.sh/uv/ [uv-tools]: https://docs.astral.sh/uv/concepts/tools/ [uvx]: https://docs.astral.sh/uv/guides/tools/ ## Commands ```console $ unihan-etl ``` ## Pythonics :::{seealso} {ref}`unihan-etl API documentation `. --- # Frequently Asked Questions Source: https://unihan-etl.git-pull.com/topics/faq/ (faq)= # Frequently Asked Questions ... Why are some fields, e.g. _kTotalStrokes_, in lists when there's seemingly not any multi-value data? : The word back from the developers of UNIHAN is they keep some fields multi-valued for future use. > Apparently at the moment there is only one record with two values for > the kTotalStrokes field in the Unihan database. However, the maintainers > of the data intend to populate the kTotalStrokes field as needed in the > future, and as documented in UAX #38. > > May 30, 2017 (Unicode 9.0) > > unihan-etl is designed to handle fields correctly and consistently > according to the documentation in the database. --- # Topics Source: https://unihan-etl.git-pull.com/topics/ (topics)= # Topics ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} About UNIHAN :link: unihan :link-type: doc The Unicode Han Database: scope, structure, and field categories. ::: :::{grid-item-card} FAQ :link: faq :link-type: doc Common questions about multi-value fields, export formats, and data quirks. ::: :::: ```{toctree} :hidden: unihan faq ``` --- # About UNIHAN Source: https://unihan-etl.git-pull.com/topics/unihan/ (unihan)= # About UNIHAN :::{seealso} - [Wikipedia article](https://en.wikipedia.org/wiki/Han_unification) - [UNIHAN database documentation][unihan database documentation] ::: ## Languages, Computers, and You There are many languages and writing systems around the world. Computers internally use numbers to represent characters in writing systems. As computers became more prominent, hundreds of encoding systems were developed to handle writing systems from different regions. No single encoding system covered all languages. Adding to the complexity, encodings conflicted with each other on the numbers assigned to characters. Any data decoded with the wrong standard would turn up as gibberish. [Unicode][unicode] is a standard devised to provide a unique number for every character. This entails pulling together minds from around the world to assign codepoints. The _Unicode Consortium_ is a non-profit organization founded to develop, extend and promote use of the Unicode Standard. ## What is UNIHAN? UNIHAN, short for [Han unification][han unification], is the effort of the consortium assign codepoints to CJK characters. Any single {}`han character` can multiple historical or regional variants to account for, hence "unification". ```{image} _static/img/sword_variants.png :width: 300px :align: center ``` To do this, various sources of information are pulled together and cross-referenced to detail characteristics of the glyphs, and vet them through a thorough proofreading process. It's an international effort, hallmarked by between researchers and groups like the [Ideographic Rapporteur Group][ideographic rapporteur group]. Glyphs once only noted in dictionaries and antiquity are set in stone with their own codepoints, carefully cross-referenced with information from, often multiple, distinct sources. The advantage that UNIHAN provides to east asian researchers, including sinologists and japanologists, linguists, analysts, language learners, and hobbyists cannot be understated. Unbeknownst to users, its used under the hood in many applications and websites. The resulting standard has industrial ramifications downstream to software developers and computer users. When a version of Unicode is released, it is then incorporated downstream in software projects. ## The database UNIHAN provides a database of its information, which is the culmination of CJK information that has been vetted and proofed painstakingly over years. You can view the [UNIHAN Database documentation][unihan database documentation] to see where information on each field of information is derived from. For instance: - [kCantonese](http://www.unicode.org/reports/tr38/#kCantonese): The Cantonese pronunciation(s) for this character using the [jyutping romanization][jyutping romanization]. Bibliography: 1. Casey, G. Hugh, S.J. Ten Thousand Characters: An Analytic Dictionary. Hong Kong: Kelley and Walsh,1980 (kPhonetic). 2. Cheung Kwan-hin and Robert S. Bauer, The Representation of Cantonese with Chinese Characters, Journal of Chinese Linguistics Monograph Series Number 18, 2002. 3. Roy T. Cowles, A Pocket Dictionary of Cantonese, Hong Kong: University Press, 1999 (kCowles). 4. Sidney Lau, A Practical Cantonese-English Dictionary, Hong Kong: Government Printer, 1977 (kLau). 5. Bernard F. Meyer and Theodore F. Wempe, Student’s Cantonese-English Dictionary, Maryknoll, New York: Catholic Foreign Mission Society of America, 1947 (kMeyerWempe). 6. 饒秉才, ed. 廣州音字典, Hong Kong: Joint Publishing (H.K.) Co., Ltd., 1989. 7. 中華新字典, Hong Kong:中華書局, 1987. 8. 黃港生, ed. 商務新詞典, Hong Kong: The Commercial Press, 1991. 9. 朗文初級中文詞典, Hong Kong: Longman, 2001. - [kHanYu](http://www.unicode.org/reports/tr38/#kHanYu): The position of this character in the Hanyu Da Zidian (HDZ) Chinese character dictionary. Bibliography: 1. [‘Great Chinese Character Dictionary’ (in 8 Volumes)]. XU Zhongshu (Editor in Chief). Wuhan, Hubei Province (PRC): Hubei and Sichuan Dictionary Publishing Collectives, 1986-1990. ISBN: 7-5403-0030-2/H.16. - [kHanyuPinyin](http://www.unicode.org/reports/tr38/#kHanyuPinyin): The 漢語拼音 Hànyǔ Pīnyīn reading(s) appearing in the edition of 《漢語大字典 [Hànyǔ Dà Zìdiǎn][hànyǔ dà zìdiǎn] (HDZ) specified in the “kHanYu” property description (q.v.). Bibliography: - This data was originally input by 井作恆 Jǐng Zuòhéng - proofed by 聃媽歌 Dān Māgē (Magda Danish, using software donated by 文林 Wénlín Institute, Inc. and tables prepared by 曲理查 Qū Lǐchá), - and proofed again and prepared for the Unicode Consortium by 曲理查 Qū Lǐchá (2008-01-14). Han Unification is a global effort. And it's available free to the world. [unicode]: https://en.wikipedia.org/wiki/Unicode [han unification]: https://en.wikipedia.org/wiki/Han_unification [ideographic rapporteur group]: https://en.wikipedia.org/wiki/Ideographic_Rapporteur_Group [han character]: https://en.wikipedia.org/wiki/Chinese_characters [unihan database documentation]: http://www.unicode.org/reports/tr38/ [jyutping romanization]: https://en.wikipedia.org/wiki/Jyutping [hànyǔ dà zìdiǎn]: https://en.wikipedia.org/wiki/Hanyu_Da_Zidian ## The problem It's difficult to readily take advantage of UNIHAN database in its raw form. UNIHAN comprises over 20 MB of character information, separated across multiple files. Within these files is _90_ fields, spanning 8 general categories of data. Within some of fields, there are specific considerations to take account of to use the data correctly, for instance: UNIHAN's values place references to its own codepoints, such as _kDefinition_: ``` U+3400 kDefinition (same as U+4E18 丘) hillock or mound ``` And also by spaces, such as in _kCantonese_: ``` U+342B kCantonese gun3 hung1 zung1 ``` And by spaces which specify different sources, like _kMandarin_, "When there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). When there is only one value, it is appropriate for both.": ``` U+7E43 kMandarin běng bēng ``` Another, values are delimited in various ways, for instance, by rules, like _kDefinition_, "Major definitions are separated by semicolons, and minor definitions by commas.": ``` U+3402 kDefinition (J) non-standard form of U+559C 喜, to like, love, enjoy; a joyful thing ``` More complicated yet, _kHanyuPinyin_: "multiple locations for a given pīnyīn reading are separated by “,” (comma). The list of locations is followed by “:” (colon), followed by a comma-separated list of one or more pīnyīn readings. Where multiple pīnyīn readings are associated with a given mapping, these are ordered as in HDZ (for the most part reflecting relative commonality). The following are representative records.": ``` U+3FCE kHanyuPinyin 42699.050:fèn,fén U+34D8 kHanyuPinyin 10278.080,10278.090:sù U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng ``` Data could be exported to a CSV, but users wouldn't be able to handle delimited values and structured information held within. Since CSV does not support structured information, another format that supports needs to be found. Even then, users may not want an export that expands the structured output of fields. So if a tool exists, exports should be configurable. Users could then export a field with `gun3 hung1 zung1` pristinely without turning it into list form. ---