Core - unihan_etl.core

Download + ETL UNIHAN into structured format and export it.

unihan_etl.core.not_junk(line)
function[source]

Return False on newlines and C-style comments.

Parameters:

line (str)

Return type:

bool

unihan_etl.core.in_fields(c, fields)
function[source]

Return True if string is in the default fields.

Parameters:
Return type:

bool

unihan_etl.core.filter_manifest(files)
function[source]

Return filtered UNIHAN_MANIFEST from list of file names.

Parameters:

files (list[str])

Return type:

UntypedUnihanData

unihan_etl.core.files_exist(path, files)
function[source]

Return True if all files exist in specified path.

Parameters:
Return type:

bool

exception unihan_etl.core.FieldNotFound
exception[source]

Bases: Exception

Raise if field not found in file list.

exception unihan_etl.core.FileNotSupported
exception[source]

Bases: Exception

Raise if field requested is not included in current file list.

unihan_etl.core.get_files(fields)
function[source]

Return list of files required by fields. Simple dependency resolver.

Parameters:

fields (Sequence[str])

Return type:

list[str]

unihan_etl.core.get_parser()
function[source]

Return argparse.ArgumentParser instance for CLI.

Returns:

argument parser for CLI use.

Return type:

argparse.ArgumentParser

unihan_etl.core.has_valid_zip(zip_path)
function[source]

Return True if valid zip exists.

Parameters:

zip_path (str or pathlib.Path) – absolute path to zip

Returns:

True if valid zip exists at path

Return type:

bool

unihan_etl.core.zip_has_files(files, zip_file)
function[source]

Return True if zip has the files inside.

Parameters:
Returns:

True if files inside of :py:meth:`zipfile.ZipFile.namelist()

Return type:

bool

unihan_etl.core.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None, cache=True)
function[source]

Download UNIHAN zip from URL to destination.

Parameters:
  • url (str or pathlib.Path) – URL to download from.

  • dest (pathlib.Path) – file path where download is to be saved.

  • urlretrieve_fn (UrlRetrieveFn) – function to download file

  • reporthook (ReportHookFn, Optional) – Function to write progress bar to stdout buffer.

  • cache (bool)

Returns:

destination where file downloaded to.

Return type:

pathlib.Path

unihan_etl.core.load_data(files)
function[source]

Extract zip and process information into CSV’s.

Parameters:

files (list of str)

Returns:

combined data from files

Return type:

str

unihan_etl.core.extract_zip(zip_path, dest_dir)
function[source]

Extract zip file. Return zipfile.ZipFile instance.

Parameters:
Returns:

The extracted zip.

Return type:

zipfile.ZipFile

unihan_etl.core.normalize(raw_data, fields)
function[source]

Return normalized data from a UNIHAN data files.

Parameters:
  • raw_data (str) – combined text files from UNIHAN

  • fields (list of str) – list of columns to pull

Returns:

list of unihan character information

Return type:

list

unihan_etl.core.expand_delimiters(normalized_data)
function[source]

Return expanded multi-value fields in UNIHAN.

Parameters:

normalized_data (list of dict) – Expects data in list of hashes, per core.normalize()

Returns:

Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent).

Return type:

list of dict

unihan_etl.core.listify(data, fields)
function[source]

Convert tabularized data to a CSV-friendly list.

Parameters:
  • data (list of dict)

  • params (list of str) – keys/columns, e.g. [‘kDictionary’]

  • fields (Sequence[str])

Return type:

ListifiedExport

unihan_etl.core.export_csv(data, destination, fields)
function[source]

Export UNIHAN in flattened, CSV format.

Parameters:
Return type:

None

unihan_etl.core.export_json(data, destination)
function[source]

Export UNIHAN in JSON format.

Parameters:
Return type:

None

unihan_etl.core.export_yaml(data, destination)
function[source]

Export UNIHAN in YAML format.

Parameters:
Return type:

None

unihan_etl.core.is_default_option(field_name, val)
function[source]

Return True if option is a unihan-etl default.

Parameters:
Return type:

bool

unihan_etl.core.validate_options(options)
function[source]

Validate unihan-etl options.

Parameters:

options (Options)

Return type:

TypeGuard[Options]

class unihan_etl.core.Packager

Bases: object

Download, ETL, and customize an export of UNIHAN.

UNIHAN Documentation: http://www.unicode.org/reports/tr38/

unihan_etl.core.setup_logger(logger=None, level='DEBUG')
function[source]

Configure logger for CLI use.

Parameters:
  • logger (Logger) – instance of logger

  • level (str) – logging level, e.g. ‘DEBUG’

Return type:

None