Core - unihan_etl.core

Download + ETL UNIHAN into structured format and export it.

unihan_etl.core.not_junk(line)[source]

Return False on newlines and C-style comments.

Return type:

bool

Parameters:

line (str) –

unihan_etl.core.in_fields(c, fields)[source]

Return True if string is in the default fields.

Return type:

bool

Parameters:
unihan_etl.core.filter_manifest(files)[source]

Return filtered UNIHAN_MANIFEST from list of file names.

Return type:

UntypedUnihanData

Parameters:

files (List[str]) –

unihan_etl.core.files_exist(path, files)[source]

Return True if all files exist in specified path.

Return type:

bool

Parameters:
exception unihan_etl.core.FieldNotFound(field)[source]

Bases: Exception

Raise if field not found in file list.

Parameters:

field (str) –

Return type:

None

exception unihan_etl.core.FileNotSupported(field)[source]

Bases: Exception

Raise if field requested is not included in current file list.

Parameters:

field (str) –

Return type:

None

unihan_etl.core.get_files(fields)[source]

Return list of files required by fields. Simple dependency resolver.

Return type:

List[str]

Parameters:

fields (Sequence[str]) –

unihan_etl.core.get_parser()[source]

Return argparse.ArgumentParser instance for CLI.

Return type:

ArgumentParser

Returns:

argument parser for CLI use.

Return type:

argparse.ArgumentParser

unihan_etl.core.has_valid_zip(zip_path)[source]

Return True if valid zip exists.

Return type:

bool

Parameters:

zip_path (str or pathlib.Path) – absolute path to zip

Returns:

True if valid zip exists at path

Return type:

bool

unihan_etl.core.zip_has_files(files, zip_file)[source]

Return True if zip has the files inside.

Return type:

bool

Parameters:
Returns:

True if files inside of :py:meth:`zipfile.ZipFile.namelist()

Return type:

bool

unihan_etl.core.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None, cache=True)[source]

Download UNIHAN zip from URL to destination.

Return type:

Path

Parameters:
  • url (str or pathlib.Path) – URL to download from.

  • dest (pathlib.Path) – file path where download is to be saved.

  • urlretrieve_fn (UrlRetrieveFn) – function to download file

  • reporthook (ReportHookFn, Optional) – Function to write progress bar to stdout buffer.

  • cache (bool) –

Returns:

destination where file downloaded to.

Return type:

pathlib.Path

unihan_etl.core.load_data(files)[source]

Extract zip and process information into CSV’s.

Return type:

FileInput[Any]

Parameters:

files (list of str) –

Returns:

combined data from files

Return type:

str

unihan_etl.core.extract_zip(zip_path, dest_dir)[source]

Extract zip file. Return zipfile.ZipFile instance.

Return type:

ZipFile

Parameters:
Returns:

The extracted zip.

Return type:

zipfile.ZipFile

unihan_etl.core.normalize(raw_data, fields)[source]

Return normalized data from a UNIHAN data files.

Return type:

UntypedNormalizedData

Parameters:
  • raw_data (str) – combined text files from UNIHAN

  • fields (list of str) – list of columns to pull

Returns:

list of unihan character information

Return type:

list

unihan_etl.core.expand_delimiters(normalized_data)[source]

Return expanded multi-value fields in UNIHAN.

Return type:

ExpandedExport

Parameters:

normalized_data (list of dict) – Expects data in list of hashes, per core.normalize()

Returns:

Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent).

Return type:

list of dict

unihan_etl.core.listify(data, fields)[source]

Convert tabularized data to a CSV-friendly list.

Return type:

ListifiedExport

Parameters:
unihan_etl.core.export_csv(data, destination, fields)[source]

Export UNIHAN in flattened, CSV format.

Return type:

None

Parameters:
  • data (UntypedNormalizedData) –

  • destination (StrPath) –

  • fields (ColumnData) –

unihan_etl.core.export_json(data, destination)[source]

Export UNIHAN in JSON format.

Return type:

None

Parameters:
  • data (UntypedNormalizedData) –

  • destination (StrPath) –

unihan_etl.core.export_yaml(data, destination)[source]

Export UNIHAN in YAML format.

Return type:

None

Parameters:
  • data (UntypedNormalizedData) –

  • destination (StrPath) –

unihan_etl.core.is_default_option(field_name, val)[source]

Return True if option is a unihan-etl default.

Return type:

bool

Parameters:
  • field_name (str) –

  • val (Any) –

unihan_etl.core.validate_options(options)[source]

Validate unihan-etl options.

Return type:

TypeGuard[Options]

Parameters:

options (Options) –

class unihan_etl.core.Packager(options=Options(source='http://www.unicode.org/Public/UNIDATA/Unihan.zip', destination=PosixPath('/home/runner/.local/share/unihan_etl/unihan.csv'), zip_path=PosixPath('/home/runner/.cache/unihan_etl/downloads/Unihan.zip'), work_dir=PosixPath('/home/runner/.cache/unihan_etl/downloads'), fields=('ucn', 'char', 'kAccountingNumeric', 'kAlternateTotalStrokes', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_SSource', 'kIRG_TSource', 'kIRG_UKSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapanese', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMojiJoho', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSUnicode', 'kSBGY', 'kSMSZD2003Index', 'kSMSZD2003Readings', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kSpoofingVariant', 'kStrange', 'kTGH', 'kTGHZ2013', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kUnihanCore2020', 'kVietnamese', 'kVietnameseNumeric', 'kXHC1983', 'kXerox', 'kZVariant', 'kZhuangNumeric'), format='csv', input_files=['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'], download=False, expand=True, prune_empty=True, cache=True, log_level='INFO'))[source]

Bases: object

Download, ETL, and customize an export of UNIHAN.

UNIHAN Documentation: http://www.unicode.org/reports/tr38/

Initialize UNIHAN Packager.

Parameters:

options (dict or Options) – options values to override defaults.

__init__(options=Options(source='http://www.unicode.org/Public/UNIDATA/Unihan.zip', destination=PosixPath('/home/runner/.local/share/unihan_etl/unihan.csv'), zip_path=PosixPath('/home/runner/.cache/unihan_etl/downloads/Unihan.zip'), work_dir=PosixPath('/home/runner/.cache/unihan_etl/downloads'), fields=('ucn', 'char', 'kAccountingNumeric', 'kAlternateTotalStrokes', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_SSource', 'kIRG_TSource', 'kIRG_UKSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapanese', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMojiJoho', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSUnicode', 'kSBGY', 'kSMSZD2003Index', 'kSMSZD2003Readings', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kSpoofingVariant', 'kStrange', 'kTGH', 'kTGHZ2013', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kUnihanCore2020', 'kVietnamese', 'kVietnameseNumeric', 'kXHC1983', 'kXerox', 'kZVariant', 'kZhuangNumeric'), format='csv', input_files=['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'], download=False, expand=True, prune_empty=True, cache=True, log_level='INFO'))[source]

Initialize UNIHAN Packager.

Parameters:

options (dict or Options) – options values to override defaults.

Return type:

None

options: Options
download(urlretrieve_fn=<function urlretrieve>)[source]

Download raw UNIHAN data if not exists.

Return type:

None

Parameters:

urlretrieve_fn (function) – function to download file

export()[source]

Extract zip and process information into CSV’s.

Return type:

Optional[UntypedNormalizedData]

classmethod from_cli(argv)[source]

Create Packager instance from CLI argparse arguments.

Return type:

Packager

Parameters:

argv (list) – Arguments passed in via CLI.

Returns:

builder

Return type:

Packager

unihan_etl.core.setup_logger(logger=None, level='DEBUG')[source]

Configure logger for CLI use.

Return type:

None

Parameters:
  • logger (Logger) – instance of logger

  • level (str) – logging level, e.g. ‘DEBUG’