Core - unihan_etl.core
¶
Download + ETL UNIHAN into structured format and export it.
- unihan_etl.core.filter_manifest(files)[source]¶
Return filtered
UNIHAN_MANIFEST
from list of file names.
- exception unihan_etl.core.FieldNotFound(field)[source]¶
Bases:
Exception
Raise if field not found in file list.
- Parameters:
field (str) –
- Return type:
None
- exception unihan_etl.core.FileNotSupported(field)[source]¶
Bases:
Exception
Raise if field requested is not included in current file list.
- Parameters:
field (str) –
- Return type:
None
- unihan_etl.core.get_files(fields)[source]¶
Return list of files required by fields. Simple dependency resolver.
- unihan_etl.core.get_parser()[source]¶
Return
argparse.ArgumentParser
instance for CLI.- Return type:
- Returns:
argument parser for CLI use.
- Return type:
- unihan_etl.core.has_valid_zip(zip_path)[source]¶
Return True if valid zip exists.
- Return type:
- Parameters:
zip_path (str or pathlib.Path) – absolute path to zip
- Returns:
True if valid zip exists at path
- Return type:
- unihan_etl.core.zip_has_files(files, zip_file)[source]¶
Return True if zip has the files inside.
- Return type:
- Parameters:
zip_file (
zipfile.ZipFile
) –
- Returns:
True if files inside of :py:meth:`zipfile.ZipFile.namelist()
- Return type:
- unihan_etl.core.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None, cache=True)[source]¶
Download UNIHAN zip from URL to destination.
- Return type:
Path
- Parameters:
url (str or pathlib.Path) – URL to download from.
dest (pathlib.Path) – file path where download is to be saved.
urlretrieve_fn (UrlRetrieveFn) – function to download file
reporthook (ReportHookFn, Optional) – Function to write progress bar to stdout buffer.
cache (bool) –
- Returns:
destination where file downloaded to.
- Return type:
- unihan_etl.core.extract_zip(zip_path, dest_dir)[source]¶
Extract zip file. Return
zipfile.ZipFile
instance.- Return type:
- Parameters:
zip_file (pathlib.Path) – filepath to extract.
dest_dir (pathlib.Path) – directory to extract to.
zip_path (Path) –
- Returns:
The extracted zip.
- Return type:
- unihan_etl.core.normalize(raw_data, fields)[source]¶
Return normalized data from a UNIHAN data files.
- unihan_etl.core.expand_delimiters(normalized_data)[source]¶
Return expanded multi-value fields in UNIHAN.
- Return type:
ExpandedExport
- Parameters:
normalized_data (list of dict) – Expects data in list of hashes, per
core.normalize()
- Returns:
Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent).
- Return type:
- unihan_etl.core.export_csv(data, destination, fields)[source]¶
Export UNIHAN in flattened, CSV format.
- Return type:
- Parameters:
data (UntypedNormalizedData) –
destination (StrPath) –
fields (ColumnData) –
- unihan_etl.core.export_json(data, destination)[source]¶
Export UNIHAN in JSON format.
- Return type:
- Parameters:
data (UntypedNormalizedData) –
destination (StrPath) –
- unihan_etl.core.export_yaml(data, destination)[source]¶
Export UNIHAN in YAML format.
- Return type:
- Parameters:
data (UntypedNormalizedData) –
destination (StrPath) –
- unihan_etl.core.is_default_option(field_name, val)[source]¶
Return True if option is a unihan-etl default.
- unihan_etl.core.validate_options(options)[source]¶
Validate unihan-etl options.
- Return type:
TypeGuard[Options]
- Parameters:
options (Options) –
- class unihan_etl.core.Packager(options=Options(source='http://www.unicode.org/Public/UNIDATA/Unihan.zip', destination=PosixPath('/home/runner/.local/share/unihan_etl/unihan.csv'), zip_path=PosixPath('/home/runner/.cache/unihan_etl/downloads/Unihan.zip'), work_dir=PosixPath('/home/runner/.cache/unihan_etl/downloads'), fields=('ucn', 'char', 'kAccountingNumeric', 'kAlternateTotalStrokes', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_SSource', 'kIRG_TSource', 'kIRG_UKSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapanese', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMojiJoho', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSUnicode', 'kSBGY', 'kSMSZD2003Index', 'kSMSZD2003Readings', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kSpoofingVariant', 'kStrange', 'kTGH', 'kTGHZ2013', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kUnihanCore2020', 'kVietnamese', 'kVietnameseNumeric', 'kXHC1983', 'kXerox', 'kZVariant', 'kZhuangNumeric'), format='csv', input_files=['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'], download=False, expand=True, prune_empty=True, cache=True, log_level='INFO'))[source]¶
Bases:
object
Download, ETL, and customize an export of UNIHAN.
UNIHAN Documentation: http://www.unicode.org/reports/tr38/
Initialize UNIHAN Packager.
- __init__(options=Options(source='http://www.unicode.org/Public/UNIDATA/Unihan.zip', destination=PosixPath('/home/runner/.local/share/unihan_etl/unihan.csv'), zip_path=PosixPath('/home/runner/.cache/unihan_etl/downloads/Unihan.zip'), work_dir=PosixPath('/home/runner/.cache/unihan_etl/downloads'), fields=('ucn', 'char', 'kAccountingNumeric', 'kAlternateTotalStrokes', 'kBigFive', 'kCCCII', 'kCNS1986', 'kCNS1992', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCheungBauerIndex', 'kCihaiT', 'kCompatibilityVariant', 'kCowles', 'kDaeJaweon', 'kDefinition', 'kEACC', 'kFenn', 'kFennIndex', 'kFourCornerCode', 'kFrequency', 'kGB0', 'kGB1', 'kGB3', 'kGB5', 'kGB7', 'kGB8', 'kGSR', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHanYu', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kIBMJapan', 'kIICore', 'kIRGDaeJaweon', 'kIRGHanyuDaZidian', 'kIRGKangXi', 'kIRG_GSource', 'kIRG_HSource', 'kIRG_JSource', 'kIRG_KPSource', 'kIRG_KSource', 'kIRG_MSource', 'kIRG_SSource', 'kIRG_TSource', 'kIRG_UKSource', 'kIRG_USource', 'kIRG_VSource', 'kJIS0213', 'kJa', 'kJapanese', 'kJapaneseKun', 'kJapaneseOn', 'kJinmeiyoKanji', 'kJis0', 'kJis1', 'kJoyoKanji', 'kKangXi', 'kKarlgren', 'kKorean', 'kKoreanEducationHanja', 'kKoreanName', 'kLau', 'kMainlandTelegraph', 'kMandarin', 'kMatthews', 'kMeyerWempe', 'kMojiJoho', 'kMorohashi', 'kNelson', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kPseudoGB1', 'kRSAdobe_Japan1_6', 'kRSUnicode', 'kSBGY', 'kSMSZD2003Index', 'kSMSZD2003Readings', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kSpoofingVariant', 'kStrange', 'kTGH', 'kTGHZ2013', 'kTaiwanTelegraph', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kUnihanCore2020', 'kVietnamese', 'kVietnameseNumeric', 'kXHC1983', 'kXerox', 'kZVariant', 'kZhuangNumeric'), format='csv', input_files=['Unihan_DictionaryIndices.txt', 'Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_OtherMappings.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt'], download=False, expand=True, prune_empty=True, cache=True, log_level='INFO'))[source]¶
Initialize UNIHAN Packager.
- download(urlretrieve_fn=<function urlretrieve>)[source]¶
Download raw UNIHAN data if not exists.
- Return type:
- Parameters:
urlretrieve_fn (function) – function to download file