Core - unihan_etl.core¶
Download + ETL UNIHAN into structured format and export it.
-
unihan_etl.core.not_junk(line)¶
Return False on newlines and C-style comments.
-
unihan_etl.core.in_fields(c, fields)¶
Return True if string is in the default fields.
-
unihan_etl.core.filter_manifest(files)¶
Return filtered
UNIHAN_MANIFESTfrom list of file names.
-
unihan_etl.core.files_exist(path, files)¶
Return True if all files exist in specified path.
-
exception unihan_etl.core.FileNotSupported¶
Bases:
ExceptionRaise if field requested is not included in current file list.
-
unihan_etl.core.get_files(fields)¶
Return list of files required by fields. Simple dependency resolver.
-
unihan_etl.core.get_parser()¶
Return
argparse.ArgumentParserinstance for CLI.- Returns:
argument parser for CLI use.
- Return type:
-
unihan_etl.core.has_valid_zip(zip_path)¶
Return True if valid zip exists.
- Parameters:
zip_path (str or pathlib.Path) – absolute path to zip
- Returns:
True if valid zip exists at path
- Return type:
-
unihan_etl.core.zip_has_files(files, zip_file)¶
Return True if zip has the files inside.
- Parameters:
files (list of str) – files inside zip file
zip_file (
zipfile.ZipFile)
- Returns:
True if files inside of :py:meth:`zipfile.ZipFile.namelist()
- Return type:
-
unihan_etl.core.download(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None, cache=True)¶
Download UNIHAN zip from URL to destination.
- Parameters:
url (str or pathlib.Path) – URL to download from.
dest (pathlib.Path) – file path where download is to be saved.
urlretrieve_fn (UrlRetrieveFn) – function to download file
reporthook (ReportHookFn, Optional) – Function to write progress bar to stdout buffer.
cache (bool)
- Returns:
destination where file downloaded to.
- Return type:
-
unihan_etl.core.load_data(files)¶
Extract zip and process information into CSV’s.
- Parameters:
files (list of str)
- Returns:
combined data from files
- Return type:
-
unihan_etl.core.extract_zip(zip_path, dest_dir)¶
Extract zip file. Return
zipfile.ZipFileinstance.- Parameters:
zip_file (pathlib.Path) – filepath to extract.
dest_dir (pathlib.Path) – directory to extract to.
zip_path (Path)
- Returns:
The extracted zip.
- Return type:
-
unihan_etl.core.normalize(raw_data, fields)¶
Return normalized data from a UNIHAN data files.
-
unihan_etl.core.expand_delimiters(normalized_data)¶
Return expanded multi-value fields in UNIHAN.
- Parameters:
normalized_data (list of dict) – Expects data in list of hashes, per
core.normalize()- Returns:
Items which have fields with delimiters and custom separation rules, will be expanded. Including multi-value fields not using both fields (so all fields stay consistent).
- Return type:
list of dict
-
unihan_etl.core.listify(data, fields)¶
Convert tabularized data to a CSV-friendly list.
-
unihan_etl.core.export_csv(data, destination, fields)¶
Export UNIHAN in flattened, CSV format.
- Parameters:
data (UntypedNormalizedData)
destination (StrPath)
fields (ColumnData)
- Return type:
-
unihan_etl.core.export_json(data, destination)¶
Export UNIHAN in JSON format.
- Parameters:
data (UntypedNormalizedData)
destination (StrPath)
- Return type:
-
unihan_etl.core.export_yaml(data, destination)¶
Export UNIHAN in YAML format.
- Parameters:
data (UntypedNormalizedData)
destination (StrPath)
- Return type:
-
unihan_etl.core.is_default_option(field_name, val)¶
Return True if option is a unihan-etl default.
-
unihan_etl.core.validate_options(options)¶
Validate unihan-etl options.
-
class unihan_etl.core.Packager¶
Bases:
objectDownload, ETL, and customize an export of UNIHAN.
UNIHAN Documentation: http://www.unicode.org/reports/tr38/
-
unihan_etl.core.setup_logger(logger=None, level='DEBUG')¶
Configure logger for CLI use.