Utilities - unihan_etl.util

Utilities for parsing UNIHAN’s data and structures.

unihan_etl.util.ucn_to_unicode(ucn)[source]

Return a python unicode value from a UCN.

Converts a Unicode Universal Character Number (e.g. "U+4E00" or "4E00") to Python unicode (u'\\u4e00')

Return type:

str

Parameters:

ucn (str) –

>>> ucn_to_unicode("U+4E00")
'一'
>>> ucn_to_unicode("4E00")
'一'
unihan_etl.util.ucnstring_to_python(ucn_string)[source]

Return Unicode UCN (e.g. “U+4E00”) as native Python Unicode (u’\u4e00’).

Return type:

bytes

Parameters:

ucn_string (str) –

>>> ucnstring_to_python("U+4E00")
b'\xe4\xb8\x80'
unihan_etl.util.ucnstring_to_unicode(ucn_string)[source]

Return ucnstring as Unicode.

Return type:

str

Parameters:

ucn_string (str) –

>>> ucnstring_to_unicode('U+4E00')
'一'
>>> ucnstring_to_unicode('U+4E01')
'丁'
>>> ucnstring_to_unicode('U+0030')
'0'
>>> ucnstring_to_unicode('U+0031')
'1'
unihan_etl.util._dl_progress(count, block_size, total_size, out=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

MIT License: https://github.com/okfn/dpm-old/blob/master/dpm/util.py.

Modification for testing: http://stackoverflow.com/a/4220278

Return type:

None

Parameters:
  • count (int) –

  • block_size (int) –

  • total_size (int) –

  • out (IO[str]) –

>>> _dl_progress(0, 1, 10)
Total size: 10b
>>> _dl_progress(0, 100, 942_200)
Total size: 942Kb
unihan_etl.util.merge_dict(d, u)[source]

Return updated dict.

Return type:

TypeVar(T, bound= Mapping[str, t.Any])

Parameters:
Returns:

Updated dictionary

Return type:

dict

Notes

Thanks: http://stackoverflow.com/a/3233356

unihan_etl.util.get_fields(d)[source]

Return list of fields from dict of {filename: [‘field’, ‘field1’]}.

Return type:

List[str]

Parameters:

d (UntypedUnihanData) –