Utilities - unihan_etl.util

Utilities for parsing UNIHAN’s data and structures.

unihan_etl.util.ucn_to_unicode(ucn)
function[source]

Return a python unicode value from a UCN.

Converts a Unicode Universal Character Number (e.g. "U+4E00" or "4E00") to Python unicode (u'\\u4e00')

>>> ucn_to_unicode("U+4E00")
'一'
>>> ucn_to_unicode("4E00")
'一'
Parameters:

ucn (str)

Return type:

str

unihan_etl.util.ucnstring_to_python(ucn_string)
function[source]

Return Unicode UCN (e.g. “U+4E00”) as native Python Unicode (u’\u4e00’).

>>> ucnstring_to_python("U+4E00")
b'\xe4\xb8\x80'
Parameters:

ucn_string (str)

Return type:

bytes

unihan_etl.util.ucnstring_to_unicode(ucn_string)
function[source]

Return ucnstring as Unicode.

>>> ucnstring_to_unicode('U+4E00')
'一'
>>> ucnstring_to_unicode('U+4E01')
'丁'
>>> ucnstring_to_unicode('U+0030')
'0'
>>> ucnstring_to_unicode('U+0031')
'1'
Parameters:

ucn_string (str)

Return type:

str

unihan_etl.util._dl_progress(count, block_size, total_size, out=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)
function[source]

MIT License: https://github.com/okfn/dpm-old/blob/master/dpm/util.py.

Modification for testing: http://stackoverflow.com/a/4220278

>>> _dl_progress(0, 1, 10)
Total size: 10b
>>> _dl_progress(0, 100, 942_200)
Total size: 942Kb
Parameters:
Return type:

None

unihan_etl.util.merge_dict(d, u)
function[source]

Return updated dict.

Parameters:
Returns:

Updated dictionary

Return type:

dict

unihan_etl.util.get_fields(d)
function[source]

Return list of fields from dict of {filename: [‘field’, ‘field1’]}.

Parameters:

d (UntypedUnihanData)

Return type:

list[str]