unicode normalization

paraphasing Montoya: 'I do not think this word is what you think it is.'

January 28, 2022

3 min, 480 words

The problem...

You feel you're on top of unicode, because all your database-data is encoded in utf-8, and your search-terms are encoded in utf-8, and internally you're decoding to unicode for everything -- all-good. But... you have a search term, and you know the search term is in the database, yet no match is found!!

Example...

>>> database_words
['hola', 'más', 'qué']
>>> 
>>> 'foo' in database_words
False
>>> 
>>> 'hola' in database_words
True
>>> 
>>> 'más' in database_words  # search word typed in via mac 'option-e, a' method
True
>>> 
>>> 'qué' in database_words  # search word typed in via mac 'option-e, e' method
False  # ???
>>>

The above confusing situation is because in the 'database' (database_words), the last word's accented-letter is two combined unicode-characters, but the search-term is using a single unicode-character for the accented letter.

Here's how database_words was created:

>>> database_words = ['hola', 'm' + '\u00e1' + 's', 'qu' + '\u0065' + '\u0301']
>>> database_words
['hola', 'más', 'qué']

And this is how you'd explicitly create the single unicode-character search term:

>>> 'qu' + '\u00e9'
'qué'
>>>

(On the Mac, that accented-e is created via typing option-e then e, which produces the above single-unicode character.)

There is no way, just by looking, that you can tell the true nature of that accented 'e'.

examining characters, if curious...

>>> import unicodedata
>>> 
>>> for c in 'qué':  # typed in via option-e method which produces `'qu' + '\u00e9'`
...     print( f'unicode-name, ``{unicodedata.name(c)}``' )
... 
unicode-name, ``LATIN SMALL LETTER Q``
unicode-name, ``LATIN SMALL LETTER U``
unicode-name, ``LATIN SMALL LETTER E WITH ACUTE``
>>> 

>>> for c in 'qué':  # pasted in from database_words, which uses the two combined unicode-characters: `'qu' + '\u0065' + '\u0301'`
...     print( f'unicode-name, ``{unicodedata.name(c)}``' )
... 
unicode-name, ``LATIN SMALL LETTER Q``
unicode-name, ``LATIN SMALL LETTER U``
unicode-name, ``LATIN SMALL LETTER E``
unicode-name, ``COMBINING ACUTE ACCENT``
>>>

Dealing with this weird reality...

Good tools, like solr, often enable you to not have to worry about this -- a search-term with characters composed one way will generally find content composed the other way.

If you normalize-->decompose data in the database, and normalize-->decompose the search term, all will work.

The code below normalizes by decomposing single-unicode characters into multiple-unicode characters. Note that we start with the single unicode-character, and transform it into the two combined unicode-characters.

>>> decomposed_output = unicodedata.normalize( 'NFKD', 'qu' + '\u00e9' )
>>> 
>>> for c in decomposed_output:
...     print( f'unicode-name, ``{unicodedata.name(c)}``' )
... 
unicode-name, ``LATIN SMALL LETTER Q``
unicode-name, ``LATIN SMALL LETTER U``
unicode-name, ``LATIN SMALL LETTER E``
unicode-name, ``COMBINING ACUTE ACCENT``
>>>

So one overall solution...

Save data in a normalized-->decomposed way...

>>> decomposed_database_words = []
>>> 
>>> for word in database_words:
...     decomposed_word = unicodedata.normalize( 'NFKD', word )
...     decomposed_database_words.append( decomposed_word )
... 
>>> decomposed_database_words
['hola', 'más', 'qué']
>>>

...and then normalize-->decompose the search-term...

>>> search_term = 'qu' + '\u00e9'  # demonstrates search term starts with the single unicode character
>>> 
>>> decomposed_search_term = unicodedata.normalize( 'NFKD', search_term )
>>>

...then the normalized-->decomposed search term will find the data...

>>> decomposed_search_term in decomposed_database_words
True
>>>