NFC Normalization of Å #8
Labels
No labels
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
BeRo1985/pucu#8
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint
$00C5) produces the sequence$0041 $030A. This is correct.However, composing the sequence
$0041 $030Aproduces the codepoint$212B(Angstrom Sign).$00C5and$212Bare equivalent codepoints but their normal form is$00C5so the composition is wrong.This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).
The problem, unfortunately, isn't isolated to Å.
I've now run a unit test against the test cases in the Unicode character database. The results are not good...
The crash on decomposition is an endless loop in the normalization loop. For example try decomposing Ḕ ̄ (Latin Capital Letter E with Macron and Grave + Combining Macron, codepoint sequence
$1E14 $0304). The result should be$0045 $0304 $0300 $0304.The cause of the Å problem is that PUCU fails to take equivalent codepoints into account when the
PUCUUnicodeCharacterCompositionMaptable is built inPUCUConvertUnicode.ResolveCompositions.As far as I can tell, the function uses the codepoint->decomposition table to build a decomposition->codepoint mapping by hashing all the decomposition values to their codepoints. The problem here is that it assumes that if A maps to B then B must map to A. As I've demonstrated, this isn't the case for all codepoints.
I have verified that removing duplicates from
PUCUUnicodeCharacterCompositionMap(and keeping the entry with the lowest codepoint), solves the problem for Å (and 31 other test cases), but there are still 260 other cases that fail the composition test.I suspect that the correct method of solving this is to keep the equivalence state along with the decomposition data so it can be used when generating the composition table.
I found another unicode library, and removed everything not normalization related to make their tables smaller.
Now I have a normalization-only library: https://github.com/benibela/internettools/blob/master/data/bbnormalizeunicode.pas
You could test if that works better in these cases
Thanks but I gave up and wrote my own implementation from scratch:
https://gitlab.com/anders.bo.melander/pascaltype2/-/blob/master/Source/PascalType.Unicode.pas?ref_type=heads#L820
It passes all 19,074 compose/decompose tests of the Unicode database.