NFC Normalization of Å #8

New issue

Open

opened 2023-05-20 00:13:03 +00:00 by andersmelander · 4 comments

andersmelander commented

2023-05-20 00:13:03 +00:00

(Migrated from github.com)

Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint $00C5) produces the sequence $0041 $030A. This is correct.
However, composing the sequence $0041 $030A produces the codepoint $212B (Angstrom Sign).

$00C5 and $212B are equivalent codepoints but their normal form is $00C5 so the composition is wrong.

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).

Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint `$00C5`) produces the sequence `$0041 $030A`. This is correct. However, composing the sequence `$0041 $030A` produces the codepoint `$212B` (Angstrom Sign). `$00C5` and `$212B` are equivalent codepoints but their [normal form](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) is `$00C5` so the composition is wrong. > For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining [ring above](https://en.wikipedia.org/wiki/Ring_above) "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å"). This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).

andersmelander commented

2023-05-21 02:16:11 +00:00

(Migrated from github.com)

The problem, unfortunately, isn't isolated to Å.

I've now run a unit test against the test cases in the Unicode character database. The results are not good...

Operation	Passed	Failed	Crash
Decomposition & normalization	17,023	205	~1,846
Composition	18,782	292	0

The crash on decomposition is an endless loop in the normalization loop. For example try decomposing Ḕ ̄ (Latin Capital Letter E with Macron and Grave + Combining Macron, codepoint sequence $1E14 $0304). The result should be $0045 $0304 $0300 $0304.

The problem, unfortunately, isn't isolated to Å. I've now run a unit test against the test cases in the [Unicode character database](https://unicode.org/Public/UNIDATA/). The results are not good... | Operation | Passed | Failed | Crash | | :---- | ----: | ----: | ----: | | Decomposition & normalization | 17,023 | 205 | ~1,846 | | Composition | 18,782 | 292 | 0 | The crash on decomposition is an endless loop in the normalization loop. For example try decomposing Ḕ ̄ (Latin Capital Letter E with Macron and Grave + Combining Macron, codepoint sequence `$1E14 $0304`). The result should be `$0045 $0304 $0300 $0304`.

andersmelander commented

2023-05-29 01:37:26 +00:00

(Migrated from github.com)

The cause of the Å problem is that PUCU fails to take equivalent codepoints into account when the PUCUUnicodeCharacterCompositionMap table is built in PUCUConvertUnicode.ResolveCompositions.

As far as I can tell, the function uses the codepoint->decomposition table to build a decomposition->codepoint mapping by hashing all the decomposition values to their codepoints. The problem here is that it assumes that if A maps to B then B must map to A. As I've demonstrated, this isn't the case for all codepoints.

I have verified that removing duplicates from PUCUUnicodeCharacterCompositionMap (and keeping the entry with the lowest codepoint), solves the problem for Å (and 31 other test cases), but there are still 260 other cases that fail the composition test.

I suspect that the correct method of solving this is to keep the equivalence state along with the decomposition data so it can be used when generating the composition table.

The cause of the Å problem is that PUCU fails to take equivalent codepoints into account when the `PUCUUnicodeCharacterCompositionMap` table is built in `PUCUConvertUnicode.ResolveCompositions`. As far as I can tell, the function uses the codepoint->decomposition table to build a decomposition->codepoint mapping by hashing all the decomposition values to their codepoints. The problem here is that it assumes that if A maps to B then B must map to A. As I've demonstrated, this isn't the case for all codepoints. I have verified that removing duplicates from `PUCUUnicodeCharacterCompositionMap` (and keeping the entry with the lowest codepoint), solves the problem for Å (and 31 other test cases), but there are still 260 other cases that fail the composition test. I suspect that the correct method of solving this is to keep the equivalence state along with the decomposition data so it can be used when generating the composition table.

benibela commented

2023-08-29 15:14:56 +00:00

(Migrated from github.com)

I found another unicode library, and removed everything not normalization related to make their tables smaller.

Now I have a normalization-only library: https://github.com/benibela/internettools/blob/master/data/bbnormalizeunicode.pas

You could test if that works better in these cases

I found another unicode library, and removed everything not normalization related to make their tables smaller. Now I have a normalization-only library: https://github.com/benibela/internettools/blob/master/data/bbnormalizeunicode.pas You could test if that works better in these cases

andersmelander commented

2023-08-29 16:57:58 +00:00

(Migrated from github.com)

Thanks but I gave up and wrote my own implementation from scratch:
https://gitlab.com/anders.bo.melander/pascaltype2/-/blob/master/Source/PascalType.Unicode.pas?ref_type=heads#L820

It passes all 19,074 compose/decompose tests of the Unicode database.

Thanks but I gave up and wrote my own implementation from scratch: https://gitlab.com/anders.bo.melander/pascaltype2/-/blob/master/Source/PascalType.Unicode.pas?ref_type=heads#L820 It passes all 19,074 compose/decompose tests of the Unicode database.