Detecting Normalization Forms (Unicode)
Last updated at 8:57 am UTC on 10 December 2015
The Unicode Character Database supplies properties that allow implementations to quickly determine whether a string x is in a particular Normalization Form—for example, isNFC(x). This is, in general, many times faster than normalizing and then comparing.
The data to do this check comes from DerivedNormalizationProps.txt.
http://www.unicode.org/reports/tr15/tr15-43.html#Detecting_Normalization_Forms
Java version
public int quickCheck(String source) {
short lastCanonicalClass = 0;
int result = YES;
for (int i = 0; i < source.length(); ++i) {
int ch = source.codepointAt(i);
if (Character.isSupplementaryCodePoint(ch)) ++i;
short canonicalClass = getCanonicalClass(ch);
if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
return NO; }
int check = isAllowed(ch);
if (check == NO) return NO;
if (check == MAYBE) result = MAYBE;
lastCanonicalClass = canonicalClass;
}
return result;
}
Not yet implemented in Squeak. To be added.