The Java default collation rules ignore whitespaces. This is unfortunate because for instance Czech collation rules include spaces. The RuleBaseCollator
uses CollationRules.DEFAULTRULES
(SUN specific) and appends locale specific rules at the end. The default rules uses spaces in second order comparison. This can be fixed with two solutions:
"& ' ' , but all characters that should be collated after space must be explicitly stated
CollationRules.DEFAULTRULES
- add rule for space before rule for '_'
Note: This is just a sample, how to fix the problem, it doesn't include all czech specific collation rules.
Locale locale = new Locale("cs"); RuleBasedCollator defaultCollator = (RuleBasedCollator) Collator.getInstance(locale); final String rules = defaultCollator.getRules(); // due to unicode this is not fully legible string System.out.println("rules: " + rules); // correct sorting, but all characters must be explicitly specified, // this sample // only specifies that y and z are after space, other characters will be // before space RuleBasedCollator collator1 = new RuleBasedCollator(rules + "& ' ' < x,z"); // add rule for space before '_' RuleBasedCollator collator2 = new RuleBasedCollator(rules.replaceAll( "<'\u005f'", "<' '<'\u005f'")); String[] testStr = { "ja", "j p", "j z" }; String[] test = testStr.clone(); Arrays.sort(test, defaultCollator); System.out.println("default sorting: " + Arrays.toString(testStr)); test = testStr.clone(); Arrays.sort(test, collator1); System.out.println("partially correct: " + Arrays.toString(test)); test = testStr.clone(); Arrays.sort(test, collator2); System.out.println("should be correct: " + Arrays.toString(test));
The output on JDK 6.0:
rules: (rules with garbage characters due to unicode)
Default sorting is buggy, space is sorted after 'a':
default sorting: [ja, j p, j z]
First solution with only two explicit characters, so the apparent problem is clear:
partially correct: [j z, ja, j p]
Correct solution with changed default collation rules:
should be correct: [j p, j z, ja]