You are here

Java collator and spaces

The Java default collation rules ignore whitespaces. This is unfortunate because for instance Czech collation rules include spaces. The RuleBaseCollator uses CollationRules.DEFAULTRULES (SUN specific) and appends locale specific rules at the end. The default rules uses spaces in second order comparison. This can be fixed with two solutions:

  • append new rules at the end with "& ' ' , but all characters that should be collated after space must be explicitly stated
  • modify CollationRules.DEFAULTRULES - add rule for space before rule for '_'

Note: This is just a sample, how to fix the problem, it doesn't include all czech specific collation rules.

Locale locale = new Locale("cs");
RuleBasedCollator defaultCollator = (RuleBasedCollator) Collator.getInstance(locale);

final String rules = defaultCollator.getRules();
// due to unicode this is not fully legible string
System.out.println("rules: " + rules);

// correct sorting, but all characters must be explicitly specified,
// this sample
// only specifies that y and z are after space, other characters will be
// before space
RuleBasedCollator collator1 = new RuleBasedCollator(rules + "& ' ' < x,z");

// add rule for space before '_'
RuleBasedCollator collator2 = new RuleBasedCollator(rules.replaceAll(
		"<'\u005f'", "<' '<'\u005f'"));

String[] testStr = { "ja", "j p", "j z" };

String[] test = testStr.clone();
Arrays.sort(test, defaultCollator);
System.out.println("default sorting: " + Arrays.toString(testStr));

test = testStr.clone();
Arrays.sort(test, collator1);
System.out.println("partially correct: " + Arrays.toString(test));

test = testStr.clone();
Arrays.sort(test, collator2);
System.out.println("should be correct: " + Arrays.toString(test));

The output on JDK 6.0:

rules: (rules with garbage characters due to unicode)

Default sorting is buggy, space is sorted after 'a':

default sorting: [ja, j p, j z]

First solution with only two explicit characters, so the apparent problem is clear:

partially correct: [j z, ja, j p]

Correct solution with changed default collation rules:

should be correct: [j p, j z, ja]