[JavaScript] Is it possible to sort mixed numbers, accented, non-accented and uppercase letters?

For example, taken from another programming forum, I replaced [^a-zA-Z] for [^a-z\u00E0-\u00FC] :

var reA = /[^a-z\u00E0-\u00FC]/g;
var reN = /[^0-9]/g;

function sortAlphaNum(a, b) 
{
  var aA = a.replace(reA, "");
  var bA = b.replace(reA, "");
  if (aA === bA) 
  {
    var aN = parseInt(a.replace(reN, ""), 10);
    var bN = parseInt(b.replace(reN, ""), 10);
    return aN === bN ? 0 : aN > bN ? 1 : -1;
  } 
  else 
  {
    return aA > bA ? 1 : -1;
  }
}
console.log(
["3", "2", "10", "40", "6", "4", "30", "33", "1", "Gustavo", "julho", "Klaus", "Ιαπωνία", "keyboard", "სკოლა", "último", "árbol", "γυναίκα", "uma", "água", "Argentina", "Ángelo", "argelino", "unido", "женщина", "κήπος", "друг", "дом", "ბაღი", "люди", ].sort(sortAlphaNum)
)

I also added Intl.Collator().compare, beocming sort(sortAlphaNum) into sort(Intl.Collator(sortAlphaNum).compare), but the numbers, accented, non-accented and uppercase are in the wrong order.

Without Intl.Collator().compare, the numbers order is correct, but the alphabetical order of non-accented, accented and upper letters is incorrect. See the output:

0: "1"
1: "2"
2: "3"
3: "4"
4: "6"
5: "10"
6: "30"
7: "33"
8: "40"
9: "Ιαπωνία"
10: "სკოლა"
11: "γυναίκα"
12: "женщина"
13: "κήπος"
14: "друг"
15: "дом"
16: "ბაღი"
17: "люди"
18: "argelino"
19: "julho"
20: "keyboard"
21: "Klaus"
22: "Ángelo"
23: "Argentina"
24: "uma"
25: "unido"
26: "Gustavo"
27: "água"
28: "árbol"
29: "último"

With Intl.Collator().compare, the numbers order is incorrect, but the alphabetical order of non-accented, accented and upper letters is correct. See the output:

0: "1"
1: "10"
2: "2"
3: "3"
4: "30"
5: "33"
6: "4"
7: "40"
8: "6"
9: "água"
10: "Ángelo"
11: "árbol"
12: "argelino"
13: "Argentina"
14: "Gustavo"
15: "julho"
16: "keyboard"
17: "Klaus"
18: "último"
19: "uma"
20: "unido"
21: "γυναίκα"
22: "Ιαπωνία"
23: "κήπος"
24: "дом"
25: "друг"
26: "женщина"
27: "люди"
28: "ბაღი"
29: "სკოლა"

Hi!

As surprising as it might sounds, alphabetical sorting is a very hard problem (well, depending on the language), and although tools have seen huuuuge improvements over the last decade, there are still many traps. The first problems come from diacritics (e.g. é, è, ê, ë, î, à, ç and the few other ones we have in French, and all the other common ones in west Europe like ß, ñ, ... then the Nordic and east Europe ones, then if you start looking at Unicode pages, you’ll see that there are always new ones which you didn’t know about.)

Then you have the ligatures. Should cœur be inside the coeu range, or after the regular letters, ...

Then you also have nouns composed of multiple words, separated by an apostrophe, a hyphen, or even a space. e.g. vice versa in French is a single word, you’ll never find one of these parts alone. aujourd’hui is a single word, and so many composed words with hyphens. (see part 3 of the link below for many examples, correctly sorted, and you’ll feel the pain)

The second problem, and I don’t know if that’s specific to French or if other languages have the same challenge, is that the rules change per domain. I was shocked when I learned this (having to fix a sorting bug because we ignored this) 15 years ago. How you order acronyms, proper nouns vers common nouns, .... some domains (from memory, legal stuff has a different ordering than general dictionaries). Acronyms were a real headache in my use case (we had to produce the same deterministic ordering than the book we were producing the companion software, but they never told us their rules :-/ )

All that to say that the localCompare and similar things could do well, or could still have issues, depending on the solution you’re trying to build. Hopefully, you won’t need it. But in case, I found that reference a super helpful a few years ago (it’s the result of the analysis the Canadian - we have both English and French as official languages here - and they needed to set the rules for deterministic sort algorithms and weren’t able to get anything from the common dictionaries)

www-clips.imag.fr/geta/gilles.serasset/tri-du-fra…

The document is in French, but I assume automatic translation should work pretty well, especially for part 2 (the rules) and part 4 (the algorithm). The main idea is to use multiple keys

the first key is an ascii lower case representation of the word,
then the other keys are used to correctly handle the specific stuff if the characters are not the same than in that first keys (uppercase, spaces, diacritics, ligatures),
at the end you have a numeric index

I hope you won’t have to do this (I had to, before the glorious UTF-8 days, and this was not the most interesting stuff to do ;-) But it’s useful to know the traps and pitfalls, and have pointers if the existing tools don’t let you get the exact ordering you need.

Hopefully, you won’t have to. For French, we are lucky that this work has since been used and have been integrated into ISO/CEI 10646-1, so the locale-based sort should work in French. I’m pretty sure other languages have other specific stuff I don’t know, and I have no idea how much of this has been standardized, and therefore how much the locale-based sorting works there. But at least you know that there’s a risk so that you can integrate this into your tests.

Thread

[JavaScript] Is it possible to sort mixed numbers, accented, non-accented and uppercase letters?

Responses(3)

Search Hashnode

[JavaScript] Is it possible to sort mixed numbers, accented, non-accented and uppercase letters?

Responses(3)