How does localeCompare() actually works?

This method reminds me of sort() which sorts its array element’s first char by ASCI value or ( own created comparison function ).

But localeCompare() sorts alphabetically and it takes in 2 strings. I’m really baffled as to how its sorting or comparison actually works? why does the below examples gives a positive value: referenceStr going after the comparisonStr.

const a = 'CHhea'; 
const b = 'CCHhn'; 

console.log(a.localeCompare(b)); // 1

const c = 'réservé'; // with accents, lowercase
const d = 'RESERVE'; // no accents, uppercase

console.log(c.localeCompare(d));  // 1

referenceStr.localeCompare(compareString):

Negative when the referenceStr occurs before compareString
Positive when the referenceStr occurs after compareString
Returns 0 if they are equivalent

It is a function for using with sort, it’s not an alternative, it’s a comparison function that exists so that you don’t need to write your own.

someArrayOfWords.sort((a, b) => a.localeCompare(b))

So it’s

Positive when the referenceStr occurs after compareString

Like

referenceString.localeCompare(compareString)

So for your examples:

"CHhea".localeCompare("CCHhn")

C is same as C (0)
H comes after C (1)
Stop, return 1

"réservé".localeCompare("RESERVE")

r comes after R (1)
Stop, return 1

So for
"CHhea".localeCompare("CCHhn")
The first char of both string equal 0. so we go to next char of referenceString and compare "H" to first char of compareString? Or Second char of compareString?

Also for second example, doesn’t "r" come before "R"?

const a = 'r';
const b = 'R'; 

console.log(a.localeCompare(b)); // -1 (negative, so come before)

Well it’s just alphabetical, that’s what I’m describing: “CCHhn” comes before “CHhea”, you only need to look at the first two letter to see that’s true, there’s no complicated logic, it’s just how the alphabet works.

No, other way round (unless you specify you want it to be case-insensitive in the options you can pass to the localeCompare, in which case they’re the same). It’s arbitrary but computer character systems & English dictionaries both follow this pattern – if you’re putting letters in order and you distinguish between upper and lower case, then either the upper- or lowercase version has to come first, most things pick the former (I don’t know about non-english dictionaries). See for example:

Locale compare just means words will be compared in dictionary order, alphabetically, using the locale specified (which is by default whichever one is set on the system you are using). Rather than by character code.

So at a basic level, it just allows a sort function to sort alphabetically, and do so correctly regardless of language.

In many cases, there will be a text file on your computer somewhere which is just a list of rules and exceptions for your locale. The system (and in turn a program that requires the information from it, ie in this case a browser) will load this file. In turn, the information held within it will then be made available to things like the browser’s i18n API (of which localeCompare is part of).


For more specific examples:

Take the locale I use most often, “en-GB”, because I live in the UK. The modern English alphabet has 26 letters. But some words may use accented characters, for example naïve or café

Say I have a program that accepts user input, I want to allow someone to type either cafe or café, or naive or naïve.

If I then want to sort those words that the user typed, then I want the program to treat them the same, hence I either write the logic myself or I just use localeCompare.


Similarly, there are a number of words in English which use letters that are not in the standard alphabet. For example, the word mediæval. “æ” is an actual letter, it’s not a ligature. It’s just that it’s mainly been replaced by the digraph “ae”. Again, in a program, I would want to treat mediaeval and mediæval as the same thing.


Similarly, there are UK names and placenames use letters that are not in the standard modern alphabet. Irish Gaelic (18 letter alphabet) makes heavy use of acute accents on certain characters, and very occasionally a dot to indicate the séimethe (afaik usage of the dot has almost completely disappeared tho).

Béal Feirste

Scottish Gaelic alphabet (23 letters) includes grave-accented versions of vowels.

Abhainn a’ Ghlinne Mhòir

Welsh uses a circumflex to indicate lengthened characters (and also uses both acute and grave accents).

Caersŵs

If I have a list of placenames, I likely want those in the correct alphabetical order, regardless of whether their spelling uses letters that do not exist in the standard alphabet.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.