19 August 2022

Correctly sorting data containing accented characters (a.k.a. the Côte d'Ivoire and Türkiye issue)

This will be short and sweet.  Hopefully it will be useful to some of you,

As you may or may not know, the country formerly known as Turkey recently changed its official name to Türkiye, even in English.  However, when sorting by country name, it should appear between Tunisia and Turkmenistan rather than later in the list.  (Official U.N. sorted list can be found here.)

In the weeks following this change, I was asked on at least five separate occasions to look at sorting issues that arose.  Please accept my apologies in advance if you already know what I'm about to write.  I, myself, thought that it was common knowledge, but recent experience has shown otherwise.

Below, I'll explain how I correctly sort by country within Oracle.  This technique can also apply to any other data (names, etc.) that need to take account of accented characters when sorting.  I hope that it's useful to some of you.

Note: What comes below applies principally to Latin alphabets, I have not tested on non-Latin alphabets, but I suspect that a similar approach exists.

First, demonstrating the incorrect sort

This is what some use by default, the problem is that accented characters are sorted after non-accented letters.  

So we can see that both Côte d'Ivoire and Türkiye are incorrectly sorted here and are placed after their peers.

SQL> select country_name from vw_temp_countries order by country_name; 

COUNTRY_NAME                                                                     
Cook Islands
Costa Rica
Croatia
Cuba
Curaçao
Cyprus
Czechia
Côte d'Ivoire
Tunisia
Turkmenistan
Tuvalu
Türkiye

13 rows selected. 


Now, a correct sort

With this sort, accented characters are taken into account and sorted appropriately.  The Oracle NLSSORT function is used to ensure the correct sort order.  The NLS_SORT parameter is set to swiss, as this setting accommodates most languages with Latin characters.

Here we can see that both Côte d'Ivoire and Türkiye are correctly sorted and are placed in their correct order.

SQL> select country_name from vw_temp_countries 
          order by nlssort(country_name, 'NLS_SORT = swiss');

COUNTRY_NAME                                                                     
Cook Islands
Costa Rica
Côte d'Ivoire
Croatia
Cuba
Curaçao
Cyprus
Czechia
Tunisia
Türkiye
Turkmenistan
Tuvalu

13 rows selected. 



That's it, as I promised "short and sweet" 😀

No comments:

Post a Comment