
The 0123456789 arabic digits sort the same as their Eastern Arabic Indic counterparts (٠١٢٣٤٥٦٧٨٩).įor sort -u, ① sorts the same as ② and 0123 the same as ٠١٢٣ so sort -u would retain only one of each, while for uniq (not GNU uniq which uses strcoll() (except with -f)), ① is different from ② and 0123 different from ٠١٢٣, so uniq would consider all 4 unique. characters² and many others sort the same because their sort order is not defined. For instance, in the en_US.UTF-8 locale on a GNU system, all the ①②③④⑤⑥⑦⑧⑨⑩. In some locales, especially on GNU systems, there are different characters that sort the same. With POSIX compliant sorts and uniqs (GNU uniq is currently not compliant in that regard), there's a difference in that sort uses the locale's collating algorithm to compare strings (will typically use strcoll() to compare strings) while uniq checks for byte-value identity (will typically use strcmp())¹.
#Sort dcommand in linux code#
It also doesn't mask the return code of sort, which may be important (in modern shells there are ways to get this, for example, bash's $PIPESTATUS array, but this wasn't always true). $ time sort /dev/shm/file | uniq >/dev/null On my system I consistently get results like this: $ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100ġ04857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s Especially if the file is big, sort -u will likely use fewer intermediate files to sort the data. It is also faster internally as a result of being able to do both operations at the same time (and due to the fact that it doesn't require IPC ( Inter-process communication) between uniq and sort). The two were likely merged because removing duplicates within a file requires sorting (at least, in the standard case), and is an extremely common use case of sort. It's mostly a throwback to the days when sort -u didn't exist (and people don't tend to change their methods if the way that they know continues to work, just look at ifconfig vs. Sort | uniq existed before sort -u, and is compatible with a wider range of systems, although almost all modern systems do support -u - it's POSIX.
