banner

For a list of BASHing data 2 blog posts see the index page.    RSS


How to force a preferred array sort in AWK

I use GNU AWK 5 (5.2.1) and there's something a little odd about it, as noted by Redditor "Paul_Pedant" and as demonstrated below.

According to the AWK documentation, if you loop through an array, the order in which the index strings are scanned is uncertain:

By default, the order in which a ‘for (indx in array)’ loop scans an array is not defined; it is generally based upon the internal implementation of arrays inside awk.

Because the scanning order is not defined, GNU AWK offers a way to force certain commonly used orders, as shown in a previous BASHing data post. You can order the scan by index string or value string, by numbers or letters, and either ascending or descending. To do this you add a PROCINFO["sorted_in"]=... statement to your AWK command.

Well, the scanning order might be undefined, but what I find when I loop through an array by its index strings is that the output order is the same regardless of the input order of those strings. Here are just a few examples; try a few of your own as tests:

scan1

So... AWK follows its own internal rules on ordering and these rules are applied consistently to a given set of index strings, regardless of their original order. But there's a way to get around the rules.

Here's a table of produce items and their counts, "fruveg":

apples 3
melons 2
beetroots 6
melons 2
beetroots 2
melons 2
beetroots 4
parsnips 2
parsnips 5
beetroots 6
apples 3
parsnips 3
apples 1
apples 2
parsnips 2

To tally up the counts by item, I can do:

awk '{a[$1]+=$2} END {for (i in a) print i, a[i]}' fruveg

scan2

Now suppose I want to force the output order so that vegetables come first, followed by fruits, with both in alphabetical order. I can't do that with the ready-made PROCINFO["sorted_in"]=... options. It's possible to write a purpose-built AWK function that the PROCINFO["sorted_in"]=... statement can refer to, but I find the AWK manual's instructions on how to do this hard to understand.

An easier workaround (for me, anyway) is to put the index strings into an array created by AWK's split function. AWK will assign index numbers to the strings in the order I provide them:

awk 'BEGIN {n=split("beetroots parsnips apples melons",b); \
for (i=1;i<=n;i++) print i,b[i]}'

scan3

If I add this to my tallying-up command, I get just the output order I want:

awk 'BEGIN {n=split("beetroots parsnips apples melons",b)} \
{a[$1]+=$2} \
END {for (i=1;i<=n;i++) print b[i],a[b[i]]}' fruveg

scan4

Rather than loop through the array "a" in the END statement, I'm instead going through array "b" in numerical order, and getting the item tally as the value string in an "array of an array", a[b[i]].

I won't pretend this is a general solution or that it will be so simple for longer lists of index strings, but it's a workable way to force a preferred sort order.


Next post:
2024-10-25   Anatomy of a data analysis


Last update: 2024-10-18
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License