banner

For a full list of BASHing data blog posts see the index page.     RSS


A muggle's guide to AWK arrays: 4

In this final part of the series (see also parts 1, 2, 3), I show how GNU AWK (gawk) 4 has made sorting the output from an array both easier and more tweak-able than in other AWK versions.

As explained in part 1, when you scan the contents of an array (or part of an array) with a loop, the index and value strings aren't in any particular order. For example, suppose I build an array "a" from this space-separated table, which I'll call "sample":

1 watermelon
2 cherry
3 apple
4 peach
5 banana

I'll define the fruit names (in field 2) as the index strings and the numbers (in field 1) as the value strings (a[$2]=$1). When I scan the array the sorting is higgledy-piggledy:

scan1

There were two ways to sort this output in the past. The simple way was to pipe the output to the sort command with appropriate options. Alternatively, you could call up AWK's "asort" or "asorti" functions, whose correct use was (and is) a bit complicated.

With GNU AWK (gawk) 4 your simplified sorting helper is PROCINFO["sorted_in"]. This statement can take one of 11 pre-set values:

"@ind_str_asc"
"@ind_str_desc"
"@ind_num_asc"
"@ind_num_desc"

"@val_str_asc"
"@val_str_desc"
"@val_num_asc"
"@val_num_desc"

"@unsorted"
"@val_type_asc"
"@val_type_desc"

As you might guess, "ind" and "val" refer to the index and value strings of the array, respectively. "str" means "string", "num" means "number, "asc" means "ascending" and "desc" means "descending". If I want to sort the array output in alphabetical order of index strings (fruit names, field 2), I choose "@ind_str_asc", which means "index strings, treated as strings, in ascending order":

awk 'BEGIN {PROCINFO["sorted_in"]="@ind_str_asc"} \
{a[$2]=$1} END {for (i in a) print i, a[i]}' sample

scan2

To get reverse alphabetical order of the index strings I use "@ind_str_desc":

scan3

Notice that I put the PROCINFO["sorted_in"]=... statement in the BEGIN section. That's my habit, but you could also put it in the END section, or even in the actions part of the command:

scan4

To sort the output in increasing and decreasing numerical orders of the value strings (the numbers, field 1):

scan5

In these examples I treated strings as strings ("str") and numbers as numbers ("num"), which is logical. What would happen if I made a mistake and sorted the index strings (fruit names) as numbers? Nothing different, although the GNU AWK manual says that in a numeric sort, non-numeric values are "positioned" like zero. If I sort the value strings (which are numbers) by mistake as strings, I still get a numeric sort.

Things get interesting if the index or value strings are of mixed type. Here's the table "mixed":

1 watermelon
2 cherry
apple 3
peach 4
banana 5

With the array again defined as a[$2]=$1, a numerical sort of the index strings puts the strings first, and a string sort puts the numbers first:

scan6

AWK's behaviour is similar with a value string sort with mixed types. The "@val_type..." instruction behaves like the "@val_str..." one (numbers come first):

scan7

The last pre-set value for PROCINFO["sorted_in"], "@unsorted", doesn't do any sorting at all. For more information about this method of scanning and sorting an array, see the relevant page in the GNU AWK 4 manual.


Last update: 2019-09-20
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License