banner

For a full list of BASHing data blog posts see the index page.     RSS


A muggle's guide to AWK arrays: 2

Part 1 of this series looked at a couple of AWK array basics, namely index strings and value strings. This post deals with a very common use of AWK arrays, namely filtering or modifying a file based on information in another file.


Getting multiple files into a single AWK command. AWK commands typically have the form

awk 'pattern {action}' file

and AWK will only carry out the action on a line in the file if the line matches the pattern.

Further, AWK can process more than one file in a single command. It processes the first file given as an argument, then smoothly moves on to the second file and so on:

awk '[some command in here]' file1 file2...

These two rules can be combined to get AWK to build an array from one file, then use the array in a pattern for action on the second file. The standard way of doing this looks like:

awk 'FNR==NR {build an array from file1; next} \
pattern involving array {action on file2}' file1 file2

It's a command structure that puzzles some AWK users so I'll break it down. There are two action/pattern parts to this command. The pattern in the first part is FNR==NR. FNR is a built-in AWK variable that counts the number of lines processed in the currently viewed file. NR counts the total number of lines that AWK processes in the whole command no matter how many files are listed as arguments.

Suppose "file1" has 3 lines and "file2" has 4 lines. Here are the running tallies of FNR and NR as AWK processes first "file1", then "file2":

FNR    NR
1      1      file1 being processed      FNR equals NR
2      2      file1 being processed
3      3      file1 being processed
1      4      file2 being processed      FNR no longer equals NR
2      5      file2 being processed
3      6      file2 being processed
4      7      file2 being processed

AWK starts off processing the first line in "file1". It checks for the pattern FNR==NR and as you can see in the table above, FNR does equal NR for that first line of "file1". The first action can proceed, which is to build an array based on the contents of that first line, then jump to the next line without doing anything else in the command (next).

The same happens with line 2 and line 3. When AWK runs out of lines in "file1", it smoothly moves to "file2" and starts processing the pattern, but FNR==NR no longer applies. FNR is 1 but NR is 4 and the {build array; next} action doesn't happen. Instead AWK moves to the second part of the command and processes the first line in "file2" (and subsequent lines) according to the pattern/action pattern involving array {action on file2}.


Is something from file1 in file2? This is a grep-like use of AWK, similar to a grep -f command. AWK is much faster than grep with big files, however, as demonstrated in an earlier BASHing data post. Here are two small demonstration files:

file1:
 
Denmark
Finland
Iceland
Norway
Sweden

file2 (tab-separated):

CountryPopulationLand Area (sq km)
Germany82438639348560
United Kingdom66959016241930
France65480710547557
Italy59216525294140
Spain46441049498800
Ukraine43795220579320
Poland38028278306230
Romania19483360230170
Netherlands1713290833720
Belgium1156278430280
Greece11124603128900
Czechia1063058977240
Portugal1025466691590
Sweden10053135410340
Hungary965536190530
Belarus9433874202910
Austria876620182409
Serbia873340787460
Switzerland860825939516
Bulgaria6988739108560
Denmark577522442430
Finland5561389303890
Hungary965536190530
Belarus9433874202910
Austria876620182409

For the question "Are any of the countries in file1 included in file2?" the command is:

awk -F"\t" 'FNR==NR {a[$0]; next} $1 in a' file1 file2

Here the first action builds an array "a" from "file1" using the whole line ($0) as an index string. No value string is defined, so the value strings are empty. In the second part of the command (which only works on "file2") the pattern is $1 in a, which means The string in field 1 of "file2" is also an index string in array "a". No action is specified, so by default AWK prints any line in "file2" that matches the pattern. The results will be in country order as found in "file2":

arr2-1

I can also get the lines from "file2" where field 1 is not an index string in the array, by writing !($1 in a) as the pattern:

awk -F"\t" 'FNR==NR {a[$0]; next} !($1 in a)' file1 file2

arr2-2

1 file, 2 passes. You can give AWK the same file twice as separate arguments. That means you can process a file once to build an array, then go through the same file a second time to do something with the information stored in the array:

awk 'FNR==NR {build an array from file; next} \
pattern involving array {action on file}' file file

In a 2018 BASHing data post I showed how this trick was ideal for finding partial duplicates. For example, in the tab-separated file "table" you could look for records in which fields 2,5 and 8 were duplicated:

awk -F"\t" 'FNR==NR {a[$2,$5,$8]++; next} a[$2,$5,$8]>1' table table

In the first pass, AWK builds an array "a" from "table" using a combination of fields 2, 5 and 8 as an index string, and keeps a count (++) of the unique 2-5-8 combinations. In the second pass, AWK checks the array to see if the value string in "a" (the count from the first pass) is greater than one for fields 2, 5 and 8 in the current line. If yes, that line is printed.

The commas separating the fields in the index string tell AWK to use its own, hidden separator (it's the character with hex value 1c) to link the field 2, field 5 and field 8 strings. We don't need to know what that separator is, because the same separator is also used in the second part of the command to link field contents. You could make the separator explicit, if you wanted, for example by building "a" as a[$2 FS $5 FS $8]++ or a[$2":"$5":"$8]++. Commas are simpler.


Last update: 2019-07-12
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License