For a full list of BASHing data blog posts see the index page.

Add an issues field to a data table

Some of the data tables I audit have an "Issues" field that flags problems with individual records. It's a kind of self-reporting "Look! I've got an error!" feature.

I wondered if I could do something similar with my favourite data-processing language, AWK, and this post documents my tinkerings. Although the 10-record sample table I've used for demonstration purposes could be checked by eye, many of the tables I work with have tens of thousands of records and more than 100 fields. The commands shown here are a little fiddly but work well, regardless of table size.

Data table. In the tab-separated table (TSV) "countries", Area is in km² and PopDens is in persons/km².

Country	Pop	PopYr	Area	PopDens
Afghanistan	38,928,346	2020	652,860	60
El Salvador	6,486,205	2020	20,720	313
Slovenia	2,078,938	2200	20,140	93
USA	331,002,651	2020	9,147,420	36
Yemen	29,825,964	2020	527,970	56
Chile	19,116,201	2020	743,532	26
United States	331,002,651	2020	9,147,420	36
San Marino	33,931	2020		566
Nigeria	200,963,599	2019	910,770	221
Canada	37,742,154	2020	9,093,510	4

Issues table. I'll back up the original table and create a working one ("countriesI") at the same time by paste-ing the field name "Issues" as the header of a blank sixth field. Adding a header+blank field this way with paste can be done anywhere in a table, as explained in an earlier BASHing data post. The screenshot below shows that after the header, all the records in "countriesI" end in a blank field delimited by a tab (catted as "^I").

paste countries <(echo "Issues") > countriesI

Get field names. In several of the tests below I'll be adding something like "Problem with field 5" to Issues. It would be a lot more friendly to report "Problem with PopDens". To do that I'll store the first five field names in the array "fld", so I can call them up later in the command. The field numbers are the array's index strings and the field names are the array's value strings. In the next screenshot the array-building is the main action, and in an END statement I print out the index strings and their corresponding value strings:

awk -F"\t" '{for (i=1;i<=(NF-1);i++) \
{if (NR==1) fld[i]=$i}} \
END {for (i in fld) print i, fld[i]}' countriesI

Test for missing data items. The "countries" table shouldn't have any blanks. The following command checks for missing data items field by field in "countriesI" and reports blanks in Issues. The Issues field is redefined in this command if there's a blank found, so AWK needs to be given an output field separator as well as an input one (-F"\t" -v OFS="\t").

The test for blanks is actually a test for absence of an alphanumeric character ($i !~ /[[:alnum:]]/, and it works on each of the 5 fields defined in the loop for (i=1;i<=(NF-1);i++). The final "1" ensures that every line is printed; it's short for "if 1 {print}" and the condition "1" is always true.

Because other tests will be done on the same table (see below), I use GNU AWK's -i inplace feature to edit "countriesI" in place. The semicolon after the Issues entry is there to separate the latest note from any others found in later tests on the same record. For the same reason, field 6 is redefined as the old field 6 concatenated with the new note.

awk -F"\t" -v OFS="\t" -i inplace \
'{for (i=1;i<=(NF-1);i++) \
{if (NR==1) fld[i]=$i; \
if (NR>1 && $i !~ /[[:alnum:]]/) \
$6=$6"Missing "fld[i]";"}} 1' countriesI

Right-hand side of edited "countriesI":

Test for incorrect date. The population figures in the table should all date from 2020, so the next test simply looks at field 3 to confirm that "2020" is the entry. If not, there's a note in Issues.

awk -F"\t" -v OFS="\t" -i inplace \
'{for (i=1;i<=(NF-1);i++) {if (NR==1) fld[i]=$i} \
{if (NR>1 && $3 != "2020") \
$6=$6fld[3]" is not 2020;"}} 1' countriesI

Right-hand side of edited "countriesI":

Test for incorrect calculation. Does the table have the correct population density, which should be Pop divided by Area, rounded to the nearest whole number? (Spoiler: it's wrong for Slovenia.)

A test that divides field 2 by field 4 will be meaningless if either field is empty (and will give a fatal error "attempt to divide by zero" if field 4 doesn't contain a non-zero number), so a condition for the test is that neither field is blank ($2 != "" && $4 != "").

Another quirk of the data is that the figures contain comma separators. These have to be removed temporarily, in this case with gensub. The comma-less figure from field 2 is sprintfed as variable "m", and from field 4 as "n".

The quotient "m/n" is formatted as a whole number with sprintf "%.f" and the result stored in the variable "x" for convenience. If field 5 is not equal to the calculated quotient (if ($5 != x)), the Issues field gets a note with both the correct and the incorrect values for the quotient.

awk -F"\t" -v OFS="\t" -i inplace \
'{for (i=1;i<=NF;i++) {if (NR==1) {fld[i]=$i}} \
{if (NR>1 && $2 != "" && $4 != "") \
{m=sprintf("%d",gensub(",","","g",$2)); \
n=sprintf("%d",gensub(",","","g",$4)); \
x=sprintf("%.f",m/n); \
if ($5 != x) $6=$6fld[2]"/"fld[4]" is "x", not "$5";"}}} 1' countriesI

Right-hand side of edited "countriesI":

Test for duplicate records. This command is a bit elaborate but delivers a neat result. The assumption here is that field 1 contains a unique identifier for the record, which is true in this case, so field 1 can be ignored when looking for duplicates.

AWK processes the file "countriesI" twice. In the first pass (FNR==NR) it creates an array "dup" whose index strings are a concatenation of fields 2 through 5. The "++" after the array statement means that AWK keeps a count of the number of times each unique concatenation of fields appears in the file.

In the second pass (next) AWK first sets the variable "n" to the count found by "dup" for that particular concatenation of fields. If the count is greater than 1 (indicating duplicates), then AWK starts a counter "c" at that record. It then rebuilds field 6 with the the number (from "c") out of "n" for which that record is a duplicate, followed by the contents of the first three of the duplicated fields. If there's more than one set of duplicated records in the table, those quoted fields will identify which duplicate ("1 of 2", "2 of 2") goes with which.

awk -F"\t" -v OFS="\t" -i inplace \
'FNR==NR {dup[$2,$3,$4,$5]++; next} \
{n=dup[$2,$3,$4,$5]; \
{if (n>1) \
{++c; $6=$6"Duplicate "c" of "n" with "$2" | "$3" | "$4" etc;"}}} 1' \
countriesI countriesI

Right-hand side of edited "countriesI":

Summing up. Yes, I can do tests with AWK and add "issues" notes to a data table, but putting code like the above into a script would only be worthwhile if the table to be checked was in a standardised form and I could apply a standardised group of tests. Alas, that's not the case with my auditing work, where every table is different and the problems I look for depend on the nature of the data supplied!

Last update: 2020-05-20
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License