banner

For a list of BASHing data 2 blog posts see the index page.    RSS


How to ignore everything but numbers

I wanted to check a table similar to the tab-separated one below (which I'll call "survey") but which had many more fields and a couple of thousand records. The four fields in "survey" have a unique record ID (field 1), a pipe-separated tally of specimens by sex (field 2), a pipe-separated tally of specimens by life stage (field 3) and a total number of specimens (field 4). In some records there's no pipe in field 2 or 3 and only one of the possibilities is tallied.

By "check" I mean that I wanted to ensure that the males + females total in field 2 equalled the number of adults in field 3, and that the adults + juveniles in field 3 equalled the field 4 total.

0016 males | 5 females11 adults | 5 juveniles16
0022 males | 1 female4 adults | 4 juveniles8
0032 females3 adults | 4 juveniles7
0041 male | 1 female2 adults | 1 juvenile3
0053 females3 adults | 5 juveniles8
0061 male | 4 females5 adults5
0071 male | 3 females4 adults | 2 juveniles6
0081 male | 1 female2 adults |1 juvenile3
0096 males | 10 females16 adults | 37 juveniles54
0102 males | 2 females4 adults4

The check was easy to do with AWK, because so far as AWK is concerned, arithmetic is done with the numbers it first sees, and everything else is just noise to be ignored.

For example, I'll split field 2 using "|" as separator and put the resulting pieces into an array "a". I'll then print the separate pieces followed by the arithmetical total of the pieces:

awk -F"\t" '{split($2,a,"|"); print $1,a[1],a[2],a[1]+a[2]}' OFS=" | " survey

arithmetic1

Notice that in doing the sum, AWK has ignored the non-numeral bits following the numbers. That wouldn't have been the case if the numbers had been embedded in strings:

printf "6 gazelles | 3 lions" | awk '{split($0,a,"|"); print a[1]+a[2]}'
 
printf "I saw 6 gazelles | and 3 lions" | awk '{split($0,a,"|"); print a[1]+a[2]}'

arithmetic2

A GNU AWK (gawk) trick is to convert an expression beginning with a number to just the number, with the strtonum function. Again, this won't extract a number from the middle of a string:

printf "216 gazelles" | awk '{print strtonum($0)}'
 
printf "I saw 216 gazelles" | awk '{print strtonum($0)}'

arithmetic3

So my check on "males + females = adults?" would be:

awk -F"\t" '{split($2,a,"|"); split($3,b,"|"); if ((a[1]+a[2]) != strtonum(b[1])) print}' survey

arithmetic4

And my check on "adults + juveniles = total specimens?" would be:

awk -F"\t" '{split($3,a,"|"); if ((a[1]+a[2]) != $4) print}' survey

arithmetic5

See here for more cautions on using strtonum.


Next post:
2025-05-30   What a long, strange trip it's been


Last update: 2025-05-23
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License