
For a list of BASHing data 2 blog posts see the index page.
How to ignore everything but numbers
I wanted to check a table similar to the tab-separated one below (which I'll call "survey") but which had many more fields and a couple of thousand records. The four fields in "survey" have a unique record ID (field 1), a pipe-separated tally of specimens by sex (field 2), a pipe-separated tally of specimens by life stage (field 3) and a total number of specimens (field 4). In some records there's no pipe in field 2 or 3 and only one of the possibilities is tallied.
By "check" I mean that I wanted to ensure that the males + females total in field 2 equalled the number of adults in field 3, and that the adults + juveniles in field 3 equalled the field 4 total.
001 | 6 males | 5 females | 11 adults | 5 juveniles | 16 |
002 | 2 males | 1 female | 4 adults | 4 juveniles | 8 |
003 | 2 females | 3 adults | 4 juveniles | 7 |
004 | 1 male | 1 female | 2 adults | 1 juvenile | 3 |
005 | 3 females | 3 adults | 5 juveniles | 8 |
006 | 1 male | 4 females | 5 adults | 5 |
007 | 1 male | 3 females | 4 adults | 2 juveniles | 6 |
008 | 1 male | 1 female | 2 adults |1 juvenile | 3 |
009 | 6 males | 10 females | 16 adults | 37 juveniles | 54 |
010 | 2 males | 2 females | 4 adults | 4 |
The check was easy to do with AWK, because so far as AWK is concerned, arithmetic is done with the numbers it first sees, and everything else is just noise to be ignored.
For example, I'll split field 2 using "|" as separator and put the resulting pieces into an array "a". I'll then print the separate pieces followed by the arithmetical total of the pieces:
awk -F"\t" '{split($2,a,"|"); print $1,a[1],a[2],a[1]+a[2]}' OFS=" | " survey

Notice that in doing the sum, AWK has ignored the non-numeral bits following the numbers. That wouldn't have been the case if the numbers had been embedded in strings:
printf "6 gazelles | 3 lions" | awk '{split($0,a,"|"); print a[1]+a[2]}'
printf "I saw 6 gazelles | and 3 lions" | awk '{split($0,a,"|"); print a[1]+a[2]}'

A GNU AWK (gawk) trick is to convert an expression beginning with a number to just the number, with the strtonum function. Again, this won't extract a number from the middle of a string:
printf "216 gazelles" | awk '{print strtonum($0)}'
printf "I saw 216 gazelles" | awk '{print strtonum($0)}'

So my check on "males + females = adults?" would be:
awk -F"\t" '{split($2,a,"|"); split($3,b,"|"); if ((a[1]+a[2]) != strtonum(b[1])) print}' survey

And my check on "adults + juveniles = total specimens?" would be:
awk -F"\t" '{split($3,a,"|"); if ((a[1]+a[2]) != $4) print}' survey

See here for more cautions on using strtonum.
Next post:
2025-05-30 What a long, strange trip it's been
Last update: 2025-05-23
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License