For a list of BASHing data 2 blog posts see the index page.

Does string A contain string B? Ask AWK's index

String-in-string queries are pretty useful in data checking. For example, here's a PSV ("names") with full names and name components. Do all the component names agree with the full names, or was there a data entry or similar error?

An easy way to pull out any disagreements is with GNU AWK's index function:

awk -F"|" 'index($1,$2)==0 || index($1,$3)==0' names

index(itemA,itemB) looks for itemB within itemA. If it doesn't find it, it returns zero. If it does, it returns the character position in itemA at which itemB begins:

In biological taxonomy, a scientific name is usually given as (for example) Felix catus Linnaeus, 1758, where "Felix catus" is the Latin name and "Linnaeus, 1758" is the Latin name's author and the year in which the Latin name was first published. In Darwin Core data tables these can be in separate fields, or the authorship can be included in the name field:

scientificName = Felix catus
scientificNameAuthorship = Linneaus, 1758

scientificName = Felix catus Linneaus, 1758
scientificNameAuthorship = Linneaus, 1758

For the second case I use index to check that the authorship string is exactly the same in the two fields. My function is:

authID() { awk -F"\t" -v sciN="$2" -v sciNA="$3" -v ID="$4" 'NR==1 {print $ID FS $sciN FS $sciNA; next} $sciN != "" && $sciNA != "" && index($sciN,$sciNA)==0 {print $ID FS $sciN FS $sciNA}' "$1" | barsep; }

"barsep" is an alias for sed 's/\t/ | /g', which replaces all tabs with [space][pipe][space].

authID takes four arguments: filename, number of the scientificName field, number of the scientificNameAuthorship field and number of the field holding the record's unique ID. It returns disagreements in authorship strings, and note that it first checks to make sure that neither of the name fields is empty. In the example below, the tab-separated table "25Krecs" has field 1 with a unique ID number, field 63 with scientificName (species + authorship) and field 64 with scientificNameAuthorship (authorship alone):

The disagreement in ID 3591539 is between "1899" and "1889"; the comma after "typhlops" is a formatting error.

Next post:
2025-03-07 Permutations and combinations of pairs with AWK

Last update: 2025-02-28
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License