banner

For a full list of BASHing data blog posts see the index page.     RSS


Comparing strings more clearly

In a recent data audit, field 19 of a TSV contained a scientific name, and field 20 contained another version of the name plus the scientific authority for that name. In most cases the two name versions were the same, like this:

Anadyomene stellata      Anadyomene stellata (Wulfen) C.Agardh

In other cases the two versions weren't the same. Sometimes the species or subspecies names differed, sometimes the genus names and sometimes both:

Ceramium fastigiatum      Ceramium cimbricum H.E.Petersen
Codium fragile subsp. tomentosoides
      Codium fragile subsp. fragile (Suringar) Hariot
Boergeseniella thuyoides      Vertebrata thuyoides (Harvey) Kuntze
Acrosorium uncinatum      Cryptopleura ramosa (Hudson) L.Newton

I used AWK to select out the field 19/field 20 pairs where the names differed. To demonstrate this command I'll use a simplified TSV called "demo", with fake scientific names:

IDNameAlt_name+authority
001Primium vulgareSecundum vulgare Müller
002PrimiumPrimium De Blas
003Trivius latum scotensisTrivius latus scotensis Baker
004PrimiaPrimium De Blas
005Secundum vulgareSecundum vulgare Müller
006Primia vulgarisSecundum vulgare Müller
007Trivius latus scotensisTrivius latus scotensis Baker
008Primium latum scotenseTrivius latus scotensis Baker

awk -F"\t" 'NR>1 {n=split($2,a," "); split($3,b," "); \
for (i=1;i<=n;i++) if (a[i] != b[i]) {print; next}}' demo

Core parts of this command are explained below.

tinker1

AWK users may be wondering why I didn't just test the longer field to see if it matched the shorter field, like this: $3 !~ $2. That's OK for strings with only alphanumeric charcters, but the matching will throw fatal errors if the shorter string contains something that looks like an invalid range expression or an incompletely round-bracketed one. It will also return a non-match if the shorter field contains "*". The command I used is longer but safer.

This worked fine, but it didn't tell me which of the names were different. A bit of tinkering with AWK led me to a nice couple of solutions. The first method selects the lines with name changes and colorises the "before" and "after" words:

awk -F"\t" 'NR>1 {n=split($2,a," "); split($3,b," "); \
for (i=1;i<=n;i++) {if (a[i] != b[i]) \
{sub(a[i],"\033[1;31m"a[i]"\033[0m",$2); \
sub(b[i],"\033[1;31m"b[i]"\033[0m",$3)}}} /\033/' \
OFS=" | " demo

tinker2

There are two condition-action parts to the command. The main one (inside the 3 nested curly brackets) has NR>1 its condition, so it only applies to lines after the header.
 
The main action begins by splitting fields 2 and 3 into arrays "a" and "b", respectively. Each array element will contain a space-separated word. In addition, the total number of words in field 2 is counted and stored in the variable "n". (Besides splitting strings, split also returns the number of array elements it creates.)
 
The main action now starts a for loop which iterates the variable "i" through "n" (for (i=1;i<=n;i++)). In other words, what happens next will only happen for the words in field 2, and any additional words in field 3 will be ignored.
 
For each word in field 2, AWK checks to see if it differs from its corresponding word in field 3 (if (a[i] != b[i])). If that's true, then AWK substitutes a colorised version of the word, in both fields. The substitution is done with the sub function and uses ANSI color escapes for the colorising (e.g. sub(a[i],"\033[1;31m"a[i]"\033[0m",$2))
 
With all those jobs done, AWK turns to the second condition (/\033/). If the escape character is found in the line, the line is printed (default action).
 
As a final touch for clarity, AWK is told with OFS=" | " as a "pseudo-argument" that the output field separator is space, pipe, space.

This first solution shows me the differences between names if I happen to be in a terminal, but it doesn't produce something I can store in a text file. The second solution does that job:

awk -F"\t" 'NR>1 {n=split($2,a," "); split($3,b," "); \
for (i=1;i<=n;i++) if (a[i] != b[i]) \
f = (!f) ? a[i]"|"b[i] : f", "a[i]"|"b[i]} \
f {print $0 "\n " f; f=""}' demo

tinker3

This command begins like the first one, but changes the action to be done if the words are different. The action is to define a variable "f" with a ternary "if, do this, else, do that" test. If "f" doesn't exist yet, it's defined as the field 2 word separated from its field 3 version by a pipe. If "f" does exist, it's defined as the existing "f" followed by a comma and a space, then the latest two versions separated by a pipe. That closes the action.
 
The second condition/action begins with the condition "f", which means "if f exists", and "f" only exists for those lines in which names differ between fields 2 and 3. The action to be taken is simply to print the whole line, a newline and a couple of spaces, then "f". After the printing, "f" is reset to an empty string in preparation for processing the next line in the file.


Last update: 2020-12-09
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License