For a full list of BASHing data blog posts see the index page.

How to do a both/neither/one/other tally - updated

A routine check I do on data tables is to see if paired fields are either both filled or both empty in each record. By "paired fields" I mean, for example, a latitude field and a longitude field. It doesn't make much sense to have a latitude without a longitude, or vice versa!

I wrote a command one-liner to tally up the "both filled", "both empty", "only the first filled" and "only the second filled" entries, and I've put the command in a function. To show the function at work, below is a tab-separated latitude/longitude table called "latlon". I've packed "latlon" in an HTML table with alternating lines to save space on this webpage. To get the 40 records in serial order, copy the table and paste it as a text file, then pass the file to sed to convert the third tab in each line to a newline:

sed 's/\t/\n/3' < file > latlon

ID	Latitude	Longitude	001		153.4273
002	-29.4148	153.3504	003	-28.8635	153.5639
004	-29.5384	150.3841	005	-28.6467	147.6139
006	-29.0830	147.1818	007
008	-29.5489	152.3187	009	-28.4787	153.5510
010	-28.8002	153.5840	011	-29.0908	153.4327
012	-29.4000	153.3500	013	-29.3931	153.2331
014	-28.8635		015
016	-29.3696		017	-20.3500	148.8000
018	-27.2999	152.8861	019
020	-28.7551		021	-29.6001	152.1833
022	-28.8275	153.5329	023	-22.8167	149.8833
024			025	-28.1910
026		153.6139	027	0	0
028	-28.9334	149.3500	029	-27.0850	152.9762
030	-29.3640	150.9910	031	-26.9334	152.9500
032	-28.8502	153.0471	033
034	-28.9368	153.1744	035	-27.0237	152.9499
036	-29.3981	153.3686	037		153.1725
038	-29.1724	150.8508	039	-29.4869	153.3713

Here's my function, called "fldpair". It takes three arguments: filename, number of first field in pair, number of second field in pair. The commands are explained below.

UPDATE. In the first version of this blog post, "fldpair" didn't distinguish between an empty field and a field containing zero ("0"). The revised function only looks for genuinely empty fields. For more information, see this BASHing data post.

fldpair() { awk -F"\t" -v one="$2" -v two="$3" 'NR==1 {x=$one; y=$two; next} ($one!="") && ($two!="") {both++} ($one!="") && ($two=="") {oneonly++} ($one=="") && ($two!="") {twoonly++} ($one=="") && ($two=="") {neither++} END {print "neither\t"neither"\n"x" only\t"oneonly"\n"y" only\t"twoonly"\nboth\t"both}' "$1" | sed 's/\t$/\t0/' | column -t -s $'\t'; }

The result for "latlon":

Some results for paired fields from a real-world table, "prep":

The only "gotcha" I'm aware of for "fldpair" is that an empty field really has to be empty. If an apparently empty field actually contains a space or spaces of some kind, the AWK command will count it as filled. Please also note that the AWK command assumes the table is tab-separated.

awk -F"\t" -v one="$2" -v two="$3"
   AWK is told what the table's field separator is (tab in this case), and AWK variables "one" and "two" are assigned to the shell command arguments 2 and 3, namely paired field 1 and paired field 2.

NR==1 {x=$one; y=$two; next}
   The instruction for the first line of the table (the header) is to store the name of the first paired field in the variable "x" and the name of the second paired field in the variable "y". Once that's done, AWK moves to the next line.

($one!="") && ($two!="") {both++}
   If both of the paired fields are non-empty, the variable "both" is incremented by one.

($one!="") && ($two=="") {oneonly++}
   If the first paired field is non-empty but the second one is empty, increment "oneonly" by one.

($one=="") && ($two!="") {twoonly++}
   If the second paired field is non-empty but the first one is empty, increment "twoonly" by one.

($one=="") && ($two=="") {neither++}
   If both of the paired fields are empty, increment "neither" by one.

END {print "neither\t"neither"\n"x" only\t"oneonly"\n"y" only\t"twoonly"\nboth\t"both}'
   With all lines processed, print the tallies for the four possible results with tab-separated labels. The "only" labels get their field names from "x" and "y".

"$1"
   This is the file which AWK has to process (the first argument for the function).

| sed 's/\t$/\t0/'
   AWK doesn't print anything if a tally is empty, so I pipe the AWK output to sed and convert any blank tallies to "0". An alternative would be to initialise the four incrementing variables to zero.

| column -t -s $'\t'
   I pipe the sed output to the column command to tidy the result.

Last update: 2020-08-19
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License