For a full list of BASHing data blog posts, see the index page.
Pseudo-blank ("empty") records and fields
It's easy to find records in a data table that are entirely blank or only contain tabs or whitespace. Just ask AWK to print the line numbers of any records that have no fields:
awk 'NF==0 {print NR}' datatable
or
awk '!NF {print NR}' datatable
Blank records are fairly rare in real-world datasets. You're much more likely to find records that contain one or more essential data items (like a unique record ID) and nothing else, or maybe a "?" or "-" or two. These are "pseudo-blank" records. Three examples are on lines 4, 9 and 12 in the tab-separated table below, "fileB":
RecordID | Last_name | First_name | CustomerID | Modified |
10001 | Albee | Michael | mel141a | 2018-08-01 |
10002 | Allan | Rebecca | war206a | 2018-08-01 |
10003 | ??? | 2018-08-01 | ||
10004 | Allan | ? | war206b | 2018-08-01 |
10005 | Allen | Georgina | mel097a | 2018-08-01 |
10006 | Amado | Paolina | mel141b | 2018-08-01 |
10007 | Andrewartha | Herbert | bri129a | 2018-08-01 |
10008 | 2018-08-01 | |||
10009 | Andrews | Samuel | syd322a | 2018-08-01 |
10010 | Antill | Jessica | hob111a | 2018-08-01 |
10011 | - | - | - | 2018-08-01 |
Cases like this are messy, but a general strategy for finding pseudo-blank records might be to count the fields with unexpected content, like fields without alphanumeric characters, then tally up the records by their suspect field counts:
awk -F"\t" 'BEGIN {c=0} {for (i=1;i<=NF;i++) if ($i !~ /[[:alnum:]]/) {c++}} {print "lines with "c" suspect fields"; c=0}' datatable | sort -r | uniq -c
In the first part of the command, AWK goes through each tab-separated field in turn, testing it to see if it lacks alphanumeric characters. If it does, the variable "c" is incremented from zero, which was the value initialized in the BEGIN statement. When all fields in the record have been tested, AWK prints the field count "c", then resets "c" to zero. The output of the AWK command is fed to sort -r and uniq -c for a field count tally.
The next step is to print the records with the highest suspect field counts:
A data table might have a complete header line but only blank data items in some fields. These fields aren't completely empty, because they have a field name in the header, so they're "pseudo-blank". Finding these fields is fairly straightforward with AWK, and to do this I use the "empties" script explained below. To demo the script I'll use "fileC":
Field1 | Field2 | Field3 | Field4 | Field5 |
aaa | aaa | |||
aaa | aaa | aaa | ||
aaa | aaa | |||
aaa | aaa | aaa | ||
aaa | aaa | |||
aaa | aaa |
The core of the script is a command that finds any fields that are empty below the header, and lists their field numbers:
awk -F"\t" 'NR>1 {for (i=1;i<=NF;i++) a[i]+=length($i)} END {for (j in a) {if (a[j]==0) print j}}' datatable
AWK here ignores the header line ("NR>1") and goes through the fields on each line one by one. For each field the "length" function calculates the number of characters and adds that number to an array "a" indexed by field number. An empty field will have a total character number of zero, which is what AWK looks for in the END statement when it walks through the array "a".
The complete "empties" script is shown below. The list of empty fields is stored in a temp file ("/tmp/flds"), as is a numbered list of all fields in the table ("/tmp/allflds"). If there are no empty fields in the table, the script reports that and exits. Otherwise, an AWK command uses an array to add a colon and a field name to each of the empty field numbers, saving the output as the file "[filename]_emptyfields". The number of empty fields is reported on screen, and the script asks if I'd like the data table with the empty fields removed. If I say yes, the new table is built with a name I choose.
#!/bin/bash
awk -F"\t" 'NR>1 {for (i=1;i<=NF;i++) a[i]+=length($i)} END {for (j in a) {if (a[j]==0) print j}}' "$1" > /tmp/flds
head -n1 "$1" | tr '\t' '\n' | nl -w1 > /tmp/allflds
if [ ! -s /tmp/flds ]; then
echo
echo "No empty fields in $1 table" && rm /tmp/flds /tmp/allflds && exit
else
awk -F"\t" 'FNR==NR {b[$1]=$2; next} $1 in b {print $1":"b[$1]}' /tmp/allflds /tmp/flds > "$1"_emptyfields
echo
echo "$(wc -l < "$1"_emptyfields) empty field(s) in "$1" table"
echo
read -p "Build table with non-empty fields only? (y/n) " build
if [[ "$build" == "y" ]]; then
read -p "What name for the new table? " name
cut --complement -f$(paste -d',' -s /tmp/flds) "$1" > "$name"
else
rm /tmp/flds /tmp/allflds && exit
fi
fi
rm /tmp/flds /tmp/allflds
exit
This script is much faster than the "empties" script I published in 2017 on the Linux Rain blog and in version 1 of A Data Cleaner's Cookbook.
Last update: 2020-01-13
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License