On this page:
|
On the Structure 1 page:
|
TSV This marker means that the recipe only works with tab-separated data tables.
Blank fields
A data table might have a complete header but only blank data items in some fields. These fields aren't completely empty because they have a field name in the header, so they're "pseudo-blank". The following script (I call it "empties") will identify any pseudo-blank fields and list them with their field numbers in a new file, "table_emptyfields". It also offers the option of building a new table without the pseudo-blank fields.
Interactive script to find empty fields and build a new data table without them TSV
(More information here)
#!/bin/bash
awk -F"\t" 'NR>1 {for (i=1;i<=NF;i++) a[i]+=length($i)} END {for (j in a) {if (a[j]==0) print j}}' "$1" > /tmp/flds
head -n1 "$1" | tr '\t' '\n' | nl -w1 > /tmp/allflds
if [ ! -s /tmp/flds ]; then
echo
echo "No empty fields in $1" && rm /tmp/flds /tmp/allflds && exit
else
awk -F"\t" 'FNR==NR {b[$1]=$2; next} $1 in b {print $1":"b[$1]}' /tmp/allflds /tmp/flds > "$1"_emptyfields
echo
echo "$(wc -l < "$1"_emptyfields) empty field(s) in "$1" table"
echo
read -p "Build table with non-empty fields only? (y/n) " build
if [[ "$build" == "y" ]]; then
read -p "What name for the new table? " name
cut --complement -f"$(paste -d',' -s /tmp/flds)" "$1" > "$name"
else
rm /tmp/flds /tmp/allflds && exit
fi
fi
rm /tmp/flds /tmp/allflds
exit 0
The "empties" script runs fairly slowly with big tables, so I use a modification that has a progress bar generated by the pv utility. The first AWK command in the script is replaced by:
pv -w 50 -pbt "$1" | awk -F"\t" 'NR>1 {for (i=1;i<=NF;i++) a[i]+=length($i)} END {for (j in a) {if (a[j]==0) print j}}' > /tmp/flds
Blank records
Print a list of record numbers for completely blank records,
or for records containing only tabs or whitespace
awk 'NF==0 {print NR}' table
awk '!NF {print NR}' table
Blank records are fairly rare in real-world data tables. More common are records that contain one or more essential data items (like a unique record ID) and nothing else, or maybe a "?" or "-" or two. These are "pseudo-blank" records. Examples are on lines 4, 5, 9 and 12 in the tab-separated table below, called "table1".
The "susscount" function prints a tally of records with "suspect" fields containing no alphanumeric characters.
Tally records with suspiciously "empty" fields TSV
(Earlier version explained here)
susscount() { awk -F"\t" '{c=0; for (i=1;i<=NF;i++) if ($i !~ /[[:alnum:]]/) {c++}} {print "records with "c" suspect fields"}' "$1" | sort -r | uniq -c; }
Note that the "susscount" function will return a high number of lines if the data table contains many blank data items. It is mainly for use with tables that would normally be well-filled.
To inspect the records with suspect fields, modify the function to print all records with a suspect field count equal to N, where N is based on the result of "susscount". The modified function is "showsuss".
Print records with suspiciously "empty" fields TSV
showsuss() { awk -F"\t" -v cnt="$2" '{c=0; for (i=1;i<=NF;i++) if ($i !~ /[[:alnum:]]/) {c++}} c==cnt {print}' "$1"; }