For a full list of BASHing data blog posts see the index page.
How to find the missing parts of a series
My wife has a data table with a unique serial number for each of its records. The table gets frequent edits and she wanted to check which of the numbers might be missing as a result of past deletions. There are thousands of numbers, so eyeballing for gaps in the series isn't practical.
She gave me a text file ("UIs") with an unsorted list of the numbers. To discover the first and last ones I did a numerical sort of "UIs" (sort -n) and piped the result to sed to print just the first and last lines in the sorted list:
sort -n UIs | sed -n '1p;$p'
I then used comm to compare the sorted list of numbers in "UIs" with the sorted full range of numbers from 1 to 7255, built with seq. The -3 option for comm suppresses the lines the two lists have in common. I found 24 missing numbers:
comm -3 <(sort UIs) <(sort <(seq 1 7255)) | column
Well, that was easy, but it was just a series of numbers. What if the series wasn't so simple? How to build a complete series of other kinds of strings, to compare with the series from which items might be missing?
Product codes. Below is an unsorted list of serial product codes, filename "SKUs". One is missing from the series.
BRM109Z
BRM112Z
BRM99Z
BRM116Z
BRM96Z
BRM102Z
BRM117Z
BRM101Z
BRM120Z
BRM104Z
BRM98Z
BRM105Z
BRM118Z
BRM114Z
BRM119Z
BRM115Z
BRM111Z
BRM107Z
BRM108Z
BRM100Z
BRM110Z
BRM113Z
BRM97Z
To get the start and finish codes I'll use sort and sed again. A simple sort would put "BRM99Z" after "BRM100Z", so I've used sort's -V option for sorting version numbers:
sort -V SKUs | sed -n '1p;$p'
I'll build the complete series of product codes with a BASH for loop and echo:
comm -3 <(sort SKUs) <(sort <(for i in {96..120}; do echo "BRM"$i"Z"; done))
A hierarchical series. I see these occasionally in my data auditing work. For example, a biologist has 6 serially numbered samples from 4 different methods at 3 different sites, or 72 samples in all. Each sample is coded Site-Method-SampleNumber. In "samples" (below), the sites are coded BAR, MOG and SPA; the methods are coded BRU, PIT, SIE and SWE; and one of the 72 samples is missing:
MOG-SIE-5 | BAR-SWE-2 | SPA-SWE-4 | SPA-SWE-2 |
SPA-SIE-1 | BAR-PIT-5 | SPA-PIT-5 | MOG-SWE-6 |
MOG-BRU-4 | MOG-PIT-4 | BAR-BRU-2 | BAR-SWE-6 |
BAR-BRU-6 | MOG-SWE-5 | BAR-SWE-3 | SPA-PIT-3 |
MOG-SIE-6 | MOG-SIE-2 | SPA-BRU-1 | BAR-BRU-5 |
BAR-SWE-4 | SPA-SIE-3 | MOG-PIT-5 | MOG-PIT-6 |
MOG-SIE-3 | SPA-SIE-4 | MOG-SWE-1 | BAR-SIE-5 |
SPA-PIT-1 | SPA-PIT-4 | BAR-BRU-3 | BAR-SWE-5 |
MOG-BRU-3 | BAR-PIT-3 | MOG-SIE-4 | BAR-PIT-4 |
BAR-SIE-1 | SPA-BRU-2 | MOG-SWE-4 | BAR-PIT-1 |
BAR-BRU-4 | SPA-SIE-5 | BAR-SIE-4 | BAR-PIT-2 |
MOG-PIT-1 | SPA-BRU-3 | SPA-BRU-6 | SPA-SWE-6 |
MOG-BRU-2 | MOG-PIT-2 | BAR-SWE-1 | BAR-BRU-1 |
BAR-SIE-3 | BAR-PIT-6 | SPA-SIE-6 | MOG-PIT-3 |
SPA-BRU-5 | SPA-SWE-5 | MOG-BRU-1 | MOG-SWE-3 |
SPA-SWE-1 | MOG-BRU-5 | SPA-SIE-2 | SPA-PIT-6 |
BAR-SIE-2 | SPA-BRU-4 | BAR-SIE-6 | MOG-SIE-1 |
MOG-BRU-6 | SPA-PIT-2 | SPA-SWE-3 |
To "reconstitute" this table as a simple unsorted list, copy/paste it into a text file and do
sed 's/\t/\n/g' file | awk NF > samples.
To build and check the complete series, I'll use three BASH arrays and iterate through them as nested entities, echoing the output as a suitably formatted string:
site=(BAR MOG SPA); method=(BRU PIT SIE SWE); sample=(1 2 3 4 5 6)
comm -3 <(sort samples) \
<(sort <(for i in ${site[@]}; do \
for j in ${method[@]}; do \
for k in ${sample[@]}; do \
echo $i"-"$j"-"$k; done; done; done))
Update. Reader Markus Weihs suggests a much simpler solution to the hierarchical series problem, using BASH brace expansion. The command
echo {BAR,MOG,SPA}-{BRU,PIT,SIE,SWE}-{1..6}
builds all the possible combinations as a space-separated series:
The space-separated series could be turned into a list with either tr " " "\n" or xargs -n 1, but an even simpler method is to use printf:
comm -3 <(sort samples) \
<(sort <(printf "%s\n" {BAR,MOG,SPA}-{BRU,PIT,SIE,SWE}-{1..6})
Last update: 2021-02-03
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License