About this blog
BASHing data is a companion blog to A Data Cleaner's Cookbook. The blog is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.
This page introduces the first BASHing data series, which includes 200 posts and ran from 2018 to 2022. Like A Data Cleaner's Cookbook, the first series is archived in Zenodo and can be downloaded for offline use. The second series of BASHing data began in 2024 and is a separate website.
About me
I'm a data auditor and retired zoologist.
Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com
The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Topic categories:
- Data auditing, cleaning and processing
- Characters and encoding
- Data formatting
- Data analysis examples
- AWK tips and tricks
- BASH tips and tricks
- Useful programs for command-line data ops
- Data entry and display
- The Windows and spreadsheet worlds
- Miscellaneous stuff
Posts by category (most recent post first):
Data auditing, cleaning and processing
People are the best data cleaners (2022-04-08)
Between spreadsheets and Big Data analytics is the command line
Search for (exact) strings; report line, column and context (2022-03-09)
A coloured grep for data tables
Detecting truncations: another sometimes successful method (2021-12-15)
This is a difficult job and every command-line trick helps
A quick cross-file comparison with AWK (2021-11-10)
AWK neatly does a tricky data comparison
Duplicate records differing only in unique identifiers - updated (2021-10-27)
A much-improved method for finding these partial duplicates
Some regex tests with grep, sed and AWK (2021-10-20)
Speed tests for different search/filter cases
How to do replacements based on multiple field values (2021-10-06)
Command-line repairs with a powerfully simple tool
There's data missing - please explain (2021-06-30)
A blank entry can have hidden meanings
The curious world of check digits (2021-06-16)
How they work, and code to validate an ABN
The Incrementing Fill-Down Error (2021-05-26)
Another data crime with spreadsheeting as the prime suspect
A data checker's checklist (2021-05-12)
A draft outline of topics for the next online resource
How to fix "one2many" data issues (2021-03-17)
Command-line repairs for a surprisingly common type of error
DIY primary/foreign key relationships, again — updated (2021-03-25)
A script to check for primary/foreign key issues
Four kinds of data anomalies (2021-02-24)
Anomalies might be out of range, out of place, out of match or out of date
How to find the missing parts of a series (2021-02-03)
Command-line solutions for a simple and two more complicated cases
How to build a multi-file fields concordance (2020-12-23)
Clearly show which fields have the same name in two or more files
Check the day of year, given a date (2020-11-18)
Comparing ISO 8601 dates with their day numbers
How to keep an eye on field numbers (2020-11-04)
Put the field numbers on a digital Post-it note with YAD
Three kangaroos in the ocean (2020-09-30)
Ridiculous outliers can sometimes be worth salvaging
Finding one-to-many entries in a data table (updated) (2020-09-16)
Too many B's for each A?
Checking DIY primary/foreign key relationships (2020-09-02)
Problems when primary and foreign keys are hand-built
How to do a both/neither/one/other tally - updated (2020-09-06)
A simple check on paired fields (like latitude and longitude) in a data table
How to find almost-duplicates (2020-07-01)
Two methods that work with some (but not all) data tables
Add an issues field to a data table (2020-05-20)
How to get records to self-report their problems
Spellchecking scientific names on the command line (2020-05-06)
How to build and use a dictionary of scientific names
Targeted string replacements with sed and AWK (2020-04-08)
Avoid the dangers in globally replacing A with B
A curious pair of data ops (2020-03-18)
Multiple pivots and keying the unreadable
Moving averages with AWK (2020-03-04)
A command for adding moving averages to a table
Topping and tailing, and the slowness of GNU sort — updated (2019-11-08)
GNU sort can be a rate-limiting step in a pipeline
How to guess the field separator in a table (2019-10-04)
Count up the likely field separators in the header line with AWK
Long, narrow tables vs short, wide ones (2019-08-16)
Three tests of processing speed show that table shape doesn't matter
A bulk replacement GUI with YAD (2019-08-02)
A shell script for "normalising" pseudo-duplicates in a data table
Finding malformed markup (2019-07-19)
How to identify messed-up HTML tags in non-HTML documents
Leading and trailing whitespace (2019-06-28)
How to find and delete "fore and aft" whitespace within fields in a data table
Growing the Cookbook's "broken" function (2019-05-31)
A more informative way to tally up the number of fields in a data table
How to delete, insert and replace whole lines (2019-05-12)
Use line addresses to target just the right lines
How to delete, insert and replace whole fields (2019-05-05)
Cut and paste are usually the right tools for these jobs
Comparing fields across two tables (2019-02-03; updated)
A script to check for changes in a field
How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons? (2018-12-16)
On the command line, you can ignore everything but the numbers
Finding changepoints in a list, revisited (2018-12-06)
Using AWK to find where values change in a list
How to find distances between lat/lons for geochecking (2018-11-07)
When you're looking for big differences, an approximate method is fine
Bird watching with AWK and grep (2018-10-24)
Showing off the fastest way to search a text file for strings in another file
How to validate ISO 8601 dates without regex (2018-10-05)
Check for format and content errors in YYYY-MM-DD fields with AWK
Fightin' fields (2018-09-30)
Finding disagreements between data fields can be challenging
Fuzzy matching in practice (2018-09-23)
Tips for approximate matching with tre-agrep
48 sea levels and a trope for your terminal (2018-08-11)
A bulk string replacement with AWK, and that ACCESS DENIED thing
Pseudo-blank ("empty") records and fields (2018-08-04)
How to find not-quite-empty rows and columns in a data table
Time series ops (2018-07-23)
Using AWK to summarise time series data
Partial duplicates (2018-07-14)
One way to find "pseudoduplicated" records
Truncated data items (2018-07-04)
Detecting truncations, such as a 100-character string clipped to 50 characters
Compare parts of strings (2018-05-22)
How to use AWK's "split" function to compare parts of strings
Characters and encoding
Gremlin detection bigly improved and a NUL problem avoided (2021-12-08)
The gremlin detector script has been rebuilt from scratch
How to watermark a UTF-8 plain text file (2021-11-24)
Use an inconspicuous Unicode character and a placement code
How to find mixed Latin+Cyrillic words (2021-09-29)
Blue Latin and red Cyrillic letters in words containing both
Show Unicode code points for UTF-8 characters (2021-09-15)
Convert a character to its code point (\uxxxx) with shell tools
Yet another gremlin: the zero-width space (2021-09-01)
How to find and kill it before it does mischief
What is +ACY- doing in the data? (2021-07-14)
A strange encounter with UTF-7
Mojibake madness (2021-05-19)
Spectacular examples of gibberish from recent data audits
Mojibake bonanza (2020-12-16)
New mojibake origin puzzles from a museum database
Encoding detection smackdown (2020-09-23)
enca vs file vs iconv vs isutf8 vs uchardet
Character equivalence classes 2: the nature of equivalence (2020-06-24)
What does "something like" actually mean?
Character equivalence classes 1: search and replace (2020-06-17)
How to find "something like" a character
More mojibake fun (2020-04-01)
Easy-to-hard examples of translating from gibberish
Hunting gremlins (2020-01-22)
A script to make invisible gremlin characters visible
Build your own character class inventories — updated (2019-12-27)
Find out what [:alpha:] and [:cntrl:] mean in your system
Introducing the replo (2019-11-01)
Character replacements by computers can be reversible, reconstructable or researchable
An unexpected character replacement (2019-10-18)
Strange replacements of non-ASCII characters by R
Return of the mojibake detective (2019-07-05)
Three new cases of mysterious character corruptions
Quotes as characters (2019-04-07; updated 2019-05-26)
How to recognise the nine different kinds of single and double quotes
How to choose special characters, revisited (2019-03-24)
Scripting a little GUI for copying/pasting your most often-used special characters
iconv and illegal input sequences (2018-09-13)
Getting around a roadblock in changing the character encoding of a file
SCI and 62;c62;c62;c... (2018-08-25)
A control character causes strange behaviour in GUI terminals
Mojibake detective work (2018-08-06)
A close look at some character encoding problems
Question marks that aren't really question marks (2018-07-27)
Some question marks show that a program doesn't understand a character's encoding
Combo characters (2018-06-09)
How to deal with Unicode's combining characters
Data formatting
gron the JSON flattener (2022-03-23)
Flattened JSON can be worked with shell tools
How to flatten ("unpivot") a data table (2022-03-16)
Make a table into a list of values by row and column
Auto-incrementing version letters (2022-03-02)
Solutions for building 101c, 101d, 102a, 102b...
A dog-cat-horse-turtle problem (2022-01-19)
Seven solutions and counting for this one problem
Tidy tables for data processing (2022-01-12)
Preparing data for programs that don't care about "pretty"
Building an ODT on the command line (2021-12-29)
Bypass the GUI by starting with HTML
Making a transect into a point and circle (2021-12-22)
Convert a WKT linestring to point-radius in metres
What's wrong with my footprintWKT? (2021-11-17)
About WKT and unexplained "invalid" results in GBIF
On visual contrast and QR codes (2021-11-03)
Boosting contrast makes blurry QR codes readable
TSV to CSV on the CLI (if you really have to) (2021-10-13)
How to build an RFC4180-compliant CSV from a TSV
zbarimg and blurry QR codes (2021-08-25)
Surprisingly well-blurred codes are still readable
Two data formatting tweaks (updated) (2021-08-11)
Handy ways to make tab-separated fields more obvious
Reverse or shuffle a string in a particular field (2021-07-07)
Shell tools or AWK can do this, or a mix
"Firstname Lastname" to "Lastname, Firstname", with complications (2021-06-23)
Name parsing and formatting is rarely simple
CSV to table, table to CSV (2021-06-02)
How to pivot and "de-pivot" a CSV table
Converting a list to a presence/absence table (2021-02-10)
Re-formatting is easy with tidy, well-structured data
ASCII score bars and a gorblimey command (2021-01-27)
How to build a string of characters and their complement
Form text and placeholders (2021-01-13)
Form letters, diaries and mail merge in plain text
Comparing strings more clearly (2020-12-09)
How to make and emphasise a string comparison between fields
Re-format blah,YYYYMMDD,blah as blah,YYYY,MM,DD,blah (2020-12-02)
How to do it with sed or AWK: 7 methods
How to stack columns (2020-11-25)
Turn a "columnated" table into a straight up-and-down one
Building a data table from a sentence (2020-10-07)
How to expand a condensed data structure
Spotting spaces, and AWK's view of emptiness (2020-09-09)
A simple way to show and count plain whitespaces,
and "non-empty" vs "non-empty and non-zero" in AWK
How to number copy/pasted commands (2020-08-05)
A neat way to number and indent commands and their outputs
Sharing data and metadata together (2020-07-29)
How not to lose a data table's metadata
A quick repair job on a dislocated table (2020-07-15)
Fixing a table with displaced fields
Extra commas in a CSV (2020-07-08)
How to safely delete just the excess commas
Join consecutive lines if condition applies (2020-06-03)
Simple ways to fix embedded newlines
Printing repeats within repeats, and splitting a list into columns (2020-05-27)
Why I use pr rather than column for some columnating jobs
How to move selected lines within a file (2020-05-13)
No need to cut and paste, use the command line
Dealing with an all-CAPS/first-CAP jumble (2020-04-29)
How to normalise a mix of WORDS and Words
How to be uncertain with dates (2020-02-12)
A skeptical look at some of ISO 8601's new extensions
JSON Lines: record-style JSON (2020-01-29)
A bridge between table-style data and standard JSON
Emphasising text in the terminal (2019-12-13)
Making selected strings stand out with ANSI codes
Embedded newlines without a clue (2019-11-15)
Without clear markers for field fragments, you need to be creative
Add leading zeroes that aren't really leading (2019-09-13)
How to format numbers when they're inside non-numeric strings
A GUI to re-order fields in a table (2019-08-30)
A shell script for building a new table with reordered fields
The lat/lon floating point delusion (2019-08-09)
That big building is at -33.8903169365705 151.198409720645? Really?
Renumber a list after inserting a line — updated (2019-07-27)
A handy function for inserting and renumbering
Data from dingbats: copying down (2019-02-24)
Copying down is easy in a spreadsheet, but it's also possible on the command line
Fancy numbering of records (2019-02-17)
On the command line, you can number a list of records any way you like
Reformatting a list, cleverly (2019-01-27)
Create horizontal lists from a vertical one
Horizontal sorting within a field (2019-01-13)
There are two different ways to sort a field "horizontally", but neither of them is simple.
Changing the month format: a fairly general solution (2018-12-30)
Build a look-up table and use the starting and finishing format in an AWK command
Putting information into a table from the table's filename (2018-12-13)
The example adds a date from the filename to each record in the table
Unwrap your fasta (2018-12-01)
How to concatenate the sequence lines in FASTA files
Repair job: separate the tandem repeats (2018-10-26)
How to split a tandem repeat between fields
Too many lat/lon digits (2018-06-30)
Rounding off latitude/longitude data to an appropriate number of significant figures
Embedded newlines (2018-06-23)
How to safely remove embedded newlines
Data analysis examples
Online shopping and a one2many tweak (2022-02-23)
How to group product purchases by customer
Are you 10000 days old yet? (2022-01-05)
Three command-line ways to find out
Batch triangulation on the command line (2021-06-09)
Locate a point given the distances to two other, located points
Hunting Excel date twins (2021-03-09)
Microsoft's choice of starting dates leads to duplicate records
The myth of equinoctial gales (2020-10-14)
Real-world wind data don't show equinoctial gales
What's wrong with these records? (2020-08-26)
Tinkering with "present in these records, absent in those"
Checking date components across fields (2020-04-15)
Does "date" agree with "year", "month" and "day"?
Life tables (2020-03-11)
A sober look at the probability of dying in Australia
Data quality in iNaturalist downloads (2020-02-05)
Top marks for data from the citizen-science iNaturalist project
Steady as she goes, in Darwin (2019-10-25)
The daily temperatures in Darwin (Australia) are remarkably constant
Two ugly CSVs (2019-04-28)
Open but messy data from the Australian Electoral Commission and Companies House
Dog and cat data (2019-03-31)
A command-line exploration of five public datasets
Data with bulges (2019-03-10)
Three cases of unexpectedly large values in a data item
Two special data validations (2019-03-03)
Is that tree correctly located? Is that list of names and addresses truly regular?
Drugs on the command line (2019-01-06)
A disappointing dive into drugs data from the US Food and Drug Administration
Has the rainfall pattern in my hometown changed? (2018-12-23)
No obvious trends in number, length or intensity of rainfall events in recent years
Fun with BOM data (2018-07-11)
Weather watching with wget and gnuplot
Pivoting airlines (2018-06-03)
Using arrays of arrays to build a pivot table with AWK
AWK tips and tricks
How to use patsplit (GNU AWK) (2022-02-02)
Another way to split a string with AWK
Combinations from 2 lists: speed trials (2021-12-01)
Comparing two ways to build Cartesian products
Building a molar mass calculator (2021-03-24)
A shell script with AWK doing the chemical formula parsing
Updating a file from a lookup table (2020-11-11)
How to use an AWK array for lookup operations
How to use flags in AWK (revisited) (2020-10-21)
Flags are handy for defining AWK's working range of records
The easy-going syntax of AWK commands (2020-02-26)
AWK is flexible and tolerant in its command rules
Another surprising AWK trick (2019-12-06)
Strings or numbers? It depends on what you're doing with them.
A muggle's guide to AWK arrays: 4 (2019-09-20)
Easier and more flexible ways to sort array outputs
A muggle's guide to AWK arrays: 3 (2019-08-23)
Reformatting and table joining using arrays
A muggle's guide to AWK arrays: 2 (2019-07-12)
Working with two files, or the same file twice
A muggle's guide to AWK arrays: 1 (2019-06-07)
Array naming, index strings and value strings
A surprising AWK trick (2018-05-27)
A clever way to avoid using a flag in AWK
BASH tips and tricks
Put an editable command at the next prompt (2021-09-08)
Two ways to send an unfinished command to a prompt
How to bookmark directories in the shell (2020-06-10)
A couple of functions is all it takes
Brace expansion with variables and arrays: eval to the rescue (2020-04-22)
eval, a BASH built-in, solves brace expansion problems
Getting around a subshell problem (2020-01-15)
Something strange happens with buffering in a subshell
Working around the BASH brace expansion rule (2019-06-14)
How to build Cartesian string products in BASH
The magic of BASH string expansion (2019-05-19)
A simple trick that allows AWK and sed to use BASH as an interpreter
Avoiding senior moments with command-line functions (2018-11-13)
The trick is to make the documentation available on the CLI
Useful programs for command-line data ops
Revisiting a command-line translator (2021-07-28)
A handy tweak for the translate-shell program
VisiData: a table explorer for the terminal (2019-10-11)
Display, sort, reformat and more with this CLI utility
Transpose, pivot and bin with GNU Datamash 1.4 (2019-05-24)
Do complex data transformations more easily with Datamash
Parsing scientific names (gnparser) (2019-01-20)
Scientific names are much harder to parse than personal names
Data entry and display
Mapping with gnuplot, part 3 (2022-04-06)
How to create and animate "layers" on a gnuplot basemap
Mapping with gnuplot, part 2 (2022-03-30)
How to build a good-quality, fixed-scale basemap with gnuplot
An AWK histogram with scaling (2021-09-22)
In these histograms, bar length is scaled to the longest bar length
CSV viewers for CSV haters (2021-08-18)
Two CLI tools and one GUI
Visualising data as a PGM image (2021-08-04)
A not-very-successful experiment
A sunset surprise (2021-02-17)
Data graphics help to explain a puzzling phenomenon
Changing TTY prompt, font and colors (2020-02-19)
How to prettify your virtual terminals
Data validation on entry with YAD (2019-11-29)
In praise of lookup lists for data entry, with help from YAD dialogs
Plotting data in the terminal with gnuplot (2019-06-21)
A separate graphic is much better than an in-terminal plot
Making pictures with data (2019-04-14)
How to display data bytes as image bytes
Mapping with gnuplot (2018-10-31)
How to use gnuplot to put data points on a basemap
How to enter nothing in a database (2018-10-18)
If you have nothing to say, say nothing
Displaying data from table fragments (2018-09-06)
One way to build a tidy table from a jumble of data
A record pager built with YAD (2018-08-18; updated 2018-09-09)
How to turn a YAD dialog into a GUI viewer/pager for records in a data table
GUI ways to view and edit big text files (2018-07-31)
glogg, gvim, Geany and csvpad, but not spreadsheets
YAD repeat and edit (2018-05-21)
How to avoid re-entering data in a YAD data entry form
The Windows and spreadsheet worlds
Apple + Microsoft = character confusion (2022-02-09)
Saving a .docx to plain text can fail in odd ways
Spreadsheet annoyance no. 3: quotes have priority (2021-01-20)
Beware of unmatched quotes in data items
A grizzle about captive data (2020-07-22)
Don't confuse data with the Windows software that contains it
Spreadsheet annoyance no. 2 (2019-04-21)
Spreadsheets make dates out of entries that aren't dates, but that's not all they mess up
The trouble with Windows CRLF (2019-03-17)
Windows line endings are in a pain in the ... terminal
Getting data out of Excel safely (2019-02-10)
Watch out for embedded linebreaks, comma problems and character encoding issues
Curse of the CSV monster (2018-07-18)
CSV with broken records to TSV
Miscellaneous stuff
DNA-style frameshift cryptography (2022-02-16)
Secret messaging inspired by biology
Scripting a temperature notifier (2022-01-26)
How cold did it get last night, and how cold is it now?
The data worker's guide to psiphiorrhea (2021-07-21)
Too many decimal places? There's a name for that
The little museum and its data (2021-06-01)
Love affairs between science and IT don't always end well
A short rant about Python, R and UNIX (2020-10-28)
Why would you clean data with Python or R?
A data table thousands of years old (2020-08-12)
Modern record-keeping in ancient Mesopotamia
Second Tuesday of each month and a BASHing data century (2020-03-25)
ncal and the 100th blog post
Msot popele can undreatnsd tihs setennce (2019-12-20)
Garbling and ungarbling with shell scripts
Python and shell tools (2019-11-22)
A comparison of three data operations
A command-line "Countdown" (UK) companion (2019-09-27)
Fast solver for anagram puzzles, and a puzzle generator
Getting data from an Enphase Envoy S (2019-09-06)
Two user-accessible JSON files with performance data
Data on clay (2018-09-20)
Cheap data storage for thousands of years? Check.
Ancient glyphs in your terminal? Check.