
About this blog
This is the second series (2024 >) of the BASHing data blog. The first series of 200 posts (2018-2022) and this one are companion websites to A Data Cleaner's Cookbook. Like the first series, the current blog is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.
The first BASHing data series and A Data Cleaner's Cookbook are still online, but they are also archived in Zenodo and can be downloaded for offline use. The first 75 posts in this second BASHing data series are likewise archived in Zenodo.
About me
I'm a data auditor and retired zoologist.
Robert Mesibov, West Ulverstone, Tasmania, Australia
mesibov@datafix.com.au
The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Topic categories:
- Data auditing, cleaning and processing
- Characters and encoding
- Data formatting
- Data analysis examples
- Data entry and display
- Useful programs for command-line data ops
- AWK tips and tricks
- BASH tips and tricks
- Miscellaneous stuff
Posts by category (most recent post first):
Data auditing, cleaning and processing
Does string A contain string B? Ask AWK's index (2025-02-28)
Demonstrating a handy use for the index function
Extract the year from a date string without using the date command
(2025-02-21)
Demonstrating 2 methods and 4 variations for each
Adding the missing keys and values in a key-value series (2025-01-24)
Easily done with an AWK array
Replace the last N occurrences of a pattern in a string (2025-01-03)
Drive sed with a for loop
Numbering duplicates by appearance order and date order (2024-12-20)
How to violate the "one-field-one-kind-of-information" principle of databasing
Another embedded newlines fix (2024-11-29)
If all records begin with the same string, there's an interesting AWK solution
Merging tables with (some) shared fields (2024-11-15)
Get the fields into alignment with datamash and the join command
Timing a CSV to TSV operation (2024-11-01)
How to quickly and easily compare process times?
Documenting edits with a before-and-after report (2024-09-13)
A tweak to make the output more informative
Find the first, last, nth and first+last occurrence of a string (2024-06-21)
Showing the easiest ways I know to do these jobs
Extract successive pairs from a list, and rapidly grow a list (2024-05-03)
How to do it, but be careful with the "yes" command
Post- and pre-incrementing (var++ and ++var) with AWK (2024-04-26)
Pre or post? Sometimes it doesn't matter
Finding near-duplicate spelling variants (2024-04-05)
How to search for ä/ae-type duplicates
Table in a PDF to a TSV, on the command line (2024-03-29)
Use the pdftotext utility and clean up with sed and AWK
Finding identifier codes with and without extra characters (2024-02-02)
A command-line solution for finding near-duplicate values
Characters and encoding
Beware these characters in a terminal (2025-06-20)
Really annoying behaviour for CLI users
How to hide a number in plain sight (2025-06-13)
A simple cryptographic trick
The ìèñëèâñüêå mystery (2025-01-31)
The killer was... the Microsoft Corporation
A Unicode normalisation problem (2025-01-10)
How to get rid of full-width characters
The Web's most familiar gibberish: ’ (2024-11-22)
Unfortunately, it isn't going to go away anytime soon
Mojibake, anyone? (2024-07-19)
More delightful examples from real-world data audits
How to detect and convert those baffling ruffians (2024-06-28)
Beware of Latin ligatures
A text full of nulls - what happened? (2024-06-07)
Hint: Microsoft Windows encoding
Print a character as a variable with BASH printf (2024-03-22)
There's a right way and a wrong way, but both work
Counterfeit spaces: the NBSP menace (2024-03-01)
How to visualise and replace (or delete) NBSPs
Mojibake with 2 hearts and 52 bytes (2024-02-09)
Encoding ping-pong between UTF-8 and Windows-1252
Data formatting
Multiple-line records to a simple table (2025-05-16)
A single AWK command regularises an irregular set of records
How to add trailing spaces and zeroes (2025-04-25)
Spaces easy, zeroes tricky
Extreme reformatting: a vertical calendar (2025-04-18)
It took a surprising amount of work to build vertically
Text processing with xargs and jot (2025-03-28)
Demonstrating niche uses for these two utilities
Munging the Atlas of Living Australia table format (2024-12-06)
Why is the header in a separate file?
USV: The Unicode Separated Values format (2024-10-11)
It's new and interesting
Line spacing tricks - updated (2024-07-12/2024-09-03)
sed, AWK and grep are your friends
Archiving images: TIFF vs PPM (2024-07-04)
Which format will be more easily readable in 1000 years?
DataMatrix codes and data content (2024-04-19)
Squeezing lots of information into a tiny graphic
CSV to JSON to CSV, awkwardly (2024-04-12)
Recovering CSV data from an awful JSON file
Convert Microsoft serial day numbers to YYYY-MM-DD (2024-02-23)
Easy, if you remember that 1900-02-29 didn't happen
Data analysis examples
How to ignore everything but numbers(2025-05-23)
If AWK sees a number first, it thinks arithmetically
Find all data points "X" km or less from a given point(2025-03-14)
A command-line alternative to working with a GIS program
Permutations and combinations of pairs with AWK (2025-03-07)
Easy ways to get results with and without repetition
Anatomy of a data analysis (2024-10-25)
5 million records dissected with BASH arrays
Summing by type in a table (2024-09-20)
What to do if the table layout is awful
Minimum, maximum and range by group (2024-05-24)
GNU datamash is great, but sometimes more is needed
Data entry and display
Data entry with unknown data categories (2025-07-04)
What to do when you don't know the fields in advance
Rename time-series files for chronological sorting (2025-04-04)
Dealing with the "document15176075861143268989.pdf" problem
Four exercises with data art (2025-02-14)
Colorful fun with PPMs
Pretty-printing a table in the terminal - updated (2024-11-08; 2025-02-21)
Three little-known CLI programs and a tip about less
A plotting-in-terminal solution: sixels and mlterm (2024-10-04)
With some terminals, sixel graphics are wonderfully easy to use
Millipedes and maps (2024-08-16)
A script to automate some map-making for the Web
Searching a pick-list with YAD (2024-08-09)
YAD can display form options from a list
Middle-click paste a series of numbers or letters (2024-07-26)
A neat trick that might be handy someday
Mapping with gnuplot, part 5 (2024-03-15)
Building a dialog for choosing data to be mapped
Mapping with gnuplot, part 4 (2024-03-08)
Showing a much-improved way to build a basemap
Useful programs for command-line data ops
Making an archive job a lot easier (2025-06-06)
Selectively unzip and rename, all on the command line
Two more tweaks for the ranger file manager (2025-05-09)
Wrap text in preview, and improve the default colors
csvlens: a delimited text file viewer for the terminal (2025-05-02)
TL, DR: it works very well as a viewer!
New code for my translation box (2025-03-21)
Translations on the fly, item by item
MAD about the median (2025-01-17)
That's Median Absolute Deviation, a useful statistic
7 ways to get the source code of a webpage (2024-09-27)
Not the same as Web scraping
Escaping from Microsoft Excel on the command line (2024-08-30)
xlsx2csv, in2csv, ssconvert and unoconv
How to crunch a grawlix (2024-08-02)
Demonstrating an unusual use for crunching
Five useful tweaks for the ranger file manager (2024-06-14)
Easy ways to make this CLI utility even better
Polyglot and round-trip translations (2024-05-31)
Flexible translations with translate-shell
GNU datamash and months (2024-02-16)
How to help datamash over the month-sorting hurdle
AWK tips and tricks
Five ways to pass a shell variable to AWK (2025-06-27)
Two simple ways and three clever tricks
AWK's view of existence (2025-02-07)
Empty vs zero: beware the difference
How to force a preferred array sort in AWK (2024-10-18)
Use a second array to control the first
Find a word, plus words either side of the matching word (2024-08-23)
It might be easier with AWK than with grep
AWK one-liners to multi-liners (2024-05-10)
A little-known "pretty print" option
BASH tips and tricks
The script command for tinkerers (2025-07-11) ⇜ LATEST
A handy way to record command history
A launcher for occasionally used applications (2025-04-11)
A DIY desktop tool for speed and efficiency
Sorting camels, kebabs, pascals and snakes (2024-12-13)
Word case has a powerful effect on sorting
Miscellaneous stuff
What a long, strange trip it's been (2025-05-30)
Computer users were computer programmers, to begin with
The browser-as-text-editor trick (2024-12-27)
It's a simple trick and can save you having to open a separate text editor
Tools of my trades (2024-09-06)
23 GUI and 72 CLI programs I need for my work
The curious world of UUIDs (2024-05-17)
What they are and how to tinker with them