banner

About this blog

This is the second series (2024 >) of the BASHing data blog. The first series of 200 posts (2018-2022) and this one are companion websites to A Data Cleaner's Cookbook. Like the first series, the current blog is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.

The first BASHing data series and A Data Cleaner's Cookbook are still online, but they are also archived in Zenodo and can be downloaded for offline use. The first 75 posts in this second BASHing data series are likewise archived in Zenodo.

This website has an RSS feed.

About me

I'm a data auditor and retired zoologist.

Robert Mesibov, West Ulverstone, Tasmania, Australia
mesibov@datafix.com.au

The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Topic categories:


Posts by category (most recent post first):

Data auditing, cleaning and processing

Does string A contain string B? Ask AWK's index (2025-02-28)
     Demonstrating a handy use for the index function

Extract the year from a date string without using the date command
(2025-02-21)
     Demonstrating 2 methods and 4 variations for each

Adding the missing keys and values in a key-value series (2025-01-24)
     Easily done with an AWK array

Replace the last N occurrences of a pattern in a string (2025-01-03)
     Drive sed with a for loop

Numbering duplicates by appearance order and date order (2024-12-20)
     How to violate the "one-field-one-kind-of-information" principle of databasing

Another embedded newlines fix (2024-11-29)
     If all records begin with the same string, there's an interesting AWK solution

Merging tables with (some) shared fields (2024-11-15)
     Get the fields into alignment with datamash and the join command

Timing a CSV to TSV operation (2024-11-01)
     How to quickly and easily compare process times?

Documenting edits with a before-and-after report (2024-09-13)
     A tweak to make the output more informative

Find the first, last, nth and first+last occurrence of a string (2024-06-21)
     Showing the easiest ways I know to do these jobs

Extract successive pairs from a list, and rapidly grow a list (2024-05-03)
     How to do it, but be careful with the "yes" command

Post- and pre-incrementing (var++ and ++var) with AWK (2024-04-26)
     Pre or post? Sometimes it doesn't matter

Finding near-duplicate spelling variants (2024-04-05)
     How to search for ä/ae-type duplicates

Table in a PDF to a TSV, on the command line (2024-03-29)
     Use the pdftotext utility and clean up with sed and AWK

Finding identifier codes with and without extra characters (2024-02-02)
     A command-line solution for finding near-duplicate values


Characters and encoding

Beware these characters in a terminal (2025-06-20)
     Really annoying behaviour for CLI users

How to hide a number in plain sight (2025-06-13)
     A simple cryptographic trick

The ìèñëèâñüêå mystery (2025-01-31)
     The killer was... the Microsoft Corporation

A Unicode normalisation problem (2025-01-10)
     How to get rid of full-width characters

The Web's most familiar gibberish: ’ (2024-11-22)
     Unfortunately, it isn't going to go away anytime soon

Mojibake, anyone? (2024-07-19)
     More delightful examples from real-world data audits

How to detect and convert those baffling ruffians (2024-06-28)
     Beware of Latin ligatures

A text full of nulls - what happened? (2024-06-07)
     Hint: Microsoft Windows encoding

Print a character as a variable with BASH printf (2024-03-22)
     There's a right way and a wrong way, but both work

Counterfeit spaces: the NBSP menace (2024-03-01)
     How to visualise and replace (or delete) NBSPs

Mojibake with 2 hearts and 52 bytes (2024-02-09)
     Encoding ping-pong between UTF-8 and Windows-1252


Data formatting

Multiple-line records to a simple table (2025-05-16)
     A single AWK command regularises an irregular set of records

How to add trailing spaces and zeroes (2025-04-25)
     Spaces easy, zeroes tricky

Extreme reformatting: a vertical calendar (2025-04-18)
     It took a surprising amount of work to build vertically

Text processing with xargs and jot (2025-03-28)
     Demonstrating niche uses for these two utilities

Munging the Atlas of Living Australia table format (2024-12-06)
     Why is the header in a separate file?

USV: The Unicode Separated Values format (2024-10-11)
     It's new and interesting

Line spacing tricks - updated (2024-07-12/2024-09-03)
     sed, AWK and grep are your friends

Archiving images: TIFF vs PPM (2024-07-04)
     Which format will be more easily readable in 1000 years?

DataMatrix codes and data content (2024-04-19)
     Squeezing lots of information into a tiny graphic

CSV to JSON to CSV, awkwardly (2024-04-12)
     Recovering CSV data from an awful JSON file

Convert Microsoft serial day numbers to YYYY-MM-DD (2024-02-23)
     Easy, if you remember that 1900-02-29 didn't happen


Data analysis examples

How to ignore everything but numbers(2025-05-23)
     If AWK sees a number first, it thinks arithmetically

Find all data points "X" km or less from a given point(2025-03-14)
     A command-line alternative to working with a GIS program

Permutations and combinations of pairs with AWK (2025-03-07)
     Easy ways to get results with and without repetition

Anatomy of a data analysis (2024-10-25)
     5 million records dissected with BASH arrays

Summing by type in a table (2024-09-20)
     What to do if the table layout is awful

Minimum, maximum and range by group (2024-05-24)
     GNU datamash is great, but sometimes more is needed


Data entry and display

Data entry with unknown data categories (2025-07-04)
     What to do when you don't know the fields in advance

Rename time-series files for chronological sorting (2025-04-04)
     Dealing with the "document15176075861143268989.pdf" problem

Four exercises with data art (2025-02-14)
     Colorful fun with PPMs

Pretty-printing a table in the terminal - updated (2024-11-08; 2025-02-21)
     Three little-known CLI programs and a tip about less

A plotting-in-terminal solution: sixels and mlterm (2024-10-04)
     With some terminals, sixel graphics are wonderfully easy to use

Millipedes and maps (2024-08-16)
     A script to automate some map-making for the Web

Searching a pick-list with YAD (2024-08-09)
     YAD can display form options from a list

Middle-click paste a series of numbers or letters (2024-07-26)
     A neat trick that might be handy someday

Mapping with gnuplot, part 5 (2024-03-15)
     Building a dialog for choosing data to be mapped

Mapping with gnuplot, part 4 (2024-03-08)
     Showing a much-improved way to build a basemap


Useful programs for command-line data ops

Making an archive job a lot easier (2025-06-06)
     Selectively unzip and rename, all on the command line

Two more tweaks for the ranger file manager (2025-05-09)
     Wrap text in preview, and improve the default colors

csvlens: a delimited text file viewer for the terminal (2025-05-02)
     TL, DR: it works very well as a viewer!

New code for my translation box (2025-03-21)
     Translations on the fly, item by item

MAD about the median (2025-01-17)
     That's Median Absolute Deviation, a useful statistic

7 ways to get the source code of a webpage (2024-09-27)
     Not the same as Web scraping

Escaping from Microsoft Excel on the command line (2024-08-30)
     xlsx2csv, in2csv, ssconvert and unoconv

How to crunch a grawlix (2024-08-02)
     Demonstrating an unusual use for crunching

Five useful tweaks for the ranger file manager (2024-06-14)
     Easy ways to make this CLI utility even better

Polyglot and round-trip translations (2024-05-31)
     Flexible translations with translate-shell

GNU datamash and months (2024-02-16)
     How to help datamash over the month-sorting hurdle


AWK tips and tricks

Five ways to pass a shell variable to AWK (2025-06-27)
     Two simple ways and three clever tricks

AWK's view of existence (2025-02-07)
     Empty vs zero: beware the difference

How to force a preferred array sort in AWK (2024-10-18)
     Use a second array to control the first

Find a word, plus words either side of the matching word (2024-08-23)
     It might be easier with AWK than with grep

AWK one-liners to multi-liners (2024-05-10)
     A little-known "pretty print" option


BASH tips and tricks

The script command for tinkerers (2025-07-11) ⇜ LATEST
     A handy way to record command history

A launcher for occasionally used applications (2025-04-11)
     A DIY desktop tool for speed and efficiency

Sorting camels, kebabs, pascals and snakes (2024-12-13)
     Word case has a powerful effect on sorting


Miscellaneous stuff

What a long, strange trip it's been (2025-05-30)
     Computer users were computer programmers, to begin with

The browser-as-text-editor trick (2024-12-27)
     It's a simple trick and can save you having to open a separate text editor

Tools of my trades (2024-09-06)
     23 GUI and 72 CLI programs I need for my work

The curious world of UUIDs (2024-05-17)
     What they are and how to tinker with them