banner

About this blog

This is the second series (2024 >) of the BASHing data blog. The first series of 200 posts (2018-2022) and this one are companion websites to A Data Cleaner's Cookbook. Like the first series, the current blog is a place for demonstrations and trials of command-line data "ops". The operations might include analysing, archiving, auditing, cleaning, de-duplicating, encoding, entering, migrating, querying, reformatting, reporting, storing etc.

The first BASHing data series and A Data Cleaner's Cookbook are still online, but they are also archived in Zenodo and can be downloaded for offline use.

This website has an RSS feed.

About me

I'm a data auditor and retired zoologist.

Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com

The blog posts on this website are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Topic categories:


Posts by category (most recent post first):

Data auditing, cleaning and processing

Numbering duplicates by appearance order and date order (2024-12-20)⇜ LATEST
     How to violate the "one-field-one-kind-of-information" principle of databasing

Another embedded newlines fix (2024-11-29)
     If all records begin with the same string, there's an interesting AWK solution

Merging tables with (some) shared fields (2024-11-15)
     Get the fields into alignment with datamash and the join command

Timing a CSV to TSV operation (2024-11-01)
     How to quickly and easily compare process times?

Documenting edits with a before-and-after report (2024-09-13)
     A tweak to make the output more informative

Find the first, last, nth and first+last occurrence of a string (2024-06-21)
     Showing the easiest ways I know to do these jobs

Extract successive pairs from a list, and rapidly grow a list (2024-05-03)
     How to do it, but be careful with the "yes" command

Post- and pre-incrementing (var++ and ++var) with AWK (2024-04-26)
     Pre or post? Sometimes it doesn't matter

Finding near-duplicate spelling variants (2024-04-05)
     How to search for ä/ae-type duplicates

Table in a PDF to a TSV, on the command line (2024-03-29)
     Use the pdftotext utility and clean up with sed and AWK

Finding identifier codes with and without extra characters (2024-02-02)
     A command-line solution for finding near-duplicate values


Characters and encoding

The Web's most familiar gibberish: ’ (2024-11-22)
     Unfortunately, it isn't going to go away anytime soon

Mojibake, anyone? (2024-07-19)
     More delightful examples from real-world data audits

How to detect and convert those baffling ruffians (2024-06-28)
     Beware of Latin ligatures

A text full of nulls - what happened? (2024-06-07)
     Hint: Microsoft Windows encoding

Print a character as a variable with BASH printf (2024-03-22)
     There's a right way and a wrong way, but both work

Counterfeit spaces: the NBSP menace (2024-03-01)
     How to visualise and replace (or delete) NBSPs

Mojibake with 2 hearts and 52 bytes (2024-02-09)
     Encoding ping-pong between UTF-8 and Windows-1252


Data formatting

Munging the Atlas of Living Australia table format (2024-12-06)
     Why is the header in a separate file?

USV: The Unicode Separated Values format (2024-10-11)
     It's new and interesting

Line spacing tricks - updated (2024-07-12/2024-09-03)
     sed, AWK and grep are your friends

Archiving images: TIFF vs PPM (2024-07-04)
     Which format will be more easily readable in 1000 years?

DataMatrix codes and data content (2024-04-19)
     Squeezing lots of information into a tiny graphic

CSV to JSON to CSV, awkwardly (2024-04-12)
     Recovering CSV data from an awful JSON file

Convert Microsoft serial day numbers to YYYY-MM-DD (2024-02-23)
     Easy, if you remember that 1900-02-29 didn't happen


Data analysis examples

Anatomy of a data analysis (2024-10-25)
     5 million records dissected with BASH arrays

Summing by type in a table (2024-09-20)
     What to do if the table layout is awful

Minimum, maximum and range by group (2024-05-24)
     GNU datamash is great, but sometimes more is needed


Data entry and display

Pretty-printing a table in the terminal (2024-11-08)
     Three little-known CLI programs and a tip about less

A plotting-in-terminal solution: sixels and mlterm (2024-10-04)
     With some terminals, sixel graphics are wonderfully easy to use

Millipedes and maps (2024-08-16)
     A script to automate some map-making for the Web

Searching a pick-list with YAD (2024-08-09)
     YAD can display form options from a list

Middle-click paste a series of numbers or letters (2024-07-26)
     A neat trick that might be handy someday

Mapping with gnuplot, part 5 (2024-03-15)
     Building a dialog for choosing data to be mapped

Mapping with gnuplot, part 4 (2024-03-08)
     Showing a much-improved way to build a basemap


Useful programs for command-line data ops

7 ways to get the source code of a webpage (2024-09-27)
     Not the same as Web scraping

Escaping from Microsoft Excel on the command line (2024-08-30)
     xlsx2csv, in2csv, ssconvert and unoconv

How to crunch a grawlix (2024-08-02)
     Demonstrating an unusual use for crunching

Five useful tweaks for the ranger file manager (2024-06-14)
     Easy ways to make this CLI utility even better

Polyglot and round-trip translations (2024-05-31)
     Flexible translations with translate-shell

GNU datamash and months (2024-02-16)
     How to help datamash over the month-sorting hurdle


AWK tips and tricks

How to force a preferred array sort in AWK (2024-10-18)
     Use a second array to control the first

Find a word, plus words either side of the matching word (2024-08-23)
     It might be easier with AWK than with grep

AWK one-liners to multi-liners (2024-05-10)
     A little-known "pretty print" option


BASH tips and tricks

Sorting camels, kebabs, pascals and snakes (2024-12-13)
     Word case has a powerful effect on sorting


Miscellaneous stuff

Tools of my trades (2024-09-06)
     23 GUI and 72 CLI programs I need for my work

The curious world of UUIDs (2024-05-17)
     What they are and how to tinker with them