About this website
A Data Cleaner's Cookbook went online on 23 October 2016. I corrected and updated it frequently over the next three years. At the end of 2019 I began re-organising the site and adding new recipes and examples from the companion blog, BASHing data. I did a third revision at the beginning of 2022. Past versions of the Cookbook are available as downloads in Zenodo.
If you find mistakes in the Cookbook or have suggestions for better recipes, please email me.
Please note that in 2024 a special version of this Cookbook went online for workers using the Darwin Core biodiversity standard.
Robert Mesibov, West Ulverstone, Tasmania, Australia
robert.mesibov@gmail.com
Latest update: 2024-09-12
About the companion blog
On the BASHing data blog I write about
- Data auditing, cleaning and processing
- Characters and encoding
- Data formatting
- Data analysis
- AWK tips and tricks
- BASH tips and tricks
- Useful programs for command-line data ops
- Data entry and display
- The Windows and spreadsheet worlds
- Miscellaneous stuff
The blog has more examples than the Cookbook does, and more background information on data-cleaning recipes. I recommend browsing the BASHing data posts for ideas on command-line data operations that don't involve data cleaning.
BASHing data is in two series. The first series of 200 posts ran from 2018 to 2022 and like the Cookbook has been archived in Zenodo. The second series started in early 2024.
About me
I'm a data auditor and retired scientist, and I've been working with data tables for more than 50 years. I started with printed columns on paper (and a calculator) before moving to spreadsheets and relational databases (Microsoft Access, Filemaker Pro, MySQL, SQLite).
In 2012 I discovered the AWK language and realised that every processing job I had ever done with data tables could be done faster and more simply on the command line. Since then my data tables have been stored as plain text and managed with command-line tools, especially AWK.
In case you're wondering "Which Linux?", I run MX on my desktop and my work laptop. MX Linux is based on Debian (stable) and is supplemented with handy new utilities and a good selection of apps. I highly recommend MX as an all-purpose Linux distro.
Contact me directly if you would like a quote on a data auditing or data cleaning job. I'm also happy to quote on training data workers in command-line methods.
About the banner image
The webpage banner shows a detail from a painting by the 17th-century Flemish artist David Rijckaert III. I like the look of concentration on the alchemist's face as he refers to a text. Working with the command line isn't alchemy, but sometimes it seems like magic.
Legal stuff
The text and images on this website are my own work and are copyright under a Creative Commons Attribution-NonCommercial 4.0 International License. You are welcome to use or copy the information and images on this website for non-commercial purposes, but please attribute that use to this source.
Please note that the software commands on this website are provided "as is", without warranty of any kind, express or implied, including fitness for particular purposes. In no event shall the website author be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software commands on this website.