For a full list of BASHing data blog posts see the index page.

A data checker's checklist

The BASHing data blog has been in recess while I worked on a new resource for digital data checkers and cleaners. I've now got something like an outline of topics for that resource, which I'm offering below. Comments from readers would be very welcome on things I've left out, and on things with which data workers would firmly disagree. BASHing data, meanwhile, will continue with occasional posts on miscellaneous topics (like next week's post on some spectacular mojibake).

The new resource will help data workers build data tables that cause the least trouble for downstream data users and processing applications. It explains what to look for in a data table but not how to look. There's no code in the new resource, and no software recommendations. The data-working community is very diverse and includes Excel, R, Python and AWK/BASH wizards. Different workers will have different preferred strategies for checking data tables and for cleaning them. Each to their own!

Basic layout

A data table is a rectangular array of data items
Each field (column) contains one and only one kind of data
All records (rows) have the same number of fields
The first record has field names
No entirely blank records and no entirely blank fields
No extra metadata lines outside the field/record array
Missing data items are just that: missing (empty strings)

Formatting and characters

The table is plain, unformatted text
Encoding is UTF-8
Lines end with the linefeed character, not carriage return + linefeed
The field-separating character does not appear within any data item
Data items contain only letters, numbers, simple punctuation and plain whitespace
No invisible control characters, no-break spaces or soft hyphens
No leading, trailing or excess whitespace within data items
No linebreaks within data items
Each character has just one encoding, preferably the simplest
Combining characters are normalised to single character equivalents
Lost characters ("?", "�" and mojibake) are replaced with originals
No unnecessary quoting of data items
No markup or markdown
Paired characters such as "(" and ")" are correctly matched within data items

Fields

An index field has a code for each record
No duplicate entries in the index field
No missing-but-expected fields
No empty fields (field is named but empty in all records)

Records

No blank records
No records split over two or more lines
No exact duplicate records
Partial duplicate records only where expected

Data items

All data items valid for their field
All data items appropriate for their field
All data items correctly and consistently formatted for their field
No pseudo-duplicated data items
No missing-but-expected data items
No truncated data items
No "filler" for missing data
Dates? Include date field formatted as ISO 8601
Coordinates? Include decimal latitude and decimal longitude fields
Check for "copy-down" errors from table previously in spreadsheet
Check for "field shift" errors from table previously in spreadsheet

Cross-table and between-table relationships

No "unilaterally" missing data items in paired fields
No disagreements between related fields
No disagreements of the min > max (!) kind
Fields shared between tables have the same fieldname
Fields shared between tables have referential integrity

Last update: 2021-05-12
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License