For a full list of BASHing data blog posts see the index page.
A data checker's checklist
The BASHing data blog has been in recess while I worked on a new resource for digital data checkers and cleaners. I've now got something like an outline of topics for that resource, which I'm offering below. Comments from readers would be very welcome on things I've left out, and on things with which data workers would firmly disagree. BASHing data, meanwhile, will continue with occasional posts on miscellaneous topics (like next week's post on some spectacular mojibake).
The new resource will help data workers build data tables that cause the least trouble for downstream data users and processing applications. It explains what to look for in a data table but not how to look. There's no code in the new resource, and no software recommendations. The data-working community is very diverse and includes Excel, R, Python and AWK/BASH wizards. Different workers will have different preferred strategies for checking data tables and for cleaning them. Each to their own!
Basic layout
- A data table is a rectangular array of data items
- Each field (column) contains one and only one kind of data
- All records (rows) have the same number of fields
- The first record has field names
- No entirely blank records and no entirely blank fields
- No extra metadata lines outside the field/record array
- Missing data items are just that: missing (empty strings)
Formatting and characters
- The table is plain, unformatted text
- Encoding is UTF-8
- Lines end with the linefeed character, not carriage return + linefeed
- The field-separating character does not appear within any data item
- Data items contain only letters, numbers, simple punctuation and plain whitespace
- No invisible control characters, no-break spaces or soft hyphens
- No leading, trailing or excess whitespace within data items
- No linebreaks within data items
- Each character has just one encoding, preferably the simplest
- Combining characters are normalised to single character equivalents
- Lost characters ("?", "�" and mojibake) are replaced with originals
- No unnecessary quoting of data items
- No markup or markdown
- Paired characters such as "(" and ")" are correctly matched within data items
Fields
- An index field has a code for each record
- No duplicate entries in the index field
- No missing-but-expected fields
- No empty fields (field is named but empty in all records)
Records
- No blank records
- No records split over two or more lines
- No exact duplicate records
- Partial duplicate records only where expected
Data items
- All data items valid for their field
- All data items appropriate for their field
- All data items correctly and consistently formatted for their field
- No pseudo-duplicated data items
- No missing-but-expected data items
- No truncated data items
- No "filler" for missing data
- Dates? Include date field formatted as ISO 8601
- Coordinates? Include decimal latitude and decimal longitude fields
- Check for "copy-down" errors from table previously in spreadsheet
- Check for "field shift" errors from table previously in spreadsheet
Cross-table and between-table relationships
- No "unilaterally" missing data items in paired fields
- No disagreements between related fields
- No disagreements of the min > max (!) kind
- Fields shared between tables have the same fieldname
- Fields shared between tables have referential integrity
Last update: 2021-05-12
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License