On this page:
|
On the Data tables 2 page:
|
On the Data tables 3 page:
|
On the Data tables 4 page:
|
What's a "data table"?
On this website a data table is a plain text file containing records broken into fields, with one record per line. The data in the records are strings with letters, numerals, punctuation or spaces. Each field in a record contains just one data item. Record number 1 in the table is the header and it contains the names of the fields.
Above is a small data table as it looks in the Geany text editor. The table has 5 fields and 11 records. Some of the data items in the "Landline" field are empty strings — they're missing, but the table holds places for them.
The screenshot shows fields separated by the tab character, making this table a tab-separated value file, or TSV. For reasons listed in the next section, TSVs are best for data checking and cleaning. A TSV is what you get when you copy/paste a spreadsheet into a plain-text file (see below), and most databases have TSV as an export option. If you're checking or cleaning a plain-text table with field separators other than tabs, such as commas in a CSV, it's best to convert the table to a TSV first, as described below. The original field separators can be put back later if needed.
Recipes on this website that only work with TSVs are marked with the icon TSV. It's also assumed that data tables are in UTF-8 encoding.
Why TSV?
There are four good reasons to use the tab character to separate (or delimit) the fields in a data table:
To avoid comma hassles. Comma-separated value (or 'CSV') text files are more popular than TSVs, but they can be troublesome because a comma within a data item is no different from a comma used to separate data items. This trouble is avoided by double-quoting data items containing a comma. Any pre-existing double quotes in the data item then have to be double-quoted, as well. In practice, sometimes only data items containing commas are double-quoted, sometimes data items with spaces are also double-quoted, and sometimes all data items are double-quoted:
Three ways to quote data items in a CSV:
"Smith, Andrew",0417 563 294,"""The Croft""",Yan Yean,Vic
"Smith, Andrew","0417 563 294","""The Croft""","Yan Yean",Vic
"Smith, Andrew","0417 563 294","""The Croft""","Yan Yean","Vic"
The same data items in a TSV:
Smith, Andrew 0417 563 294 "The Croft" Yan Yean Vic
Unlike commas, tabs are unlikely to be present within data items. It doesn't matter how many commas or spaces there are in a TSV file, they won't be mistaken for field separators, and data items in a TSV with commas, spaces or double quotes don't need to be double-quoted.
For ease of displaying fields. Tabs make it easier to see fields as separate blocks when a data table is viewed in a text editor or a terminal. In the screenshot above, note how the Geany text editor shows the tab character as a light-coloured arrow. The text editor Gedit does the same with its 'Draw Spaces' plug-in.
You can get clear separation of tab-separated data items in a terminal by adjusting the default tab spacing with the tabs command. In this screenshot from my terminal, I set the tab spacing to 12 characters before cat-ing the TSV "contacts":
For ease of text-spreadsheet interchange. Spreadsheet programs use the tab character as field delimiter in copy/pasting, which is a great time-saver when swapping between text-only and spreadsheet format. See below for details.
No need to specify separator with two basic utilities. The tab character is the default field separator for the cut and paste commands, so you can use those commands without specifying a delimiter. Both commands are very handy in managing data tables.
Spreadsheet to TSV
If you copy a rectangular array of cells in a spreadsheet and paste from the clipboard into a plain-text text editor, the data items in the cells will automatically be tab-separated. This method saves you having to choose between the spreadsheet export options for field separator, and will avoid the quoting of data items.
To select a rectangular array of cells (in Microsoft Excel, LibreOffice Calc or Gnumeric), first make the top left cell active. If that cell is "A1", do this with Ctrl + Home. To complete the selection, use Ctrl + Shift + End.
If you copy and paste cells from Microsoft Excel, the encoding may be Windows-1252 (it depends on the settings in your text editor). To convert to UTF-8 encoding see here. Exporting directly from Excel to text may be an easier option, but see this BASHing data blog post for some cautions.
There are dozens of plain-text text editors. My favourites are Geany and gedit.
CSV to TSV
CSV formatting can vary from application to application and I don't know of a single command or CSV tool that will convert every CSV "in the wild" to a TSV without errors. A good first choice, though, is the csvformat converter from the "csvkit" group of utilities (in Linux repositories).
csvformat -T table.csv > table.tsv
csvformat will also neatly build a TSV from a "CSV" file that uses semicolons or pipes as field separators. Use the -d option to specify the separator:
csvformat -d ";" -T semicolontable.csv > semicolontable.tsv
csvformat -d "|" -T pipetable.csv > pipetable.tsv
You can also try the "c2t" function (below), which converts most CSV variants to TSV. The function takes table as an argument and generates a TSV named "tsv_table".
Convert correctly formatted CSV to TSV
c2t() { sed 's/""/@@@/g' "$1" | awk -v FPAT='[^,]*|"[^"]*"' -v OFS="\t" '{$1=$1; gsub(/"/,""); print}' | sed 's/\t@@@\t/\t\t/g;s/@@@/"/g' > tsv_"$1"; }
For more information on CSVs and CSV tools see these BASHing data posts:
Curse of the CSV monster (2018-07-18)
CSV to table, table to CSV (2021-06-02)
CSV viewers for CSV haters (2021-08-18)
TSV to CSV on the CLI (if you really have to) (2021-10-13)