For a list of BASHing data 2 blog posts see the index page.
Escaping from Microsoft Excel on the command line
Back in 2019 I blogged about getting out of Excel safely, or in other words converting an Excel worksheet into a plain-text, tab-separated file (TSV) suitable for data checking on the command line.
Under Linux I can use either a GUI method or a CLI one to safely escape from Excel. The GUI one is simple: open the Excel worksheet in LibreOffice Calc or Gnumeric and select all the populated cells with Ctrl+Home and Shift+Ctrl+End. Copy the selected cells to the clipboard and paste the data in a new (blank) file in a capable text editor, like Geany. The result is a TSV in the text editor's default encoding (UTF-8 in my case).
For demonstration purposes I downloaded the sample Excel file "Employee Sample Data.xlsx" from the Spreadsheet Guru website. The spreadsheet has 14 fields and 1000 records plus a one-line header. I opened it in Calc and pasted the active cells as "calc1.tsv" in my text editor.
One issue I've found with the GUI method is that the date format in the TSV can vary with the (non-Excel) spreadsheet program used to open the Excel file. If I open the demo file in Calc I get dates formatted as "8/04/2016" by default, for example, while in Gnumeric the same date is "8/4/16" by default, and those two strings appear in their respective copy-pasted TSVs. I can do post-pasting reformatting of the TSV on the command line, but I need to know what date format I'm dealing with.
There are 4 different CLI methods I've tried for Excel escapes, as demonstrated below.
xlsx2csv is a project of developer Dilshod Temirkhodjaev, with other contributors. The -d tab option makes the output tab-separated, and you can also set an output date format with -f. The command
xlsx2csv -d tab -f "%Y-%m-%d" 'Employee Sample Data.xlsx' > dilshod.tsv
builds a TSV with ISO 8601 dates, as shown in these screenshots:
I got the ISO 8601 date format I wanted, but the salary has been stripped of its "$" and thousands separator, and the integer percentage "15%" has been converted to "0.15". Note, however, that there was a trailing space in the copy-paste "calc1.tsv" file after "$141,604", and xlsx2csv has removed it.
The csvkit bundle of utilities includes in2csv, which converts dates to ISO 8601 by default but behaves like xlsx2csv with regard to currency and percentages. It only outputs CSV, but you can combine in2csv with the csvformat -T command to build a TSV as follows:
in2csv 'Employee Sample Data.xlsx' | csvformat -T > csvkit.tsv
The command-line tool ssconvert is included in the Gnumeric spreadsheet package. I've had no success with getting ssconvert to export with tab separators, but with the following command chain I can get a TSV:
ssconvert 'Employee Sample Data.xlsx' gnum1.csv; csvformat -T < gnum1.csv > gnum2.tsv
Notice that I've again lost the currency and percentage formatting, and the date format has defaulted to YYYY/MM/DD. I'm not sure where that comes from, as it's not the default in my locale, and I haven't seen a date/time formatting option for ssconvert.
The fourth command-line tool I've looked at is unoconv, which works with LibreOffice but needs to be installed separately. I found it easiest to again generate a CSV and convert that to TSV:
unoconv -f csv 'Employee Sample Data.xlsx'; csvformat -T < 'Employee Sample Data.csv' > lo2.tsv
unoconv preserves not only the date format shown in Calc, but also the currency and percentage formatting. It also includes that strange trailing space in the salary data item (not present in the spreadsheet cell).
Summing up, the best CLI escaper for my purposes is xlsx2csv, since the datasets I audit don't normally include formatting characters like "$" and "%" within data items, and the program reliably outputs TSV and ISO 8601 dates. xlsx2csv is also very fast and handles large and multi-worksheet Excel files easily.
Last update: 2024-08-30
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License