For a full list of BASHing data blog posts, see the index page.
Getting data out of Excel safely
This isn't an Excel-bashing post. Excel is perfectly OK for what it does, and millions of people happily use Excel every day. Their data look and behave just fine in an Excel spreadsheet. Then the data get exported from Excel as plain text for use in various other applications, and Bad Things Happen.
In my experience, those Bad Things mainly have to do with commas, embedded linebreaks and character encoding. Let me turn that on its head: if you export your Excel spreadsheet as plain text with (1) tab-separated fields, (2) no embedded linebreaks and (3) UTF-8 encoding, you probably won't be creating unnecessary work for other users of your data.
(1) Why tab-separated fields (TSV)? What's wrong with comma-separated (CSV)? Please see Curse of the CSV monster. Unless you know that the next user of your data absolutely, definitely requires the CSV format, export your data from Excel in TSV format.
Pro tip: if you use the filename suffix ".csv" instead of ".tsv" or ".txt" for your TSV, spreadsheet programs like Excel should open the file without field-separation errors, and no one looking at the spreadsheet will know you used tabs instead of commas in your text file.
(2) Although you can remove embedded linebreaks from a text file exported from Excel, it's easier to do the job within the spreadsheet. In all recent versions of Excel, bring up the "Replace" dialog with Ctrl + h. With the cursor in the "Find what" box, type either Ctrl + j or Alt + 010, where the "010" is entered from the numeric keypad on your keyboard. In both cases you're entering a linefeed character. This will appear in the "Find what" box as a tiny dot. Next, enter your preferred separator in the "Replace with" box. For example, if you have in a cell the following list
apples
bananas
cherries
and if you want those listed items converted (in the same cell) to
apples, bananas, cherries
then enter a comma and a space in the "Replace with" box. Now click on the "Replace all" button and all linebreaks within cells in that worksheet will turn into comma+space separated strings of items.
(3) The "house encoding" in the Microsoft world is Windows-1252. Unfortunately, Microsoft doesn't make conversion from Windows-1252 to UTF-8 easy. In Excel 2019 one of the "Save as" choices is "CSV UTF-8 (Comma delimited)", which is said to convert the default Windows-1252 encoding to UTF-8. Neither in Excel 2019 or in any earlier Excel version is there a "TSV UTF-8 (Tab delimited)" choice.
Here's a recommended procedure: choose "Text (Tab delimited) (*.txt)" from the "Save as" options list, then go to the "Tools" menu in the same dialog window, go to "Web options" under "Tools", go to the "Encoding" tab, and from the "Save this document as:" dropdown menu, select "Unicode (UTF-8)". Name your text file and export.
I tried this recently with Excel 2016 (build 16.0.4738.1000) and it didn't work. The exported text file was tab-separated, but the encoding was still Windows-1252. You may have better luck.
Plan B begins by selecting the full range of used rows and columns in your Excel worksheet (from which you have already removed any embedded linebreaks), and copying them and pasting them into a blank text file. Like other spreadsheet applications, Excel uses the tab character as default field separator when copying and pasting, and the result is a tab-separated text version of your worksheet.
The next step in Plan B is to convert the tab-separated text file from Windows-1252 to UTF-8. There are online encoding converters, but for larger files a better bet is to open the file in a good text editor and use the editor's encoding conversion tool. A popular choice with Windows users is Notepad++, but be aware that there may be some problems with Notepad++, like this one and this one.
If all else fails, ask your friendly neighbourhood Linux user to convert your tab-separated, no-embedded-linebreaks, Windows-1252-encoded text file to UTF-8 for you. She'll just open a terminal and do
iconv -f windows-1252 -t utf-8 < file.txt > utf8file.txt
See also the character encoding pages in A Data Cleaner's Cookbook and this BASHing data post about illegal byte sequences.
Last update: 2020-01-13
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License