For a full list of BASHing data blog posts see the index page.
Encoding detection smackdown
In A Data Cleaner's Cookbook I've been recommending the file command for detecting whether or not a file is in UTF-8 encoding, mainly because file also reports line endings. I had some doubts, though, so I decided to run a couple of simple tests on file and some other command-line encoding detectors.
Preparation. The starting file I used in both cases is a modified, plain-text version of Tolstoy's War and Peace as downloaded from the Project Gutenberg website. The file is called "wap" and is a single line in UTF-8 encoding with about 3,200,000 characters.
For the first test I "salted" my UTF-8 file "wap" with a non-UTF-8 character, hex code 80 (the euro sign in Windows-1252 encoding). With this "salt" I replaced the one character at position 1, or position 10, or position 100, etc up to position 1,000,000, renaming the files with the "salt" character's nominal position: "wap1", "wap10" etc. For the second test I put the UTF-8 byte order mark (hex ef bb bf) at the start of "wap" and each of the "salted" files, renaming them "wapBOM", "wap1BOM", "wap10BOM" etc.
The five programs I tested (in alphabetical order) were
- enca with language set to "none"
- file with the --mime-encoding option
- iconv converting from and to UTF-8
- isutf8 from the GNU "moreutils" package
- uchardet
Round 1. file, iconv and isutf8 all found something wrong with the salted files. Note that iconv and isutf8 return nothing (exit status 0) if the file is valid UTF-8:
enca and uchardet both failed. enca thought the salted files "wap100K" and "wap1M" were OK UTF-8, while uchardet got all the salted files wrong:
Round 2. Neither file, iconv nor isutf8 was fooled by the initial byte order mark:
No change with enca, and uchardet took my word for it (my BOM) and said "wap1BOM" was also UTF-8:
Conclusion. file and isutf8 are both good choices for detecting whether or not a file contains only UTF-8 characters. I'll continue using file because I like the output messages, as in this screenshot from the Cookbook:
Last update: 2020-09-23
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License