Detailed answers to any questions you might have Quality applications allow a user to specify how a text file is encoded when opening it, but often include an "auto" option. Closes #3271 These characters, known as a byte-order mark, help some versions of Microsoft Excel understand that the CSV file is a UTF-8 encoded file.
Also be aware that some versions of Excel do not behave the same when you double-click on the file to open it or open the file via the menu.Why does it matter if it opens correctly in Excel?
If there is no BOM, it is possible to guess whether the text is UTF-16 and its byte order by searching for ASCII characters (i.e.
(But Excel also does @Heinzi I learnt a long time ago that you cannot really win when working with CSV and Excel. The BOM for little-endian UTF-32 is the same pattern as a little-endian UTF-16 BOM followed by a NUL character, an unusual example of the BOM being the same pattern in two different encodings. Software Engineering Stack Exchange works best with JavaScript enabled
This is the CSV format Apple’s Numbers exports by default, UTF-8 sans BOM.
To communicate which byte order was in use, U+FEFF (the byte-order mark) was used at the start of the stream as a magic number that is not logically part of the text the stream represents. If they need a BOM (mainly micro-sloth) then you need to add one, but UTF-8 + BOM ≠ UTF-8.Even though CSV is apparently easier to generate, there are so many compatibility issues, especially if you stray out of pure 7-bit ASCII, that I would very, very, strongly recommend you generate actual XLSX if the goal is for users to open it in Excel (rather than re-import it in some other software, in which case you will have to give options for separators, encoding, etc.). Nothing in the question states Excel needs to be able to parse the generated file...There's also widespread software requiring a BOM: Excel needs a BOM to correctly identify a CSV file as UTF-8 rather than "ANSI", i.e., the local compatibility locale. It's simply a lousy CSV-reader.
But what format do the programs want.
The MS developers are not that bad (Posix compliance, now UTF-8 support). Because of these considerations, heuristic analysis can detect with high confidence whether UTF-8 is in use, without requiring a BOM. Start here for a quick overview of the site If a user selects "auto", some UTF-8 files that don't have a BOM may be misidentified as using some other encoding. I encountered a problem with a BOM (Byte Order Mark) at the front of UTF-8 string data. Anybody can answer
If we try it again with a … It's unnecessary (UTF-8 has no byte order) unlike UTF-16/32 and not recommended in the Unicode standard.It's also quite rare to see UTF-8 with BOM "in the wild", so unless you have a valid reason (e.g. A large number (i.e. far higher than random chance) in the same order is a very good indication of UTF-16 and whether the 0 is in the even or odd bytes indicates the byte order. Nevertheless, the BOM can be used to indicate the encoding of the text that follows it.In UTF-7, the fourth byte of the BOM, before encoding as SCSU allows other encodings of U+FEFF, the shown form is the signature recommended in UTR #6.
The best answers are voted up and rise to the top The Programs that interpret UTF-16 as a byte-based encoding may display a garbled mess of characters, but ASCII characters would be recognizable because the low byte of the UTF-16 representation is the same as the ASCII code and therefore would be displayed the same. The picture below shows the bytes used in a sequence of two-byte characters.