What is the difference between BOM and BOM encoding files?

Question

What is the difference between BOM and BOM encoding files?

Asked 11 years, 2 months ago

Viewed 15,042 times

7

A long time ago I faced problems with formatting files between ISO-*, UTF-8, ANSII and others, I started researching ways to solve these problems, I found several different ways, both using tools and using programming languages, but one thing I never really got to know is:
What is the structural difference between a file without GOOD and one with GOOD?

2

@DBX8 but I believe the answer to that question may answer that question.

– Silvio Andorinha

2014/05/21 at 14:53
ANSI is with one I; ASCII is with two I. ;-)

– Sony Santos

2020/07/23 at 09:53

3 answers

10

UTF-8 in conjunction with BOM(Byte order mark) is encoded with bytes EF BB BF at the beginning of the file. No difference, at least unofficial amid UTF-8 and UTF-8 with GOOD. While there is use, according to the Padrão Unicode, the Byte order mark for UTF-8 files is not recommended.

In the section 3.10 Unicode Encoding Schemes, item D95 says, in free translation:

Its use at the beginning of a UTF-8 datastream is not required nor recommended by Unicode Standard, but its presence does not affect the compliance with the coding scheme UTF-8.

1

Byte order is not variable in UTF-8. But of course, there may be a problem finding that the text is in UTF-8 for conversation starters...

– marcus

2014/05/21 at 14:50
True. I will correct my answer.

– Oralista de Sistemas

2014/05/21 at 16:11

Browser other questions tagged character-encoding

You are not signed in. Login or sign up in order to post.

by marcus • **2,131** points · Answer 1 · 2014-05-21T15:03:51+00:00

BOM (byte order mark, byte order mark) was created to solve a UTF-16 problem (and also the UTF-32, although this format is little used to save files).

As each character in UTF-16 is composed of 2 bytes (or in more rare cases by a pair of 2-byte units each), there is the possibility to sort them in different ways: byte 1, byte 2; or byte 2, byte 1 (on the order of the bits, no one discusses, at least...). Then little-endian architectures will prefer to use UTF-16LE (LE = little endian), which has the order "byte 2, byte 1" which is the most natural for the processor. And big-endian architectures will prefer to use UTF-16BE.

To differentiate the two types of UTF-16, BOM is used at the beginning of the file, which is a character that cannot be confused with its "inverse", so when reading it it will be possible to find out what is the order of the bytes of the rest of the file.

The UTF-8 was designed differently, where the order of bytes does not depend on the architecture of the computer. Hence, many consider it unnecessary to use BOM in UTF-8 files.

BOM, which in UTF-16 takes 2 bytes, when encoded in UTF-8 takes the form of 3 bytes. So some programs, despite the no-recommendation to use BOM in UTF-8 ended up adopting it anyway, because when they open a file and find those 3 special bytes, they will know that it is probably a UTF-8 file (because it is very rare for a text to start with ï»¿, which is how GOOD appears if it is read as cp1252 encoding).

Now, whether or not you should use BOM in your files, the debate gets a little philosophical, because there are pros and cons...

by Oralista de Sistemas • **23,115** points · Answer 2 · 2014-05-21T14:20:11+00:00

BOM means Byte Order Mark.

In our world people cannot understand themselves about various things, even if the bits of a lesser value byte should be aligned left or right. Believe me, there are heated discussions full of personal aggression about which form is best.

With certain encodings something similar happens. Some characters are represented by more than one byte. In UTF-32, for example, four bytes by character. There are people who prefer that the bytes with smaller values are left or right aligned in each character.

As it is not possible to adopt one or another way as the universal, sometimes we need to inform a parser the order in which the bytes should be read. We do this using the BOM. If you do not inform the BOM, the parser has to literally guess the form of reading. That’s why, without it, sometimes the texts get "broken".

It is common the BOM of a text to be indicated by the preamble, which are the first three bytes of a text. The parser uses them to determine which is the encoding used.

As noted by DBX8 in his reply, this should be irrelevant to UTF-8, which uses only one byte by character. The only advantage of knowing the BOM of UTF-8 is that it helps the parser to recognise the encoding used.