Problem reading UTF-8 files in PHP and right in the browser

Asked

Viewed 781 times

2

I need to save a file .txt in UTF-8 in PHP. Saving the file is very simple, however, when opening the file directly in the browser (example www.site.com/file.txt) the browser cannot recognize the UTF-8 encoding and the "Bugam" characters and end up like this: inserir a descrição da imagem aqui

Having searched the internet, I found some solutions, where people recommend adding GOOD in the file. To add the GOOD is very simple, just use one of these 3 codes.

fwrite($file, "\xEF\xBB\xBF"); 

fwrite($file, pack("CCC",0xef,0xbb,0xbf)); 

fwrite($file, chr(0xEF).chr(0xBB).chr(0xBF)); 

After I added BOM the browser correctly interpreted all the characters and this solved the problem. The end result was more or less like this:

inserir a descrição da imagem aqui

But over time users have reported a problem. The contents of this file can be displayed in two ways on my system. The first way is direct viewing by the equal browser displayed above and the second way is normal viewing on an HTML page. Inside this HTML page there is a <textarea> with the file text and when using shortcuts CTRL + A to select everything and CTRL + C to copy the entire contents of the file displayed in <textarea>, the control character of the GOOD is copied together, and at the time of pasting the copied content the result is more or less like this:

inserir a descrição da imagem aqui .

As you can see, the first character of the copied text is a '?' which in this case is a control character, which is the equivalent of the GOOD.

My doubt is:

  • Is there any way to force the browser to read the file in UTF-8 without the need to add BOM to the file?

OR

  • Is there any way to read and display the contents of the file. txt in PHP without the GOOD being shown together?
  • 1

    fwrite($file, utf8_encode("This doesn’t solve? Hugs!"));

  • @jcbrtl unfortunately does not solve.... This was the first method I had tested and unfortunately does not solve the problem.

  • 1

    tried to use mb_convert_encoding? mb_convert_encoding($content, 'UTF-8')

  • 1

    I believe that when you make the copy, it copies the allocated garbage in memory, you get to copy some space before the copied text?

  • @Williamsilva It is impossible to be memory junk because it happens to ALL users, and when I take the GOOD, the character does not come together as I explained above.

  • 1

    Eduardo, the character of ? is displayed in place of which character in the generated file?

  • @Andréfilipe this character is not displayed in the file because it is a control character. If you open the file in Vscode the character does not appear, however, if you print the contents of the file in an HTML page and copy all the text and then paste this text into Vscode the character appears. It’s kind of complicated to explain but some users are complaining about it because they like to copy everything straight from the site and paste in their text editors but this character is coming along.

  • 1

    You scored <meta charset="utf-8"> in the <head> of the HTML page where the text of the UTF-8 encoded file is displayed, né?

  • @jcbrtl I believe that this has absolutely nothing to do with the problem, but answering your question, yes I marked.

  • @Adirkuhn tried to use mb_convert_encoding, but to no avail. In the main tab it looks good, but in the raw text it loses the UTF-8 encoding. Main guide: http://prntscr.com/o8575c. Raw text: http://prntscr.com/o857fc

  • 1

    what is the main guide and raw text? kkkkk

  • @Adirkuhn Main Guide would be my website (I expressed myself in the wrong way). Raw text would be the file itself, in which case the contents of the file opened directly in the browser, for example www.site.com/.txt so you would open the file directly, ie the raw text of the file.

  • 1

    Unfortunately at no time did you say you were wanting to view a static file .txt directly in the browser, I don’t think that until your last edit (8 minutes ago) someone would be able to deduce your problem and solve it. Just for the record in your reply you wrote .htacess, but it has 2 "c", so .htaccess. I hope you understand how a constructive criticism, this part of displaying a static doc directly was fundamental to understand, so always describe exactly what you did.

  • @Guilhermenascimento yes I understand, I created the topic in the wrong way. I created the topic 9 days ago and since then I’ve been studying various things and learned a lot about it. The day I created the topic I thought my problem was time to save the file, but only after a while did I discover that actually the problem was time to read the file. I think I owe you an apology.

  • @Guilhermenascimento I thought it was impossible to modify the way the browser interprets the static file, so I didn’t even comment on that question, i thought the only way was to modify the way to save the kkkkkkk file but as I said it was a mistake of mine for lack of knowledge.

  • 1

    ps: the downvotes are not mine (nor in any of the answers), I just commented to guide you on future questions, regardless of how much you understand about something you always have to explain the step by step (still yes simply) and explain exactly where it was displayed that it failed. Good luck my dear.

Show 11 more comments

3 answers

-1


Answering the two questions.

.

1) Is there any way to force the browser to read the file in UTF-8 without the need to add BOM to the file?

A: Yes, it is possible to "force" the browser to interpret the contents of the text file in UTF-8 by adding the following rule Header set Content-Type "text/html; charset=utf-8" inside the archive .htaccess (apache).

.

2) Is there any way to read and display the contents of the file . txt in PHP without the BOM being displayed together?

A: More or less. There is no native PHP method for reading file contents without BOM. Whenever we use methods such as file_get_contents() or similar, PHP returns ALL the contents of the file, including BOM, so to fix this we can do a "gambiarra" by removing the first character from the String. Example $texto = substr(file_get_contents("arquivo.txt"), 1);

-2

The solution

Create the files without BOM, and configure your web server to serve UTF-8 files by default. What is happening is that it is configured for another encoding, or in case it is not configured, browsers interpret the files as ISO-8859-1, which causes bad impression.

That’s because you’re actually facing two different problems:

1. A UTF-8 file may or may not have BOM

The BOM clearly indicates that a file is in UTF-8, but this is not necessary, apart from not even being recommended, precisely for causing this kind of problem you are having. See: See: https://en.wikipedia.org/wiki/Byte_order_mark

Then one possible solution is to ensure that you are writing a valid UTF-8 file without BOM.

2. You control the web page that displays that file?

If yes, just add the code that eliminates GOOD. After all, it is invalid to appear in the middle of any text. Web page is a text, and the character is probably being displayed after headers, which is invalid (on a text page).

If not, well... point that error at whoever controls the page. Even if it does not control the source/generation of the file, it has to control to never appear the GOOD char in the middle of a text.

  • What you said makes perfect sense. I am working with 2 scenarios, the first scenario is MY SITE, the second scenario is BROWSER. On my site, the text of the file is displayed normally, with or without the BOM, however, if I open the file directly in the browser, using for example, www.site.com/.txt the characters "Ugam", if I do not enter the BOM.

  • If I DO NOT inform the BOM this is what my website looks like: https://prnt.sc/o8575c. and so the raw text is in the browser: https://prnt.sc/o857fc . If I inform BOM, the browser can interpret everything normally, however, my site gets that problem of CTRL + A -> CTRL + C

  • 1

    Browser problem solves in browser. Try these solutions to resolve: https://stackoverflow.com/questions/40587629/how-to-force-browser-to-use-utf-8 .

  • There’s no way I can touch the navigator. When I mentioned "raw text" in the browser, I was referring to the fact that the browser opens and interprets the original text file. Here is an example of a "raw text" file where the browser opens and interprets the text https://tools.ietf.org/rfc/rfc793.txt

  • From what I understand, from the . htaccess file we can force the browser to interpret in UTF-8 correct? Would this work on all browsers? As far as I know the files. htaccess only exist on apache servers, and if I was using Lighttpd or Nginx servers

  • 1

    Cannot move the browser, but it is possible instruct the browser. So try the solutions proposed in this link, not only the answer, but the question as well. Yes, it would work on all browsers, use . htaccess or other solutions. To use other servers, you use the options from question, that do not depend on the servers, or seek the equivalent instruction of these servers.

  • 1

    But what I would really do is serve the file with a combination of header('Content-type: text/plain; charset=utf-8'); and then fopen and fpassthru.

  • I tried to follow the instructions of the topic you recommended and other related topics but to no avail. I added the file. htaccess with the recommended options but even then the browser does not interpret the characters in UTF-8. Here is the result: http://prntscr.com/o966y2

  • It is impossible to add any line of code because the file is being opened directly in the browser understand? On my website everything works perfectly, the problem is time to open the original file directly through the browser. Returning to the previous discussion, as I was saying, if I add BOM the browser interprets everything perfectly (as you can see: http://prntscr.com/o9685g). However, if I add the BOM, there is that "problem" with the CTRL + A -> CTRL + C

  • I believe that our problem is then how to make the browser "open" the files with the UTF-8 encoding without the BOM being written in the file.

  • 1
  • I think I’ve solved the problem by adding Header set Content-Type "text/html; charset=utf-8" in .htaccess. I will run some more tests, but I think this will solve the problem. As soon as possible I will edit the topic and add an answer. It seems that the problem is solved. But I won’t jump to conclusions before I do all the tests.

Show 7 more comments

-2

  • :( It didn’t work. On my site it looks good, but in the raw text of the browser it loses the UTF-8 encoding. My website: https://prnt.sc/o8575c. Raw text: https://prnt.sc/o857fc

Browser other questions tagged

You are not signed in. Login or sign up in order to post.