(attention: the question-specific example is not an encoding problem - this answer is for this example. For encoding problems see the @Leo response)
In this case, it is not a problem of encoding:
The characters you are viewing as squares are the "ESC" character (ASCII code 27/0x1b/033 - exactly what is generated by the button with the same name)
What happens is that in the terminal, this character is used as a prefix for sequences ANSI, which allow the text in the terminal to appear colored, bold, or even the cursor to be repositioned, etc...
When these sequences are sent to the terminal, the terminal program "consumes" the ESC and the other characters in the sequence, and changes the color, etc... However, when they are sent to a file, the characters are saved as they arrive, in full.
If you have this generated file printed in the terminal (cat nomearquivo.txt
), you will see that it is not only readable, but all the text that was generated colored by the original commands will appear colored in the terminal. The text editor you are using does not recognize the sequences (in fact, I don’t know any that use ANSI sequences to allow color .txt file editing) - and they are displayed in full.
All the most common ones have the format of "ESC[<;numero>..." (In this case the "opens bracket" [
is literal, that character even, not a representation of an optional snippet).
This allows you to create a regular expression that can filter these strings and leave only the text. Note that other control characters such as " x08" (Backspace) or " x0d" (carreiage Return) can still stay - these sequences are similar to ANSI, and are used to return the cursor on the same line, to, for example, update bars or progress percentages without changing line.
That being said, the command below can be used directly on the theme to filter the ANSI sequences. (I used Python instead of sed
shorter in these cases, because I don’t know how to represent the "ESC" character on the command line for the sed
- in Python the regular expression is inside a string, and any character can be represented using \x
or \u
escape):
python3 -c "import re;print(re.sub(r'\x1b+\[[\d;]*?[a-zA-Z]', '', open('nomearquivo.txt').read()))" > arquivo_corrigido.txt
The concepts and commands you put in this answer are pretty cool - but the problem with this particular issue isn’t about encoding - it’s about adding ANSI sequences to the file. This response however can be pretty cool for more people coming here - if you add a note that in this particular case the problem was not encoding, this answer could help even more people!
– jsbueno