Is there an extra parameter for the linux script command for characters to be saved with the correct encoding?

Question

Is there an extra parameter for the linux script command for characters to be saved with the correct encoding?

Asked 5 years, 7 months ago

Viewed 81 times

1

When using the command script filename.txt is saved in the created file everything that is generated in the terminal, but when saving, the text is encoded in several points, points that I needed to read by another project that I am working on.

2 answers

1

(attention: the question-specific example is not an encoding problem - this answer is for this example. For encoding problems see the @Leo response)

In this case, it is not a problem of encoding: The characters you are viewing as squares are the "ESC" character (ASCII code 27/0x1b/033 - exactly what is generated by the button with the same name)

What happens is that in the terminal, this character is used as a prefix for sequences ANSI, which allow the text in the terminal to appear colored, bold, or even the cursor to be repositioned, etc...

When these sequences are sent to the terminal, the terminal program "consumes" the ESC and the other characters in the sequence, and changes the color, etc... However, when they are sent to a file, the characters are saved as they arrive, in full.

If you have this generated file printed in the terminal (cat nomearquivo.txt), you will see that it is not only readable, but all the text that was generated colored by the original commands will appear colored in the terminal. The text editor you are using does not recognize the sequences (in fact, I don’t know any that use ANSI sequences to allow color .txt file editing) - and they are displayed in full.

All the most common ones have the format of "ESC[<;numero>..." (In this case the "opens bracket" [ is literal, that character even, not a representation of an optional snippet).

This allows you to create a regular expression that can filter these strings and leave only the text. Note that other control characters such as " x08" (Backspace) or " x0d" (carreiage Return) can still stay - these sequences are similar to ANSI, and are used to return the cursor on the same line, to, for example, update bars or progress percentages without changing line.

That being said, the command below can be used directly on the theme to filter the ANSI sequences. (I used Python instead of sed shorter in these cases, because I don’t know how to represent the "ESC" character on the command line for the sed - in Python the regular expression is inside a string, and any character can be represented using \x or \u escape):

python3 -c "import re;print(re.sub(r'\x1b+\[[\d;]*?[a-zA-Z]', '', open('nomearquivo.txt').read()))" > arquivo_corrigido.txt

Browser other questions tagged linux script character-encoding terminal

You are not signed in. Login or sign up in order to post.

by LEo • **136** points · Answer 1 · 2019-12-12T12:28:15+00:00

A program is writing a file using a certain encoding and another program is reading using another encoding. There is no markup in the file to indicate which type of encoding was used. What you can do is a polite guess. Use the file to verify which is the encoding of the generated file. Example:

$ file 001.txt 
001.txt: ASCII text
$ file 002.txt 
002.txt: UTF-8 Unicode text

The program that will read the file should read using the encoding in which the file was written. Suppose your file has been written using UTF-8 and your program is reading using ISO 8859-1 encoding. You can convert from one encoding to the other using the iconv.

$ iconv -f UTF-8 -t ISO8859-1 002.txt > 002_iso.txt
$ file 002_iso.txt 
002_iso.txt: ISO-8859 text

Depending on the case it may be necessary to use the option TRANSLIT, when a character cannot be represented in the target encoding, using this option will cause characters to be mapped to the characters visually nearest.

$ iconv -f UTF-8 -t ISO8859-1//TRANSLIT 002.txt > 002_iso.txt