Script to remove GOOD signature from UTF-8 files

Asked

Viewed 573 times

3

I have several file problems in UTF-8 with BOM, several tokens are being generated at the beginning of the pages, this causes several problems in the reading of json files and desindentation of HTML components. Almost impossible to find out because the tokens are invisible. I googled a way to change all files to UTF-8 without BOM and found a perl script to remove the BOM signature but it didn’t work. Someone could help. I need a script that changes all project files.

More information about the problem and the script can be found here

My solution for now is to go digging the files and saving in UTF-8 without BOM, but they are several files, so I thought of a script, but I have no idea how to do.

The momentary solution to json token problems I did so to resolve (POG):

1.I retrieve the string from the first key { found. Because tokens are generated before this key. This solves momentarily. But it is a gambit.

json = json.substring(json.indexOf("{"),json.length);
objeto = $.parseJSON(json);
  • 1

    Does it have to be in Perl? And do you really need a full script, or are you already working on some language and can you just embed a small chunk of code? Removing BOM is trivial, just read the first three bytes of the file and - if they match a certain pattern - create a new file from the 4th byte forward. Assuming UTF-8, of course.

  • It doesn’t have to be in Perl, I’ve never actually programmed in Perl. So any language can be done. I use linux sometimes even have a program that does this but I do not know. Thanks @mgibsonbr, it seems simple to do for one after Voce explained, the problem I think will be identify the file format and do a search in all project directories.

  • I can try to do if there is no script already ready.

  • Are you using Javascript? Would you solve a code to eliminate BOM in Javascript, just before you use it? (or you really need to convert the files locally?)

  • 1

    i want to remove the BOM in all project files. Javascript was an example of how I managed to get around the problem in a certain circumstance. But what I wanted was a code similar to the one made in perl of the site I mentioned. But I couldn’t make it work here.

  • soon I will try to follow your hint and I will try to do the script in python or in c.

  • I’m writing an answer, I thought it might be something simpler.

  • can answer. Your answer may be useful.

Show 3 more comments

2 answers

3


A file UTF-8 with GOOD is simply a file in UTF-8 encoding in which the first 3 bytes are EF BB BF. Identifying the BOM is therefore a matter of reading the first 3 bytes and seeing if they correspond to this format. And to delete BOM, just copy the rest of the file to output, without including these 3 bytes.

An example in Python (3), well simplified (Disclaimer: haven’t tested!), would be:

import os, sys

def tem_bom(arq):
    with open(arq, mode="rb") as f:
        bom = f.read(3)
        resto = f.read()
        if bom == b"\xef\xbb\xbf":
            return True, resto
        else:
            return False, bom + resto

def copiar_pasta(origem, destino, copiar_sempre=True):
    for nome in os.listdir(origem):
        path1 = os.path.join(origem, nome)
        path2 = os.path.join(destino, nome)

        if os.path.isfile(path1):
             bom, resto = tem_bom(path1)
             if bom or copiar_sempre:
                 with open(path2, "wb") as f:
                     f.write(resto)
                 if bom:
                     print("Corrigido arquivo {}".format(path1))

        elif os.path.isdir(path1):
            os.mkdir(path2)
            copiar_pasta(path1, path2, copiar_sempre)

if __name__ == "__main__":
    copiar_pasta(sys.argv[1], sys.argv[2])

This example would take a source folder and copy all the files to a destination folder recursively. Every file he had OK, he would copy without BOM. I did so (without changing anything in the original folder) not to risk overwriting anything important, just check if the destination folder is a new, empty folder. Adapt if necessary.

  • 1

    I tested it here and it really looks like it worked out was exactly that @mgibsonbr, thank you so much. But it’s almost 100% @mgibsonbr, I did the test here and apparently removed BOM, but it needs correction in the functions is_file and is_dir which would be respectively isfile and isdir. And also another problem when going to copy a file that is inside another folder the program n ta creating the folder in the other directory before copying the file, there says that it is not possible to copy because the folder in the destination directory does not exist, nothing but a little fix to make it perfect.

  • 1

    @Thanks for the corrections! I fixed here, and also added an option in the function for when the file does not have GOOD not copy (it goes that the folder is large, has many files, etc). You would only have to call the function by passing this parameter with False.

  • Now yes @mgibsonbr, I just ran the script in the project, it worked, all the damn tokens are gone. Thanks again. Got 10 the script.

0

#!/usr/bin/perl -pi
s/^(\xEF\xBB\xBF)//;  ## remove BOM !

This version changes the files themselves. Example of use:

rmbom *.js or

perl rmbom file1 file2 *.js dir/*

Browser other questions tagged

You are not signed in. Login or sign up in order to post.