How to discover the original encoding of a filename (or any string)?

Asked

Viewed 8,292 times

5

I have a number of files that seem to have been generated on different operating systems, because the character encoding of their names seems to vary between them.

There are names whose accents appear normally to me, both in OSX and Linux (with terminal configured in UTF-8 in both cases), while others have become strange. For example, there’s an excerpt that I see APRESENTAÇO_MAR_2015 where clearly the word should be APRESENTAÇÃO.

Looking more carefully at the problematic section ÇO, I found 5 values (in hexadecimal):

0xC2
0x80
0x43
0x327
0x4F

I tried to convert the string with iconv, varying input and output encodings, but I couldn’t get the desired result (ÇÃO). How to discover the original encoding of these names and fix them? I have many files with problem, and would like to fix it programmatically.

  • Well, I think your question is related to php, to find out the enconding you can use this function mb_detect_encoding()

  • I can actually solve in any programming language, even in the shell. I just don’t know how. I’ll test this PHP function and say the result.

  • @Guilhermelopes O mb_detect_encoding returns UTF-8, which is still true on my system. However I believe that this is not the original encoding of the file name.

  • It is, really, do not know if it returns the original encoding of the file name, I did some tests, take a look: https://ideone.com/VJrQks

  • It seems not possible, take a look here: http://programmers.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file

  • But from what I understand this question is about the contents of the files, and my problem is with their names. I understand that in the name there is no even information about encoding (as there may be in the header of a file), but I am looking for ways to at least infer the encoding to be able to correct the names of the problem files.

  • I solved a similar problem using the ucsdet_detect() function of lib icu. With it you can detect the charset of any string. In my case it was the treatment of a file that contained lines in different Charsets.

  • @bfavaretto, I’m understanding that you want to try to figure out the enconding of some string. Does this help: http://stackoverflow.com/questions/910793/detect-encoding-and-make-everything-utf-8

  • It’s complicated because it depends on the environment. For example, if you’re going to solve this in php you have some specific techniques, if you’re going to solve it with visual studio it’s another way, if you’re going to do it in java or at the windows prompt, it’s another way. It would be better to define in which environment and with which tools you want to use.

  • I am finding that it will be impossible or very complicated as there seems to be encoding on top of encoding in the files. I will close the question. @Danielomine

  • @Cantoni I tried what it says there, but I got even more confused. Now I think it’s getting clearer, but the question may not be answered. See my comment to Daniel Omine, right above. Thank you!

  • 1

    @bfavaretto, really is a boring problem to deal with. It’s that kind of problem where the time you’ll spend writing an automatic solution makes it impossible.

  • @Cantoni It’s exactly what I’m feeling. Either I’ll settle in the nail, or I’ll ask the client himself (who passed me the files) solve.

Show 9 more comments

1 answer

-1

Navigate to the folder and type:

file -I meuarquivo.extensao

If it is a result from a database, you can use the PHP language to discover the charset using the mb_check_encoding function.

if(mb_check_encoding($row['campo'], 'UTF-8')){
    echo"verdadeiro";
}else{
    echo"falso";
}
  • 1

    The file -I gave application/zip; charset=binary (the file is a pptx).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.