How to get the encoding type of a file?

Asked

Viewed 3,703 times

0

Follows code:

string text = File.ReadAllText($@"{pathname}", Encoding.UTF8);

I have several txt files with different condition. Because here shows no special characters, because different encoding.

Before running the line File.ReadAllText, how do I get the file type?

Example: ANSI, UNICODE, UTF-8, ETC...

Something like that:

if (pathname == Encoding.ASCII)
{
    string text = File.ReadAllText($@"{pathname}", Encoding.ASCII);
}
else if (pathname == Encoding.UTF8)
{
    string text = File.ReadAllText($@"{pathname}", Encoding.UTF8);
}

1 answer

1


You will have to read the file to know then it probably pays to read otherwise. Using the StreamReader and read at least one part can discover with the property CurrentEncoding. But they say she can’t be trusted.

If you have difficulties with it you can try to use a library like the chardetsharp, UDE, Nchardet, Architect Shack. I don’t know them and I don’t know how trustworthy they are.

Have answers in the OS with codes that try to do the job: here, here and here.

If you want to understand more about the GOOD.

You’ll always have cases you can spot wrong.

  • Is there any way to know if the detected goes wrong ? Vixi screwed me, I will try with library that you indicated me.

  • 1

    Not reliably, including because the text may be in a poorly formed encoding. Reliable will only be if you can ensure that the file is never corrupted and have a header in the file that indicates the encoding. Eventually this is what you will get in some standard encodings. If you know that you will only have two or three different encodings and which ones help a little because you can limit the choices. It has encodings that can be ambiguous.

  • I understand you can tell me which is the safest way to know which type of encoding?

  • 1

    One of those right there, but I don’t know which one. Then you can manually refine to make less error. But test because it might go well with the pattern you have.

  • I understand, the best library (in my opinion) is UDE. Because it speaks the type of encoding and the level of file TRUST that returns float. That’s what you wrote in the commentary: o texto pode estar em uma codificação mal formada.

  • Bora delete comment ?

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.