Readalltext returns Chinese characters

Asked

Viewed 96 times

3

I have a file. 55GB sql and needed to run line by line, but as I cannot load a file of this size, I divided into 543 100MB files named as Imagens<NumeroSequencial>.sql.

So I made a code that reads the files 1 to 1 and extracts the lines in a list, after that runs the commands line by line.

The third party program that splits the files extracts the files with encoding UTF-16 BE that in c# is Unicode. When I read files with the even number at the end of the name and debug the following code, I get Chinese characters, but when files with odd number come all normal, the problem is that the odd number have the same encoding as the even number ones, they came from msm software of the same division.

string s_unicode = File.ReadAllText(path,Encoding.Unicode);
//retorno > 猀漀挀挀攀爀

What could possibly be going on? I’ve tried using UTF-8 but ai returns me a lot of \0\0\0\0\0\0

Follow the file for testing at that link

2 answers

2

If you already know the encoding used in the file, why not use it to read the file?

string s_unicode = File.ReadAllText(path, Encoding.GetEncoding("UTF-16BE"));

You can also pass the results to a array, might come in handy:

string[] s_unicode = File.ReadAllLines(path, Encoding.GetEncoding("UTF-16BE"));

Code tested and running at 100%!

2

I don’t know how you extracted the text or in what format you used to parse the original file, but you can try to solve the problem by reading these "broken" files as binary reading files, removing all bytes 0x00 and parsing as string.

The excerpt with Chinese ideograms translates to the following Unicode bytes: 0x7300 0x6F00 0x6300 0x6300 0x6500 0x7200.

If you remove 0x00 and parse as Unicode string, the result is soccer.

EDIT: Try to remove the first 0x00 from the file, use the following code:

using System;
using System.IO;
using System.Text;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Writing Imagens1.sql now.");
            File.WriteAllBytes("Imagens1.sql", new byte[] { 0x00, 0x73, 0x00, 0x6F, 0x00, 0x63, 0x00, 0x63, 0x00, 0x65, 0x00, 0x72, 0x00 });

            Console.WriteLine("Reading Imagens1.sql and dumping it into Imagens1.new.sql");
            using (BinaryReader br = new BinaryReader(File.OpenRead("Imagens1.sql")))
            using (BinaryWriter bw = new BinaryWriter(File.OpenWrite("Imagens1.new.sql")))
            {
                byte[] buffer;

                br.ReadByte(); // This should be the heading 0x00 that we want to get rid of.

                while ((buffer = br.ReadBytes(1024)).Length > 0)
                {
                    bw.Write(buffer);
                }

                bw.Close();
                br.Close();
            }

            Console.WriteLine("Finished!");
            Console.Read();
        }
    }
}
  • Marcelo, I have a nice experience with programming, but I couldn’t understand how to apply it in practice, it’s too complex for you to give me an example?

  • It must have a character 0 at the very beginning of the file. Just remove it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.