Remove space and break string lines

Asked

Viewed 8,648 times

0

I’m making a Webapi that generates an XML, this XML is read several times a day, so in the first run it serializes all my XML and saves it to disk, and during 24h it reads from disk instead of serializing the whole object again.

I do this because it has several accesses, XML are great some with up to 300mb, and information can be cached for 24h

The problem is that the description field, I believe it could be 'compressed' or better could try to make a Minify in xml before burning it to disk. I’m trying to remove the whitespace and line breaks just from that field for now so I already reduce some good megas.

Using Webapi in C#, Redis, MSSQL

Today I’m sending her like this:

    <description><![CDATA[SOBRADO

Área Terreno: 8 x 28
Área Construída: 170m&sup2;

Pavimento Superior:
2 dormitórios sendo 1 dormitorio com armario embutido planejado e um maste
banheiro
jardim de inverno
sacada


Pavimento Térreo:
2 salas
Copa
Cozinha
Corredor lateral
jardim na frente
quintal

Edícula:
1 dormitórios
banheiro
lavanderia
deposito

4 vagas

IPTU R$ 1.200,00 anual]]></description>

I would like to send so:

<description><![CDATA[SOBRADO Área Terreno: 8 x 28    Área Construída: 170m&sup2;...

I use 2 functions to try to clear the code, but it is not as you would like.

description = Biblioteca.RemoveTroublesomeCharacters(Biblioteca.CorrigeDescricao(imovel.Descricao)),

internal static string CorrigeDescricao(string descricao)
{
    var tab = '\u0009';
    descricao = descricao.Replace("  ", " ");
    descricao = descricao.Replace("=\r\n", "");
    descricao = descricao.Replace(";\r\n", "");
    descricao = descricao.Replace("\t", " ");
    descricao = descricao.Replace(tab.ToString(), "");
    return RemoveHtml(descricao);
}

And

 internal static string RemoveTroublesomeCharacters(string inString)
        {
            if (inString == null) return null;

            var newString = new StringBuilder();
            char ch;

            for (int i = 0; i < inString.Length; i++)
            {
                ch = inString[i];
                // remove any characters outside the valid UTF-8 range as well as all control characters
                // except tabs and new lines
                //if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
                //if using .NET version prior to 4, use above logic
                if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
                {
                    newString.Append(ch);
                }
            }
            return newString.ToString();
        }
  • 1

    And what’s your problem?

  • want to remove whitespace, line breaks try to minify the file, some has 300mb if I manage to remove some characters can already mean some 20mb at the end

  • @Dorathoto, what kind of application will consume your service? all are . NET?

  • no, most of them I think are in java.. but I don’t know their technologies.

4 answers

2

NEVER treat XML as text. There are terms for those who do this that, although they are technical terms and even used in books, would cause me to be banned from here if I used them ;)

Instead, encapsulate everything you want in XML as an object that is serializable. Then use the Framework’s XML classes to generate the XML when writing, or read from a file. This will not only keep XML compact, it will ensure good formatting and save you hours of development.

Start with this class: Xmlwriter.

  • I didn’t even notice he was riding the XML at hand, I thought he just wanted to remove the characters from the Descricao, in any case it should not implement anything in this sense, after all the WebAPI will return a JSON or XML depleting Accept sent by the customer.

  • 1

    do not treat as text, do a whole serialization of my xml.

  • @Dorathoto, what we’re trying to say is that the method RemoveTroublesomeCharacters seems unnecessary, JSON.NET should be able to handle this without the need for any intervention. Then you would only need to define a model, popular it, and return it, the Webapi would already serialize the model as XML if the client informs that it wants an XML by Header Accept.

  • Removetroublesomecharacters use to ensure only encoding utf-8 I think I was unclear on the question, do the correct mode, serializo perfectly as it should be with the webapi, is that besides that I also end up saving the xml physically and these files have become large

  • I made an edit on the question, it was not clear enough

2


Dorathoto, I believe that better than removing spaces from the string, is to compress the entire Response.

The easiest way to do it without configuring it directly on IIS is to install the following Nuget: Microsoft ASP.NET Web API Compression

PM> Install-Package Microsoft.AspNet.WebApi.Extensions.Compression.Server

Then run the following configuration in your Webapi Startup.:

GlobalConfiguration.Configuration.MessageHandlers.Insert(0, 
    new ServerCompressionHandler(
        new GZipCompressor(), 
        new DeflateCompressor())); 

Original Response (EN)

  • It seems interesting, in my first test, I realized that it seems to impact a lot on performance which is something very delicate in my case, I had to even apply Redis to try to accelerate the process that was no longer light.

  • @Dorathoto, as I said this is a way to apply direct compression in the code, another way would be to activate it directly in IIS, look for WebAPI GZip on IIS, try to do it and match the result with us.

2

You can resolve the issue of the excesses of line breaking and space with REGEX :

Removing excesses

pattern : (\s){2,}
replace : $1

Will capture spacing characters repeat more than twice and replace with a single. Note that it replaces with the first found.

Example

'teste de quebra    '
'de linha     '

Applying would look like this:

'teste de quebra de linha '

because she joined the ' \n' and replaced by ' ', for ' ' was the first found

Removing line breaks

pattern : (\n){2,}
replace : $1

They are similar, but not the same because this only considers line breaking, maybe it is necessary to change to ( n r?){2,} because Windows some windows Ides still add the car return.

Example

'quebra de linha     '

'em duas     '

Applying gets like this:

'quebra de linha'
'em duas'
  • very interesting...

  • how would you do it in C# ? Description = Regex.Replace(Description, @"( n){ 2,}");

  • @Dorathoto Regex rgx = new Regex(pattern); String result = rgx.Replace(descricao, replacement);

-1

I had a similar problem but I needed the description to have at most a line break and at most a space between words, including tabs. Below is the use of Pattern that can solve your problem:

Texto de exemplo

Using Pattern with regex replace Regex usada usando Ignore Case em C#

Resultado da descricao com a regex aplicada

Browser other questions tagged

You are not signed in. Login or sign up in order to post.