Java converts accented characters to "strange" characters

Question

Java converts accented characters to "strange" characters

Asked 4 years, 5 months ago

Viewed 204 times

1

I am facing a somewhat strange problem. I have a Mysql database in version 5.6 and a table with field of type longblob that stores a text in compressed HTML format (ZIP). When my website makes a request for the backend (utilizo Spring Boot 2.3.3, JPA and Java 11.0.9 Amazon correct), the same searches this database record, unzips and returns only the HTML text to my website.

When I do this on my local machine works perfectly, but when I do this from the server, with the same version of Java, the process does not work, the backend converts accented characters to "strange characters".

This is an example of the text saved in the database:

<p style="text-indent:0pt;margin-top:0pt;margin-bottom:0pt;"><span style="color:#000000;font-weight:bold;">&nbsp;SAIBAM</span><span style="color:#000000;"> quantos a presente </span><span style="color:#000000;font-weight:bold;">Escritura Pública de Cessão e Transferência de Posse</span><span style="color:#000000;"> virem que, sendo aos æData_lav1&gt;, neste Distrito de Itaió, 2º do município e comarca de Itaiópolis, Estado de Santa Catarina, neste Ofício de Notas, sito às margens da Rodovia SC 477, s/n, perante mim, Tabelião de Notas, partes entre si, justas e contratadas a saber</span></p>

This is the result obtained on the server:

It can be verified that all accented letters have been converted to "strange characters".

I have tried to change the database connection to UTF-8, but failed:

conexão...&useUnicode=yes&characterEncoding=UTF-8

spring.jpa.properties.hibernate.connection.characterEncoding=utf-8
spring.jpa.properties.hibernate.connection.CharSet=utf-8
spring.jpa.properties.hibernate.connection.useUnicode=true

This is the method that unpacks ZIP:

public String convertToEntityAttribute(byte[] compactado) {
    if(compactado == null){
        return "";
    }

    final int BUFFER_SIZE = 1024;

    try {
        ByteArrayInputStream is = new ByteArrayInputStream(compactado);

        GZIPInputStream gis = new GZIPInputStream(is, BUFFER_SIZE);
        StringBuilder builder = new StringBuilder();

        byte[] data = new byte[BUFFER_SIZE];
        int bytesRead;

        while ((bytesRead = gis.read(data)) != -1) {
            builder.append(new String(data, 0, bytesRead, Charset.defaultCharset()));
        }

        gis.close();
        is.close();

        return builder.toString();
    } catch (IOException e) {
        return "";
    }
}

Does anyone have any idea what it might be?

2

Have you checked if the server is emitting the correct character encoding headers? It needs to inform the browser that the content is in UTF8 - by its print, it seems that the browser does not know this and is trying to display as if it were Latin 1.

– bfavaretto

2021/03/04 at 18:12
Hello @bfavaretto I believe that the problem is not this, because I put a log with the result of Builder.toString() before the Return in my method that decompresses and the text is already with the strange characters.

– Everton

2021/03/04 at 18:25
2

Charset.defaultCharset(). Have you tried StandardCharsets.UTF_8? You seem to be setting UTF-8, if the server has another default, e. g., CP-1252, the conversion process will fail.

– Anthony Accioly

2021/03/04 at 18:55
2

Hello @Anthonyaccioly was just that, I don’t know how I hadn’t thought of it before rsrs, thank you so much!

– Everton

2021/03/04 at 19:03

1 answer

Browser other questions tagged java mysql spring-boot character-encoding

You are not signed in. Login or sign up in order to post.

by Anthony Accioly • **20,516** points · Answer 1 · 2021-03-04T21:14:19+00:00

In the following line:

builder.append(new String(data, 0, bytesRead, Charset.defaultCharset()));

The method defaultCharset returns the charset system standard. You are forcing UTF-8 to read the data in the database. If your server is configured with any other charset standard (e. g., CP-1252 in Windows) there is a Mismatch amid encodings and the builder of string can produce garbage.

You can solve this problem by fixing StandardCharsets.UTF_8:

builder.append(new String(data, 0, bytesRead, StandardCharsets.UTF_8));

An additional tip is that for strings small and Java 9+ you don’t need to do all this juggling with buffers intermediaries and StringBuilder. The class InputStream has a method readAllBytes that does what you need. Read and convert the whole array of bytes is more efficient because it avoids the creation and disposal of strings intermediaries.

Here is a possible implementation for the method convertToEntityAttribute:

public String convertToEntityAttribute(byte[] compactado) {
    if(compactado == null) {
        return "";
    }

    try (GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(compactado))) {
        return new String(gis.readAllBytes(), StandardCharsets.UTF_8);
    } catch (IOException e) {
        // Você talvez queira ao menos logar a exceção aqui
        return "";
   }
}