Error converting HTML to PDF using Xmlworkerhelper

Asked

Viewed 631 times

1

While exporting the HTML file to PDF using iTextSharp and Xmlworker error occurs in some situations saying that certain tag is not closed and searching found this post How to Convert HTML to Valid XHTML? (but it is in javascript) that the conversion should be in m XHTML format because it is sure that the tags are properly formatted.

My application queries an SQL table from where it returns saved HTML files and when I try to turn them into PDF error occurs saying that certain tag is not closed, below is the code I use to export to PDF:

public ActionResult GetPdfFileZiped(ProcessamentoRegistros pProcessamentoRegistros)
        {
XMLWorkerHelper.GetInstance().ParseXHtml(pw, doc, srHtml);

ocorre erro pois a estrutura do HTML eventualmente não está bem formatada
pProcessamentoRegistros.IdProcessamentoDiario = 1;
                pProcessamentoRegistros.IdRegistro = 1;
                pProcessamentoRegistros.IdServico = 2;
                ProcessamentoRegistros _processamento = _IRepositorio.ObterProcessamentoRegistros(pProcessamentoRegistros);

                var doc = new Document(PageSize.A4.Rotate());
                var stream = new MemoryStream();
                var pw = PdfWriter.GetInstance(doc, stream);
                var minhaStringHTML = @_processamento.DocumentoHtml.Trim();

                doc.Open();

                using (var srHtml = new StringReader(minhaStringHTML))
                {
                    XMLWorkerHelper.GetInstance().ParseXHtml(pw, doc, srHtml); // <-- AQUI OCORRE ERRO
                }
                doc.Close();

                using (var compressedFileStream = new MemoryStream())
                {
                    using (var zipArchive = new ZipArchive(compressedFileStream, ZipArchiveMode.Update, false))
                    {
                        var zipEntry = zipArchive.CreateEntry("MeuPDFZipado.pdf");                        
                        using (var originalFileStream = new MemoryStream(stream.ToArray()))
                        {
                            using (var zipEntryStream = zipEntry.Open())
                            {
                                originalFileStream.CopyTo(zipEntryStream);
                            }
                        }
                    }
                    return new FileContentResult(compressedFileStream.ToArray(), "application/zip") { FileDownloadName = "Filename.zip" };
                }
}

For example, below the img tag is not closed and I have no control in its formatting, the error occurs in some other tags:

<IMG border="0" src="https://www.sifge.caixa.gov.br/Empresa/Crf/images/caixa.gif" width=180 height=44>

Below is the full HTML::

<HTML>

<HEAD>
<META NAME="GENERATOR" Content="Microsoft Visual Studio 6.0">
<script language=javascript>
//function MudarPagina() {
//  window.history.back();
//}
</script>
</HEAD>
<!--body bgcolor=white onBlur=MudarPagina();-->
<body bgcolor=white>
    <FORM method="post" style="BACKGROUND-COLOR: white">
    <!--FORM name="Imprimir" method="post" style="BACKGROUND-COLOR: white"-->
<br>    
<table>
<tr>
<td align=center><a href="javascript:window.print();"><IMG src="https://www.sifge.caixa.gov.br/Empresa/Crf/images/botimprimir.gif" border=0></a>
<a href="javascript:window.history.back();"><IMG src="https://www.sifge.caixa.gov.br/Empresa/Crf/images/botvoltar.gif" border=0></a></td>
</tr>

<tr><td>

<table width="75%" CELLSPACING=0 CELLPADDING=10 border=1 align=center bordercolorlight="#FFFFFF" bordercolordark="#CCCCCC">


<tr>
<td>    

    <TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0 style="color: black" class=txtcentral>
        <tr>
            <td align=left><IMG border="0" src="https://www.sifge.caixa.gov.br/Empresa/Crf/images/caixa.gif" width=180 height=44></td>
        </tr>

        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <td align=rigth><span style="font-size: 13pt" align=center><strong>Certificado de Regularidade do FGTS - CRF</strong></span></td>
        </tr>
    </table>

    <TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0 style="color: black" class=txtcentral>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <TD width=22%><font style=" font-family: Verdana;font-size:10pt"><strong>Inscrição:</strong></font></TD>
            <TD ><font style=" font-family: Verdana;font-size:8pt">08439659/0001-50</font></TD>
        </tr>
        <tr>
            <td width=22% valign=top nowrap><font style=" font-family: Verdana;font-size:10pt"><strong>Razão Social:</strong></font></TD>
            <td><font style=" font-family: Verdana;font-size:8pt">CPFL ENERGIAS RENOVAVEIS S A</font></TD>
        </tr>

        <tr>
            <td width=22% nowrap><font style=" font-family: Verdana;font-size:10pt"><strong>Nome Fantasia:</strong></font></TD>
            <td ><font style=" font-family: Verdana;font-size:8pt">CPFL RENOVAVEIS</font></TD>
        </tr>

        <tr>
            <td width=22% valign=top><font style=" font-family: Verdana;font-size:10pt"><strong>Endereço:</strong></font></TD>
            <td ><font style=" font-family: Verdana;font-size:8pt">AV DOUTOR CARDOSO DE MELO   1184   ANDAR 7 / VILA OLIMPIA / SAO PAULO / SP / 4548-004</font></TD>
        </tr>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <TD colspan=2 style="text-align: justify"><font style=" font-family: Verdana;font-size:10pt">A Caixa Econômica Federal, no uso da atribuição que lhe confere o Art. 7, da
            Lei 8.036, de 11 de maio de 1990, certifica que, nesta data, a empresa acima identificada
            encontra-se em situação regular perante o Fundo de Garantia do Tempo de Serviço - FGTS.
            </font>
            </TD>
        </tr>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <td style="text-align: justify" colspan=2><font style=" font-family: Verdana;font-size:10pt">O presente Certificado não servirá de prova contra cobrança de quaisquer débitos referentes
            a contribuições e/ou encargos devidos, decorrentes das obrigações com o FGTS.</font>
            </td>
        </tr>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>


        <tr>
            <td colspan=2><font style=" font-family: Verdana;font-size:10pt"><strong>Validade: </strong>28/02/2017 a 29/03/2017</font></TD>
        </tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <td colspan=2><font style=" font-family: Verdana;font-size:10pt"><strong>Certificação Número: </strong>2017022805233090232330</font></TD></TR>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <TD colspan=2><font style=" font-family: Verdana;font-size:10pt">Informação obtida em 15/03/2017, às 17:14:51.</font></TD>
        </tr>

        <tr><td colspan=2>&nbsp</td></tr>
        <tr><td colspan=2>&nbsp</td></tr>

        <tr>
            <TD style="text-align: justify" colspan=2><font style=" font-family: Verdana;font-size:10pt">A utilização deste Certificado
                para os fins previstos em Lei está condicionada à verificação de
                autenticidade no site da Caixa: <strong>www.caixa.gov.br</strong></font></TD>
            </tr>
    </TABLE>
</form>

</td></tr></table>

</td>
</tr>

</table> 

<script language=javascript>
//window.print();
</script>   
</BODY>
</HTML>

How do I get around this problem ? Can I parse HTML and turn it into XHTML ? Is there any other alternative free to convert this HTML to PDF along with the tags Styles ?

1 answer

1


How can I get around this problem ?

The correct way to get around your problem is to attack the root of it. That is, you should fix your HTMLs so that the tool can work properly. Something that can be used, for example, is the Validator of the W3C to check whether the HTML past has errors.

Can parse in HTML and transform into XHTML ?

I have no experience with the tool, but test the Tidymanaged.

Below an example of its use:

using System;
using TidyManaged;

public class Test
{
  public static void Main(string[] args)
  {
    using (Document doc = Document.FromString("<hTml><title>test</tootle><body>asd</body>"))
    {
      doc.ShowWarnings = false;
      doc.Quiet = true;
      doc.OutputXhtml = true;
      doc.CleanAndRepair();
      string parsed = doc.Save();
      Console.WriteLine(parsed);
    }
  }
}

The exit of HTML will be something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
<title>test</title>
</head>
<body>
asd
</body>
</html>

It is probably possible to do something like this also with the W3C API.

Have some other free alternative to convert this HTML to PDF along with the tags' Styles ?

The problem is not the generation of PDF, but HTML (root problem, as I reported before). But if something prevents you from making the correction in HTML, you can try using some tool like the one I indicated above to try to parse your HTML by correcting the errors found. But that is not 100% reliable, some errors may not be detected.

  • 1

    its solutions were very good, but I have no way to eliminate the root problem in the current situation, but I have solved the root problem using this dll Pechkin: [https://github.com/gmanny/Pechkin], henceforth I intend to make some improvements in zip archive compression and for that I will open another post.

  • Interesting this one Peckin, thanks for sharing the content.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.