Convert docx file to pdf without losing formatting?

Asked

Viewed 244 times

2

I am converting a docx to pdf file using the Docx4j API, but I am finding it difficult to maintain the original formatting of the text after performing the conversion.

Dependencies:

<!-- docx4j -->
    <dependency>
        <groupId>org.docx4j</groupId>
        <artifactId>docx4j</artifactId>
        <version>3.3.0</version>
    </dependency>
    <dependency>
        <groupId>org.docx4j</groupId>
        <artifactId>docx4j-ImportXHTML</artifactId>
        <version>3.3.0</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>1.6.1</version>
    </dependency>
    <dependency>
        <groupId>org.capaxit.textimage</groupId>
        <artifactId>TextImageGen</artifactId>
        <version>2.0-SNAAPSHOT</version>
    </dependency>
    <dependency>
        <groupId>com.googlecode.jaxb-namespaceprefixmapper-interfaces</groupId>
        <artifactId>JAXBNamespacePrefixMapper</artifactId>
        <version>2.2.4</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>com.sun.xml.bind</groupId>
        <artifactId>jaxb-impl</artifactId>
        <version>2.2.11</version>
    </dependency>
    <dependency>
        <groupId>org.glassfish.jaxb</groupId>
        <artifactId>jaxb-runtime</artifactId>
        <version>2.2.11</version>
    </dependency>
    <dependency>
        <groupId>org.plutext</groupId>
        <artifactId>jaxb-xslfo</artifactId>
        <version>1.0.1</version>
    </dependency>
    <dependency>
        <groupId>org.docx4j</groupId>
        <artifactId>docx4j-export-fo</artifactId>
        <version>3.3.0</version>
    </dependency>
    <dependency>
        <groupId>org.docx4j</groupId>
        <artifactId>xhtmlrenderer</artifactId>
        <version>3.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.xmlgraphics</groupId>
        <artifactId>xmlgraphics-commons</artifactId>
        <version>2.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.avalon.framework</groupId>
        <artifactId>avalon-framework-api</artifactId>
        <version>4.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.avalon.framework</groupId>
        <artifactId>avalon-framework-impl</artifactId>
        <version>4.3.1</version>
    </dependency>
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.4</version>
    </dependency>

Method that performs file replacement and conversion:

   @Path("fichaCaptacao")
    @GET
    @Produces({"application/pdf"})
    public Response fichaCaptacao(@Context ServletContext servletContext) throws Exception {
        // Exclude context init from timing
        org.docx4j.wml.ObjectFactory foo = org.docx4j.jaxb.Context.getWmlObjectFactory();

        // Font regex (optional)
        // Set regex if you want to restrict to some defined subset of fonts
        // Here we have to do this before calling createContent,
        // since that discovers fonts
        String outputFile = "/home/desenvolvimento/qimob.git/qimob-web/src/main/webapp/resources/templates/contratos/OUT_VariableReplace.docx";
        // Set regex se você quiser definir um grupo de fonte
        String regex = null;
        regex = ".*(Courier New|Arial|Times New Roman|Comic Sans|Georgia|Impact|Lucida Console|Lucida Sans Unicode|Palatino Linotype|Tahoma|Trebuchet|Verdana|Symbol|Webdings|Wingdings|MS Sans Serif|MS Serif).*";

        PhysicalFonts.setRegex(regex);

        String docInputStream = servletContext.getRealPath("/") + "/resources/templates/contratos/CONTRATO_LOCACAO_IMOVEL_RESIDENCIAL.docx";
        InputStream docxInputStream = new FileInputStream(docInputStream);

        WordprocessingMLPackage tmpPkg = null;

        tmpPkg = WordprocessingMLPackage.load(docxInputStream);

        MainDocumentPart documentPart = tmpPkg.getMainDocumentPart();

        HashMap<String, String> mappings = new HashMap<>();
        mappings.put("contratante", "Omar Mota");
        mappings.put("naturalidade", "Goiás-GO");
        mappings.put("nacionalidade", "Brasileiro");

        documentPart.variableReplace(mappings);
        // Refresh the values of DOCPROPERTY fields
        FieldUpdater updater = new FieldUpdater(tmpPkg);
        updater.update(true);

        // Set up font mapper (optional)
        Mapper fontMapper = new IdentityPlusMapper();
        tmpPkg.setFontMapper(fontMapper);

        // FO exporter setup (required)
        // .. the FOSettings object
        final FOSettings foSettings = Docx4J.createFOSettings();
        foSettings.setWmlPackage(tmpPkg);

        // Document format:
        // The default implementation of the FORenderer that uses Apache Fop will output
        // a PDF document if nothing is passed via
        foSettings.setApacheFopMime(FOSettings.MIME_PDF);
        // apacheFopMime can be any of the output formats defined in org.apache.fop.apps.MimeConstants eg org.apache.fop.apps.MimeConstants.MIME_FOP_IF or
        // FOSettings.INTERNAL_FO_MIME if you want the fo document as the result.
        //foSettings.setApacheFopMime(FOSettings.INTERNAL_FO_MIME);

        // Specify whether PDF export uses XSLT or not to create the FO
        // (XSLT takes longer, but is more complete).

//      // Save it
//      if (true) {
//          SaveToZipFile saver = new SaveToZipFile(tmpPkg);
//          saver.save(outputFile);
//      } else {
//          System.out.println(XmlUtils.marshaltoString(documentPart.getJaxbElement(), true,
//                  true));
//      }

//      PdfSettings pdfSettings = new PdfSettings();
//      OutputStream out = new FileOutputStream(new File("/home/desenvolvimento/Documents/conversao.pdf"));
//      PdfConversion converter = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(tmpPkg);
//      converter.output(out,pdfSettings);

        ResponseBuilder builder = Response.ok(
                new StreamingOutput() {
                    public void write(OutputStream output) throws IOException, WebApplicationException {
                        try {
                            Docx4J.toFO(foSettings, output, Docx4J.FLAG_EXPORT_PREFER_XSL);
                        } catch (Docx4JException e) {
                            throw new WebApplicationException(e);
                        }
                    }
                }
        );

//      // Clean up, so any ObfuscatedFontPart temp files can be deleted
        if (tmpPkg.getMainDocumentPart().getFontTablePart() != null) {
            tmpPkg.getMainDocumentPart().getFontTablePart().deleteEmbeddedFontTempFiles();
        }
        // This would also do it, via finalize() methods
        updater = null;
        tmpPkg = null;

        return builder.build();
//      // Prefer the exporter, that uses a xsl transformation
//      // Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
//
//      // Prefer the exporter, that doesn't use a xsl transformation (= uses a visitor)
//      // .. faster, but not yet at feature parity
//      // Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_NONXSL);
//
//      System.out.println("Saved: " + outputfilepath);
//

    }

Log:

 15:24:27,217 INFO  [org.docx4j.openpackaging.contenttype.ContentTypeManager] (default task-41) Detected WordProcessingML package 
    15:24:27,217 INFO  [org.docx4j.openpackaging.io3.Load3] (default task-41) Instantiated package of type org.docx4j.openpackaging.packages.WordprocessingMLPackage
    15:24:27,218 INFO  [org.docx4j.openpackaging.io3.Load3] (default task-41) package read;  elapsed time: 3 ms
    15:24:27,218 INFO  [org.docx4j.openpackaging.parts.JaxbXmlPart] (default task-41) Lazily unmarshalling /word/document.xml
    15:24:27,224 INFO  [org.docx4j.openpackaging.parts.DocPropsCorePart] (default task-41) unmarshalling org.docx4j.openpackaging.parts.DocPropsCorePart
    15:24:27,224 INFO  [org.docx4j.openpackaging.parts.DocPropsExtendedPart] (default task-41) unmarshalling org.docx4j.openpackaging.parts.DocPropsExtendedPart
    15:24:27,225 INFO  [org.docx4j.model.fields.FieldUpdater] (default task-41) 

    Simple Fields in /word/document.xml
    ============= 
    Found 0 simple fields 

     Complex Fields in /word/document.xml
    ============== 
    Found 0 fields 

    15:24:27,225 WARN  [org.docx4j.fonts.IdentityPlusMapper] (default task-41) WARNING! SubstituterWindowsPlatformImpl works best on Windows.  To get good results on other platforms, you'll probably  need to have installed Windows fonts.
    15:24:27,227 INFO  [org.docx4j.fonts.RunFontSelector] (default task-41) rPrDefault/rFonts referenced Calibri
    15:24:27,227 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Calibri' is not mapped to a physical font. 
    15:24:27,236 INFO  [org.docx4j.openpackaging.parts.WordprocessingML.FontTablePart] (default task-41) Writing temp embedded fonts 1463077467236
    15:24:27,236 WARN  [org.docx4j.fonts.IdentityPlusMapper] (default task-41) - - No physical font for: Calibri
    15:24:27,236 WARN  [org.docx4j.fonts.Mapper] (default task-41) Overwriting existing fontMapping: arial
    15:24:27,236 WARN  [org.docx4j.fonts.IdentityPlusMapper] (default task-41) - - No physical font for: Times New Roman
    15:24:27,244 INFO  [org.docx4j.fonts.RunFontSelector] (default task-41) rPrDefault/rFonts referenced Calibri
    15:24:27,244 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Calibri' is not mapped to a physical font. 
    15:24:27,252 INFO  [org.docx4j.openpackaging.parts.WordprocessingML.FontTablePart] (default task-41) Writing temp embedded fonts 1463077467252
    15:24:27,254 INFO  [org.docx4j.convert.out.common.preprocess.FieldsCombiner] (default task-41) starting
    15:24:27,255 INFO  [org.docx4j.convert.out.common.preprocess.CoverPageSectPrMover] (default task-41) No need to move sectPr 
    15:24:27,261 WARN  [org.docx4j.openpackaging.parts.WordprocessingML.DocumentSettingsPart] (default task-41) No w:settings/w:compat element
    15:24:27,265 INFO  [org.docx4j.model.structure.PageDimensions] (default task-41) No cols in this section; defaulting.
    15:24:27,266 INFO  [org.docx4j.fonts.RunFontSelector] (default task-41) rPrDefault/rFonts referenced Calibri
    15:24:27,266 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Calibri' is not mapped to a physical font. 
    15:24:27,266 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Calibri is not mapped!
    15:24:27,280 INFO  [org.docx4j.XmlUtils] (default task-41) Using org.apache.xalan.transformer.TransformerImpl
    15:24:27,280 INFO  [org.docx4j.convert.out.common.AbstractConversionContext] (default task-41) /pkg:package
    15:24:27,286 INFO  [org.docx4j.fonts.RunFontSelector] (default task-41) rPrDefault/rFonts referenced Calibri
    15:24:27,286 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Calibri' is not mapped to a physical font. 
    15:24:27,294 INFO  [org.docx4j.openpackaging.parts.WordprocessingML.FontTablePart] (default task-41) Writing temp embedded fonts 1463077467294
    15:24:27,294 INFO  [org.docx4j.convert.out.common.preprocess.FieldsCombiner] (default task-41) starting
    15:24:27,294 INFO  [org.docx4j.convert.out.common.preprocess.CoverPageSectPrMover] (default task-41) No need to move sectPr 
    15:24:27,296 INFO  [org.docx4j.model.structure.PageDimensions] (default task-41) No cols in this section; defaulting.
    15:24:27,296 INFO  [org.docx4j.fonts.RunFontSelector] (default task-41) rPrDefault/rFonts referenced Calibri
    15:24:27,296 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Calibri' is not mapped to a physical font. 
    15:24:27,296 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Calibri is not mapped!
    15:24:27,299 INFO  [org.docx4j.XmlUtils] (default task-41) Using org.apache.xalan.transformer.TransformerImpl
    15:24:27,299 INFO  [org.docx4j.convert.out.common.AbstractConversionContext] (default task-41) /pkg:package
    15:24:27,303 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Times New Roman' is not mapped to a physical font. 
    15:24:27,307 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Times New Roman' is not mapped to a physical font. 
    15:24:27,310 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Times New Roman' is not mapped to a physical font. 
    15:24:27,313 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Times New Roman' is not mapped to a physical font. 
    15:24:27,315 INFO  [org.docx4j.fonts.RunFontSelector] (default task-41) rPrDefault/rFonts referenced Calibri
    15:24:27,315 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Calibri' is not mapped to a physical font. 
    15:24:27,317 WARN  [org.docx4j.fonts.fop.util.FopConfigUtil] (default task-41) Document font Calibri is not mapped to a physical font!
    15:24:27,317 WARN  [org.docx4j.fonts.fop.util.FopConfigUtil] (default task-41) Document font Times New Roman is not mapped to a physical font!
    15:24:27,322 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) Font "Calibri,normal,400" not found. Substituting with "any,normal,400".
    15:24:27,327 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) The contents of fo:region-body on page 4 exceed its viewport by 42211 millipoints. (See position 1:449)
    15:24:27,327 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) The contents of fo:region-body on page 3 exceed its viewport by 42211 millipoints. (See position 1:449)
    15:24:27,327 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) The contents of fo:region-body on page 2 exceed its viewport by 42211 millipoints. (See position 1:449)
    15:24:27,327 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) The contents of fo:region-body on page 1 exceed its viewport by 42211 millipoints. (See position 1:449)
    15:24:27,331 INFO  [org.docx4j.org.apache.xml.serializer.ToXMLStream] (default task-41) Using repackaged ToXMLStream
    15:24:27,331 INFO  [org.docx4j.org.apache.xml.serializer.ToXMLStream] (default task-41) Using repackaged ToXMLStream
    15:24:27,340 INFO  [org.docx4j.model.images.AbstractConversionImageHandler] (default task-41) Wrote @src='file:/tmp/6ccc1fe4-53c9-4661-b078-78c79a9a95d8image1.jpeg
    15:24:27,350 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Times New Roman' is not mapped to a physical font. 
    15:24:27,481 INFO  [org.docx4j.fonts.RunFontSelector] (default task-41) rPrDefault/rFonts referenced Calibri
    15:24:27,481 WARN  [org.docx4j.fonts.RunFontSelector] (default task-41) Font 'Calibri' is not mapped to a physical font. 
    15:24:27,489 WARN  [org.docx4j.fonts.fop.util.FopConfigUtil] (default task-41) Document font Calibri is not mapped to a physical font!
    15:24:27,489 WARN  [org.docx4j.fonts.fop.util.FopConfigUtil] (default task-41) Document font Times New Roman is not mapped to a physical font!
    15:24:27,509 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) Font "Symbol,normal,700" not found. Substituting with "Symbol,normal,400".
    15:24:27,509 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) Font "ZapfDingbats,normal,700" not found. Substituting with "ZapfDingbats,normal,400".
    15:24:27,510 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) Font "Arial,normal,700" not found. Substituting with "Arial,normal,400".
    15:24:27,521 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) Font "Calibri,normal,400" not found. Substituting with "any,normal,400".
    15:24:27,535 WARN  [org.apache.fop.apps.FOUserAgent] (default task-41) The contents of fo:inline line 1 exceed the available area in the inline-progression direction by 23379 millipoints. (See position 3:11147)
    15:24:27,561 INFO  [org.apache.fop.apps.FOUserAgent] (default task-41) Rendered page #1.

The process files are available here

The PDF file is the result and DOCX is the original file.

If anyone can help me in this challenge I’d be grateful!

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.