How to manipulate docx files using python?

Asked

Viewed 2,970 times

1

I created two files docx but I can’t write to them (using python) but I can do it with txt files. The question is I could create a file docx and write in it using a python script?

  • Yes, just use the pattern set by Docx. Text files are simple because they receive raw text. Docx is probably a text file in XML format with a structure not so simple that should be followed for readers to understand. If you use this pattern, it will work normally.

  • 1

    @Andersoncarloswoss is a zip whose body is bounded by an xml. Do not forget that you can put several embedded files, hence the need to be a container format, zip type

1 answer

2


The point is that documents like . docx, the old . doc, or the libreoffice . odt are orders of magnitude more complex than a normal . txt file. Although it is easy to understand how they work (the . docx and . odt, not the . doc)- the Quesado comment on the question is correct - they are complex files to handle. Both files are a container of type ". zip", with one or more ". xml" type files built in. (Versions of both files can be produced only as an . xml, without being a . zip, but it is not a widely used format - the libreoffice even took to support . odt not zipped)

In particular, there is no obligation for any programming language to have tools already included for handling these types of files: this type of support is provided by third-party modules. Fortunately Python has an eco-module system that has some tools that can directly manipulate this, and these modules are easy to install with the tool "Pip".

Note that the two technologies used by docx and odt sane indirectly supported by the standard Python library - there is the module zipfile which allows you to extract, update and create files of the type zip, and there’s the xml.etree which allows you to read and write XML files treating the tags as appropriate "we". The problem is that the structure and what goes inside the XML file in each case is complex, whether you can write in it or not. Just to create a program that can properly open the container. zip from an odt or docx, and parse the internal xml, and be able to update it and create a new version of docx or odt would take most of a day from an experienced and motivated programmer.

These modules to update and create files of the type already do the dirty work - but updating the contents of this type of file remains complicated: you will have to know how to create the type of data in Python that represents the type of data you want to insert into the file (paragraph, table, etc...), and how to insert this data in the edited place, but this only after learning to use the library to read the contents of the file.

Probably the module "python-docx" will allow you to do whatever you want, but I suggest a good look at the documentation of it before trying to use it.

Alternatively, if you need to create text with different styles, colors, etc... it can be Much simpler create an HTML document with a suitable stylesheet, and use either a command line command, or the selenium to run a browser to read, render this html and generate a PDF output. (or the HTML with the stylesheet can be read directly by Microsoftword or Libreoffice itself). Documenting HTML + CSS is much easier to query, it has thousands of times more examples, and breaking you can still use one of the many Python template tools, to scrutinize the "skeleton" and general form of documents that your program needs to generate, and create only "stuffing" with the relevant data in Python code. I would suggest using the jinja2, which is a very used template package in Python programming for web.

  • But if I have a document docx, how do I access the zip file corresponding to that document? I had heard that docx is based on xml, but never know how to access these xml files.

  • In Windows, the system uses the file extension in a "mandatory" way for the applications - just rename your file ". docx" to ". zip" and open with the app you use for ZIP normally. On Linux and Mac, just open the file in the zip app you use, you don’t even need to rename it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.