Using Python to Extract Equations, Figures and Other Latex File Items

Asked

Viewed 236 times

1

I’m on a personal project that involves leaving an article written in Latex as clean as possible for sending the translation.

Aiming to increase my productivity, avoid problems of information leakage and facilitate for the translator, who does not have much familiarity with this type of writing, I decided that would send the original text more "clean", without the equations for example. Then I can extend the same concept to figures, etc.

Problem: After extracting what is desired (equations for example), two new files are saved, one containing only the equations and the other clean, without the equations. I even set up the code below, which already works. The challenge is: (1) At each extraction, a reference should be left to return the same equation in the same place as it was, when received the translated text; (2) To return the original equations, there must be another script for this purpose.

Any strategy suggestions to better address this challenge ?

Follow the current code working, without yet meeting the items (1) and (2) above.

print('Inicio do Script')
infileName = open('document.tex','r')
inOrig = infileName.readlines()
outfileName_eq = open('document_equacoes.tex','w')
outfileName_tex = open('document_limpo.tex','w')
extract_block = False
oneWrite = False
lista = [['begin{equation}', 'end{equation}'],\
         ['begin{equation*}', 'end{equation*}'],\
         ['begin{eqnarray}', 'end{eqnarray}'],\
         ['begin{eqnarray*}', 'end{eqnarray*}'],\
         ['begin{align}', 'end{align}'],\
         ['begin{align*}', 'end{align*}']]
for list in lista:
    print('Examinando '+ list[0] + ' e ', list[1])
    for line in inOrig:
        if list[0] in line:
            extract_block = True
        if extract_block:
            outfileName_eq.write(line)
        if list[1] in line:
            extract_block = False
            outfileName_eq.write("%------------------------------------------\n\n")

#separado para melhor entendimento do funcionamento    
for line in inOrig:
    for list in lista:
        if list[0] in line:
            extract_block = True            
            oneWrite = True
        if list[1] in line:
            extract_block = False
            oneWrite = True
    if not (extract_block or oneWrite):
        outfileName_tex.write(line)
        oneWrite = True
    oneWrite = False

infileName.close()
outfileName_eq.close()
outfileName_tex.close()
print('Fim do Script')

The Latex document I used for testing is the following, which to match the above code, must be saved as "Document.tex"

\documentclass{article}
\usepackage[utf8]{inputenc} % Disponibiliza acentos.
\usepackage[english,brazil]{babel}
\usepackage{lipsum}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\title{Titulo do Artigo}
\author{Nome do Autor}
\begin{document}
\maketitle
\begin{abstract}
    \lipsum[1]
\end{abstract}
    \section{Primeira Seção}
    \lipsum[2-4]
\section[Exemplo de Fórmula]{Fórmula}
Neste trecho existe um exemplo de como aparece geralmente 
uma equação. A primeira equação é a de Báskara, conforme (\ref{eq:bask01})
\begin{equation}
    x = \frac{-b \pm \sqrt{b^2-4ac}}{2a}
    \label{eq:bask01}
\end{equation}
Outra forma de expressar as fórmulas que também precisam ser verificadas abaixo
\begin{eqnarray*}
    x =& a^b\\
    y =& h^{\pi.r}
\end{eqnarray*}
A seguinte é muito parecida com a de Báskara e pode ser visto em (\ref{eq:bask02}), no entanto não existe em literatura.
\begin{equation}
    x = \frac{-b/2 \pm \sqrt{b^2-4ac}}{2acb}
    \label{eq:bask02}
\end{equation}
Outra forma de expressar as fórmulas que também precisam ser verificada
\begin{eqnarray}
    x &=& a^b\\
    y &=& h^{\pi.r}
\end{eqnarray}
Finalmente outro método com $k_2$ tal como (\ref{eq:seqEq})
\begin{align}
    k_1&= s^2\\
    k_2&= k^2 \label{eq:seqEq}
\end{align}
Fim das descrições gerais
\end{document}
  • You have the string of the equation, right? Create a hash MD5 (library hashlib) and assemble a dictionary where the key is the hash and the content is the text of string. This dictionary you can save to disk as a JSON file (library json), at the place where the equation was you leave a Latex comment containing only the hash.

  • It seems to be a good solution. I also thought of a tag. Via MD5 seems to be even better, because I don’t worry about "inventing" a different one for each element I’m looking to extract. I’ll have to test it. I’m not very familiar with JSON, but I believe it will be better, because it is structured. Thanks for the tip.

  • The MD5 key will even help you identify the identical equations and using the Latex comment tag, besides being easier to identify, you do not break the compilation of the document. Work with dictionaries and use functions json.load() to load it from disk and json.dump() to save him.

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.