Wrong formatting when opening a latex generated PDF

Asked

Viewed 402 times

4

In my program I need to open a PDF file and pick up the text that contains it. But when opening the PDF, the text comes badly formatted. For example:
Please to `my fam ? read by ? measure efforts
When the right thing would be:
I thank my family for not measuring efforts

This only occurs when the PDF is generated by latex. When it is generated by word, the text is normal.
The code I’m using to open the pdf is:

int i = 1;//Sendo n o numero de paginas
PdfReader reader = new PdfReader(diretorio);
while(i<=n){
   conteudo+=PdfTextExtractor.getTextFromPage(reader, i);
   i++;
}

I know it has to do with encoding, but I don’t know how to solve/ what to do!
Remembering that Pdfs will not be generated by me.

  • 1

    If you copy and paste the character, with the same mouse, how does it look? You could paste it here for us to see?

  • @Math actually the example I quoted, has already been copied and pasted.

2 answers

3

Solution 1: Evade the problem :)

Edit Latex and merge:

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}    %% <<<<<< esta linha
\begin{document}
...

The generated PDF no longer gives problems!

Solution 2:

Do a postprocessor of the doctored text coming from the pdf and by successive replacements restore the sharp -- bad idea...

Update:

can’t be the solution 1: I don’t know how to do it decently. Normally I use tools like the pdftotext which applied to bad pdf coming from latex (MPVL) have the following aspect: pdftotext mpvl.pdf

Jo˜ao 
Resumo
fam´ılia esfor¸co

and after a | fix-mpvl

João
Resumo
família esforço

In my case the fix-mpvl does many things among which:

#!/usr/bin/perl 
use utf8::all;

while(<>){
  s/eˆ/ê/g; s/ˆe/ê/g;
  s/aˆ/â/g; s/ˆa/â/g;
  s/oˆ/ô/g; s/ˆo/ô/g;
  s/e´/é/g; s/´e/é/g;
  s/a´/á/g; s/´a/á/g;
  s/o´/ó/g; s/´o/ó/g;
  s/u´/ú/g; s/´u/ú/g;
  s/a˜/ã/g; s/˜a/ã/g;
  s/o˜/õ/g; s/˜o/õ/g;
  s/n˜/ñ/g; s/˜n/ñ/g;
  s/ı´/í/g; s/´ı/í/g;
  s/c¸/ç/g; s/¸c/ç/g;
  print $_;
}
  • Different Pdfs will be used and I will not be the one creating them.

  • 1

    If you open the PDF generated by Latex in Acrobat Reader, is the text already wrong? If you’re already wrong and you don’t generate the document, there’s not much you can do. As João already said, a postprocessing is very difficult and a bad idea (because you will not know the types of mistakes that may come, and you have a huge responsibility that is not yours).

  • @Luizvieira no.. the PDF error is only after I extract the text from it. When I open the PDF in Acrobat it is normal. The problem I believe is in the encoding of the pdf generated by latex.

  • 1

    Ah, okay, that’s important information that wasn’t very clear in the question. :)

  • 3

    The problem is that without the fontenc, complex characters are made by overlapping two or more characters. (in Latex, you can ask for 2 cedillas with an a with three accents: the appearance looks great but there are actually several elements).

  • 2

    @Jjoao In my opinion, your last comment is worth being included in the answer. It’s very important (and interesting - I didn’t know). If you have any official reference to such overlap, even better! P.S.: I like the acronym "MPVL". :)

  • @Jjoao This code, I must test by putting it where?

Show 2 more comments

1

You can try using another library called Apache Pdfbox. The advantage is that the features are already available in a jar. You can test if it works on your files, if it works, you can integrate the classes into your source code.

You can download directly from the Maven repository here

Download the jar and run the following command in your pdf file

java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <inputfile> [Text file]

The parameters are you who have to search here, since I don’t have specific details to your project. Take a look at the option -encoding, maybe there is the answer to your question. By default Latex uses OT1, if I’m not mistaken.

If you can extract the text correctly with the command, then you can add the library as dependency.

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox</artifactId>
  <version>1.8.9</version>
</dependency>

And use a example who uses TextExtraction in Java.

  • I don’t understand what you mean by running the command in the PDF file. Where I run this command?

  • @Peace in the terminal!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.