Wrong formatting when opening a latex generated PDF

Question

Wrong formatting when opening a latex generated PDF

Asked 10 years, 2 months ago

Viewed 402 times

4

In my program I need to open a PDF file and pick up the text that contains it. But when opening the PDF, the text comes badly formatted. For example:
Please to `my fam ? read by ? measure efforts
When the right thing would be:
I thank my family for not measuring efforts

This only occurs when the PDF is generated by latex. When it is generated by word, the text is normal.
The code I’m using to open the pdf is:

int i = 1;//Sendo n o numero de paginas
PdfReader reader = new PdfReader(diretorio);
while(i<=n){
   conteudo+=PdfTextExtractor.getTextFromPage(reader, i);
   i++;
}

I know it has to do with encoding, but I don’t know how to solve/ what to do!
Remembering that Pdfs will not be generated by me.

1

If you copy and paste the character, with the same mouse, how does it look? You could paste it here for us to see?

– Math

2015/05/01 at 15:13
@Math actually the example I quoted, has already been copied and pasted.

– Pacíficão

2015/05/04 at 17:53

2 answers

Browser other questions tagged java pdf encode

You are not signed in. Login or sign up in order to post.

by JJoao • **5,113** points · Answer 1 · 2015-04-29T11:18:46+00:00

Solution 1: Evade the problem :)

Edit Latex and merge:

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}    %% <<<<<< esta linha
\begin{document}
...

The generated PDF no longer gives problems!

Solution 2:

Do a postprocessor of the doctored text coming from the pdf and by successive replacements restore the sharp -- bad idea...

Update:

can’t be the solution 1: I don’t know how to do it decently. Normally I use tools like the pdftotext which applied to bad pdf coming from latex (MPVL) have the following aspect: pdftotext mpvl.pdf

Jo˜ao 
Resumo
fam´ılia esfor¸co

and after a | fix-mpvl

João
Resumo
família esforço

In my case the fix-mpvl does many things among which:

#!/usr/bin/perl 
use utf8::all;

while(<>){
  s/eˆ/ê/g; s/ˆe/ê/g;
  s/aˆ/â/g; s/ˆa/â/g;
  s/oˆ/ô/g; s/ˆo/ô/g;
  s/e´/é/g; s/´e/é/g;
  s/a´/á/g; s/´a/á/g;
  s/o´/ó/g; s/´o/ó/g;
  s/u´/ú/g; s/´u/ú/g;
  s/a˜/ã/g; s/˜a/ã/g;
  s/o˜/õ/g; s/˜o/õ/g;
  s/n˜/ñ/g; s/˜n/ñ/g;
  s/ı´/í/g; s/´ı/í/g;
  s/c¸/ç/g; s/¸c/ç/g;
  print $_;
}

by GabrielOshiro • **256** points · Answer 2 · 2015-05-05T18:32:36+00:00

You can try using another library called Apache Pdfbox. The advantage is that the features are already available in a jar. You can test if it works on your files, if it works, you can integrate the classes into your source code.

You can download directly from the Maven repository here

Download the jar and run the following command in your pdf file

java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <inputfile> [Text file]

The parameters are you who have to search here, since I don’t have specific details to your project. Take a look at the option -encoding, maybe there is the answer to your question. By default Latex uses OT1, if I’m not mistaken.

If you can extract the text correctly with the command, then you can add the library as dependency.

<dependency>
  <groupId>org.apache.pdfbox</groupId>
  <artifactId>pdfbox</artifactId>
  <version>1.8.9</version>
</dependency>

And use a example who uses TextExtraction in Java.