How to read page by page from a PDF with Pdfbox

Asked

Viewed 1,715 times

2

Good Afternoon.

I wonder if anyone can help me. I need to extract data from a PDF file, but I need to read page by page from the file, if anyone can help me thank you.

public static void main(String args[]) {
    PDFTextStripper pdfStripper = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    File file = new File("C\\testes\\teste.pdf");
    try {
        PDFParser parser = new PDFParser(new FileInputStream(file)); //Aqui o FileInputStream está acusando erro;
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(5);
        String parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText);
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

1 answer

2


That’s how it works?

import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessBufferedFileInputStream;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class SuaClasse {
    public static void main(String args[]) {
        File file = new File("C\\testes\\teste.pdf");
        try {
            PDFParser parser = new PDFParser(new RandomAccessBufferedFileInputStream(file));
            parser.parse();
            COSDocument cosDoc = parser.getDocument();
            PDFTextStripper pdfStripper = new PDFTextStripper();
            PDDocument pdDoc = new PDDocument(cosDoc);
            for (int i = 1; i <= pdDoc.getNumberOfPages(); i++) {
                pdfStripper.setStartPage(i);
                pdfStripper.setEndPage(i);
                String parsedText = pdfStripper.getText(pdDoc);
                System.out.println("Página " + i + ": " + parsedText);
            }
        } catch (IOException e) {
            // Tratar a exceção adequadamente.
            e.printStackTrace();
        }
    }
}
  • The problem is that there in "Fileinputstream" he keeps accusing an error that I do not know the reason. It asks you to add the java.io.Fileinputstream import, only when I add it tells you to convert Fileinputstream to Randomacessread, but if I do this conversion it misses precisely in this conversion

  • @R.Santos I updated the answer. And now?

  • He brought the PDF information, yes, thank you. But he brought an error message between each page "Jun 23, 2016 5:54:43 PM org.apache.pdfbox.pdmodel.font.Pdsimplefont toUnicode WARNING: No Unicode Mapping for . notdef (9) in font Times-Bold, "you would know what it would be?

  • One more question, the pdfs will have different sizes, and I would need the information of all the pages, you have how to help me with that? Thanks for your help so far;

  • @R.Santos As for this warning, this is because he has found some Unicol symbol that he does not know what it is. I do not know how to fix this, but unless something very wrong occurs, it should be safe to ignore this problem for now. As for the number of pages, I updated my answer.

  • Perfect, thanks for the help even

  • Guy I’m giving continuity in this same code, because I need to get certain information from these PDF pages, for this I’m using matcher and Pattern. Would you tell me how do I read line by line from each PDF page? Grateful

  • @R.Santos No, I wouldn’t know. At least not yet. I recommend that you post a new question about it, and link to it here.

  • Thank you. If you get anything put here

Show 4 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.