2
Good Afternoon.
I wonder if anyone can help me. I need to extract data from a PDF file, but I need to read page by page from the file, if anyone can help me thank you.
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C\\testes\\teste.pdf");
try {
PDFParser parser = new PDFParser(new FileInputStream(file)); //Aqui o FileInputStream está acusando erro;
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
The problem is that there in "Fileinputstream" he keeps accusing an error that I do not know the reason. It asks you to add the java.io.Fileinputstream import, only when I add it tells you to convert Fileinputstream to Randomacessread, but if I do this conversion it misses precisely in this conversion
– R.Santos
@R.Santos I updated the answer. And now?
– Victor Stafusa
He brought the PDF information, yes, thank you. But he brought an error message between each page "Jun 23, 2016 5:54:43 PM org.apache.pdfbox.pdmodel.font.Pdsimplefont toUnicode WARNING: No Unicode Mapping for . notdef (9) in font Times-Bold, "you would know what it would be?
– R.Santos
One more question, the pdfs will have different sizes, and I would need the information of all the pages, you have how to help me with that? Thanks for your help so far;
– R.Santos
@R.Santos As for this warning, this is because he has found some Unicol symbol that he does not know what it is. I do not know how to fix this, but unless something very wrong occurs, it should be safe to ignore this problem for now. As for the number of pages, I updated my answer.
– Victor Stafusa
Perfect, thanks for the help even
– R.Santos
Guy I’m giving continuity in this same code, because I need to get certain information from these PDF pages, for this I’m using matcher and Pattern. Would you tell me how do I read line by line from each PDF page? Grateful
– R.Santos
@R.Santos No, I wouldn’t know. At least not yet. I recommend that you post a new question about it, and link to it here.
– Victor Stafusa
Thank you. If you get anything put here
– R.Santos