How to convert a PDF file to TXT in Java?

Asked

Viewed 2,745 times

6

Is there any java way to convert a PDF extension file to TXT extension?

  • 1

    The content of PDF can vary a lot, there is no way to extract exactly something standardized, there are many PDF documents that have been generated from files. doc There should be yes, but it won’t be easy. This is just a hint of what you’ll have ahead of you, I’ll search and see if there is any lib. See more.

  • I already have knowledge of this obstacle, but it would help showing and exemplifying a form and would be grateful @Guilhermenascimento

1 answer

5


You can try using the library iText, which has some features ready for text extraction from PDF files. A way to do this would be:

public void parsePdf(String pdf, String txt) throws IOException {
    PdfReader reader = new PdfReader(pdf);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));
    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
        out.println(strategy.getResultantText());
    }
    out.flush();
    out.close();
    reader.close();
}

Where the parameter pdf is the PDF file that should be extracted the text and parameter txt is the target TXT file.

This chunk of code was taken from a ready-made example, created by the iText developer. This example, as well as the resulting TXT, can be found in this link.

  • The Pdfbox library can also help you

Browser other questions tagged

You are not signed in. Login or sign up in order to post.