How to use the Javascript pdf.js library in Selenium with Java via the Javascriptexecutor clase

Asked

Viewed 167 times

0

I found this library that does exactly what I need, extract the text from the PDF and turn into a String. http://git.macropus.org/2011/11/pdftotext/example/ https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext

From what I researched (a lot), it seems to me that the version below is the latest in pdf.js. However, I cannot after opening the pdf file in the browser, make this library be called and then use its methods to copy the text. https://github.com/mozilla/pdf.js

I did a lot of research for 2 in a row, in fact I’m not very knowledgeable about js, but I found this shape https://stackoverflow.com/questions/1554280/extract-text-from-pdf-in-javascript which seems to be ideal for how to implement, however, I was unable to adapt to Selenium’s Javascriptexecutor.

Here is my attempt trying to call in the same way as the index of the first example http://git.macropus.org/2011/11/pdftotext/example/.

driver.get("file:///C:/Users/user/Desktop/arquivo.pdf");

    JavascriptExecutor jse = (JavascriptExecutor) driver;

    String script1 = "id=\"pdf-js\"";
    String script2 = "src=\"projeto/src/test/resources/js/pdf.js\"";
    String script3 = "PDFJS.workerSrc = cslight/src/test/resources/js/pdf.js";
    String script4 = "src=\"/projeto/src/test/resources/js/app.js\"";
    String script5 = "var app = new App;";

    jse.executeScript(script1);
    jse.executeScript(script2);
    jse.executeScript(script3);
    jse.executeScript(script4);
    jse.executeScript(script5);

Down with the error:

Exception in thread "main" org.openqa.selenium.WebDriverException: unknown error: PDFJS is not defined

(Session info: Chrome=65.0.3325.181) (Driver info: chromedriver=2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7),Platform=Windows NT 10.0.14393 x86_64) (WARNING: The server Did does not provide any stacktrace information) Command Duration or timeout: 0 milliseconds Build info: version: '3.5.3', Revision: 'a88d25fe6b', time: '2017-08-29T12:42:44.417Z' System info: host: 'NC0048', ip: '10.13.30.196', os.name: 'Windows 10', os.Arch: 'amd64', os.version: '10.0', java.version: '1.8.0_161' Driver info: org.openqa.Selenium.chrome.Chromedriver Capabilities [{mobileEmulationEnabled=false, hasTouchScreen=false, platform=XP, acceptSslCerts=false, acceptInsecureCerts=false, webStorageEnabled=true, browserName=chrome, takesScreenshot=true, javascriptEnabled=true, platformName=XP, setWindowRect=true, unexpectedAlertBehaviour=, applicationCacheEnabled=false, Rotatable=false, networkConnectionEnabled=false, Chrome={chromedriverVersion=2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7), userDataDir=C: Users ICARO~1.PRA Appdata Local Temp scoped_dir17892_11337}, takesHeapSnap=true, pageLoadStrategy=normal, unhandledPromptBehavior=, databaseEnabled=false, handlesAlerts=true, version=65.0.3325.181, browserConnectionEnabled=false, nativeEvents=true, locationContextEnabled=true, cssSelectorsEnabled=true}] Session ID: 757fa21a22500f6618317bc12d5799ce at sun.reflect.Nativeconstructoraccessorimpl.newInstance0(Native Method) at sun.reflect.Nativeconstructoraccessorimpl.newInstance(Nativeconstructoraccessorimpl.java:62) at sun.reflect.Delegatingconstructoraccessorimpl.newInstance(Delegatingconstructoraccessorimpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.openqa.Selenium.remote.Errorhandler.createThrowable(Errorhandler.java:215) at org.openqa.Selenium.remote.Errorhandler.throwIfResponseFailed(Errorhandler.java:167) at org.openqa.Selenium.remote.http.JsonHttpResponseCodec.reconstructValue(Jsonhttpresponsecodec.java:40) at org.openqa.Selenium.remote.http.AbstractHttpResponseCodec.Decode(Abstracthttpresponsecodec.java:82) at org.openqa.Selenium.remote.http.AbstractHttpResponseCodec.Decode(Abstracthttpresponsecodec.java:45) at org.openqa.Selenium.remote.Httpcommandexecutor.execute(Httpcommandexecutor.java:164) at org.openqa.Selenium.remote.service.DriverCommandExecutor.execute(Drivercommandexecutor.java:82) at org.openqa.Selenium.remote.Remotewebdriver.execute(Remotewebdriver.java:646) at org.openqa.Selenium.remote.Remotewebdriver.executeScript(Remotewebdriver.java:582) at br.com.conductor.test.Generictester.Ester(Generictester.java:40) at br.com.conductor.test.Generictester.main(Generictester.java:61)

1 answer

0


Follow two API’s that you can add to your Maven project to read PDF:

with.itextpdf itextpdf 5.5.13 org.apache.pdfbox pdfbox 2.0.9

https://developers.itextpdf.com/examples/itext-action-second-edition/chapter-1

https://pdfbox.apache.org/2.0/examples.html

package testcases;

import java.io.File; import java.io.Ioexception;

import org.apache.pdfbox.io.Randomaccessbufferedfileinputstream; import org.apache.pdfbox.io.Randomaccessread; import org.apache.pdfbox.pdfparser.Pdfparser; import org.apache.pdfbox.pdmodel.Pddocument; import org.apache.pdfbox.text.Pdftextstripper; import org.junit.Test;

import com.itextpdf.text.pdf.Pdfreader; import com.itextpdf.text.pdf.parser.Pdftextextractor;

public class Pdftest {

private final String pdfUrl = "http://files.isec.pt/DOCUMENTOS/SERVICOS/BIBLIO/teses/Tese_Mest_Marcio-Carvalho.pdf";
private final String pdfPath = "/home/diamaral/Documentos/diamaral/test.pdf";

@Test
public void lerConteudoPdfUsandoApiIText() throws IOException {
    PdfReader pdfReader = new PdfReader(pdfUrl); 

    System.out.println("\n\n---------API ITEXT-----------------------------"+
            PdfTextExtractor.getTextFromPage(pdfReader,1));
}

@Test
public void lerPdfUsandoApiPdfBox() throws IOException {
    RandomAccessRead doc = new RandomAccessBufferedFileInputStream(new File(pdfPath));
    PDFParser parser = new PDFParser(doc);
    parser.parse();
    PDDocument pdfDoc = parser.getPDDocument();
    PDFTextStripper stripper = new PDFTextStripper();
    System.out.println("\n\n---------API PDFBOX-----------------------------"
                        +stripper.getText(pdfDoc));
    pdfDoc.close();
}

}

  • Sensational! Solved my problem in minutes. Thank you very much!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.