How to search a snippet in a PDF remotely?

Asked

Viewed 1,368 times

2

Is there a way for me to search an excerpt, or a word in a PDF that’s on the internet? I researched about CURL, some libraries, but found nothing. More or less this way:

I have a website and in it the guy would insert a name for example: John. After that my website script would check inside the file: http://www.bu.ufsc.br/ArtigoCientifico.pdf there is the name John, and I would return to me whether or not there is.

Is there a way to do that? Does anyone know a library or can give me a north?

1 answer

2


The library http://www.pdfparser.org/ allows you to pick up text from PDF files.

$url = 'http://www.bu.ufsc.br/ArtigoCientifico.pdf';
$nome = 'João';

$parser = new \Smalot\PdfParser\Parser();

$pdf = $parser->parseContent(file_get_contents($url));
$text = $pdf->getText();

if (strpos($text, $nome) !== false) {
    // achou o nome
}
  • That’s right, thanks man!

  • Just one more question, let’s assume that I have a page, but the PDF link is not clear on it, for example: www.bu.ufsc.br/downloadArtigo.php? id=19898454 and only by clicking on this link that the pdf is downloaded. Whether I use file_get_contents works the same way or not?

  • file_get_contents will work the same way in this case, because it collects the content without caring about the extension.

  • Hmmm, got it, thank you very much/

  • Now I’m going to ask you another question, if you have an answer, I’ll create the question and you answer it. I noticed on the site that the PDF download ID is generated automatically, and I didn’t want to have to insert it into the system every time. So is there any way I can put a file_get_content only with the link until the id= and it download the same way, some kind of gambiarra?

  • In case the id identifies which document to download, you need to know the id, but if there is a page that contains a link to the article and it is not a page generated dynamically by javascript, you can get the content using file_get_contents on the page with the link and find the link with preg_match or preg_match_all or if it is a very complex page using https://code.google.com/p/phpquery/

  • Gee, that would be ideal, but the damn page uses Javascript.

  • Since the page is mounted by javascript I don’t know how to do it with php, I only know http://casperjs.org/ in conjunction with http://phantomjs.org/ they are not for php but you can create scripts from them and run in php with functions like exec, however its hosting should allow adding and running programs, phantomjs is a standalone base program without graphical interface that loads pages and executes javascript, and casperjs is a frontend for phantomjs with facilities to do things.

  • I’ll take a look. Thanks!

Show 4 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.