0
Good afternoon, I’m developing a routine for doing PDF file research. My idea is to distribute the search processing in each file in different threads to optimize the response time. The Searchpdftext method below searches and returns the files correctly, but when I do a test processing the files in different threads and another test processing one by one in the main thread of the application the average response time is the same.
int ctFilesToSearch;
int ctFilesSearching;
/// <summary>
/// Files where the search text was found.
/// </summary>
List<FileInfo> lFilesGood;
Queue<FileInfo> qFilesToSearch;
public SearchEngine()
{
lFilesGood = new List<FileInfo>();
}
public IEnumerable<FileInfo> SearchPDFText(IEnumerable<FileInfo> lFiles, string searchText)
{
try
{
qFilesToSearch = new Queue<FileInfo>(lFiles);
int totalFiles = lFiles.Count();
ctFilesToSearch = totalFiles;
ctFilesSearching = 0;
while (ctFilesSearching < totalFiles)
{
ctFilesSearching++;
Thread tr = new Thread(() => Search(qFilesToSearch.Dequeue(), searchText));
tr.Start(); //Multi thread.
//Search(qFilesToSearch.Dequeue(), searchText); //Processamento 1 a 1.
}
while (ctFilesToSearch > 0) ; //Aguarda todos os arquivos a serem processados.
return lFilesGood;
}
catch { throw; }
}
private void Search(FileInfo file, string searthText)
{
if (SearchPdfFile(file.FullName, searthText))
lFilesGood.Add(file);
ctFilesToSearch--;
}
private bool SearchPdfFile(string fileName, string searthText)
{
bool textFound = false;
if (File.Exists(fileName))
{
using (PdfReader pdfReader = new PdfReader(fileName))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searthText))
{
textFound = true;
break;
}
}
}
}
return textFound;
}
Note: I am using the following dlls:
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Threading;
My test was based on processing 200. pdf files with 16 pages each. The response time in both scenarios was on average 1m 40s. I hoped that by processing in parallel the result would be much better. The way I am doing to achieve my goal (parallelism) is correct?
This Searchpdffile method does an I/O Bound task?
– Gabriel Coletta
@Gabrielcoletta, the Searchpdffile method uses the iTextSharp dll to open the pdf and locate the searchText. I use only the File.Exists method of the System.IO class.
– RaphaSheep
Just to understand the difference between your expectation and reality, how did you imagine dividing the work into more than one thread could improve the "response time"?
– Bacco
based on the amount of pdfs to analyze, maybe it would be more efficient if instead of putting multiple threads to a single pdf, put a thread to each pdf and jump to the next ones as soon as it is over!
– Lodi
The problem is as follows: 200 records to find is considered little record, whether in Thread or not. The problem arises when you start reading the files on a disk. This process is considered I/O bound, so your thread will be idle waiting for your result to continue processing. What I recommend if you want a better performance is to work with asynchronous programming, so I can continue doing some process while the file is searched.
– Gabriel Coletta
@Bacco, my idea was to have several files being searched "at the same time", instead of waiting to finish the search in a file and then start the search in the next file.
– RaphaSheep
@Lodi, that’s what I did... Each file is being searched in a single thread. I don’t want to wait to finish searching a file to start the next one, but to search several files in parallel.
– RaphaSheep
@Gabrielcoletta, 200 files I used for test... In production can reach 2Gb of pdf files... I imagined that the way I did it would be asynchronous since when calling tr.Start() the system continues processing without waiting for a return. Is not?
– RaphaSheep
The pc does nothing "at the same time", it simply "splits the attention" when you create threads. Imagine a coffee packer on a mat, packing boxes. When you create a new "thread", the same operator takes care of two mats. Eventually you can have a better use of several colors from the same PC in a scenario like this, but only if you disregard the other things the machine already has to do. And yet, every employee (core) needs to be taking care of their own treadmill. (OK, I’ve oversimplified it, but let’s leave the technicalities to whoever is going to officially answer).
– Bacco
@Raphasheep Not necessarily. Search() will run in a thread as you comment, but there will come a time when this thread will call an I/O Bound method (read the file, for example), where it will wait for this file to be loaded and then continue. This is probably the reason to do in a sequential or concurrent way not giving significant difference.
– Gabriel Coletta
@Gabrielcoletta, I understand, thank you very much for the explanation. So what would be the best option to optimize the response time in my scenario? Maybe using Task instead of Thread?? Or would not change anything?
– RaphaSheep