3
Hello,
I’m developing a multithreaded Crawler, each job (thread) deals with X sites to analyze certain content with the jsoup lib. The sites are all accessible. The problem is that the final results are never the same. That is, when I run Crawler it solves 200 contents as well as 180. Analyzing the logs I see that I am receiving 500 or 400, the next run already goes well, I run back to give me a random result. jsoup code (executed by each thread)
try {
Connection.Response resp = Jsoup.connect( url )
.userAgent( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21" )
.timeout( 5000 )
.ignoreHttpErrors( true )
.execute( );
doc = null;
if( resp.statusCode( ) == 200 )
doc = resp.parse( );
else {
log.info( "return url["+ url +"] statusCode == " + resp.statusCode( ) );
return;
}
} catch( Exception e ) {
log.error( "[Jsoup] get response url["+ url +"] exception = ", e );
return;
}
String title = doc.title( ); //get page title
Elements links = doc.select( "img" ); //get all tag <img />
for( Element imgItem : links ) {
if( numImgsbyUrl != -1 )
if( countImg == numImgsbyUrl ) break;
String title = getAttribute( hypeItem , "title" );
log.debug( "[Tag Images] title["+titleImg+"] width["+width+"] height["+height+"] alt["+alt+"]" );
if( !checkTerms( src, titleImg , width , height , alt ) )
continue;
resultsImg.add( new ImageSearchResult( title ) );
if( numImgsbyUrl != -1 ) countImg++;
}
log.debug( "Number of results = [" + resultsImg.size( ) + "] to url[" + link + "]" );
First execution got 183 maintenance. Second execution got 233 contents. Third execution got 203 contents. I’ve got five threads running parallel to 100 sites. I don’t know if I’m being blocked with so many jsoup hits, any idea what might be happening ?
Código da thread master:
ExecutorService pool = Executors.newFixedThreadPool( NThreads );
CountDownLatch doneSignal;
...
// the SAX parser
UserHandler userhandler = new UserHandler( );
XMLReader myReader = XMLReaderFactory.createXMLReader( );
myReader.setContentHandler( userhandler );
myReader.parse( new InputSource(new URL( url ).openStream( ) ) );
resultOpenSearch = userhandler.getItems( );
...
doneSignal = new CountDownLatch( resultOpenSearch.size( ) );
List< Future< List< ContentsResult > > > submittedJobs = new ArrayList< >( );
for( ItemXML item : resultOpenSearch ) { //Search information tag <img>
Future< List< ContentsResult > > job = pool.submit( new CrawlerParser( doneSignal ) );
submittedJobs.add( job );
}
try {
isAllDone = doneSignal.await( timeout , TimeUnit.MILLISECONDS );
if ( !isAllDone )
cleanUpThreads( submittedJobs );
} catch ( InterruptedException e1 ) {
cleanUpThreads( submittedJobs ); // take care, or cleanup
}
//get images result to search
for( Future< List< ContentsResult > > job : submittedJobs ) {
try
// before doing a get you may check if it is done
if ( !isAllDone && !job.isDone( ) ) {
// cancel job and continue with others
job.cancel( true );
continue;
}
List< ContentsResult > result = job.get( ); // wait for a processor to complete
if( result != null && !result.isEmpty( ) ) {
log.debug( "Resultado = " + result.size( ) );
imageResults.addAll( result );
}
} catch (ExecutionException cause) {
log.error( "ContentsResultsController", cause ); // exceptions occurred during execution, in any
} catch (InterruptedException e) {
log.error( "ContentsResultsController", e ); // take care
}
}
Only with this information can we assume what the problem is but it will be difficult to find a real solution to what you are facing. You can provide a minimum, complete and verifiable example of your code?
– Sorack
I’ve already added html parser code, pretty simple code. I think the code posted is what makes the threads processing and the parser is what each of the threads performs
– user2989745
@user2989745 There are several strange things in this code. If you are using an executor, why use this one
doneSignal
to do some kind of manual synchronization? None of that is necessary. in the end, when you dojob.get()
, theFuture
awaits completion of the job. My suspicion is that you have placed a very low timeout and are canceling the Jobs too early. Since access to websites over the network is often relatively time consuming, you will always get random results. If you can, add the relevant parts of the code that are not above and we can help more. Hug!– utluiz
Yes I realize that Signal is unnecessary, get is blocking, but after some research I didn’t realize if only with get I could receive all the results of all threads. The timeout is at 100000, I’ll increase the timeout and say something. Thank you ;)
– user2989745
However I removed the logic of Signal and the problem continues...
– user2989745
I edited with some more code, I still can’t solve the problem. Some help ?
– user2989745