Search for words in almost 1 million files

Question

Search for words in almost 1 million files

Asked 6 years, 10 months ago

Viewed 335 times

0

I am looking for methods/ideas that can help me to solve my problem. I have a structure of folders / files that my program generates, so far so good. But what happens is that this folder already contains more than 900 thousand files. Each file of this is very small, about 1 KB, has a header and a text.

Currently the search used by the software is basic: it actually opens file by file and searches for the word. But imagine the delay of searching... in an SSD, searching by word saude generated me a wait of more than 8 minutes.

I did some tests to see if reducing the number of files would help, but I noticed that it would help little, yet the search would take minutes.

The idea currently is (not using any database) index manually, with an external process responsible for this, words of 3 or more characters per folder, decreasing the search from millions to a few thousand but still could take a few seconds in some cases.

I was also thinking about how Windows indexing works for file content:

[

I did some small tests on "what takes research":

Time to open files + read: 465.948. Found: 2921

Time looking : 264.318. Found: 2921

Time to open files + read: 788.992. Found: 2921

Looking for temp : 599.093. Found: 2921

Time to open files + read: 834.300. Found: 2921

Time looking : 572.496. Found: 2921

Time to open files + read: 709.464. Found: 2921

Time looking : 539.053. Found: 2921

Time to open files + read: 857.443. Found: 2921

Time searching : 761,121. Found: 2921

Time to open files + read: 909.440. Found: 2921

Time looking : 602.000. Found: 2921

Time to open files + read: 865.306. Found: 2921

Seeking weather : 499.046. Found: 2921

The test was done on 1000 files only. In the first result, take into account the opening of the file, in the second only the time when it is searching something (in my test I used strstr).

Is there any method to make this faster without using a database? Which I’m not sure would solve the case, since on average it would have 200 characters per file where millions of characters would be available to search. If it is not possible without a database, what would be the general idea? A database can handle this data volume well?

where is the code?

– Leandro Angelo

2018/12/19 at 19:47
Ever think about using the Windows Search API?

– Leandro Angelo

2018/12/20 at 11:38
@Leandroangelo good, actually no, but it would be like activating the option to index contents and searching for code in the application?

– Kevin Kouketsu

2018/12/20 at 11:45
This, if I do not miss the memory you open an Oledb connection to the search engine and make the query as SQL itself

– Leandro Angelo

2018/12/20 at 11:47
@Leandroangelo. I’ll see this, thanks for the tip. This search for minutes is a little problem for me!

– Kevin Kouketsu

2018/12/20 at 11:49
https://docs.microsoft.com/en-us/windows/desktop/search/-search-3x-wds-overview

– Leandro Angelo

2018/12/20 at 12:01

Show 1 more comment

2 answers

0

My solution, as suggested by Leandro Angelo in the comments, was to use the Windows Search API.
I made the implementation of Ifilter in a DLL and registered it to be used.

With this, I could open an Oledb connection, create new properties and run queries. It became fast and very good.

Browser other questions tagged c++ índices file-system

You are not signed in. Login or sign up in order to post.

by Carlos Mesquita Aguiar • 89 points · Answer 1 · 2018-12-19T19:49:50+00:00

@Kevin Kouketsu, the best technique to be used for this scenario is the full-text search engine. There is a relational database that has this feature: Postgresql. There is also Solr and Elasticsearch, which include full-text search and mainly real-time indexing as you mentioned above; all the latter two is based on the Lucene project.