I’ll describe what you need to do, if I can, later I’ll try to write some code. But I already say that you can not do it more precisely because your lines have no fixed size. So you have to do more or less what a database does, with the difference that the database organized the data for this.
Have you learned how to randomly access the file on your previous question.
You’ll need to know the file size, something like System.IO.FileInfo(path).Length
. View documentation.
You will also need at Indexof()
to find the line breaking.
Search strategy
As you cannot read everything to memory you will have to read on data pages. My suggestion is that these pages have 4096 bytes. It is the size of the memory page currently and the probable size of the data page (cluster) disk. Then reading and storing a smaller size than this will be wasteful because this is the minimum size that will be read physically on the disk. With smaller sizes you will end up reading more than once the same data. It is true that in the next few times the operating system will probably pick up from the cache and not the disk. This size will not even make tickle in memory, after all after finishing using this buffer it can be discarded by the GC.
Dividing the total file size by the size of the buffer you will find the amount of existing pages in the archive.
You will read these pages that will certainly contain several lines (by the pattern you have shown, I am not saying that it is guaranteed that you will, but it seems that yes).
Since you have no way to set the line size you will work with the pages and apply binary search on the pages that will act as the basic search element.
You will probably read the first page (as indicated in the other question I answered), the last, the middle and then successively will read the half of the current half going to the previous or later half as compared.
How you get to the page. Think of the page numbered zero to the last minus one. For example, if you have 20 pages, they will be numbered from 0 to 19. And the beginning of each page is its number times the size. The first will be 0 X 4096, ie position 0 of the file. Page 10 will be 10 X 4096 and it should start reading at position 40960. It will always read 4096 bytes according to the size of the buffer. It doesn’t matter that the last one probably won’t have all buffer filled.
On each page read you will have to search for all ISBN codes. From what I understand, it always comes after a line break, so go look with IndexOf
by the line break(s) character(s). Here is to take the next 14 characters. This is the ISBN. Here you compare to what you are looking for. This will be a loop to find all the line breaks and hence the code that comes right after. Inside the page it is probably best to leave the binary search aside and do the sequential search. As a hint know that exists IndexOf()
with offset to seek an intermediate position, that is, to find a string starting at a specific position.
If anyone is the same, you found and can terminate the search algorithm (all that is left is to get the whole line). If the last code found inside the page is smaller than what you are looking for, you will look for it in the later goal. If the first code found inside the page is larger than you’re looking for, you’ll look in the previous half. If you have already looked at every possible page (there are no more pages between the current and the last read) and have not found on any page, the code does not exist.
There is the possibility of a code starting on this page and ending only on the other. You will have to analyze this and if you detect that the code is not complete read the next page on disk to pick up the rest of the code.
This is what you’ll probably have to do as well when you find the line you want. There is a reasonable probability that the line you want and think is starting on one page but it only ends (that is, there is another line break) on the other page, so you have to read the other page to have the entire line.
All I’ve described is the binary search process, no secrets. What is different is the strategy of searching for pages to give a stable information of breaking parts of the file.
Completion
It is understandable that this is not ideal, but under the circumstances this is the possible solution. I didn’t go into so much detail, there might be some things that can be done in a better way, some extra checks. To do an exercise, it is more or less that, to put in production has to architect better.
Can you ensure that these numbers at the beginning of each row are in ascending order? So you showed are not. There is no way.
– Maniero
Sorry I’ll change the print, yes they are already ordered.
– William Pereira
But you showed that this is not guaranteed.
– Maniero
I already created a method to sort, then I call the method, then I will perform the binary search, I had sent the print before calling the sorting method. Now you’re right.
– William Pereira
But if you scan the entire file to sort, why not throw it into a database?
– Maniero
It turns out it’s a college activity, I have to work with files only.. I have to do binary research on it. If the length of the lines were fixed, I could do it. I would like to know how to do binary search when the file lines (the records) have no fixed size.
– William Pereira
I’m going to describe a basic algorithm or try to make a pseudo code so you code right.
– Maniero
Okay, that’ll help a lot!
– William Pereira
If you have any questions, comment there in the answer, but if you have more specific questions about how to do any operation in particular, maybe it is better to open a new question with the doubt, otherwise this question would be too wide.
– Maniero
@williamhk2 can’t you index the lines in that sort? It would be nice to order, take advantage and save the position of the lines beginnings in another txt or even in memory. So you could make the conversion number of line x position at the time of the search.
– Bacco