How to do binary search in a txt file?

Asked

Viewed 2,282 times

12

I am with a txt file that stores data from several books (example below) I would like to make a binary research that returns the line that is with the first field (ISBN of the book) equivalent to search, without loading the entire file in memory (considering that the file may have a very large size) is it possible to do this type of search? In C# or Java.

OBS.: The lines have varied size.

inserir a descrição da imagem aqui

  • Can you ensure that these numbers at the beginning of each row are in ascending order? So you showed are not. There is no way.

  • Sorry I’ll change the print, yes they are already ordered.

  • But you showed that this is not guaranteed.

  • I already created a method to sort, then I call the method, then I will perform the binary search, I had sent the print before calling the sorting method. Now you’re right.

  • But if you scan the entire file to sort, why not throw it into a database?

  • It turns out it’s a college activity, I have to work with files only.. I have to do binary research on it. If the length of the lines were fixed, I could do it. I would like to know how to do binary search when the file lines (the records) have no fixed size.

  • I’m going to describe a basic algorithm or try to make a pseudo code so you code right.

  • Okay, that’ll help a lot!

  • If you have any questions, comment there in the answer, but if you have more specific questions about how to do any operation in particular, maybe it is better to open a new question with the doubt, otherwise this question would be too wide.

  • 1

    @williamhk2 can’t you index the lines in that sort? It would be nice to order, take advantage and save the position of the lines beginnings in another txt or even in memory. So you could make the conversion number of line x position at the time of the search.

Show 5 more comments

2 answers

12


I’ll describe what you need to do, if I can, later I’ll try to write some code. But I already say that you can not do it more precisely because your lines have no fixed size. So you have to do more or less what a database does, with the difference that the database organized the data for this.

Have you learned how to randomly access the file on your previous question.

You’ll need to know the file size, something like System.IO.FileInfo(path).Length. View documentation.

You will also need at Indexof() to find the line breaking.

Search strategy

As you cannot read everything to memory you will have to read on data pages. My suggestion is that these pages have 4096 bytes. It is the size of the memory page currently and the probable size of the data page (cluster) disk. Then reading and storing a smaller size than this will be wasteful because this is the minimum size that will be read physically on the disk. With smaller sizes you will end up reading more than once the same data. It is true that in the next few times the operating system will probably pick up from the cache and not the disk. This size will not even make tickle in memory, after all after finishing using this buffer it can be discarded by the GC.

Dividing the total file size by the size of the buffer you will find the amount of existing pages in the archive.

You will read these pages that will certainly contain several lines (by the pattern you have shown, I am not saying that it is guaranteed that you will, but it seems that yes).

Since you have no way to set the line size you will work with the pages and apply binary search on the pages that will act as the basic search element.

You will probably read the first page (as indicated in the other question I answered), the last, the middle and then successively will read the half of the current half going to the previous or later half as compared.

How you get to the page. Think of the page numbered zero to the last minus one. For example, if you have 20 pages, they will be numbered from 0 to 19. And the beginning of each page is its number times the size. The first will be 0 X 4096, ie position 0 of the file. Page 10 will be 10 X 4096 and it should start reading at position 40960. It will always read 4096 bytes according to the size of the buffer. It doesn’t matter that the last one probably won’t have all buffer filled.

On each page read you will have to search for all ISBN codes. From what I understand, it always comes after a line break, so go look with IndexOf by the line break(s) character(s). Here is to take the next 14 characters. This is the ISBN. Here you compare to what you are looking for. This will be a loop to find all the line breaks and hence the code that comes right after. Inside the page it is probably best to leave the binary search aside and do the sequential search. As a hint know that exists IndexOf() with offset to seek an intermediate position, that is, to find a string starting at a specific position.

If anyone is the same, you found and can terminate the search algorithm (all that is left is to get the whole line). If the last code found inside the page is smaller than what you are looking for, you will look for it in the later goal. If the first code found inside the page is larger than you’re looking for, you’ll look in the previous half. If you have already looked at every possible page (there are no more pages between the current and the last read) and have not found on any page, the code does not exist.

There is the possibility of a code starting on this page and ending only on the other. You will have to analyze this and if you detect that the code is not complete read the next page on disk to pick up the rest of the code.

This is what you’ll probably have to do as well when you find the line you want. There is a reasonable probability that the line you want and think is starting on one page but it only ends (that is, there is another line break) on the other page, so you have to read the other page to have the entire line.

All I’ve described is the binary search process, no secrets. What is different is the strategy of searching for pages to give a stable information of breaking parts of the file.

Completion

It is understandable that this is not ideal, but under the circumstances this is the possible solution. I didn’t go into so much detail, there might be some things that can be done in a better way, some extra checks. To do an exercise, it is more or less that, to put in production has to architect better.

  • OK I will try to do this way, if I find a better solution I will post here. Thank you!!

  • @williamhk2 restructured the answer. I think it makes it clearer, easier and gives a better result in performance. Keep talking if you need more help.

9

The best approach to a text file search will depend on how often the file is modified, how long it takes to go through the file, and how often the search functionality is used.

Depending on the size of the file read time, a binary search may be inefficient in the sense that it will not make good use of the CPU and system buffer and reading of the medium in question. Not every random or random reading is really without cost. See article.

Indexing

If the file is not updated frequently, you can create an index in memory that holds the positions of each line linked to the ISBN.

See a simple example in Java:

//lê e mantém um índice do ISBN
static class IndiceArquivo {

    //posição da linha
    static class PosicaoLinha {
        long inicio;
        int tamanho;
        PosicaoLinha(long inicio, int tamanho) {
            this.inicio = inicio;
            this.tamanho = tamanho;
        }
    }

    //índice
    Map<String, PosicaoLinha> indicePorIsbn = new HashMap<>();

    //arquivo
    Path arquivo;

    //construtor
    IndiceArquivo(Path arquivo) {
        this.arquivo = arquivo;
        montarIndice();
    }

    //lê todos os ISBNs
    void montarIndice() {
        try (BufferedReader reader = Files.newBufferedReader(arquivo, Charset.forName("UTF-8"))) {
            String linha;
            long posicao = 0;
            while ((linha = reader.readLine()) != null) {
                indicePorIsbn.put(linha.substring(0, linha.indexOf('-')), new PosicaoLinha(posicao, linha.getBytes().length));
                posicao += linha.getBytes().length + System.getProperty("line.separator").length();
            }
        } catch (IOException e) {
            throw new RuntimeException("Erro ao ler o arquivo!");
        }
    }

    //encontra a posição indexada da linha do livro e então recupera a linha completa do arquivo
    Optional<String> getLinhaIsbn(String isbn) {
        return Optional.ofNullable(isbn)
                .map(k -> indicePorIsbn.get(k))
                .map(p -> {
                    try {
                        ByteBuffer buffer = ByteBuffer.allocate(p.tamanho);
                        Files.newByteChannel(arquivo).position(p.inicio).read(buffer);
                        System.out.println(new String(buffer.array(), "UTF-8"));
                        return new String(buffer.array(), "UTF-8");
                    } catch (IOException e) {
                        throw new RuntimeException("Erro ao ler o arquivo!");
                    }
                });
    }
}

And to test:

public static void main(String[] args) throws Exception {
    //cria arquivo de exemplo
    Path arquivo = Files.createTempFile("meu", "teste");
    Files.write(arquivo, (
                    "123456789-titulo 1-autor 1-2012" + System.getProperty("line.separator") +
                    "234567890-titulo 2-autor 3-2013" + System.getProperty("line.separator") +
                    "345678901-titulo 3-autor 3-2014"
            ).getBytes("UTF-8"));

    //monta o índice
    IndiceArquivo indice = new IndiceArquivo(arquivo);

    //faz busca e verifica resultado
    Optional<String> resultadoBusca = indice.getLinhaIsbn("234567890");
    if (resultadoBusca.isPresent()) {
        System.out.println("Encontrou: " + resultadoBusca.get());
    } else {
        System.out.println("Nao Encontrou");
    }
}

Reading all

If the file is not really that large and reading the file is not required very often, simply read the whole file using a stream:

static Optional<String> buscarIsbn(String isbn) throws Exception {
    Path arquivo = Paths.get("/meu/arquivo/grande.txt");
    String inicioLinha = isbn + "-";
    try (Stream<String> lines = Files.lines(arquivo)) {
        return lines.filter(linha -> linha.startsWith(inicioLinha)).findFirst();
    } catch (IOException e) {
        throw new Exception("Erro ao ler o arquivo!");
    }
}

Example of use:

public static void main(String[] args) throws Exception {
    Optional<String> resultadoBusca = buscarIsbn("1234567890");
    if (resultadoBusca.isPresent()) {
        //encontrou
    } else {
        //não encontrou
    }
}

Copy the aligned file

Another alternative is to make a copy of the file with the characters aligned, so the binary search can be done with ease:

static void alinharArquivo(int tamanhoMaximoLinha) throws Exception {
    Path arquivoOriginal = Paths.get("/meu/arquivo/grande.txt");
    Path arquivoAlinhado = Paths.get("/meu/arquivo/grande-alinhado.txt");
    try (
            Stream<String> linhas = Files.lines(arquivoOriginal);
            PrintWriter saida = new PrintWriter(Files.newBufferedWriter(arquivoAlinhado))) {
        linhas.forEachOrdered(l -> saida.format("%s-" + tamanhoMaximoLinha + "s", l));
    } catch (IOException e) {
        throw new Exception("Erro ao ler o arquivo!");
    }
}

Considerations

If a feature relies on a large amount of data, is user-critical and often used, the best way out is to consider migrating to a suitable database or technology.

If the dependency of text files is necessary, another possibility would be to divide the file logically into groups, so you can limit the amount of information searched in a search. For example, if you split the large file into ten smaller files according to some of the first digits of the ISBN, in each search it would be possible to determine which file to search and so ignore the other 9.

However, also consider that any solution like this has the possibility to create severe limitations in the future. For example, if it is necessary to search by author you will still have to go through the ten files.

Therefore, I would say that in general terms the best approach is to use a database and the second best is to use an index. If the index is too large for the memory, you could use another file to store it in positional format so that it is feasible to do a binary search.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.