You have to read it in Chunks data. Something like this:
$file = fopen("teste.txt",'r');
$count = 0;
while (!feof($file)) {
$line = fgets($file, 4096); //provavelmente eu colocaria um valor maior, jamais menor
$count++;
}
fclose($file);
I set a limit of 4096 bytes because you’re at risk if the file is too big and you don’t have enough line breaks to create Chunks small. This solution is not perfect. A better one would need a much more sophisticated algorithm.
I came to think of another who has problems too:
$file = fopen("teste.txt",'rb');
$count = 0;
while (!feof($file)) {
$chunk= fread($f, 4096); //provavelmente eu colocaria um valor maior, jamais menor
$count += substr_count($chunk, "\n");
}
fclose($file);
I put in the Github for future reference.
Line breaking can have more than one character and stay one character in one Chunk and the other in the next Chunk. Then you won’t tell.
Production-ready solutions would have to consider this and treat when it happens. This is easier to solve in the second algorithm. He still has the advantage of never filling his memory.
Do tests to evaluate the best size of the Chunk. I put 4K because it’s memory page size and the most common size of cluster of the file system. Smaller will be worse and tend to have more risk of cutting line by half disturbing the two algorithms. Bigger ones can give much better results. I would venture to say that the bigger the better, but it depends on the hardware, OS, usage pattern, etc. It can get good at testing and create some problem in use in production. If you could read the whole file it would always be the fastest and risk-free.
The Guilhermelautert raised a question of \n
be just the line feed and therefore would not cause problem in the break. But in Windows the break is \r\n
(I have no way to test). Of the two one or PHP considers the \n
in code like full line break and happens what I said, or this code would not work properly on Windows, requiring the use of \r\n
in the code to catch the break the right way, which would have the problem of dividing the break indicator line the same way.
Take a look at the answer to the question Efficiently Counting in So-en.
– David
@David then you break me, I have an answer in English there, kkkkkkkkkk... But as I want to help the Brazilian/Portuguese community, I thought I’d put it here too
– Wallace Maxters
@David is not wrong to do this here, but some people think this is bad faith. In fact, there is a lot that is already in SOEN that is repeated here
– Wallace Maxters
I also think it is right to have repeated, because who understands English more or less, may not pay attention to some detail of the interpretation of the answer, and here in Portuguese it is clearer :)
– David
If you have some courage there... you can take a look here: http://code.metager.de/source/xref/gnu/coreutils/src/wc.c. As far as I have seen... wc works by picking up Chunks and reading character by character.
– Nelson Teixeira