How to know the number of lines a large file has in php?

Question

How to know the number of lines a large file has in php?

Asked 9 years, 2 months ago

Viewed 2,166 times

6

How do I know the number of lines in a file using PHP?

I know there are functions like file, which returns all rows of the file in one array. We could just use one count, but the problem is I need to do this for a 60mb, and I don’t think it’s a good idea to use file in that case.

There’s some other way to do it?

How can I know, for example, how many lines there are in a 2gb, without bursting the PHP memory?

Is there any smarter way to count the lines of a PHP file?

3

Take a look at the answer to the question Efficiently Counting in So-en.

– David

2016/05/13 at 17:01
2

@David then you break me, I have an answer in English there, kkkkkkkkkk... But as I want to help the Brazilian/Portuguese community, I thought I’d put it here too

– Wallace Maxters

2016/05/13 at 17:02
@David is not wrong to do this here, but some people think this is bad faith. In fact, there is a lot that is already in SOEN that is repeated here

– Wallace Maxters

2016/05/13 at 17:04
I also think it is right to have repeated, because who understands English more or less, may not pay attention to some detail of the interpretation of the answer, and here in Portuguese it is clearer :)

– David

2016/05/13 at 17:05
If you have some courage there... you can take a look here: http://code.metager.de/source/xref/gnu/coreutils/src/wc.c. As far as I have seen... wc works by picking up Chunks and reading character by character.

– Nelson Teixeira

2016/05/13 at 18:28

4 answers

3

You have to read it in Chunks data. Something like this:

$file = fopen("teste.txt",'r');
$count = 0;
while (!feof($file)) {
    $line = fgets($file, 4096); //provavelmente eu colocaria um valor maior, jamais menor
    $count++;
}
fclose($file);

I set a limit of 4096 bytes because you’re at risk if the file is too big and you don’t have enough line breaks to create Chunks small. This solution is not perfect. A better one would need a much more sophisticated algorithm.

I came to think of another who has problems too:

$file = fopen("teste.txt",'rb');
$count = 0;
while (!feof($file)) {
    $chunk= fread($f, 4096); //provavelmente eu colocaria um valor maior, jamais menor
    $count += substr_count($chunk, "\n");
}
fclose($file);

I put in the Github for future reference.

Line breaking can have more than one character and stay one character in one Chunk and the other in the next Chunk. Then you won’t tell.

Production-ready solutions would have to consider this and treat when it happens. This is easier to solve in the second algorithm. He still has the advantage of never filling his memory.

Do tests to evaluate the best size of the Chunk. I put 4K because it’s memory page size and the most common size of cluster of the file system. Smaller will be worse and tend to have more risk of cutting line by half disturbing the two algorithms. Bigger ones can give much better results. I would venture to say that the bigger the better, but it depends on the hardware, OS, usage pattern, etc. It can get good at testing and create some problem in use in production. If you could read the whole file it would always be the fastest and risk-free.

The Guilhermelautert raised a question of \n be just the line feed and therefore would not cause problem in the break. But in Windows the break is \r\n (I have no way to test). Of the two one or PHP considers the \n in code like full line break and happens what I said, or this code would not work properly on Windows, requiring the use of \r\n in the code to catch the break the right way, which would have the problem of dividing the break indicator line the same way.

1

Similar to the one on SOEN, kkkkkkk +1

– Wallace Maxters

2016/05/13 at 17:05
But why 4096? It’s not just using fgets?

– Wallace Maxters

2016/05/13 at 17:07
These stream functions are making me more excited to mess with the procedural in PHP

– Wallace Maxters

2016/05/13 at 17:18
This second solution I believe to be the ideal, because it reapplies the memory in chunks, only "The line break can have more than one character and stay one character in a Chunck and the other in the next Chunck. Then it won’t count." Why wouldn’t I? Even if I did \n\n and each in stays in one chunck it will count 1+1.

– Guilherme Lautert

2016/05/13 at 17:30
A small detail (I also confuse this): It is "Chunck" or "Chunk"?

– Wallace Maxters

2016/05/13 at 17:33
@The problem is what \n is abstract. In fact a single \n may have a specific byte CR and then a LF, Only them together is the line break, separate, it’s nothing. I imagine it’s nothing. I imagine that the algorithm does not consider them separately (in Windows for example), if you consider, there count two breaks when in fact it is only one.

– Maniero

2016/05/13 at 17:34
1

@Wallacemaxters is, I was reading and saying there was something wrong, but I didn’t see what :)

– Maniero

2016/05/13 at 17:35
@bigown in this case the constant PHP_EOL does not solve (I think it changes according to the OS)?

– Wallace Maxters

2016/05/13 at 17:38
@mustache if you wear fgets without the second parameter, it will read the whole line. Isn’t that good anymore? Because after all, if I want to count the lines, the size of it doesn’t matter, right?

– Wallace Maxters

2016/05/13 at 17:39
But the LF is "return car", in windows it only serves to make the pointer go back to the beginning of line, otherwise it would just go down, and this will depend on the reading software, because some when applying CR already give LF. The same line break is the CR(\n), the LF(\r) is attached only.

– Guilherme Lautert

2016/05/13 at 17:40
@Wallacemaxters do not know the exact mechanism of PHP, but as far as I know the \n is justly the PHP_EOL, which changes according to the OS. What if the line has 2GB? That is, if it doesn’t have a break? If it has a few and gets too big, anyway? I made an algorithm that solves your problem. It, so crude, causes another. Nothing hard to solve, especially in the second algorithm. I’d just have to check the last character of the Chunk and the first of the next. If together they form a line break, it counts, otherwise it doesn’t count. If I have bag then I do it. And I see if this is it only :)

– Maniero

2016/05/13 at 17:42
1

@Guilhermelautert I’m not as sure as you that PHP works like this. I would need to test. If it works like this it would still be a problem if the line break is the \r\n and has only the \n in the text without being a line break effectively.

– Maniero

2016/05/13 at 17:49

Show 7 more comments

Browser other questions tagged php filing-cabinet

You are not signed in. Login or sign up in order to post.

by zwitterion • **2,876** points · Answer 1 · 2016-05-13T19:54:35+00:00

The two methods ( fgets() and file()) use loop to read the file (which is inevitable). Either implicitly or explicitly there will be a loop going through all lines of the file.

But you just want to know the number of lines, so no matter the file size as you will only write a value. Do it:

$myfile = fopen("meuArquivo.txt", "r") or die("Unable to open file!");
while(!feof($myfile)) {
  $count++;
}
fclose($myfile);
echo $count;

by Wallace Maxters • **102,340** points · Answer 2 · 2016-05-13T17:17:27+00:00

I, as a lover of OOP in PHP, would do this to the object SplFileObject

$file = new \SplFileObject('file.extension', 'r');
$file->seek(PHP_INT_MAX);
echo $file->key() + 1;

I use the PHP_INT_MAX to point to the last line of the file, because SplFileObject implements SeekableIterator.

Hence, as the count of the lines begins by 0, I have to add +1 to bring the right value.

Another detail: How I’m using SplFileObject, the iteration of a large file would be done line by line, thus saving in memory and being able to count a giant file, without locking the script.

by Guilherme Lautert • **15,097** points · Answer 3 · 2016-05-13T17:13:37+00:00

As lover of REGEX I propose:

$content = file_get_contents("file_name");          //  LE TODO ARQUIVO
$content = preg_replace('~[^\n]~', '', $content);   //  REMOVE TUDO QUE NÃO SEJA QUEBRA DE LINHA (\n)
print_r(strlen($content)+1);                        //  CONTA QUANTOS BYTES SOBRARAM, +1 POIS NO FINAL DO ARQUIVO NÃO TEM \n