How to know the number of lines a large file has in php?

Asked

Viewed 2,166 times

6

How do I know the number of lines in a file using PHP?

I know there are functions like file, which returns all rows of the file in one array. We could just use one count, but the problem is I need to do this for a 60mb, and I don’t think it’s a good idea to use file in that case.

There’s some other way to do it?

How can I know, for example, how many lines there are in a 2gb, without bursting the PHP memory?

Is there any smarter way to count the lines of a PHP file?

  • 3

    Take a look at the answer to the question Efficiently Counting in So-en.

  • 2

    @David then you break me, I have an answer in English there, kkkkkkkkkk... But as I want to help the Brazilian/Portuguese community, I thought I’d put it here too

  • @David is not wrong to do this here, but some people think this is bad faith. In fact, there is a lot that is already in SOEN that is repeated here

  • I also think it is right to have repeated, because who understands English more or less, may not pay attention to some detail of the interpretation of the answer, and here in Portuguese it is clearer :)

  • If you have some courage there... you can take a look here: http://code.metager.de/source/xref/gnu/coreutils/src/wc.c. As far as I have seen... wc works by picking up Chunks and reading character by character.

4 answers

5

The two methods ( fgets() and file()) use loop to read the file (which is inevitable). Either implicitly or explicitly there will be a loop going through all lines of the file.

But you just want to know the number of lines, so no matter the file size as you will only write a value. Do it:

$myfile = fopen("meuArquivo.txt", "r") or die("Unable to open file!");
while(!feof($myfile)) {
  $count++;
}
fclose($myfile);
echo $count;
  • I think there’s a mistake there. No fgets He’s going into an infinite loop, since it doesn’t go to the next line

  • OI Wallace. I don’t think so. See: fgets - Gets line from file Pointer. (He doesn’t want the content of the line but goes through it. He just wants to know how many lines the file has) and while(! feof ... = As long as the end of the file is not crossed, move the pointer to the next line. The condition of the end of the loop is the negation of feof(). http://php.net/manual/en/function.fgets.php ,,,,,,,,,,,,,, http://php.net/manual/en/function.feof.php

3


You have to read it in Chunks data. Something like this:

$file = fopen("teste.txt",'r');
$count = 0;
while (!feof($file)) {
    $line = fgets($file, 4096); //provavelmente eu colocaria um valor maior, jamais menor
    $count++;
}
fclose($file);

I set a limit of 4096 bytes because you’re at risk if the file is too big and you don’t have enough line breaks to create Chunks small. This solution is not perfect. A better one would need a much more sophisticated algorithm.

I came to think of another who has problems too:

$file = fopen("teste.txt",'rb');
$count = 0;
while (!feof($file)) {
    $chunk= fread($f, 4096); //provavelmente eu colocaria um valor maior, jamais menor
    $count += substr_count($chunk, "\n");
}
fclose($file);

I put in the Github for future reference.

Line breaking can have more than one character and stay one character in one Chunk and the other in the next Chunk. Then you won’t tell.

Production-ready solutions would have to consider this and treat when it happens. This is easier to solve in the second algorithm. He still has the advantage of never filling his memory.

Do tests to evaluate the best size of the Chunk. I put 4K because it’s memory page size and the most common size of cluster of the file system. Smaller will be worse and tend to have more risk of cutting line by half disturbing the two algorithms. Bigger ones can give much better results. I would venture to say that the bigger the better, but it depends on the hardware, OS, usage pattern, etc. It can get good at testing and create some problem in use in production. If you could read the whole file it would always be the fastest and risk-free.

The Guilhermelautert raised a question of \n be just the line feed and therefore would not cause problem in the break. But in Windows the break is \r\n (I have no way to test). Of the two one or PHP considers the \n in code like full line break and happens what I said, or this code would not work properly on Windows, requiring the use of \r\n in the code to catch the break the right way, which would have the problem of dividing the break indicator line the same way.

  • 1

    Similar to the one on SOEN, kkkkkkk +1

  • But why 4096? It’s not just using fgets?

  • These stream functions are making me more excited to mess with the procedural in PHP

  • This second solution I believe to be the ideal, because it reapplies the memory in chunks, only "The line break can have more than one character and stay one character in a Chunck and the other in the next Chunck. Then it won’t count." Why wouldn’t I? Even if I did \n\n and each in stays in one chunck it will count 1+1.

  • A small detail (I also confuse this): It is "Chunck" or "Chunk"?

  • @The problem is what \n is abstract. In fact a single \n may have a specific byte CR and then a LF, Only them together is the line break, separate, it’s nothing. I imagine it’s nothing. I imagine that the algorithm does not consider them separately (in Windows for example), if you consider, there count two breaks when in fact it is only one.

  • 1

    @Wallacemaxters is, I was reading and saying there was something wrong, but I didn’t see what :)

  • @bigown in this case the constant PHP_EOL does not solve (I think it changes according to the OS)?

  • @mustache if you wear fgets without the second parameter, it will read the whole line. Isn’t that good anymore? Because after all, if I want to count the lines, the size of it doesn’t matter, right?

  • But the LF is "return car", in windows it only serves to make the pointer go back to the beginning of line, otherwise it would just go down, and this will depend on the reading software, because some when applying CR already give LF. The same line break is the CR(\n), the LF(\r) is attached only.

  • @Wallacemaxters do not know the exact mechanism of PHP, but as far as I know the \n is justly the PHP_EOL, which changes according to the OS. What if the line has 2GB? That is, if it doesn’t have a break? If it has a few and gets too big, anyway? I made an algorithm that solves your problem. It, so crude, causes another. Nothing hard to solve, especially in the second algorithm. I’d just have to check the last character of the Chunk and the first of the next. If together they form a line break, it counts, otherwise it doesn’t count. If I have bag then I do it. And I see if this is it only :)

  • 1

    @Guilhermelautert I’m not as sure as you that PHP works like this. I would need to test. If it works like this it would still be a problem if the line break is the \r\n and has only the \n in the text without being a line break effectively.

Show 7 more comments

3

I, as a lover of OOP in PHP, would do this to the object SplFileObject

$file = new \SplFileObject('file.extension', 'r');
$file->seek(PHP_INT_MAX);
echo $file->key() + 1; 

I use the PHP_INT_MAX to point to the last line of the file, because SplFileObject implements SeekableIterator.

Hence, as the count of the lines begins by 0, I have to add +1 to bring the right value.

Another detail: How I’m using SplFileObject, the iteration of a large file would be done line by line, thus saving in memory and being able to count a giant file, without locking the script.

  • This should be the answer marked as accepted. Using the other answer, you could also do: while (($line = fgets($file, 8192)) !== false) { $count++; }.

2

As lover of REGEX I propose:

$content = file_get_contents("file_name");          //  LE TODO ARQUIVO
$content = preg_replace('~[^\n]~', '', $content);   //  REMOVE TUDO QUE NÃO SEJA QUEBRA DE LINHA (\n)
print_r(strlen($content)+1);                        //  CONTA QUANTOS BYTES SOBRARAM, +1 POIS NO FINAL DO ARQUIVO NÃO TEM \n
  • 1

    Man, it’s interesting this technique, it makes sense. My concern is with the files of 2 gb , that in case it would crash until the PC is tried to run in PHP.

  • I imitated you... "OOP Lover"

Browser other questions tagged

You are not signed in. Login or sign up in order to post.