How to discover the CSV delimiter by PHP?

Asked

Viewed 124 times

4

A CSV file can vary the delimiters. I’ve seen it used ; (point-and-comma), , (comma) and up to \t (tab).

In PHP, I know it is possible to capture CSV values by informing the delimiter:


$delimitador = ';';

$arquivo = fopen('meu_csv.csv', 'r');

$cabecalho = fgetcsv($arquivo, 0, $delimitador);

if (feof($arquivo)) {
    exit('Você importou um CSV vazio');
}


while ($linha = fgetcsv($arquivo, 0, $delimitador)) {
    $linhas[] = array_combine($cabecalho, array_map('trim', $linha));
}

However, instead of using a fixed delimiter, I need to figure out which delimiter is used in a CSV file. Is there any way to do this in PHP? What would be the most appropriate way?

  • 3

    Phospreadsheet infers the delimiter with a statistical approach: https://github.com/PHPOffice/PhpSpreadsheet/blob/02f37d4f7e3580c07e33ad912da2b859bcceec/src/PhpSpreadsheet/Reader/Csv/Delimiter.php#L67

1 answer

6

Use strpos or preg_match may conflict, as some CSV may come like this:

"foo,bar","Nome";"Telefone"

Or else:

"foo,bar";"Nome","Telefone"

Depending on the type of check as:

$handle = fopen('arquivo.csv', 'r');

preg_match('/[;,\t]/', fgets($handle), $matches);

$delimitador = $matches[0] ?? null;

var_dump($delimitador);

In the example above testing with "foo,bar";"Nome","Telefone" will generate that output:

string(1) ","

When the right thing would be semicolon ;, detect the exhaust (I don’t mean the exhaust itself, which is something that can be adjusted, but is usually used the \) is somewhat complicated to do manually, the ideal would be to use the own fgetcsv(), but of course anyway have to inform the delimiter before.

All editors I used before opening the file ask the user to inform the delimiter before, usually giving some suggestions before or even a custom suggestion, for example on Calc (Libreoffice):

Calc (LibreOffice)

So the idea is that the user enters the delimiter and then his script validates the validator using the itself fgetcsv(), then in front-end can do something like:

<select name="delimitador">
    <option value=";">;</option>
    <option value=",">,</option>
    <option value="\t">\t (TAB)</option>
</select>

You can add custom values, since the delimiter may vary, this is just an example.

In the back end you can validate so:

$delimitador = $_POST['delimitador'];

if (!in_array($delimitador, [ '.', ',', "\t" ])) {
    die('Delimitador inválido!');
}

$arquivo = fopen('meu_csv.csv', 'r') or die('Não pode ler o arquivo');

// Obtém o cabeçalho conforme instrução do usuário
$cabecalho = fgetcsv($arquivo, 0, $delimitador);

// Se o delimitador não tiver um numero aceitável de linhas irá falhar
if (count($cabecalho) < 2) {
    die('Delimitador inválido para esse arquivo!');
}

print_r($cabecalho);

This way will count if returned at least 2 columns, a value you can adjust too, but here fits only for example. The fgetcsv() recognizes what this escaped.

Remembering that you can adjust other parameters:

fgetcsv( resource $stream , int $length = 0 , string $separator = "," , string $enclosure = '"' , string $escape = "\\" ) : array

What you can pass to the user to configure before sending the file, but still use the same "validation" process, provided the minimum number of columns.

I will not say that it has no way of detecting, but the amount of work that will generate creating a script like this will make the process slower and possibly flawed, apart from the complication of separator, enclosure or escape customized in the user’s CSV document. And yes, there are ready libs that promise to detect this, but it seems to me a cannon to kill ants and if in fact this was to detect automatically was really good would not be necessary programs like Calc (libreoffice) allow user to configure in preview (if I’m not mistaken Msexcel also has something like this).

But even if you want to insist on it I recommend limiting the allowed delimiters and still use the fgetcsv combined rewind for every test, something like:

$delimitadoresPermitidos = array( ',', ';', "\t" );

$arquivo = fopen('meu_csv.csv', 'r') or die('Não pode ler o arquivo');

$detectado = null;
$cabecalho = null;

foreach ($delimitadores as $delimitador) {

   // Para cada delimitador testado é necessário voltar a primeira linha
   rewind($arquivo);

   $cabecalho = fgetcsv($arquivo, 0, $delimitador);

   // Se retornar false é porque existe algum erro ou o documento possivelmente esta vazio
   if ($cabecalho === false) {
       die('Arquivo provavelmente vazio');
   }

   // Se o delimitador testado retornar 2 ou mais colunas então irá usar esse delimitador
   if (count($cabecalho) > 1) {
       $detectado = $delimitador;
       break;
   }
}

if ($detectado === null) {
    die('Documento inválido');
}

// Cabeçalho
print_r($cabecalho);

// Lê outras linhas usando o delimitador detectado
while (($linha = fgetcsv($handle, 0, $detectado)) !== FALSE) {
    print_r($linha);
}

Browser other questions tagged

You are not signed in. Login or sign up in order to post.