"Convert" an HTML list to Mysql with PHP

Asked

Viewed 616 times

3

I have a large list in HTML with about 106000 lines of code. Where these lines are records and these records are subdivided into:

  • 6 lines of information about a game (As name and year of publication)
  • 1 line break

Therefore, each "record" in my HTML has 7 lines and therefore I have more than 15000 different records. The data looks like this because it was removed from a list on the web and treated with PHP to be as pleasant as possible.

HTML data are presented as follows::

<h4>Jogo: Area 51</h4>
<li>Região: 2 - </li>
<li>Sistema: 8 - Sony PlayStation</li>
<li>Ano: 2003</li>
<li>Publicadoras: 1190 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1165 - Mesa Logic</li>
<!-- Quebra de linha, não é comentada assim no código original -->

Therefore, it is practically impossible to put all data in my database manually and accurately in a form of at least convert to XML (or a way to already play everything straight in BD), so that somehow you can pass to the BD that is Mysql.

PS: In the middle of the process, it is important that the browser does not stop and stop responding so that there is no addition of repeated data!

  • 1

    It would be nice a definitive guide of those around here.

  • In fact, I can only support.. hahaha

  • 1

    I think only with the DOM can solve your case. And you could keep trying until you answer ;)

  • Now it’s time to sleep, I’ve been breaking my head for a long time. Leave the HTML like this, cute already gave a lot of trampo. hahaha

  • 1

    @Gabrieltadramainginski sleep? What is this? Programmers do not sleep :)

  • Your list looks like this: http://pastebin.com/muEge8cw?

  • It has several ways to solve. A simple and fast medium, just use explode(), strpos(), substr() and so on. In less than 5 minutes I can finish this.

  • Yes @Marcosv, just like that.

  • @gmsantos students out of test days sleep HSUAHS

  • @Danielomine, I thought about doing this, but I imagine it is an extremely slow operation, because it would give several vectors with thousands of positions, but if you think it is possible, send bullet! :)

  • Can you make this list (file) available so that I have a better view and can give you an answer on your case? Maybe modify it to make these fields encapsulated by a <div> to make data capture easier.

  • List - Note: I will probably use only the Ids, and there are some data that have 2 developers or 2 publishers or 2 regions.

  • Gabriel, the performance in this case is indifferent. Using vector, using ER.. by the way, ER, depending on the complexity is much slower. But I didn’t understand the "shoot".. I mean, you expect me to make a script? rsrsrs the job is yours, right?? rsrsrs

  • Well, the joke here is to suggest solutions and well, you said, "In less than five minutes I can finish this.".

Show 9 more comments

2 answers

2


I quickly made a parser for your case here:

txt games.

<h4>Jogo: Area 51 : teste de : no nome do jogo</h4>
<li>Região: 2 - </li>
<li>Sistema: 8 - Sony PlayStation</li>
<li>Ano: 2003</li>
<li>Publicadoras: 1190 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1165 - Mesa Logic</li>

<h4>Jogo: Area 52</h4>
<li>Região: 3 - </li>
<li>Sistema: 9 - Sony PlayStation 2</li>
<li>Ano: 2004</li>
<li>Publicadoras: 1191 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1166 - Mesa Logic</li>

<h4>Jogo: Area 53</h4>
<li>Região: 4 - </li>
<li>Sistema: 10 - Sony PlayStation 3</li>
<li>Ano: 2005</li>
<li>Publicadoras: 1192 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1167 - Mesa Logic</li>

PHP:

$lines = file('jogos.txt');

$jogos = array();
foreach ($lines as $line_num => $line) {
    $line = strip_tags($line);
    if (substr($line, 0, 5) == 'Jogo:') $i++;
    list($key, $value) = preg_split('/:/', $line, 2);
    if ($value) $jogos[$i][$key] = $value;
}

print_r($jogos);

Return:

Array
(
    [1] => Array
        (
            [Jogo] => Area 51 : teste de : no nome do jogo
            [Região] => 2 - 
            [Sistema] => 8 - Sony PlayStation
            [Ano] => 2003
            [Publicadoras] => 1190 - Midway, 730 - GT Interactive
            [Desenvolvedora] => 1165 - Mesa Logic
        )
    [2] => Array
        (
            [Jogo] => Area 52
            [Região] => 3 - 
            [Sistema] => 9 - Sony PlayStation 2
            [Ano] => 2004
            [Publicadoras] => 1191 - Midway, 730 - GT Interactive
            [Desenvolvedora] => 1166 - Mesa Logic
        )
    [3] => Array
        (
            [Jogo] => Area 53
            [Região] => 4 - 
            [Sistema] => 10 - Sony PlayStation 3
            [Ano] => 2005
            [Publicadoras] => 1192 - Midway, 730 - GT Interactive
            [Desenvolvedora] => 1167 - Mesa Logic
        )
)

Now just use the array to generate the Insert in the BD...

Note: I tested with a file with 11.5mb, 384257 lines and 54894 records and generated the array in less than 15 seconds.


Example with ID separation:

PHP:

$lines = file('jogos.txt');

$jogos = array();
foreach ($lines as $line_num => $line) {
    $line = strip_tags($line);
    if (substr($line, 0, 5) == 'Jogo:') ++$i;
    list($key, $value) = preg_split('/:/', $line, 2);
    $value = trim($value);
    if ($value) {
        if (strpos($value, '-') !== false) {
            if (strpos($value, ',') !== false) {
                $j = 0;
                foreach(explode(', ', $value) as $item) {
                    list($id, $str) = preg_split('/ -/', $item, 2);
                    $str = trim($str);
                    $jogos[$i][$key][$j]['id_'.$key] = $id;
                    if ($str) $jogos[$i][$key][$j][$key] = $str;
                    $j++;
                }
            } else {
                list($id, $str) = preg_split('/ -/', $value, 2);
                $str = trim($str);
                $jogos[$i]['id_'.$key] = $id;
                if ($str) $jogos[$i][$key] = $str;
            }
        } else {
            $jogos[$i][$key] = $value;
        }
    }
}

Return:

Array
(
    [1] => Array
        (
            [Jogo] => Area 51 : teste de : no nome do jogo
            [id_Região] => 2
            [id_Sistema] => 8
            [Sistema] => Sony PlayStation
            [Ano] => 2003
            [Publicadoras] => Array
                (
                    [0] => Array
                        (
                            [id_Publicadoras] => 1190
                            [Publicadoras] => Midway
                        )

                    [1] => Array
                        (
                            [id_Publicadoras] => 730
                            [Publicadoras] => GT Interactive
                        )

                )

            [id_Desenvolvedora] => 1165
            [Desenvolvedora] => Mesa Logic
        )

    [2] => Array
        (
            [Jogo] => Area 52
            [id_Região] => 3
            [id_Sistema] => 9
            [Sistema] => Sony PlayStation 2
            [Ano] => 2004
            [Publicadoras] => Array
                (
                    [0] => Array
                        (
                            [id_Publicadoras] => 1191
                            [Publicadoras] => Midway
                        )

                    [1] => Array
                        (
                            [id_Publicadoras] => 730
                            [Publicadoras] => GT Interactive
                        )

                )

            [id_Desenvolvedora] => 1166
            [Desenvolvedora] => Mesa Logic
        )

    [3] => Array
        (
            [Jogo] => Area 53
            [id_Região] => 4
            [id_Sistema] => 10
            [Sistema] => Sony PlayStation 3
            [Ano] => 2005
            [Publicadoras] => Array
                (
                    [0] => Array
                        (
                            [id_Publicadoras] => 1192
                            [Publicadoras] => Midway
                        )

                    [1] => Array
                        (
                            [id_Publicadoras] => 730
                            [Publicadoras] => GT Interactive
                        )

                )

            [id_Desenvolvedora] => 1167
            [Desenvolvedora] => Mesa Logic
        )

)
  • @Gabrieltadramainginski change the explode to preg_split limited in 2, And notice is not a problem, probably not seen here because I have disabled notices, turn off the notices and see if it works ok...

  • Thanks for the answer and attention! But I got some problems: - The game may have ': ' in your name, but this I can solve alone. - Notice: Undefined offset: 1 in C: xampp htdocs RETROPLUS data-to-base.php on line 13, where the line 13 is the one that pulls the array of the explode. I made a print_r of the explode and it returned the last vector so: Array ( [0] => ).

  • I just replaced it so you wouldn’t be talking to yourself.. xD I’ll see what I can do here!

  • @Gabrieltadramainginski I edited the answer using the limited preg_split, tested and now breaks only in the first : of the line.

  • still the problem of the game, but I changed the foreach(explode(', ', $value) as $item) { for foreach(preg_split('/, /', $value, 2) as $item) {, leaving in a way that I can proceed alone! Thank you very much for your help, will be in the project’s thanks! :)

2

I would like to propose an alternative that aims to screw with a screwdriver and not with a hammer, that is, to syntactically analyze a hierarchical structure with a parser for real.

This is one of the few situations where the excessive verbosity of GIFT does not hinder the solution of the problem. However, for this solution to work according it is necessary that HTML is semantically formulated. For this reason I will be assuming an HTML that contains <UL tags>:

<h4>Jogo: Area 51</h4>
<ul>
<li>Região: 2 - </li>
<li>Sistema: 8 - Sony PlayStation</li>
<li>Ano: 2003</li>
<li>Publicadoras: 1190 - Midway, 730 - GT Interactive</li>
<li>Desenvolvedora: 1165 - Mesa Logic</li>
</ul>

The solution:

$dom = new DOMDocument;
$dom -> loadHTML( $html );

$data = array();

foreach( $dom -> getElementsByTagName( 'ul' ) as $node ) {

    if( $node -> hasChildNodes() ) {

        foreach( $node -> childNodes as $children ) {

            $nodeValue = trim( $children -> nodeValue );

            if( ! empty( $nodeValue ) ) {

                $structure = preg_split(

                    '/(.*?):\s+(.*?)/', $nodeValue, -1,

                    PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY
                );

                $data[ spl_object_hash( $node ) ][ $structure [ 0 ] ] = $structure[ 1 ];
            }
        }
    }
}

We iterate all <UL> elements through DOM::getelementsbytagname(). Of all that $Node will work only with the property value Domnode::nodeValue

Here opens the first possibility of gambiarra. We could explode the line breaks of this value and mount the indexes of the array directly. But we’re parsing syntactically, so this is wrong, and so we need to iterate on the list children.

We could instead get all the lists, get directly the children (<LI>), but this would require additional code and personally would make less logical sense.

To avoid mistakes, notices and cia. we will check if there are we-children, even if we are seeing that these exist. For this, we use Domnode::hasChildNodes() and, if they exist, we will work with the value of the property Domnode::childNodes

From that point on we are no longer analyzing syntactically but manipulating the text of the nodes. Let’s break every string, already cleaned with Trim(), separating the possible labels from their value.

It’s not within the scope of the topic to explain the ER, but as you can see it’s quite simple.

When adding to the array $date we need a way to make every single piece of information unique. We could do a little trick with a manually incremented counter but since we are with many objects in play, I opted for spl_object_hash() which returns a unique numerical sequence for each object at runtime, that is, each time you update p gina, will be other values.

This is purely structural, when iterating this array to insert into the database just ignore the value of the first key. Simple as this!

  • Thank you for giving me your time @Brunoaugusto! My HTML was not that standardized, which, as you yourself said, would not be possible. As I am running the project on localhost, without worrying about the quality of the code, I used the solution that would need less or no adaptation of my HTML code. But I recognize that your solution is the most ideal if I considered using it without any 'gambiarra''.

  • Perfect, but still, unless mistaken, it is possible to fix the HTML to be imported at runtime with the DOM itself, before reading its data. I’ll see if I can put something together here and edit the answer.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.