Substring(string.Indexof) is returning unwanted parts

Asked

Viewed 152 times

3

I’m capturing a music site. I would like to return only 2 artist information and the music. She is in this code snippet:

<div class="nowOnAir">
            <a href="http://www.radioitalia.it/artista/edoardo_bennato/1.php" onclick="javascript:loadUrl(this.href);return false;" class="autore" title="Scopri tutto su edoardo bennato">
                edoardo bennato            </a><br />
            <span>le ragazze fanno grandi sogni</span>

        </div>

Artist = Donatrdo Bennato

music = le ragazze fanno Grandi sogni

I’m trying to recover like this:

string musica = resposta.Substring(resposta.IndexOf("<span>"), resposta.IndexOf("</span>"));
string artista = resposta.Substring(resposta.IndexOf("autore"), resposta.IndexOf("</a><br />"));

In the case of artist ok, I know there are more items, but in the music for me would be 100% correct, but it returns in the song the following content:

"<span>le ragazze fanno grandi sogni</span>\n            \n        </div>\n     \t\n        \n        \n                \n        \n        \n        \n        \n        \n        \n        \n        \n        \n        \n        \n        \n        \n        <div class=\"iTunes\">\n        \n                \n           <a href=\"http://www.amazon.it/gp/redirect.html?camp=2025&creative=165953&location=http%3A%2F%2Fwww.amazon.it%2Fgp%2Fsearch%3Fkeywords%3Dsolo%252Cclaudio%2Bbaglioni%26url%3Dsearch-alias%253Ddigital-music&linkCode=xm2&tag=radiital-21&SubscriptionId=AKIAINZG7TF6TOXSKWSQ\" target=\"_blank\">\n           <img src=\"http://static.ritalia.nohup.it/img/2014/acquista_amazon.jpg\" title=\"Acquista su Amazon\"  alt=\"Acquista su Amazon\" />\n           </a>\n        \n        \n        \n\t\t       <!--http://clk.tradedoubler.com/click?p=24373&a=1945182&url= -->\n        \t<a style=\"background:none;\" href=\"https://itunes.apple.com/it/album/solo/id956867691?i=956867694&uo=4\" target=\"_blank\"><img src=\"http://static.ritalia.nohup.it/img/2013/Download_on_iTunes_Badge_IT_110x40_0824.png\" title=\"Scarica su itunes\"  alt=\"scarica\"/></a>\n       \n           \t\t</div>\n\t\t\n\t\t<script>\n        $(document).ready(function(){\n            var mostra=0;\n            $(\".last5\").mousedown(function(){\n                if(mostra==0){\n                    $(\".songs\").fadeIn(\"fast\");\t\n                    mostra=1;\t\n                }else{\n                    $(\".songs\").fadeOut(\"fast\");\t\n                    mostra=0;\t\n                }\n            });\n        \n        \n        });\n\t\t\n        </script>\n       \n        \n        \n        \n     \n    <div class=\"fotoArtista\">    \n       \n    \t<a href=\"http://www.radioitalia.it/multimedia/galleria/artista/1/claudio_baglioni/684.php\" onclick=\"javascript:loadUrl(this.href);return false;\" title=\"Guarda tutte le foto di claudio baglioni\">Foto: 53</a>\n   \n        \n    \t<a href=\"http://www.radioitalia.it/multimedia/video/artista/1/claudio_baglioni/1999.php\"  onclick=\"javascript:loadUrl(this.href);return false;\" title=\"Guarda tutte i video di claudio baglioni\">Video: 35</a>\n\t\n    \t\n    </div>\n        <div class=\"newsArtista\">\n    \t<a href=\"http://www.radioitalia.it/news/1/index.php\"  onclick=\"javascript:loadUrl(this.href);return false;\">\n    \t    Tutte le news\n        </a>\n\t</div>\n        \n        \n        \n        \n        \n        <div class=\"correlati\">\n            <h3>Artisti consigliati</h3>\n            <ul>\n                                <li><a href=\"http://www.radioitalia.it/artista/emma/1.php\"  onclick=\"javascript:loadUrl(this.href);return false;\" title=\"Emma\"><img src=\"http://static.ritalia.nohup.it/img/icons/artista/55827c6812054.jpg\" border=\"0\" ></a></li>\n                                <li><a href=\"http://www.radioitalia.it/artista/marco_mengoni/1.php\"  onclick=\"javascript:loadUrl(this.href);return false;\" title=\"Marco Mengoni\"><img src=\"http://sta"

What is wrong?

1 answer

3


You’re really getting the wrong positions.

The beginning is not considering the characters of what you are looking for. So if you are looking for <span> you have to get 6 characters ahead to not pick up own string search.

The second parameter expects how many characters you want to pick up and not the position. Then you should find the string which makes the end match and must subtract what has already been disregarded before, in this case the value of the first parameter. This way you have the amount of characters and not the position.

Thus:

using static System.Console;

public class Program {
    public static void Main() {
        var resposta = @"<div class=""nowOnAir"">
            <a href=""http://www.radioitalia.it/artista/edoardo_bennato/1.php"" onclick=""javascript:loadUrl(this.href);return false;"" class=""autore"" title=""Scopri tutto su edoardo bennato"">
                edoardo bennato            </a><br />
            <span>le ragazze fanno grandi sogni</span>

        </div>";
        var inicio = resposta.IndexOf("<span>") + 6;
        var musica = resposta.Substring(inicio, resposta.IndexOf("</span>") - inicio);
        inicio = resposta.IndexOf("autore") + 6;
        var artista = resposta.Substring(inicio, resposta.IndexOf("</a><br />") - inicio);
        WriteLine(musica);
        WriteLine(artista);
    }
}

Behold working in the ideone. And in the .NET Fiddle. Also put on the Github for future reference.

Note that the result of the artist is wrong as you recognize it. Adapt to what you need now. You already know where you were going wrong.

A detail: stay parsing third-party pages is asking for trouble, unless the page creator claims he will never make changes to it. Just do it in desperation.

  • I agree, any change in the code will create a mistake for me, but as it is something personal so if change I change too, quiet.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.