How to extract content from the Web (Web scraping) with C#?

Asked

Viewed 748 times

4

I recently learned how to make web scraping and I got it on some sites, but others I can’t. I noticed that in some of the ones I can’t get there’s an "#", what that means?

Let me give you an example of a site where this happens to me. https://www.meusresultados.com/jogo/IV9KYMDp/#h2h;Overall

Also there is some way to make web scraping on this site?

I usually do this:

var wc = new WebClient();
wc.Encoding = Encoding.UTF8;
var pagina = wc.DownloadString(url);

var htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(pagina);

And then I find what I want.

  • 2

    have as it has... but have to see what you are doing...

  • Like I gotta see what I’m doing?

  • 1

    this...had to see the code you’re using... and, is not working because this code will only receive the server html in a request... many sites work with javascript and will load the data after the page is loaded... so when you receive the html there is no data there...

  • So how can I proceed to get this data?

  • is a gambiarra... but could use a webbrowser, after the event DocumentComplete You wait a while (to load javascript) and then you get access to the browser’s html.... I took a quick look at the site, I did not find the addresses that the ajax request is made... so... I think only in the same gambiarra...

  • 1

    I had tried with webbrowser, but it didn’t work. It must be complicated

Show 1 more comment

1 answer

3


Here is a web scraper that takes all references to other Uris, from a URI:

public class WebScraper
{
    public static void Main(string[] args)
    {
        string url = args[0];

        foreach (string anotherUrl in GetScrapedUrls(url))
        {
            Console.WriteLine(anotherUrl);
        }
    }

    private static bool IsValidChunk(string chunk)
    {
        bool result = true;

        result = result && chunk.First() != '#';
        result = result && !chunk.Contains("clicklogger");
        result = result && !chunk.StartsWith("https");
        result = result && !chunk.Contains("captcha");
        result = result && !chunk.Contains("counter");

        return result;
    }

    private static IEnumerable<string> GetScrapedUrls(string url)
    {
        Uri myUri;
        if (Uri.TryCreate(url, UriKind.Absolute, out myUri))
        {
            yield return myUri.AbsoluteUri;

            WebClient client = new WebClient();
            string content = client.DownloadString(myUri);

            if (!string.IsNullOrEmpty(content) && content.IndexOf("<html>") > 0)
            {
                MatchCollection matches =
                    Regex.Matches(content, @"<a[^>]+?href\s*?=\s*?['""]([^'""]+)['""]");

                foreach (Match match in matches)
                {
                    string chunk = match.Groups[1].Value;

                    if (IsValidChunk(chunk))
                    {
                        string oneMoreUrl = 
                            (url.IndexOf("http") != 0 ? url : "") + 
                                (url.Last() == '/' ? "" : "/") + 
                                    chunk;

                        foreach (string evenOneMoreUrl in GetScrapedUrls(oneMoreUrl))
                        {
                            yield return evenOneMoreUrl;
                        }
                    }
                }
            }
        }
    }
}

Browser other questions tagged

You are not signed in. Login or sign up in order to post.