How to extract content from the Web (Web scraping) with C#?

Question

How to extract content from the Web (Web scraping) with C#?

Asked 7 years, 2 months ago

Viewed 748 times

4

I recently learned how to make web scraping and I got it on some sites, but others I can’t. I noticed that in some of the ones I can’t get there’s an "#", what that means?

Let me give you an example of a site where this happens to me. https://www.meusresultados.com/jogo/IV9KYMDp/#h2h;Overall

Also there is some way to make web scraping on this site?

I usually do this:

var wc = new WebClient();
wc.Encoding = Encoding.UTF8;
var pagina = wc.DownloadString(url);

var htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(pagina);

And then I find what I want.

2

have as it has... but have to see what you are doing...

– Rovann Linhalis

2018/06/21 at 18:59
Like I gotta see what I’m doing?

– Diogo Sousa

2018/06/21 at 19:00
1

this...had to see the code you’re using... and, is not working because this code will only receive the server html in a request... many sites work with javascript and will load the data after the page is loaded... so when you receive the html there is no data there...

– Rovann Linhalis

2018/06/21 at 19:04
So how can I proceed to get this data?

– Diogo Sousa

2018/06/21 at 19:27
is a gambiarra... but could use a webbrowser, after the event DocumentComplete You wait a while (to load javascript) and then you get access to the browser’s html.... I took a quick look at the site, I did not find the addresses that the ajax request is made... so... I think only in the same gambiarra...

– Rovann Linhalis

2018/06/21 at 19:41
1

I had tried with webbrowser, but it didn’t work. It must be complicated

– Diogo Sousa

2018/06/21 at 19:58

Show 1 more comment

1 answer

Browser other questions tagged c# web-scraping

You are not signed in. Login or sign up in order to post.

by Marcelo Shiniti Uchimura • **3,302** points · Answer 1 · 2018-06-21T22:02:17+00:00

Here is a web scraper that takes all references to other Uris, from a URI:

public class WebScraper
{
    public static void Main(string[] args)
    {
        string url = args[0];

        foreach (string anotherUrl in GetScrapedUrls(url))
        {
            Console.WriteLine(anotherUrl);
        }
    }

    private static bool IsValidChunk(string chunk)
    {
        bool result = true;

        result = result && chunk.First() != '#';
        result = result && !chunk.Contains("clicklogger");
        result = result && !chunk.StartsWith("https");
        result = result && !chunk.Contains("captcha");
        result = result && !chunk.Contains("counter");

        return result;
    }

    private static IEnumerable<string> GetScrapedUrls(string url)
    {
        Uri myUri;
        if (Uri.TryCreate(url, UriKind.Absolute, out myUri))
        {
            yield return myUri.AbsoluteUri;

            WebClient client = new WebClient();
            string content = client.DownloadString(myUri);

            if (!string.IsNullOrEmpty(content) && content.IndexOf("<html>") > 0)
            {
                MatchCollection matches =
                    Regex.Matches(content, @"<a[^>]+?href\s*?=\s*?['""]([^'""]+)['""]");

                foreach (Match match in matches)
                {
                    string chunk = match.Groups[1].Value;

                    if (IsValidChunk(chunk))
                    {
                        string oneMoreUrl = 
                            (url.IndexOf("http") != 0 ? url : "") + 
                                (url.Last() == '/' ? "" : "/") + 
                                    chunk;

                        foreach (string evenOneMoreUrl in GetScrapedUrls(oneMoreUrl))
                        {
                            yield return evenOneMoreUrl;
                        }
                    }
                }
            }
        }
    }
}