How to collect data from a web page?

Asked

Viewed 1,874 times

5

Web data collection, or Web Scraping, is a form of mining that allows the extraction of data from web sites by converting them into structured information for further analysis. Present here your ideas and more efficient alternatives for the accomplishment of this task.

1 answer

10

The best way to get data from a document is by making use of the Class Regex. Using immutable regular expressions to obtain the desired data we can perform queries in documents or even on a web page.

For demo we will create a program that gets the amount of votes your question received in stackoverflow.

Print - Votos

First of all we will find in the source code of our web page where is the amount of votes and then we will create the regular expression to find it in any question present in stackoverflow.

PERFORMING WEB SCRAPING MANUALLY

In the Chrome browser, right-click the place where you want the information. Click Inspect to open the Browser Element Inspecter.

Como abrir o Inspecionador de Elementos

In the element inspecter right-click and copy the string of the desired text.

Copie o a string do trecho de texto desejado

As a result we get the following string: <span itemprop="upvoteCount" class="vote-count-post ">0</span>

We can attest that the corresponding number of votes in the question is currently zero (0) and is between two Strings.
First String: <span itemprop="upvoteCount" class="vote-count-post ">
Second String: </span>

CREATING THE REGULAR EXPRESSION

With this information in hand we can now develop our regular expression. To facilitate our work we will use a website to test our expression in operation.

Open the site regexstorm.net and paste our string into Input. In Pattern similarly, we will replace only the zero number with the regular expression \d+ as it is possible to observe in the image.

Regexstorm.net

EXPLAINING THE REGULAR EXPRESSION

Regular expressions are like the name itself suggests expressions that signal certain patterns that repeat in a text. In that case we use the expression \d which corresponds to any decimal digit together with a quantifier + indicating that it must search for the decimal element described above, once or more. So if this value were to be increased or decreased it would not be found.

To analyze its workWe will increase the data in Input and what we can attest in the footer of the page in the tab Table is that the Regex was able to actually find the different values in the string successfully.

Exemplo de Regex

To read a list of available regular expressions and get examples of their use see References from the Regex.

DEVELOPING OUR WEB SCRAPING

Create a project Console Application with the name Simple Web Scraping. In order to organize our work we will create a folder in our project called Tools and within it we will create two classes: DownloadWebPage.cs and RegexTools.cs.

Print do Gerenciador de Soluções

DOWNLOADING A WEB PAGE

The class DownloadWebPage will be responsible for downloading a web page and returning the result in a String. We will use the class WebClient to download the web page.

class DownloadWebPageString
{
    static public String Run(String _url)
    {
        try
        {
            WebClient webClient = new WebClient();
            return webClient.DownloadString(_url);
        }
        catch (Exception)
        {
            return "Não foi possivel baixar a pagina web.";
        }
    }

}

GETTING DATA FROM A STRING USING REGEX

The class RegexTools will be responsible for collecting data from a String corresponding to the web page we obtain later. Not only that, but we will also implement a data substitution system to receive only the information we actually want.

We’ll do it through function Replace available for strings handling in C#. It is able to replace a particular piece of text with something else that in our case will simply be an empty string for the removal of parts of the relevant text.

class RegexTools
{
    String text;

    public void NewDocument(String _text)
    {
        text = _text;
    }

    public String Run(String _regularExpression)
    {
        var regex = new Regex(_regularExpression);
        var match = regex.Match(text);
        String resultRegex = match.ToString();
        Console.WriteLine(resultRegex);
        return resultRegex;
    }

    public String Run(String _regularExpression, String _replaceClearFirst)
    {
        var regex = new Regex(_regularExpression);
        var match = regex.Match(text);
        String resultRegex = match.ToString();
        String resultReplaceFirst = resultRegex.Replace(_replaceClearFirst, "");

        return resultReplaceFirst;
    }

    public String Run(String _regularExpression, String _replaceClearFirst, String _replaceClearSecond)
    {
        var regex = new Regex(_regularExpression);
        var match = regex.Match(text);
        String resultRegex = match.ToString();
        String resultReplaceFirst = resultRegex.Replace(_replaceClearFirst, "");
        String resultReplaceSecond = resultReplaceFirst.Replace(_replaceClearSecond, "");

        return resultReplaceSecond;
    }
}

RUNNING THE PROGRAM

In class Program automatically generated when we create our project we will add to the method Main present in it, the following commented code explaining the operation of the.

static void Main(string[] args)
    {
        Console.WriteLine("WEB SCRAPING – COMO FAZER COLETA DE DADOS WEB?");
        // DEFINIÇÕES DE PESQUISA NO DOCUMENTO
        String paginaWeb = DownloadWebPageString.Run("/questions/302606/");
        String regularExpression = "<span itemprop=\"upvoteCount\" class=\"vote-count-post \">\\d+</span>"; // ENCONTRA NA STRING NÚMEROS DECIMAIS USANDO A EXPRESSÃO REGULAR: \d+ 
        String replaceClearFirst = "<span itemprop=\"upvoteCount\" class=\"vote-count-post \">"; // DEFINE O INÍCIO DO TEXTO PARA APAGAR
        String replaceClearSecond = "</span>"; // DEFINE O FINAL DO TEXTO PARA APAGAR

        // OBTENDO OS DADOS
        RegexTools regexTools = new RegexTools();
        regexTools.NewDocument(paginaWeb); // CARREGANDO O REGEXTOOLS COM O DOCUMENTO            
        String countVotes = regexTools.Run(regularExpression, replaceClearFirst, replaceClearSecond); // OBTEMOS A QUANTIDADE DE VOTOS.
        Console.WriteLine("Quantidade de votos: " + countVotes);
        Console.ReadKey();
    }

Now to compile the project in the 2017 Visual Studio IDE click on ▶ Iniciar. Compilar no Visual Studio 2017

After compiling we will get the result of the following image in the screen of the Application Console.

Resultado

The project is available for download on my Github: Simple-Web-Scraping

  • 1

    Top Response!!!

  • Make good use of it and whatever is useful to you someday. I’m just starting at stackoverflow, hope to create more relevant content soon. Thank you @W.Faustino.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.