The best way to get data from a document is by making use of the Class Regex
. Using immutable regular expressions to obtain the desired data we can perform queries in documents or even on a web page.
For demo we will create a program that gets the amount of votes your question received in stackoverflow.
First of all we will find in the source code of our web page where is the amount of votes and then we will create the regular expression to find it in any question present in stackoverflow.
PERFORMING WEB SCRAPING MANUALLY
In the Chrome browser, right-click the place where you want the information. Click Inspect to open the Browser Element Inspecter.
In the element inspecter right-click and copy the string of the desired text.
As a result we get the following string: <span itemprop="upvoteCount" class="vote-count-post ">0</span>
We can attest that the corresponding number of votes in the question is currently zero (0) and is between two Strings.
First String: <span itemprop="upvoteCount" class="vote-count-post ">
Second String: </span>
CREATING THE REGULAR EXPRESSION
With this information in hand we can now develop our regular expression. To facilitate our work we will use a website to test our expression in operation.
Open the site regexstorm.net and paste our string into Input
. In Pattern
similarly, we will replace only the zero number with the regular expression \d+
as it is possible to observe in the image.
EXPLAINING THE REGULAR EXPRESSION
Regular expressions are like the name itself suggests expressions that signal certain patterns that repeat in a text. In that case we use the expression \d
which corresponds to any decimal digit together with a quantifier +
indicating that it must search for the decimal element described above, once or more. So if this value were to be increased or decreased it would not be found.
To analyze its workWe will increase the data in Input
and what we can attest in the footer of the page in the tab Table
is that the Regex
was able to actually find the different values in the string successfully.
To read a list of available regular expressions and get examples of their use see References from the Regex.
DEVELOPING OUR WEB SCRAPING
Create a project Console Application
with the name Simple Web Scraping
. In order to organize our work we will create a folder in our project called Tools and within it we will create two classes: DownloadWebPage.cs
and RegexTools.cs
.
DOWNLOADING A WEB PAGE
The class DownloadWebPage
will be responsible for downloading a web page and returning the result in a String. We will use the class WebClient
to download the web page.
class DownloadWebPageString
{
static public String Run(String _url)
{
try
{
WebClient webClient = new WebClient();
return webClient.DownloadString(_url);
}
catch (Exception)
{
return "Não foi possivel baixar a pagina web.";
}
}
}
GETTING DATA FROM A STRING USING REGEX
The class RegexTools
will be responsible for collecting data from a String corresponding to the web page we obtain later. Not only that, but we will also implement a data substitution system to receive only the information we actually want.
We’ll do it through function Replace
available for strings handling in C#
. It is able to replace a particular piece of text with something else that in our case will simply be an empty string for the removal of parts of the relevant text.
class RegexTools
{
String text;
public void NewDocument(String _text)
{
text = _text;
}
public String Run(String _regularExpression)
{
var regex = new Regex(_regularExpression);
var match = regex.Match(text);
String resultRegex = match.ToString();
Console.WriteLine(resultRegex);
return resultRegex;
}
public String Run(String _regularExpression, String _replaceClearFirst)
{
var regex = new Regex(_regularExpression);
var match = regex.Match(text);
String resultRegex = match.ToString();
String resultReplaceFirst = resultRegex.Replace(_replaceClearFirst, "");
return resultReplaceFirst;
}
public String Run(String _regularExpression, String _replaceClearFirst, String _replaceClearSecond)
{
var regex = new Regex(_regularExpression);
var match = regex.Match(text);
String resultRegex = match.ToString();
String resultReplaceFirst = resultRegex.Replace(_replaceClearFirst, "");
String resultReplaceSecond = resultReplaceFirst.Replace(_replaceClearSecond, "");
return resultReplaceSecond;
}
}
RUNNING THE PROGRAM
In class Program
automatically generated when we create our project we will add to the method Main
present in it, the following commented code explaining the operation of the.
static void Main(string[] args)
{
Console.WriteLine("WEB SCRAPING – COMO FAZER COLETA DE DADOS WEB?");
// DEFINIÇÕES DE PESQUISA NO DOCUMENTO
String paginaWeb = DownloadWebPageString.Run("/questions/302606/");
String regularExpression = "<span itemprop=\"upvoteCount\" class=\"vote-count-post \">\\d+</span>"; // ENCONTRA NA STRING NÚMEROS DECIMAIS USANDO A EXPRESSÃO REGULAR: \d+
String replaceClearFirst = "<span itemprop=\"upvoteCount\" class=\"vote-count-post \">"; // DEFINE O INÍCIO DO TEXTO PARA APAGAR
String replaceClearSecond = "</span>"; // DEFINE O FINAL DO TEXTO PARA APAGAR
// OBTENDO OS DADOS
RegexTools regexTools = new RegexTools();
regexTools.NewDocument(paginaWeb); // CARREGANDO O REGEXTOOLS COM O DOCUMENTO
String countVotes = regexTools.Run(regularExpression, replaceClearFirst, replaceClearSecond); // OBTEMOS A QUANTIDADE DE VOTOS.
Console.WriteLine("Quantidade de votos: " + countVotes);
Console.ReadKey();
}
Now to compile the project in the 2017 Visual Studio IDE click on ▶ Iniciar
.
After compiling we will get the result of the following image in the screen of the Application Console.
The project is available for download on my Github: Simple-Web-Scraping
Top Response!!!
– Wilson Faustino
Make good use of it and whatever is useful to you someday. I’m just starting at stackoverflow, hope to create more relevant content soon. Thank you @W.Faustino.
– Guilherme Lima