How to count the number of candidates on this page ? Python 3.6

Asked

Viewed 314 times

2

Simple thing. I need to count how many candidates there are in the table of this page, for example : http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/? id_curso=01GV&id_grupo=70 For example there are 110 names, but I need to get this number and I have to do this on a huge number of pages with the same structure. Here’s what I’ve tried :

from bs4 import BeautifulSoup
import requests
import string
import re
import urllib
r = requests.get('http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-    de-espera-sisu-3/?id_curso=01GV&id_grupo=70')
soup = BeautifulSoup(r.text, "html.parser")
contador = 0
for node in soup.findAll(".XXX-XX<"):
  contador = contador+1
print(contador)  

Except he’s not finding these characters, and they’re there, in the Cpf column, for example... How to do this ?

  • It is not useful to count how many <td> elements you have in the table <table id="Sisu">?

  • Will you scroll through all the pages? http://.... &id_group=70, http://.... &id_group=71 ... And increment the candidate number into a single variable?

3 answers

1

print(len(re.findall('XXX-XX', str(soup))))
  • 4

    Please collaborate with the quality of community content by better elaborating your response. Very short answers will hardly be clear enough to add anything to the discussion. If your response is based on some function or feature of the language/tool, link to the official documentation or reliable material that supports your response. Searching for similar questions in the community can also be interesting to indicate other approaches to the problem.

0

import HTMLParser
import urllib2
import re
from pprint import pprint

request = urllib2.Request("websiteURL")

response = urllib2.urlopen(request)

responseContent = response.read()

# Aqui pelo que reparei o que precisas é pegar apenas o conteudo da primeira coluna de cada linha, para isso a utilização desta regular expression e do findall
match = re.findall(r'<tr></td>(.*)</td>', responseContent)

# Depois de teres o code podes fazer o que bem desejares, contar, imprimir...
for code in match:
    print code

I think the code is working perfectly, if there is any error you can simply correct because by logic what you want is here

0


If only to find out how many candidates there are:

import requests
from bs4 import BeautifulSoup as bs

req = requests.get('http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=01GV&id_grupo=70')
soup = bs(req.text, 'html.parser')
rows = soup.select('#sisu tr')
print(len(rows[1:])) # 110

Installation of the Beautifulsoup

Note I make rows[1:] for the first Row (<tr>) are the names of the columns (I don’t think they count as a candidate)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.