Translate Captcha

Question

Translate Captcha

Asked 7 years, 7 months ago

Viewed 915 times

1

I’m developing a routine that researches vehicle charges and restrictions in the DMV, and to download the document, I need to go through a Captcha validation, and it’s already taking my sleep away. I wonder if someone managed to go through this validation using VBA or managed to do something to solve this type of "problem".

The site I’m working on is this: http://www.detran.sp.gov.br/wps/portal/portaldetran/cidadao/veiculos/servicos/pesquisaDebitosRestricoesVeiculos

Until now I’m using a message telling the user to type the Captcha and then OK press so that the process continues, practically this ready, only this part that is bothering me a lot.

M Marins, your question is not clear, you put a title leaving implied that you want to "translate captcha" after you describe something that seems to want to "break" the captcha of a site, what would be your real need? describe more clearly and objectively, so that we can help you.

– Vinicius Dutra

2017/12/29 at 13:27

1 answer

Browser other questions tagged vba excel-vba

You are not signed in. Login or sign up in order to post.

by Begnini • **1,681** points · Answer 1 · 2017-12-29T14:56:24+00:00

Captcha breaking is not a very trivial problem, you need to know a little bit of image processing, even for simple cases.

This captcha image is simple to break compared to most of the captcha that exist by ae. The letter is well detached from the background and the letters are well separated from each other.

The algorithm would be more or less:

Separate each letter
Do the letter recognition - The letters seem not to vary the rotation, so a simple exact match between the cut-out letter and the pre-sorted alphabet seems to be enough.

Unfortunately I do not know VB, so I will write in python the code, but I believe it is simple to adapt to Visual Basic using Emgu, a . NET wrapper for Opencv.

Step 1 - Separate letters

I’ll take an example from the page and run the algorithm step by step.

These steps are known in image processing as background removal. The idea of this technique is to remove from the image what is not important, leaving in a single color, and highlight the important object, leaving in another color, so that it is simple to get the coordinates of the objects of interest (In Wikipedia has more information - Background Subtraction)

The first step is to turn the color image into grayscale.

gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)

The result is the image below:

After that, to highlight the white letters, we’ll use a morphological operation called dilation. It will make the lyrics "chubby", reinforcing the area of objects of interest.

kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (3, 3))
dilated = cv2.dilate(gray, kernel)

The result applied to the grayscale image is the image below:

Now that we have the letter highlighted, we can turn the background into black and the letters into white just by looking at their color. The pixels that are less than the value 127 we will paint black and the largest we will paint white. This technique is called Thresholding.

_, bw = cv2.threshold(dilated, 127, 255, cv2.THRESH_BINARY)

The result of Threshold in the enlarged image is the below:

Dude, now the background is all black and the letters are all white. Now we have to separate each letter, and for this, opencv already has a nice function, which gives the same value to all neighboring pixels that have the same color.

total, markers = cv2.connectedComponents(bw)

total will have the amount of components, including the background, and markers will be an image with the connected components (each letter) painted the same color. She drawn below:

Now just find the coordinates of each letter. For this we use a method called findContours that will find the outlines of each letter.

# filtra os componentes, deixando apenas os com mais de 10px e menos de 1 mil px
images = [numpy.uint8(labels==i) * 255 for i in range(total) if numpy.uint8(labels==i).sum() > 10 and numpy.uint8(labels==i).sum() < 1000]

# faz uma copia da imagem em preto e branco, so pra visualizacao
img = cv2.cvtColor(bw, cv2.COLOR_GRAY2RGB)


# pinta retangulos em volta de cada componente
color = (255.0, 0.0, 0.0)

for label in images:
    # encontra os contornos para cada componente
    countours = cv2.findContours(label, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

    # calcula o retangulo em volta dos contornos
    (x,y,w,h) = cv2.boundingRect(countours[0])

    # e pinta ele
    cv2.rectangle(img, (x, y), (x+w, y+h), color, 1)

The result is the rectangles drawn around each letter, as in the image below:

Now with the x, y, height and width of each letter it is trivial to crop them and save them in a directory. These letters should then be sorted in some way (all 'a' in the same directory, for example, all 'b', and so on).

Step 2 - Do the recon

Now that you have a good base of sorted letters and know how to separate the letters in a new captcha, just cut each letter and compare with all the letters of your base, well in the same brute force. Opencv has a function that does this, called matchTemplate. She has several methods to calculate the difference between 2 images, by experience, I usually use the TM_CCOEFF_NORMED method.

Assuming you have 1 cut letter you want to recognize, and a list of images as a template, you can use the method below, which will give you the best match.

# Busca o melhor template p/ uma letra
def search_for_letter(image, letter, templates):
    best = 2 ** 32

    pos  = None

    for template in templates:
        # busca o template na imagem, usando o metodo passado
        match = cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)

        # encontra a pontuacao e a localizacao do template
        minVal,maxVal,minLoc,maxLoc = cv2.minMaxLoc(match)

        if best < maxVal:
            pos = {
                'error': maxVal,
                'location': maxLoc,
                'letter': letter
            }
            best = maxVal

    return pos

Now you can use this method in a loop, to find all the letters contained in the image, something like the method below:

# itera em cima de todas as letras para achar
# o melhor resultado
def search(file, templates):
    matches = []

    # esse cut_and_binarize eh todo o passo 1 
    image = cut_and_binarize(file)

    for letter in templates:
        pos = search_for_letter(image, letter, templates[letter])

        if pos is not None:
            matches.append(pos)

    # ordena os melhores casamentos
    matches = sorted(matches, key=lambda x:x['error'],reverse=True)

    # pega os 6 melhores casamentos e ordena em X
    return sorted(matches[:6], key=lambda x:x['location'][0])

With this, I believe that you get more than 90% of the recognition of this captcha. As I said at the beginning, understanding image processing is important for automating the recognition of Ptchas, but knowing the basic techniques, the work is simple.