How to use deep-Learning to parse forms with addresses?

Asked

Viewed 158 times

8

I have an app I need to import personal data into. I often receive excel or csv/txt files with fields such as name, address, email, phone, etc... File formatting varies, order also, and sometimes there are empty fields. What can help an algorithm understand the fields is that in each file I receive with N entries all have the same column organization. What varies is the format of each file, and not within each file.

I can do this by hand, often with algorithms with RegExp but that end up always having a large component of "custom made", ie need to handle the data manually.

How would it be possible to use Javascript and deep-learning to teach the program to recognize the fields, format them so that I can consume in my application and eventually indicate poorly populated fields having the program sure what kind of field should be?

Example of input, where each row is an example of how columns can be in a given file:

// nome 1, nome 2, telefone, email, campos de morada
["joao", "pereira", "215548808", "[email protected]", "rua das peras", "2890", "campo alegre"]

// nome 1, nome 2, data de nascimento, email, codigo postal, morada, telefone
["maria", "conceição", "10051978", "[email protected]", "2400", "rua de porto alegre", "98337449"]

// nome completo, morada completa, mail pessoal, mail trabalho, telefone fixo, telemovel
["andreia pires", "rua do jardim nr10 3988 porto", "[email protected]", "[email protected]", "070234382", "013387484"]

And the fields my app uses are:

nome 1 | nome 2 | email | telefone | morada | codigo postal 
  • 2

    Hi Sergio. I’m afraid this problem of yours is a little bit difficult. Numeric alpha fields, in a column or more, may be treated with n-Grams in order to obtain the probability of a set of words forming an address (categorization problem or tagging). But it will be very difficult to treat the numerical fields (telephone, date of birth without /, postal code, etc.). I have little experience in language processing, but I don’t see enough statistical variation between them to be able to differentiate them.

  • 2

    It may be easier for you to use a simple expert system (a rule-based system): you create a set of IF-THEN rules that will work together in order to gradually try to refine what each column is most likely to be (based on measurements of the content of the fields). Your system pre-classifies and presents the columns to the user with the automatic suggestions, but the user always decides how to treat columns. It is important for the user to decide why even using some technology state-of-the-art what you will get are probabilities, never certainties.

  • 1

    The measurements of the fields will be quite individual (number of characters, whether it has alpha or only digits, whether it has the word "Street", etc., etc., etc.) and will be heuristics that you as a domain expert will create. Apparently there are some rules Engines for Javascript. It is worth checking out: http://stackoverflow.com/questions/3430885/lightweight-rules-engine-in-javascript Good luck! :)

  • @Luizvieira thanks for the tips! You pointed out to me paths and solutions that I have no knowledge to complete alone, it will help a lot the research for this problem. It makes sense to use n-Grams, I didn’t even know the name of that line of analysis. Giving strings weights, or possibly numerical types, is one of the paths I will explore. Thank you! (As for tags I do not know if EN or PT, maybe post on the goal on this)

  • 1

    Not at all, Sérgio. It’s as far as I can help with this kind of problem. Let’s wait and see if someone actually posts an answer. : ) As for the tags I do not have a very well formed opinion, but tend to think that in English they can be more appropriate. It must be even better to discuss it at the finish line.

1 answer

4

This question is cool, but the answer would almost be a great work project. With deep Learning it is possible to solve your problem. With javascript, I do not know how to answer. I will give an answer in R which is easily adaptable to python and then I point out some libs that maybe you can do with javascript.

Come on.

Data collection

As with any machine learning project you will need a database with some already classified information. Fortunately, in your case it should be simple to get a big bank without much effort.

Here, I made a collection of information:

  • I picked up a list of FUVEST approved names from 2014 and separated them into first and last names
  • I got a list as street names in SP

I created a database that looks like this:

# A tibble: 10 × 2
        tipo                              valor
       <chr>                              <chr>
1        rua rua doutor heitor pereira carrilho
2        rua               rua hipólito vieites
3  sobrenome                     fogaca galdino
4       nome                             rafael
5        rua             rua ida câmara stefani
6  sobrenome                       alves duraes
7  sobrenome                       keiko sonoda
8  sobrenome             barcellos mano rinaldi
9       nome                             victor
10       rua                rua angelo catapano

At the end this database has 60k notes divided between name, surname and street. You can add other types of data you want like phone, zip code, etc. Here I didn’t do this to simplify.

Data processing

The way the data is, they are not suitable to be consumed by a deep-Learning model. We need an array that we will call X. This array must have 3 dimensions: (n, maxlen, len_char_tab) in which:

  • n is the number of observations you have
  • maxlen is the maximum number of characters an observation can have
  • len_char_tab is the number of distinct characters in every database

I mean, I turn every "abc" sequence into a matrix so

1 0 0
0 1 0
0 0 1

I’ve turned this database into what I need as follows:

library(purrr)
char_table <- stringr::str_split(df$valor, "") %>%
  unlist() %>% 
  unique()

vec <- map(
  df$valor, 
  ~unlist(str_split(.x, "")) %>%
    map_int(~which(.x == char_table))
)

maxlen <- max(map_int(vec, length))
vec <- pad_sequences(vec, maxlen = maxlen)
vec <- apply(vec, c(1, 2),function(x) as.integer(x == 0:64))
vec <- aperm(vec, c(2,3,1))

Here is my object vec is the array x that I was discussing and gets the following dimensions: 60023 58 65.

We also need a matrix called Y which will have the following dimension: (n, n_types). n is your sample size and n_types is the number of distinct types. The content of this matrix is 1 if the observation is of type i and 0 otherwise.

I did it that way:

all_res <- unique(df$tipo)
res <- sapply(df$tipo, function(x) as.integer(x == all_res)) %>% t()

The object res is the matrix Y which I have commented on and which is large: 60023, 3

Model definition

Now let’s use the keras to define an LSTM. I’m not gonna try to explain what a LSTM is because it’s much difficult and Colah has already explained 100x better than anyone would explain. Read more here

The code to define the model is below:

library(keras)
model <- keras_model_sequential()
model %>%
  layer_lstm(units = 128, input_shape = c(58, 65)) %>%
  layer_dense(3) %>%
  layer_activation("softmax")

model %>% compile(
  optimizer = "adam",
  loss = "categorical_crossentropy",
  metrics = "accuracy"
)

Model training

Training the model is the easy part:

model %>% fit(
  x = vec, y = res,
  validation_split = 0.1,
  shuffle = TRUE,
  batch_size = 32
)

Upshot:

Train on 54020 samples, validate on 6003 samples
Epoch 1/10
54020/54020 [==============================] - 372s - loss: 0.0966 - acc: 0.9707 - val_loss: 0.0070 - val_acc: 0.9992

In a single epoch the model managed to hit virtually all the observations I left as validation. Of course my database is much simpler than yours, it’s cute, and most streets are "street" ahead. Which helps a lot. Expect worse results than this, but maybe not so much worse.

Use

In your case, after training the model in a database, I would apply the forecasts in each of the columns and see what the result is that Masi appears, if it is a name, surname or address and mark that this column is of this type.

But what about javascript?

Database

I left the database available here at this link https://drive.google.com/open?id=0B9I1XHoC4uO6anJTbmJzelppeFE

You can read in R using df <- readRDS("df.rds").

Browser other questions tagged

You are not signed in. Login or sign up in order to post.