This question is cool, but the answer would almost be a great work project. With deep Learning it is possible to solve your problem. With javascript
, I do not know how to answer. I will give an answer in R
which is easily adaptable to python
and then I point out some libs that maybe you can do with javascript
.
Come on.
Data collection
As with any machine learning project you will need a database with some already classified information. Fortunately, in your case it should be simple to get a big bank without much effort.
Here, I made a collection of information:
- I picked up a list of FUVEST approved names from 2014 and separated them into first and last names
- I got a list as street names in SP
I created a database that looks like this:
# A tibble: 10 × 2
tipo valor
<chr> <chr>
1 rua rua doutor heitor pereira carrilho
2 rua rua hipólito vieites
3 sobrenome fogaca galdino
4 nome rafael
5 rua rua ida câmara stefani
6 sobrenome alves duraes
7 sobrenome keiko sonoda
8 sobrenome barcellos mano rinaldi
9 nome victor
10 rua rua angelo catapano
At the end this database has 60k notes divided between name, surname and street. You can add other types of data you want like phone, zip code, etc. Here I didn’t do this to simplify.
Data processing
The way the data is, they are not suitable to be consumed by a deep-Learning model. We need an array that we will call X
. This array must have 3 dimensions: (n, maxlen, len_char_tab) in which:
- n is the number of observations you have
- maxlen is the maximum number of characters an observation can have
- len_char_tab is the number of distinct characters in every database
I mean, I turn every "abc" sequence into a matrix
so
1 0 0
0 1 0
0 0 1
I’ve turned this database into what I need as follows:
library(purrr)
char_table <- stringr::str_split(df$valor, "") %>%
unlist() %>%
unique()
vec <- map(
df$valor,
~unlist(str_split(.x, "")) %>%
map_int(~which(.x == char_table))
)
maxlen <- max(map_int(vec, length))
vec <- pad_sequences(vec, maxlen = maxlen)
vec <- apply(vec, c(1, 2),function(x) as.integer(x == 0:64))
vec <- aperm(vec, c(2,3,1))
Here is my object vec
is the array x
that I was discussing and gets the following dimensions: 60023 58 65
.
We also need a matrix called Y
which will have the following dimension: (n, n_types). n is your sample size and n_types is the number of distinct types. The content of this matrix is 1 if the observation is of type i and 0 otherwise.
I did it that way:
all_res <- unique(df$tipo)
res <- sapply(df$tipo, function(x) as.integer(x == all_res)) %>% t()
The object res
is the matrix Y
which I have commented on and which is large: 60023, 3
Model definition
Now let’s use the keras
to define an LSTM.
I’m not gonna try to explain what a LSTM is because it’s much difficult and Colah has already explained 100x better than anyone would explain. Read more here
The code to define the model is below:
library(keras)
model <- keras_model_sequential()
model %>%
layer_lstm(units = 128, input_shape = c(58, 65)) %>%
layer_dense(3) %>%
layer_activation("softmax")
model %>% compile(
optimizer = "adam",
loss = "categorical_crossentropy",
metrics = "accuracy"
)
Model training
Training the model is the easy part:
model %>% fit(
x = vec, y = res,
validation_split = 0.1,
shuffle = TRUE,
batch_size = 32
)
Upshot:
Train on 54020 samples, validate on 6003 samples
Epoch 1/10
54020/54020 [==============================] - 372s - loss: 0.0966 - acc: 0.9707 - val_loss: 0.0070 - val_acc: 0.9992
In a single epoch the model managed to hit virtually all the observations I left as validation. Of course my database is much simpler than yours, it’s cute, and most streets are "street" ahead. Which helps a lot. Expect worse results than this, but maybe not so much worse.
Use
In your case, after training the model in a database, I would apply the forecasts in each of the columns and see what the result is that Masi appears, if it is a name, surname or address and mark that this column is of this type.
But what about javascript?
Database
I left the database available here at this link https://drive.google.com/open?id=0B9I1XHoC4uO6anJTbmJzelppeFE
You can read in R using df <- readRDS("df.rds")
.
Hi Sergio. I’m afraid this problem of yours is a little bit difficult. Numeric alpha fields, in a column or more, may be treated with n-Grams in order to obtain the probability of a set of words forming an address (categorization problem or tagging). But it will be very difficult to treat the numerical fields (telephone, date of birth without
/
, postal code, etc.). I have little experience in language processing, but I don’t see enough statistical variation between them to be able to differentiate them.– Luiz Vieira
It may be easier for you to use a simple expert system (a rule-based system): you create a set of IF-THEN rules that will work together in order to gradually try to refine what each column is most likely to be (based on measurements of the content of the fields). Your system pre-classifies and presents the columns to the user with the automatic suggestions, but the user always decides how to treat columns. It is important for the user to decide why even using some technology state-of-the-art what you will get are probabilities, never certainties.
– Luiz Vieira
The measurements of the fields will be quite individual (number of characters, whether it has alpha or only digits, whether it has the word "Street", etc., etc., etc.) and will be heuristics that you as a domain expert will create. Apparently there are some rules Engines for Javascript. It is worth checking out: http://stackoverflow.com/questions/3430885/lightweight-rules-engine-in-javascript Good luck! :)
– Luiz Vieira
@Luizvieira thanks for the tips! You pointed out to me paths and solutions that I have no knowledge to complete alone, it will help a lot the research for this problem. It makes sense to use n-Grams, I didn’t even know the name of that line of analysis. Giving strings weights, or possibly numerical types, is one of the paths I will explore. Thank you! (As for tags I do not know if EN or PT, maybe post on the goal on this)
– Sergio
Not at all, Sérgio. It’s as far as I can help with this kind of problem. Let’s wait and see if someone actually posts an answer. : ) As for the tags I do not have a very well formed opinion, but tend to think that in English they can be more appropriate. It must be even better to discuss it at the finish line.
– Luiz Vieira