Word classification

Asked

Viewed 77 times

1

I’m trying to classify the words of a Dictionarie extracted from the Tesseract when analyzing an image with standardized texts like these:

Note: The "RATING" column has been added to illustrate the desired classification. Blank lines are unwanted words.

Dictionnaire 1:

+-----------+---------+----------+----------+------+------+-------+--------+-----------------+---------------+
| block_num | par_num | line_num | word_num | left | top  | width | height |      text       | CLASSIFICAÇÃO |
+-----------+---------+----------+----------+------+------+-------+--------+-----------------+---------------+
|         1 |       1 |        1 |        1 |   51 |  150 |    76 |     56 | E,              |               |
|         1 |       1 |        1 |        2 |  156 |  146 |   169 |     52 | Rafael          | NOME          |
|         1 |       1 |        1 |        3 |  354 |  147 |   260 |     51 | Bernardo        | NOME          |
|         1 |       1 |        1 |        4 |  639 |  147 |   215 |     51 | Silveira.       | NOME          |
|         1 |       1 |        1 |        5 |  879 |  149 |   120 |     49 | GPS:            |               |
|         1 |       1 |        1 |        6 | 1024 |  149 |   460 |     57 | 753.553.554-20, | CPF           |
|         1 |       1 |        2 |        1 |   50 |  236 |   203 |     65 | prcis           |               |
|         1 |       1 |        2 |        2 |  279 |  236 |   226 |     51 | resolver        |               |
|         1 |       1 |        2 |        3 |  526 |  251 |   120 |     36 | ese             |               |
|         1 |       1 |        2 |        4 |  672 |  236 |   289 |     65 | problma.        |               |
|         1 |       1 |        2 |        5 |  989 |  239 |   143 |     48 | Date:           |               |
|         1 |       1 |        2 |        6 | 1157 |  238 |   334 |     56 | 02/01/2019      | DATA          |
|         2 |       1 |        1 |        1 |   51 |  414 |   357 |     51 | Nascimento:     |               |
|         2 |       1 |        1 |        2 |  433 |  416 |   334 |     56 | 24/07/1997      | NASCIMENTO    |
|         2 |       1 |        1 |        3 |  913 |  414 |   175 |     66 | Sino:           |               |
|         2 |       1 |        1 |        4 | 1116 |  416 |   131 |     49 | Ledo            | SIGNO         |
|         3 |       1 |        1 |        1 |   51 |  594 |   134 |     49 | Mee:            |               |
|         3 |       1 |        1 |        2 |  213 |  592 |   203 |     51 | Rebeca          | MÃE           |
|         3 |       1 |        1 |        3 |  445 |  592 |   179 |     51 | Louise          | MÃE           |
|         3 |       1 |        1 |        4 |  651 |  592 |   174 |     51 | Betina          | MÃE           |
|         3 |       1 |        2 |        1 |   51 |  681 |    93 |     51 | Pul:            |               |
|         3 |       1 |        2 |        2 |  169 |  681 |   124 |     51 | Caio            | PAI           |
|         3 |       1 |        2 |        3 |  320 |  681 |   178 |     51 | Heitor          | PAI           |
|         3 |       1 |        2 |        4 |  522 |  684 |   227 |     48 | Lorenzo         | PAI           |
|         3 |       1 |        2 |        5 |  774 |  681 |   199 |     51 | Silveira        | PAI           |
|         4 |       1 |        1 |        1 |   48 |  859 |   212 |     51 | idade:          |               |
|         4 |       1 |        1 |        2 |  288 |  858 |   214 |     52 | Pindaré         | CIDADE        |
|         4 |       1 |        1 |        3 |  529 |  859 |   162 |     50 | Mirim           | CIDADE        |
|         4 |       1 |        1 |        4 |  719 |  888 |    18 |      4 | -               |               |
|         4 |       1 |        1 |        5 |  765 |  862 |    96 |     47 | MA              | ESTADO        |
|         4 |       1 |        2 |        1 |   51 |  948 |   279 |     64 | Endeco:         |               |
|         4 |       1 |        2 |        2 |  358 |  951 |   101 |     48 | Rua             | ENDEREÇO      |
|         4 |       1 |        2 |        3 |  485 |  948 |   222 |     59 | Grande,         | ENDEREÇO      |
|         4 |       1 |        2 |        4 |  730 |  950 |   105 |     49 | 498             | ENDEREÇO      |
|         4 |       1 |        2 |        5 |  861 |  977 |    18 |      4 | =               |               |
|         4 |       1 |        2 |        6 |  904 |  950 |   193 |     49 | Centro          | BAIRRO        |
|         4 |       1 |        3 |        1 |   51 | 1039 |    91 |     49 | :               |               |
|         4 |       1 |        3 |        2 |  168 | 1039 |   373 |     49 | 32.622.441-5    | RG            |
|         4 |       1 |        3 |        3 |  672 | 1039 |   111 |     49 | Cor:            |               |
|         4 |       1 |        3 |        4 |  810 | 1037 |   185 |     66 | laranja         | COR           |
|         5 |       1 |        1 |        1 |   51 | 1306 |   114 |     49 | Não             |               |
|         5 |       1 |        1 |        2 |  191 | 1303 |    63 |     52 | ha              |               |
|         5 |       1 |        1 |        3 |  281 | 1303 |   252 |     67 | ninguém         |               |
|         5 |       1 |        1 |        4 |  559 | 1319 |   107 |     50 | que             |               |
|         5 |       1 |        1 |        5 |  689 | 1319 |   121 |     36 | ame             |               |
|         5 |       1 |        1 |        6 |  834 | 1319 |    27 |     36 | a               |               |
|         5 |       1 |        1 |        7 |  886 | 1304 |    98 |     51 | dor             |               |
|         5 |       1 |        1 |        8 | 1007 | 1319 |    96 |     50 | por             |               |
|         5 |       1 |        1 |        9 | 1124 | 1304 |    37 |     51 | si              |               |
|         5 |       1 |        1 |       10 | 1186 | 1303 |    74 |     60 | sO,             |               |
|         5 |       1 |        1 |       11 | 1286 | 1319 |   106 |     50 | qUe             |               |
|         5 |       1 |        1 |       12 | 1416 | 1319 |    27 |     36 | a               |               |
|         5 |       1 |        1 |       13 | 1490 | 1304 |   209 |     65 | busquE          |               |
|         5 |       1 |        2 |        1 |   44 | 1404 |    29 |     36 | e               |               |
|         5 |       1 |        2 |        2 |   97 | 1389 |   177 |     65 | qeira           |               |
|         5 |       1 |        2 |        3 |  298 | 1388 |   144 |     60 | té-la,          |               |
|         5 |       1 |        2 |        4 |  468 | 1389 |   403 |     65 | simplesmente    |               |
|         5 |       1 |        2 |        5 |  897 | 1404 |    96 |     50 | por             |               |
|         5 |       1 |        2 |        6 | 1014 | 1404 |    83 |     36 | ser             |               |
|         5 |       1 |        2 |        7 | 1118 | 1389 |   139 |     51 | dor...          |               |
+-----------+---------+----------+----------+------+------+-------+--------+-----------------+---------------+

Dictionnaire 2:

+-----------+---------+----------+----------+------+------+-------+--------+-----------------+---------------+
| block_num | par_num | line_num | word_num | left | top  | width | height |      text       | CLASSIFICAÇÃO |
+-----------+---------+----------+----------+------+------+-------+--------+-----------------+---------------+
|         1 |       1 |        1 |        2 |  161 |   22 |   190 |     53 | Otavio          | NOME          |
|         1 |       1 |        1 |        3 |  372 |   22 |   174 |     53 | Victor          | NOME          |
|         1 |       1 |        1 |        4 |  566 |   25 |   196 |     49 | Castro.         | NOME          |
|         1 |       1 |        1 |        5 |  787 |   25 |   120 |     49 | CPF:            |               |
|         1 |       1 |        1 |        6 |  933 |   25 |   459 |     56 | 639.335.496-80, | CPF           |
|         1 |       1 |        1 |        7 | 1421 |   24 |   202 |     64 | preciso         |               |
|         1 |       1 |        2 |        1 |   58 |  112 |   226 |     51 | resolver        |               |
|         1 |       1 |        2 |        2 |  305 |  127 |   121 |     36 | esse            |               |
|         1 |       1 |        2 |        3 |  451 |  112 |   289 |     65 | problema.       |               |
|         1 |       1 |        2 |        4 |  768 |  115 |   143 |     49 | Data:           |               |
|         1 |       1 |        2 |        5 |  936 |  114 |   334 |     56 | 03/04/2019      | DATA          |
|         2 |       1 |        1 |        1 |   59 |  289 |   357 |     52 | Nascimento:     |               |
|         2 |       1 |        1 |        2 |  441 |  292 |   334 |     57 | 26/01/1997      | NASCIMENTO    |
|         2 |       1 |        1 |        3 |  921 |  290 |   175 |     66 | Signo:          |               |
|         2 |       1 |        1 |        4 | 1120 |  289 |   227 |     66 | Aquario         | SIGNO         |
|         3 |       1 |        1 |        1 |   59 |  470 |   135 |     49 | Mae:            |               |
|         3 |       1 |        1 |        2 |  216 |  468 |   144 |     51 | Aline           | MÃE           |
|         3 |       1 |        1 |        3 |  384 |  470 |   120 |     49 | Sara            | MÃE           |
|         3 |       1 |        2 |        1 |   59 |  557 |    93 |     51 | Pai:            |               |
|         3 |       1 |        2 |        2 |  180 |  555 |   106 |     53 | Luis            | PAI           |
|         3 |       1 |        2 |        3 |  312 |  557 |   165 |     51 | Pedro           | PAI           |
|         3 |       1 |        2 |        4 |  504 |  557 |   263 |     66 | Henrique        | PAI           |
|         4 |       1 |        1 |        1 |   56 |  735 |   212 |     51 | Cidade:         |               |
|         4 |       1 |        1 |        2 |  296 |  734 |   222 |     66 | Macapa          | CIDADE        |
|         4 |       1 |        1 |        3 |  544 |  764 |    19 |      4 | -               |               |
|         4 |       1 |        1 |        4 |  586 |  737 |    77 |     48 | AP              | ESTADO        |
|         4 |       1 |        2 |        1 |   59 |  824 |   279 |     64 | Endereco:       |               |
|         4 |       1 |        2 |        2 |  361 |  824 |   231 |     51 | Avenida         | ENDEREÇO      |
|         4 |       1 |        2 |        3 |  621 |  824 |   239 |     51 | Primeiro        | ENDEREÇO      |
|         4 |       1 |        2 |        4 |  884 |  824 |    68 |     52 | de              | ENDEREÇO      |
|         4 |       1 |        2 |        5 |  979 |  824 |   154 |     59 | Maio,           | ENDEREÇO      |
|         4 |       1 |        2 |        6 | 1162 |  825 |    98 |     50 | 149             | ENDEREÇO      |
|         4 |       1 |        2 |        7 | 1286 |  853 |    18 |      4 | -               |               |
|         4 |       1 |        2 |        8 | 1331 |  824 |   224 |     52 | Buritizal       | BAIRRO        |
|         4 |       1 |        3 |        1 |   59 |  915 |    91 |     49 | RG:             |               |
|         4 |       1 |        3 |        2 |  176 |  914 |   373 |     50 | 33.101.777-5    | RG            |
|         4 |       1 |        3 |        3 |  680 |  915 |   112 |     49 | Cor:            |               |
|         4 |       1 |        3 |        4 |  814 |  914 |   159 |     50 | verde           | COR           |
|         5 |       1 |        1 |        1 |   43 | 1203 |   114 |     48 | Nao             |               |
|         5 |       1 |        1 |        2 |  184 | 1199 |    61 |     51 | ha              |               |
|         5 |       1 |        1 |        3 |  273 | 1199 |   251 |     66 | ninguém        |               |
|         5 |       1 |        1 |        4 |  552 | 1215 |   105 |     50 | que             |               |
|         5 |       1 |        1 |        5 |  682 | 1215 |   121 |     36 | ame             |               |
|         5 |       1 |        1 |        6 |  826 | 1215 |    27 |     36 | a               |               |
|         5 |       1 |        1 |        7 |  878 | 1200 |    98 |     51 | dor             |               |
|         5 |       1 |        1 |        8 |  999 | 1215 |    97 |     50 | por             |               |
|         5 |       1 |        1 |        9 | 1116 | 1200 |    36 |     51 | si              |               |
|         5 |       1 |        1 |       10 | 1179 | 1199 |    74 |     60 | s,              |               |
|         5 |       1 |        1 |       11 | 1278 | 1214 |   107 |     51 | que             |               |
|         5 |       1 |        1 |       12 | 1408 | 1215 |    27 |     36 | aA              |               |
|         5 |       1 |        1 |       13 | 1482 | 1200 |   210 |     65 | busque          |               |
|         5 |       1 |        2 |        1 |   30 | 1298 |    30 |     36 | E               |               |
|         5 |       1 |        2 |        2 |   83 | 1284 |   178 |     64 | queira          |               |
|         5 |       1 |        2 |        3 |  285 | 1282 |   142 |     60 | te-la,          |               |
|         5 |       1 |        2 |        4 |  455 | 1283 |   403 |     65 | simplesmente    |               |
|         5 |       1 |        2 |        5 |  883 | 1298 |    95 |     50 | por             |               |
|         5 |       1 |        2 |        6 | 1000 | 1298 |    83 |     36 | ser             |               |
|         5 |       1 |        2 |        7 | 1104 | 1282 |   139 |     52 | dor...          |               |
+-----------+---------+----------+----------+------+------+-------+--------+-----------------+---------------+

I tried to use the library Sklearn to sort one field at a time, but I can’t say which model to use for my dataset: (linear, regression,...)

Here’s what I got so far :

import pandas as pd
from sklearn.svm import LinearSVC

df = pd.read_csv("main.csv", sep=';', lineterminator='\r')

column = df.text.astype('category')
df['text'] = column.cat.codes

x = df[["block_num","par_num","line_num","word_num","left","top","width","height","text"]]
y = df['CLASSIFICAÇÃO']
clf = LinearSVC()
clf.fit(x, y)
  • No training data?

  • In fact I would like tips for mounting the training data too, because as you can see, in my possible dataset there are many more information that are classified as "useless" (the empty lines of Dictionnaire), and this may end up affecting the classifier

  • Dude... usually in the field of AI classification, we use ready training datsets, set up your own dataset is complicated, my experience is with image rating, the proria scikit offers datasets for this. In your case the problem will be to find datasets ready in English.

  • There are several sites on the internet that offer various types of datasets, an example is the Gee., Voce can, too, do searches on google dataset search

  • The data already "exists", as I said in the problem, is the output of Tesseract when reading an image with a defined layout. The question is to optimize the accuracy of the classifier

  • Yes, I know the data already exist, when I say dataset I mean the dataset of TRAINING, this type of rating without a training dataset if it is not impossible, it is unfeasible.

  • A suggestion would be to take a dictionary, check whether each word is in it or not, if positive, classify it as a hit, if not as an error.

  • Looking again at the text you present, it is difficult to give a suggestion, it seems that you want to classify within a particular context and not whether the reader read wrong or not, without knowing the conditions is difficult

Show 3 more comments
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.