How to return the most appearing words in a column?

Asked

Viewed 1,113 times

6

I have this table below which has two columns, being id and description:

CREATE TABLE myBigTable (
    id INT(11) AUTO_INCREMENT PRIMARY KEY,
    description TEXT NOT NULL
)

After entering some records, I made a select simple and resulted in this below:

+----+-----------------------+
| id |     description       |
+----+-----------------------+
| 1  | joão de santo cristo  |
| 2  | eduardo e mô nica     |
| 3  | santo cristo joão     |
| 4  | cristo tadeu joão     |
| 5  | juazeiro do joão      |
+----+-----------------------+

Would like a select that returns the amount of times each word appears in a given column, for example in the column description. Below follows the return in which it would be desired, containing the word and the quantity. See:

+------------+----------+
|    work    |    qnt   |
+------------+----------+
|    joão    |    4     |
|   cristo   |    3     |
|   santo    |    2     |
|    ...     |   ...    |
+------------+----------+

I did a small test using LIKE, but I have to put the word in which I want to return

SELECT count(*) as qnt FROM `phrase` WHERE description LIKE "%joao%"

Return:

+----------+
|   qnt    |
+----------+
|    4     |
+----------+

How would a select to return the words that appear most in a given column? It is possible to do this using only database resources?

Note: it need not necessarily be the amount of times a word appears, but rather, if possible, the amount of lines it meets.

  • You’ve been listening a lot Urban Legion - 2 :P

  • @Bonifazio! =)

  • I find a very complex question, I do not know if there is a performative way to do this :P first would have to explode every word of each line, and then count using LIKE "% string %" and excluding those that have already entered the count...

  • It would be a mixture of this function with this logic

  • You have a maximum length of words. For example thousand? Did not answer the question Xp

  • @Brunocosta does not have the maximum size, but I would need at most 100, which would be enough to make decisions. And now, responded?!

  • the point of your test is if Voce has the same word in the same field ex. joão de são joão del rey, would have to count 2x right?

  • with the same problem if solve brand solved in which

Show 3 more comments

2 answers

4

I found a solution online, I give all the credit to author of the article who found an ingenious solution to split a string into Mysql. However the solution of it requqer that Voce create a table of numbers with as many entries as the number of words that Voce supports.

Here is an example:

Scheme:

Create table Data2(
  id int AUTO_INCREMENT primary key,
  nome nvarchar(255)
);

insert into Data2 (nome) values ('ola');
insert into Data2 (nome) values ('ola adeus');
insert into Data2 (nome) values ('adeus ola');
insert into Data2 (nome) values ('gelados ola');
insert into Data2 (nome) values ('viva');

create table numbers (
  n int
);

insert into numbers values (1);
insert into numbers values (2);
--...

Query:

select nome, count(1) totalLinhas from (
  select id, nome, count(1) as total from ( 
    select 
      id, 
      substring_index(
        substring_index(nome, ' ', n), 
        ' ', 
        -1
      ) as nome
    from Data2
    join numbers
      on char_length(nome) 
        - char_length(replace(nome, ' ', '')) 
        >= n - 1
  ) t1
  group by id, nome
)t2
group by nome

sqlfiddle

  • But there in the context I speak : "...return the amount of times each word appears... "

  • @Acklay Can do Join with the table itself, if you trust that she has sufficient records

  • I tested the fiddle, and it’s returning only 4x ola, it shouldn’t be 6x?

  • @Brunocosta now that I read the media lyrics of his post, I think I’ll have to ask that question then ... kkkk

1

I don’t know how this would look in Mysql, I’m without the bank here to do the test, but in sql server would be like this;

declare @tabela table
(
    id int,
    description varchar(500)
) 

declare @tabelaAux table
(
    description varchar(500)
) 

declare @texto varchar(200), @textoAux varchar(200), @Index int,@Separador varchar(2) = ' '

--insert into @tabela values
--( 1,'joão de santo cristo')  
--,(2,'eduardo e mô nica')    
--,(3,'santo cristo joão')    
--,(4,'cristo tadeu joão')   
--,(5,'juazeiro do joão') 

insert into @tabela values 
( 1,'joão de santo cristo joão carlos joão joão de santo cristo joão carlos joão  ') ,
(2,'eduardo e mô nica') ,
(3,'joão santo joão cristo cristo  cristo cristo  joão') ,
(4,'cristo tadeu joão') ,
(5,'juazeiro do joão')


-- Cursor para percorrer os nomes dos objetos 
DECLARE cursor_tabela CURSOR FOR
select description from @tabela

    -- Abrindo Cursor para leitura
    OPEN cursor_tabela

    -- Lendo a próxima tabela
    FETCH NEXT FROM cursor_tabela INTO @texto

    -- Percorrendo linhas do cursor (enquanto houverem)
    WHILE @@FETCH_STATUS = 0
    BEGIN
         SET @Index = charIndex(@Separador,@texto)
         WHILE (@Index > 0) BEGIN  
            set @textoAux = SubString(@texto,1,@Index-1);
            insert into @tabelaAux( description) values(@textoAux)
            set @texto =  SubString(@texto, @Index+1, len(@texto) - @Index)
            SET @Index = charIndex(@Separador,@texto)

        END
        insert into @tabelaAux( description) values(@texto)

    -- Lendo a próxima linha
    FETCH NEXT FROM cursor_tabela INTO @texto
    END

-- Fechando Cursor para leitura
CLOSE cursor_tabela

-- Desalocando o cursor
DEALLOCATE cursor_tabela

select count(description)total,  description from @tabelaAux
group by description

Exit;

total   description
2   
2   carlos
7   cristo
2   de
1   do
1   e
1   eduardo
11  joão
1   juazeiro
1   mô
1   nica
3   santo
1   tadeu
  • I tested with this mass and did not count right man: insert into @tabela values
( 1,'joão de santo cristo joão carlos joão') 
,(2,'eduardo e mô nica') 
,(3,'joão santo joão cristo joão') 
,(4,'cristo tadeu joão') 
,(5,'juazeiro do joão')

  • 2

    @andrepaulo, truth, corrected.

  • I am using mysql, and I will try to take a look to convert it to there. The solution of our friend below in the other answer seems to work for few records, now tested for 78688 lines and took so long that the time expired... I will continue testing.

  • @Acklay, this is an issue that requires a lot of processing, if you’re going to do the query constantly, it would be ideal to have a table that stores each word separately with a field that adds up each time it’s included + 1, then it’s just to do the select. to always have it updated after a Trigger would be interesting.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.