How does Bag of Words work and where does it use it?

Asked

Viewed 2,284 times

8

I have recently researched on artificial intelligence and found some articles talking about such a "bag of words", but I don’t know what it is and I didn’t find anything in Portuguese talking about it.

I wonder, what is this "bag of words", and in which cases it applies? If possible, leave the sources.

  • bag is a mathematical concept. Something like conjunto, porém pode repetir elementos e ignora ordenação

  • Have any of these articles with easy link to help contextualize the use of this bag?

  • Unfortunately not :/

  • I didn’t quite understand your explanation, could you give examples of cases that apply? @Jeffersonquesado

  • 1

    Ensembles: {0, 1, 2} U {2, 4} = {0, 1, 2, 4}; bags: {0, 1, 2} U {2, 4} = {0, 1, 2, 2, 4}; also bags: {0, 1, 3, 1, 2} - {0, 1} = {3, 1, 2}

  • For real-world applications, see https://en.wikipedia.org/wiki/Bag_(Mathematics)? wprov=sfsi1

  • 1

    @Jeffersonquesado Legal, thanks for showing me a way. It does seem to have a relationship with the term "bag of words", but it’s still not what I wanted.

  • Google Academic returns something to "bag of words"?

  • I accidentally found this: https://en.m.wikipedia.org/wiki/Bag-of-words_model; seems relevant

  • It seems that the third section of this report describes something about "bag of words": http://conteudo.icmc.usp.br/CMS/Arquivos/arquivos_enviados/BIBLIOTECA_113_RT_209.pdf

  • @Jeffersonquesado I didn’t know there was such a Google Scholar, lol. This Wikipedia has some interesting things, but I still haven’t found what I wanted...

  • The technical report that I sent the link deposed from the Wiki link, was it more direct to the subject? There he speaks in document classification and PLN using "bag of words", including has a table there of example documents with the words "cas", "filh" and others

  • 1

    @Jeffersonquesado I think I got where I wanted, I published an answer, I think it illustrates the subject well.

  • good response =)

Show 9 more comments

1 answer

10


Explanation

The model bag-of-words is a simplified representation used in natural language processing and in the information retrieval (IR). In this template, a text (such as a sentence or a document) is represented as the bag (Multiset) of your words, disregarding the grammar and even the order of the words, but maintaining the multiplicity.

Example of Implementation

The following templates are a text document using bag-of-words.

Here are two simple text documents:

(1) John gosta de assistir filmes. Mary também gosta de filmes.

(2) John também gosta de assistir jogos de futebol.

Based on these two text documents, a list is constructed as follows:

[ 
    "John" , 
    "gosta" , 
    "de" , 
    "assistir" , 
    "filmes" , 
    "Mary" , 
    "também" , 
    "futebol" , 
    "jogos" 
]

It is also common to calculate the frequency of appearance of words:

linear(tj) = 1 − d(tj)/N

Where tj is the word you want to find the frequency, d(tj) the number of times the word appears, and N is the amount of documents or phrases.

Completion

In a simple way, the bag-of-words is a form of text representation. And is commonly used for machine Learning, sentiment analysis, chatbot and topic model.

Source: Wikipedia

Browser other questions tagged

You are not signed in. Login or sign up in order to post.