How to embedding sentences for NLP in Tensorflow?

Question

How to embedding sentences for NLP in Tensorflow?

Asked 4 years, 5 months ago

Viewed 22 times

0

I need to turn a phrase bank that I created myself into a suitable vector to use as training a neural network in Tensorflow. I have the following structure:

I managed to separate the sentences into words:

However, I would like to know how to do in part turn these phrases into an integer vector where words are replaced by an index. Someone knows how to do?

1 answer

Browser other questions tagged python machine-learning tensorflow

You are not signed in. Login or sign up in order to post.

by Fabio Oliveira • **104** points · Answer 1 · 2021-04-13T05:10:42+00:00

Use a class object Tokenizer of the Keras.

# criar objeto da classe Tokenizer():
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()

# construir o dicionário usado pelo tokenizer para converter em inteiros:
tokenizer.fit_on_texts( SUA LISTA DE FRASES DE REFERÊNCIA VAI AQUI)

# criar a lista de frases como sequência de inteiros:
sequences = tokenizer.texts_to_sequences( SUA LISTA DE FRASES QUE DESEJA CONVERTER VAI AQUI)

The method fit_on_texts() can be called multiple times with different arguments. Each time you fit, the words that are not yet in the dictionary will be included. To check the dictionary after fitting, see the attribute word_index.

Depending on how you build your model, the class Tokenizer has other useful methods, such as texts_to_matrix() and the sentences_to_texts().

More details here: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

A small correction: what you want (turn phrase into an integer vector) is done by a tokenizer, which translates words to an integer number ranging from 1 to the size of the dictionary (amount of words known by the tokenizer). Embedding is another thing, is to transform into a vector of real numbers, of reduced dimension (typically 50-300 dimensions).