Algorithm to find common word sequence

Asked

Viewed 563 times

4

I’m looking for a logic where I can find in an exorbitant amount of data, phrases that are common. Phrase in this context is word sequence.

For example in these prayers:

  • I’m traveling to South Orlando
  • I really wanted to go to South Orlando
  • It’s a beautiful day here in South Orlando

The phrase, or sequence of words in common would be Orlando do Sul

researching I found this question in Stackoverflow EN I couldn’t understand the answer.

  • Is it an array of words, or are they all in the same variable? The words are scrambled or are groups of phrases?

  • @Adrianoluz at first is a string of words, or as I said they are all in the same variable in text form.

  • There is a delimiter between sentences, such as an endpoint or a dash?

  • @Adrianoluz theoretically should be considered but because it is random content users can not be based on this as the main pillar.

  • The answer talks about the use of dictionary, well it was what I understood, a way to assign an integer to the chosen word so it could make comparisons and also sort the array of words and store them to be used with other lists, however, I don’t know if that answer gives you the solution.

  • In the answer, it also provides links of algorithms in C that implements some techniques cited by the AR. Do you have any language in mind to implement this algorithm?

  • Hello! I really didn’t understand exactly what he explained, also I don’t know if it would solve, I think in PHP..

  • This is an interesting question. The answer you found in Stackoverflow EN has all the information you need to solve the problem. If you want a PHP solution it will require some work on your part. To help, here is a Java implementation. It is simple to convert this to PHP. http://algs4.cs.princeton.edu/63suffix/ http://algs4.cs.princeton.edu/63suffix/SuffixArray.java.html http://algs4.cs.princeton.edu/63suffix/LongestRepeatedSubstring.java.html

  • @Elaine, this article addresses something related to what you want, and is well didactic and shows how to analyze words and texts or phrases in Python.

Show 4 more comments

1 answer

6

You do not mention exactly what you did not understand in the SOEN response. But one thing the author there repeats quite a lot is "linear time". The idea of an algorithm capable of processing data in linear time is that the time it takes to process grows linearly with the amount of input data. Think ideally in a direct ratio: the more data, the longer it takes. It’s what’s expected for a sufficiently good algorithm. On the other hand, a bad algorithm would take exponential time to process the data. Thus, the more data, the much longer it would take to process - to the point of becoming unviable for an amount of data that doesn’t even need to be "exorbitant".

To better understand this concept, I suggest that other issue and also that other issue.

Well, you have a lot of data (the words or the characters to process), so you need to worry about the performance of your algorithm. The SOEN response suggests an approach based on some algorithms that I do not know, but from what I could understand the essential idea is to scan the data (in linear time) and assemble a suffix tree (stored in an array): this tree is a supposedly important (and famous) data structure in text analysis that contains all possible suffixes (endings) of a string. The example of the Wikipedia link is the word "BANANA", which can have the suffix "ANANA", or the suffix "NANA" and so on (the $ means the end of the word):

inserir a descrição da imagem aqui

Apparently, using structures like this you can also build the prefix tree (all possible prefixes for your text, considered as a single string) and then find in it the longest and most common prefixes.

The concepts involved do not seem to be so complex, but will require you to read and study them probably in English. The Wikipedia article may be insufficient, but perhaps you find more material looking for "suffix tree".

  • Thank you, you explained a lot, I just don’t understand how this form leads me to what I wish...

  • My intuition and experience tell me that this structure will really be very useful for you because it is a good way to organize the content of the text, maybe to allow to accumulate some count regarding repetitions. But, like I said, I don’t really know a lot about these algorithms, so I don’t have a way to prepare an answer that’s really very didactic. Unfortunately you will have to study on your own or expect someone who really knows the subject post a more palatable response. The tips I gave you are a good starting point for your studies. Good luck. :)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.