Algorithm to find common word sequence

Question

Algorithm to find common word sequence

Asked 9 years, 2 months ago

Viewed 563 times

4

I’m looking for a logic where I can find in an exorbitant amount of data, phrases that are common. Phrase in this context is word sequence.

For example in these prayers:

I’m traveling to South Orlando
I really wanted to go to South Orlando
It’s a beautiful day here in South Orlando

The phrase, or sequence of words in common would be Orlando do Sul

researching I found this question in Stackoverflow EN I couldn’t understand the answer.

Is it an array of words, or are they all in the same variable? The words are scrambled or are groups of phrases?

– Adriano Luz

2016/05/07 at 15:15
@Adrianoluz at first is a string of words, or as I said they are all in the same variable in text form.

– Elaine

2016/05/07 at 15:17
There is a delimiter between sentences, such as an endpoint or a dash?

– Adriano Luz

2016/05/07 at 15:27
@Adrianoluz theoretically should be considered but because it is random content users can not be based on this as the main pillar.

– Elaine

2016/05/07 at 15:30
The answer talks about the use of dictionary, well it was what I understood, a way to assign an integer to the chosen word so it could make comparisons and also sort the array of words and store them to be used with other lists, however, I don’t know if that answer gives you the solution.

– gato

2016/05/07 at 23:29
In the answer, it also provides links of algorithms in C that implements some techniques cited by the AR. Do you have any language in mind to implement this algorithm?

– gato

2016/05/07 at 23:31
Hello! I really didn’t understand exactly what he explained, also I don’t know if it would solve, I think in PHP..

– Elaine

2016/05/07 at 23:40
This is an interesting question. The answer you found in Stackoverflow EN has all the information you need to solve the problem. If you want a PHP solution it will require some work on your part. To help, here is a Java implementation. It is simple to convert this to PHP. http://algs4.cs.princeton.edu/63suffix/ http://algs4.cs.princeton.edu/63suffix/SuffixArray.java.html http://algs4.cs.princeton.edu/63suffix/LongestRepeatedSubstring.java.html

– VRPF

2016/05/08 at 00:17
@Elaine, this article addresses something related to what you want, and is well didactic and shows how to analyze words and texts or phrases in Python.

– gato

2016/05/08 at 00:17

Show 4 more comments

1 answer

Browser other questions tagged algorithm logic

You are not signed in. Login or sign up in order to post.

by Luiz Vieira • **34,160** points · Answer 1 · 2016-05-08T02:46:00+00:00

You do not mention exactly what you did not understand in the SOEN response. But one thing the author there repeats quite a lot is "linear time". The idea of an algorithm capable of processing data in linear time is that the time it takes to process grows linearly with the amount of input data. Think ideally in a direct ratio: the more data, the longer it takes. It’s what’s expected for a sufficiently good algorithm. On the other hand, a bad algorithm would take exponential time to process the data. Thus, the more data, the much longer it would take to process - to the point of becoming unviable for an amount of data that doesn’t even need to be "exorbitant".

To better understand this concept, I suggest that other issue and also that other issue.

Well, you have a lot of data (the words or the characters to process), so you need to worry about the performance of your algorithm. The SOEN response suggests an approach based on some algorithms that I do not know, but from what I could understand the essential idea is to scan the data (in linear time) and assemble a suffix tree (stored in an array): this tree is a supposedly important (and famous) data structure in text analysis that contains all possible suffixes (endings) of a string. The example of the Wikipedia link is the word "BANANA", which can have the suffix "ANANA", or the suffix "NANA" and so on (the $ means the end of the word):

Apparently, using structures like this you can also build the prefix tree (all possible prefixes for your text, considered as a single string) and then find in it the longest and most common prefixes.

The concepts involved do not seem to be so complex, but will require you to read and study them probably in English. The Wikipedia article may be insufficient, but perhaps you find more material looking for "suffix tree".