How to count words from a string ignoring prepositions?

Asked

Viewed 1,152 times

8

Is there any service that does the recognition if a certain type of word is a preposition?

I want to make a word ranking of a rss feeder, but ignoring prepositions.

Ignoring words with less than N characters is a good start, but it may not be enough, because there are still large prepositions. Follow two lists:

Essential prepositions: a, before, after, until, with, against, since, in, between, to, per, before, without, under, over, behind.

Accidental prepositions of: as (= as), as (= as per), as per (= as per), as per (= as per), as per (= as per) during, except as per, outside, as per, as per, as per, except as per (= per).

Do you know any service that does this identification or have any idea of how to implement a reasonable method, that is, it does not need to be 100% comprehensive, but it does cover a significant part of the words?

It can be in any language.

Thank you.

It follows an excerpt of C# code that I’m using in the prototype, but which has proved inefficient:

private static IEnumerable<IGrouping<string, string>> MostCommonWords(string str, int maxNumWords)
{
    var prepositions = new string[] {/*...*/};
    var mostCommonWords =
        Regex.Split(str.ToLower(), @"\W+")
            .Where(s => s.Length > 3 && !prepositions.Contains(s))
            .GroupBy(s => s)
            .OrderByDescending(g => g.Count()).Take(maxNumWords);
    return mostCommonWords;
}
  • 2

    Why so many negative votes??

  • "This question shows no research effort; it is not clear or not useful". He didn’t research the subject, he just asked "Do you do it for me? Or are you ready?".

  • 4

    No, I disagree, the question is an indication of a method to do something, he is not asking anyone to hand over the program, besides he did an analysis of the possibilities, effort != code

  • 2

    @At no time did he ask "do for me". He asks if there is already something ready to do that or if there is some reasonable method of implementation, which is totally valid. The question is well written and objective.

2 answers

6


Slightly unixeira version...

xmllint --xpath '//description'  'http://.../news.rss'  |
grep -Po '(*UTF8)(*UCP)\b[\w\d_][\w\d_\-.*#]*[\w\d_]\b|\w|\.\.\.|[,.:;()[\]?!]|\S' |
grep ... |
sort | uniq -c | sort -nr
  • line 1 - extract the tag description new.rss (adapt xpath to specific needs, see option for name-Spaces -setns)
  • line 2 - tokenizer - one token per line (the word notion is more complicated than it seems)
  • line 3 - select words with a minimum of 3 chars (remove if you are not interested)
  • line 4 - count occurrences and sort in reverse order

If you need to put together something like

grep -wvf  stopwords.txt  | 

in line 3.5 to remove the words contained in the stopwords.txt file

Issue 1 Footnotes: Stop-words

@Mason commented: ... one of the points I wanted to raise with this question was how to cover a good part of the stopwords and I will take advantage to share here this link I found code.google.com/p/stop-words , (contain a collection of stop-words lists for several languages)

Usually stop-words end up being

  1. "grammatical" words (e.g., prepositions, articles, pronouns, some adverbs, conjunctions), - is useful from a list as referred by @Pedreiro,
  2. to which we add a few words too common in the context concerned (e.g.: "Sheet" and "Paul" if RSS is news of "Folha de S.Paulo")
  3. and from which we draw informative words in our context (Ex: "seen" in RSS travel bureaucracies are extremely important -- "visa is required to enter Iran")

That is stop-word = 1 + 2 - 3

finally: in many cases (1) it is advisable not to remove stopwords! (2) it makes sense stop-locutions.

  • 1

    This regexp was mythical !

  • @Isvaldofernandes, Thank you. Well to say, the right part, which relates to the score can be withdrawn in this specific case.

  • Thanks for the script, @Jjoao. I accepted the answer because in a succinct way it also covered the reading of the RSS that I had contextualized and I took the opportunity to learn about xmllint, which in this case I needed to change to also pass Namespace through the command setns. Anyway, one of the points I wanted to raise with this question was how to cover a good part of the stopwords and I will take the opportunity to share here this link that I found https://code.google.com/p/stop-words/ , which has a legal survey of these words.

  • @Bricklayer, Thanks for everything; I tried to include some details of your comment in the reply, change without fear!

1

  • Thanks for the articles. I didn’t know some important terms on the subject and this is now making my search easier. What is in item "3.2.2 Removal of Stopwords" from the UFPE article is what I’m looking for more implementation details.

  • 3

    Could you include a summary of what the links say? Because how is your answer would look better as a comment (see our [tour]). Thank you.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.