Python NLTK method that returns a syntactic tree

Asked

Viewed 1,189 times

4

I’m using the NLTK Forest library and I saw there that has some sentences with parse (syntactic tree) already created. However, I would like a method that from a new phrase it creates the parse in English.

Examples are: Use today

floresta.parsed_sents()

and he brings me a tree set up for every sentence within the existing corpus. I would like to pass new sentences in English and some (s) python(s) function(s) return me the sentence with the parse equal to the function above returns.

  • I recently wrote a post with an example of how to use Syntaxnet (from Google), trained in Portuguese, to extract a syntactic tree from a sentence, and use this information with the structures of NLTK: http://davidsbatista.net/blog/2017/03/25/syntaxnet/

3 answers

2

I don’t know about "in Portuguese" - or even in any other natural language, like English - but from what I understood the parsed_sents returns a list of already "parsed" sentences, without specifying as this analysis was performed (automatically or manually, to serve as examples). To parse a new phrase, you need to use a grammar, and then use the method parse of this grammar. Example:

grammar1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)

This is a simple grammar with few rules and a restricted vocabulary. It can be used like this:

>>> sent = "Mary saw Bob".split()
>>> rd_parser = nltk.RecursiveDescentParser(grammar1)
>>> for tree in rd_parser.parse(sent):
...      print(tree)
(S (NP Mary) (VP (V saw) (NP Bob)))

Source

The for is due to the possibility of there being two or more interpretations for the sentence, if it is ambiguous. Another example (only for use, for corresponding grammar, see link above):

>>> pdp = nltk.ProjectiveDependencyParser(groucho_dep_grammar)
>>> sent = 'I shot an elephant in my pajamas'.split()
>>> trees = pdp.parse(sent)
>>> for tree in trees:
...     print(tree)
(shot I (elephant an (in (pajamas my))))
(shot I (elephant an) (in (pajamas my)))

The way to use the code, therefore, is this. If there are good grammars for Portuguese that can be used in conjunction with this code (i.e. in a format accepted by this library), then I can’t say anymore - even because building a broad-scope grammar is a very difficult problem.

  • 1

    Thanks man!! But then, the forest is a corpus in Portuguese already. Except that assemble the parse can not take into account the model trained for English. I thought I would have a model ready in Portuguese that I could use. I will try to explore this library you sent with the corpus in English.

  • Maybe I even have, I don’t know... What I mean is that to put together a general grammar is very difficult, and I have doubts whether something like this even exists in English. Natural language processing, as far as I know, is usually done using statistical, non-formal methods. Be that as it may, check whether or not this forest library has grammars ready to use with the method parse, otherwise I’m afraid I have nothing better to suggest.

  • P.S. Please confirm to me if the library you refer to is this. From what I’ve been looking at examples in English, does not seem to have anything that facilitates the parse of an arbitrary phrase, as I suspected. I may be mistaken, I hope...

  • Yes. They made the corpus of the forest and made the model from the NLTK. I read their article but left no examples or showed how they did it. I’m exploring the links within their site as yet. I will send an email to them with these doubts. Soon I report what I could find out. Thanks!

  • And it’s not that it takes grammar into account, but it’s because the way to create the syntactic tree is different in certain languages. Portuguese is similar to Spanish and both are different from English. Therefore, a model that creates parse in English cannot be used in Portuguese. If you have any method within the NLTK that trains a new parse from some training data would be very good for me.

1

The problem with trees in Portuguese is that it doesn’t have a tagger.

You can try to make a comparison between your text and the forest, but it’s still no guarantee that they’ll cover all your words.

You can also use the nltk.CFG.fromstring and mount your tree in hand, but if it is too complex it ends up falling into the tagger problem.

I don’t know the size of your need to create this, but if you want to contribute to the development of a tagger in English.

1

Browser other questions tagged

You are not signed in. Login or sign up in order to post.