How to modify a token (Spacy - Python)?

Asked

Viewed 56 times

-1

The imported libraries:

    import spacy
    from spacy.matcher import Matcher

The following code adapted from the selected response of this https://stackoverflow.com/questions/62785916/spacy-replace-token :

    nlp=spacy.load("pt_core_news_md")
    doc=nlp("O João gosta da Maria.")

    matcher = Matcher(nlp.vocab)
    matcher.add("Maria", None, [{"LOWER": "Maria"}])

    def replace_word(orig_text, replacement):
      tok = nlp(orig_text)
      text = ''
      buffer_start = 0
      for _, match_start, _ in matcher(tok):
         if match_start > buffer_start:  
             text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
         text += replacement + tok[match_start].whitespace_  
         buffer_start = match_start + 1
      text += tok[buffer_start:].text
      return text

    print(replace_word("O João gosta da Maria.", "Ana"))

When printing this last line, the text did not suffer from any change (it should show "João likes Ana"). It will be because these Matcher functions only work for English and not for "pt_core_news_md"?

P.S.: Actually, I wanted there to be a modification in a token according to its index of the text where it is, rather than by condition (equal to a certain string).

1 answer

0

The code below works as expected

import spacy
from spacy.matcher import Matcher


nlp = spacy.load("pt_core_news_md")
doc = nlp("O João gosta da Maria.")

matcher = Matcher(nlp.vocab)
matcher.add("Maria", None, [{"LOWER": "maria"}])


def replace_word(orig_text, replacement):
    tok = nlp(orig_text)
    text = ''
    buffer_start = 0
    for _, match_start, _ in matcher(tok):
        if match_start > buffer_start:
            text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
    text += replacement + tok[match_start].whitespace_
    buffer_start = match_start + 1
    text += tok[buffer_start:].text
    return text


print(replace_word("O João gosta da Maria.", "Ana"))

The only modification in front of your code is in the line below:

of

matcher.add("Maria", None, [{"LOWER": "Maria"}])

for

matcher.add("Maria", None, [{"LOWER": "maria"}])

Note LOWER supposes a whole word in low box, that is, lower case letters.

Note 2 The code also works with pt_core_news_sm and pt_core_news_lg

I hope it helps

  • Thank you very much!! Help so! Hug, Fernando

  • Oops! Nice of you to help. If you have time, read this post

Browser other questions tagged

You are not signed in. Login or sign up in order to post.