Count elements of a column

Asked

Viewed 7,534 times

3

How to count the number of occurrences in columns?

Filing cabinet:

luz NC  luz
mas ADV más
blanquita   ADJ blanco
que CQUE    que
las ART el
que CQUE    que
traía   VLfin   traer
de  PREP    de
serie   NC  serie
mi  PPO mi|mío
coche   NC  coche

Script:

from collections import Counter

with open ("corpus_TreeTagger.txt", "r") as f:
    texte = f.read()
    colunas = texte.split("\n")

    def frequencia(colunas):
        for linhas in colunas:
            lexema = linhas.split('\t')[0]
            pos = linhas.split('\t')[1]
            lema = linhas.split('\t')[2]

        return Counter(lexema)
        return Counter(pos)
        return Counter(lema)

print(frequencia(colunas))

Error:

Traceback (most recent call last):
  File "FINALV2.py", line 72, in <module>
    print(frequencia(colunas))
  File "FINALV2.py", line 23, in frequencia
    pos = linhas.split('\t')[1]
IndexError: list index out of range

Could someone help me?

  • 1

    What kind of file is this? what divides the columns? Isn’t there a character to separate them? Do you create the file or recbe from another source? the original ending is .txt even?

  • It’s a morpho-syntax labeling software. We give a text and it does the analysis by dividing the output file into three columns: the word, the morphological label and its motto.

  • 1

    Okay! Did you develop it? If not, is there no way to configure it to create a character to separate the columns? the way it is, at least visually, it is impossible to identify the columns, if at least they had fixed width would already help. See this text (which, although not in that context, is about the subject) to understand what I’m talking about.

  • It’s actually a column, a tabulation, another column, tabulation and column. I can print the entire second column, for example, like this: lines.split(' t')[2]

  • 1

    See if my answer meets the goal

  • Thank you @Sidon! Chama Treetagger, widely used in linguistics and is a language labeler developed in Germany. My goal is to do a statistic of a text, counting the lexemes, the labels and the slogans. I’m trying to do a simple parser as well. I imagine there are other ways to do it, but I’m a beginner. Thank you very much, I’ll take a look at Pandas :)

Show 1 more comment

1 answer

1


[TL;DR]

Pandas

Now I understood the file format, I do not know if I completely understood the goal, so I made a version based on pandas, which counts the occurrences of each word in each column.

First let’s simulate the file, to facilitate includes a row to identify the columns, this can be done easily on a system in production.

import io 
import pandas as pd

# Simulando um txt separado por tabs
s = '''
Palavra\tEtiqueta\tLema
luz\tNC\tluz
mas\tADV\tmás
blanquita\tADJ\tblanco
que\tCQUE\tque
las\tART\tel
que\tCQUE\tque
traía\tVLfin\ttraer
de\tPREP\tde
serie\tNC\tserie
mi\tPPO\tmi|mío
coche\tNC\tcoche
'''

Now let’s read the file to a pandas dataframe

# lendo o arquivo para um dataframe
df = pd.read_csv(io.StringIO(s), sep='\t')

Introducing the dataframe

df
Out[15]: 
      Palavra Etiqueta    Lema
0         luz       NC     luz
1         mas      ADV     más
2   blanquita      ADJ  blanco
3         que     CQUE     que
4         las      ART      el
5         que     CQUE     que
6       traía    VLfin   traer
7          de     PREP      de
8       serie       NC   serie
9          mi      PPO  mi|mío
10      coche       NC   coche

Now let’s group by the column Palavra and display the number of occurrences of each word in that column throughout the table:

df.groupby('Palavra').count()

           Etiqueta  Lema
Palavra                  
blanquita         1     1
coche             1     1
de                1     1
las               1     1
luz               1     1
mas               1     1
mi                1     1
que               2     2
serie             1     1
traía             1     1

Grouping by the column Etiqueta and showing the number of occurrences of each word in this column in the table:

df.groupby('Etiqueta').count()

          Palavra  Lema
Etiqueta               
ADJ             1     1
ADV             1     1
ART             1     1
CQUE            2     2
NC              3     3
PPO             1     1
PREP            1     1
VLfin           1     1

Finally, the group collected by the column Lema and the number of occurrences of each word in that column throughout the table:

df.groupby('Lema').count()

        Palavra  Etiqueta
Lema                     
blanco        1         1
coche         1         1
de            1         1
el            1         1
luz           1         1
mi|mío        1         1
más           1         1
que           2         2
serie         1         1
traer         1         1

Download or view rendered in jupyter notebook.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.