Collect fraction of a text in a Pandas column [Python]

Question

Collect fraction of a text in a Pandas column [Python]

Asked 4 years, 9 months ago

Viewed 56 times

1

Good afternoon!

I’m having a hard time separating the name of the countries into a string column (text), where I don’t have a common separator, I’m not even sure where to start.

Basically, my idea was to start counting back from http and stop at the first comma (meaning <), but I couldn’t do it. (maybe there’s some other smarter way to start).

text
    7.5 magnitude #earthquake. 92 km from Sand Point, #AK, United States https://t.co/XjKbyhjl7v
    7.0 magnitude #earthquake. 14 km from Néon Karlovásion, North Aegean, #Greece https://t.co/Mam1KkK2z7
    7.4 magnitude #earthquake. 94 km from #SandPoint, AK, #UnitedStates https://t.co/gqdJzjfyVU
    7.4 magnitude #earthquake. 94 km from #SandPoint, AK, United States https://t.co/gqdJzjfyVU
    5.7 magnitude #earthquake. 295 km from Lospalos, #LA, East Timor https://t.co/rGvz9nC2Iv
    1.7 magnitude #earthquake. 4 km from Redwood Valley, CA, #UnitedStates https://t.co/lEnnEDqrLO
    4.2 magnitude #earthquake. 92 km from La Esperanza (El Zapotal), #Chiapas, Mexico https://t.co/6SUWsNjDd1
    5.5 magnitude #earthquake. 50 km from #Oxapampa, Pasco, Peru https://t.co/Z95OMBWLsw

Complete dataframe in the file below, the column in question is the last: https://drive.google.com/file/d/1_Iz-c-iKuC2HnsMOlcwfugzCZ9r0-Wug/view?usp=sharing

Obs: I had to do something at a more easy level to remove the first two numbers that inform the magnitude of the earthquakes, it was easy using the:

df['Magnitude'] = df['text'].str[:3].astype(float)
print(df.Magnitude)

The link to the file is closed

– Lucas

2020/11/01 at 20:42
1

@Lucas shared again, this time tested on an anonymous tab and is open, I believe it will work.

– fellipeao

2020/11/01 at 20:46

1 answer

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by Lucas • **3,858** points · Answer 1 · 2020-11-01T21:51:08+00:00

What you need is a regex of the kind:

'(#*[A-Z][a-z]*[\s-]*[A-Z]*[a-z]*)+\shttps'

Using it within a function:

def get_country(k):
    try:
        result = re.findall(r'(#*[A-Z][a-z]*[\s\-]*[A-Z]*[a-z]*)+\shttps',k)[0] #precisa do pacote re
        result = result.replace('#','')
    except:
        result = np.nan #precisa de pacote numpy as np
    return result

df ['country'] = [get_country(k) for k in df.text]

print(df.country.unique())

Output:

['United States' 'Greece' 'UnitedStates' 'East Timor' 'Antarctic Ridge'
 'Mexico' 'Peru' 'Island region' 'Switzerland' 'Islands region'
 'Shetland Islands' 'Japan' 'Indonesia' 'Philippines' 'Atlantic Ridge'
 'Alaska' 'Fiji Islands' 'New Guinea' 'Chile Rise' 'India' 'Chile'
 'Oregon' 'DominicanRepublic' 'China' 'Jan Mayen' 'Iceland'
 'Mariana Islands' 'NewZealand' 'Futuna' 'Costa Rica' 'Canada'
 'Reykjanes Ridge' 'Vanuatu' 'Tonga' 'Venezuela' 'Argentina' 'Russia'
 'Greenland Sea' 'Africa' 'Taiwan' 'Guatemala' 'Panama' 'Kuril Islands'
 'Indian Ridge' 'Puerto Rico' 'Fiji region' 'Japan region'
 'Solomon Islands' 'Bolivia' 'ElSalvador' 'Timor Leste' 'Nicaragua'
 'New Zealand' 'Bangladesh' 'Fiji' 'Virgin Islands' 'El Salvador'
 'Afghanistan' 'PuertoRico' 'Kyrgyzstan' 'Iran' 'Turkey' 'Spain'
 'Tajikistan' 'Romania' 'Islands' 'Guinea' 'Honduras' 'Banda Sea' 'Guam'
 'Easter Island' 'Turkmenistan' 'Socotra region' 'Pakistan'
 'New Caledonia' 'Ecuador' 'Colombia' 'Kermadec Islands' 'Italy'
 'SolomonIslands' 'Carlsberg Ridge' 'Dominican Republic' 'Croatia' 'CA'
 'France' 'Northern Alaska' 'Loyalty Islands' 'California' 'VirginIslands'
 'Central Alaska' 'Southeastern Alaska' 'Southern Alaska'
 'Alaska Peninsula' 'Nevada']

That being said, there are a number of things you need to pay attention to. In some cases, it seems you only have the state, for example. Note also that I have included a NaN if regex fails. To check how many countries you have not captured just do sum((df.country.isna())