Remove words after specific python character

Question

Remove words after specific python character

Asked 4 years, 5 months ago

Viewed 64 times

-1

Hello. I need to find a way to standardize the following classes in python

db.groupby(["EdLevel"])["EdLevel"].count() 

Master's degree                                                                       11141

Master's degree (M.A., M.S., M.Eng., MBA, etc.)                                       13112

Master's degree (MA, MS, M.Eng., MBA, etc.)                                           19569

Master's degree (MA, MS, M.Eng., MBA, etc.)                                          21396

And the code would need to summarize everything to "Master’s Degree"

I’m new to programming, and I’m totally lost. I thought of using a replace, but there are dozens of different classes, but they all follow the same pattern: "Educational level" + "(other graduations)" If I can remove everything that is after the "(" I could decrease my code

Thanks for your help

If one of the answers below solved your problem and there was no doubt left, choose the one you liked the most and mark it as correct/accepted by clicking on the " " that is next to it, which also marks your question as solved. If you still have any questions or would like further clarification, feel free to comment.

– Lucas

2021/03/22 at 22:02

1 answer

Browser other questions tagged python replace

You are not signed in. Login or sign up in order to post.

by Lucas • **3,858** points · Answer 1 · 2021-03-20T14:00:07+00:00

One option is to use regex:

import re

degrees="Master's degree (M.A., M.S., M.Eng., MBA, etc.)"

print(re.findall(r'.+(?=\s\()', degrees))

Returns:

["Master's degree"]

The regular expression (?=...) is a Positive Lookahead. This type of regular expression checks if a certain pattern occurs and picks up what comes before (in this case:.+), excluding the standard.

To apply this to a DataFrame you will need to define a function. See:

import re
import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=['A', 'B', 'C', 'D'])
df['degree']=["Master's degree","Master's degree (M.A., M.S., M.Eng., MBA, etc.)","Master's degree (MA, MS, M.Eng., MBA, etc.)","Master's degree (MA, MS, M.Eng., MBA, etc.)" ]

def get_degree(k):
    if '(' in k:
        return re.findall(r'.+(?=\s\()',k)[0]
    else:
        return k

print(df.degree.apply(get_degree))

Returns:

0    Master's degree
1    Master's degree
2    Master's degree
3    Master's degree
Name: degree, dtype: object

If all the cases you use end in "Gree", it would be easier, because just use the postive lookbehind. This type of regular expression checks if a certain pattern occurs and picks up what comes before (in this case: .+), including the standard:

import re
import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=['A', 'B', 'C', 'D'])
df['degree']=["Master's degree","Master's degree (M.A., M.S., M.Eng., MBA, etc.)","Master's degree (MA, MS, M.Eng., MBA, etc.)","Master's degree (MA, MS, M.Eng., MBA, etc.)" ]

print(df.degree.str.extract(r'(.+(?<=degree))'))

Returns:

                 0
0  Master's degree
1  Master's degree
2  Master's degree
3  Master's degree