1

In below example i have data of movies :

I want to split the title column values into 2 new column i.e one of the new column ( " movie title ") will take = toy story as row value and another new column ("year) will take =1995 as row value.

how to perform this operation on whole dataframe?

 title \ 0 Toy Story (1995) 1 Jumanji (1995) 2 Grumpier Old Men (1995) 3 Waiting to Exhale (1995) 4 Father of the Bride Part II (1995) 
0

5 Answers 5

1

Try to use str accessor with split using a regex that looks for a space and open paranthesis or close paranthesis:

df.title.str.split('\s\(|\)',expand=True) .drop(2, axis=1) .set_axis(['movies','year'], inplace=False, axis=1) 

Output:

 movies year 0 Toy Story 1995 1 Jumanji 1995 2 Grumpier Old Men 1995 3 Waiting to Exhale 1995 4 Father of the Bride Part II 1995 
Sign up to request clarification or add additional context in comments.

Comments

1

If you don't want a regex solution,

df.title.str.rstrip(')').str.split('(').apply(pd.Series) 0 1 0 Toy Story 1995 1 Jumanji 1995 2 Grumpier Old Men 1995 3 Waiting to Exhale 1995 4 Father of the Bride Part II 1995 

Comments

0

Using regex:

df[['movie_title', 'year']] = df.title.str.extract('(.*)\s\((\d+)', expand=True) 

Sample Data:

df = pd.DataFrame({'title': ['Toy Story (1995)', 'Jumanji (1995)', 'Grumpier Old Men (1995)', 'Waiting to Exhale (1995)', 'Father of the Bride Part II (1995)', 'Hello (Goodbye) (1995)' ]}) df[['movie_title', 'year']] = df.title.str.extract('(.*)\s\((\d+)', expand=True) 

Output:

 title movie_title year 0 Toy Story (1995) Toy Story 1995 1 Jumanji (1995) Jumanji 1995 2 Grumpier Old Men (1995) Grumpier Old Men 1995 3 Waiting to Exhale (1995) Waiting to Exhale 1995 4 Father of the Bride Part II (1995) Father of the Bride Part II 1995 5 Hello (Goodbye) (1995) Hello (Goodbye) 1995 

The regular expression we use is: '(.*)\s\((\d+)'.

The first part of the expression is: (.*). The parenthesis indicate it's a capturing group, and what's inside the parenthesis indicates what we want to capture. In this case .* indicates that we greedily want to capture everything. After the capturing group, we have \s\( which is literally interpreted as a space folllwed by an open parenthesis, so ' ('. Because the capturing group before this is greedy, we will capture everything up until the final ' (' if there are multiple such matches. (for instance see how it captures Hello (Goodbye) (1995) properly.

Finally, we add a second capturing group with \d+ as the thing we want to capture, which captures all of the numbers after the final ' (' in your title, which for us will be the 4 digit year.

3 Comments

awesome thanks for the solution , if you don't mind could please explain me regex as i am new to python and also where i will get best practical documentation for regex.. thanks in advance
@VishalSuryavanshi see the edit for some explanation. Otherwise, Regex101 is a great place to learn and test regex.
Thanks buddy .. you are champ . Bravo
0

You can simply create a new dataframe (df_new) that splits the string in the title column of the old dataframe (df) and pass that as a list to the new dataframe constructor:

df_new = pd.DataFrame(df['title'].str.split(r' (?=\()').tolist(), columns=['movie title','year']) 

Yields:

 movie title year 0 Toy Story (1995) 1 Jumanji (1995) 2 Grumpier Old Men (1995) 3 Waiting to Exhale (1995) 4 Father of the Bride Part II (1995) 

Comments

0
df.apply(lambda x: x.str.split('('),axis=1) df2 = pd.DataFrame({'movie title':df['title'].apply(lambda x: x[0][:-1]), 'year':df['title'].apply(lambda x: x[1][:-1])}) print(df2) 

Result

 movie title year 0 Toy Story 1995 1 Jumanji 1995 2 Grumpier Old Men 1995 3 Waiting to Exhale 1995 4 Father of the Bride Part II 1995 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.