TL;DR version:
For the simple case of:
- I have a text column with a delimiter and I want two columns
The simplest solution is:
df[['A', 'B']] = df['AB'].str.split(' ', n=1, expand=True)
You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.
Notice how, in either case, the .tolist() method is not necessary. Neither is zip().
In detail:
Andy Hayden's solution is most excellent in demonstrating the power of the str.extract() method.
But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:
>>> import pandas as pd >>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']}) >>> df AB 0 A1-B1 1 A2-B2 >>> df['AB_split'] = df['AB'].str.split('-') >>> df AB AB_split 0 A1-B1 [A1, B1] 1 A2-B2 [A2, B2]
1: If you're unsure what the first two parameters of .str.split() do, I recommend the docs for the plain Python version of the method.
But how do you go from:
- a column containing two-element lists
to:
- two columns, each containing the respective element of the lists?
Well, we need to take a closer look at the .str attribute of a column.
It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:
>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]}) >>> upper_lower_df U 0 A 1 B 2 C >>> upper_lower_df["L"] = upper_lower_df["U"].str.lower() >>> upper_lower_df U L 0 A a 1 B b 2 C c
But it also has an "indexing" interface for getting each element of a string by its index:
>>> df['AB'].str[0] 0 A 1 A Name: AB, dtype: object >>> df['AB'].str[1] 0 1 1 2 Name: AB, dtype: object
Of course, this indexing interface of .str doesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:
>>> df['AB'].str.split('-', 1).str[0] 0 A1 1 A2 Name: AB, dtype: object >>> df['AB'].str.split('-', 1).str[1] 0 B1 1 B2 Name: AB, dtype: object
Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do
>>> df['A'], df['B'] = df['AB'].str.split('-', n=1).str >>> df AB AB_split A B 0 A1-B1 [A1, B1] A1 B1 1 A2-B2 [A2, B2] A2 B2
Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:
>>> df['AB'].str.split('-', n=1, expand=True) 0 1 0 A1 B1 1 A2 B2
So, another way of accomplishing what we wanted is to do:
>>> df = df[['AB']] >>> df AB 0 A1-B1 1 A2-B2 >>> df.join(df['AB'].str.split('-', n=1, expand=True).rename(columns={0:'A', 1:'B'})) AB A B 0 A1-B1 A1 B1 1 A2-B2 A2 B2
The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn't deal well with splits of different lengths:
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']}) >>> df AB 0 A1-B1 1 A2-B2 2 A3-B3-C3 >>> df['A'], df['B'], df['C'] = df['AB'].str.split('-') Traceback (most recent call last): [...] ValueError: Length of values does not match length of index >>>
But expand=True handles it nicely by placing None in the columns for which there aren't enough "splits":
>>> df.join( ... df['AB'].str.split('-', expand=True).rename( ... columns={0:'A', 1:'B', 2:'C'} ... ) ... ) AB A B C 0 A1-B1 A1 B1 None 1 A2-B2 A2 B2 None 2 A3-B3-C3 A3 B3 C3
read_table()orread_fwf()