19

I am trying to concat multiple Pandas DataFrame columns with different tokens.

For example, my dataset looks like this :

dataframe = pd.DataFrame({'col_1' : ['aaa','bbb','ccc','ddd'], 'col_2' : ['name_aaa','name_bbb','name_ccc','name_ddd'], 'col_3' : ['job_aaa','job_bbb','job_ccc','job_ddd']}) 

I want to output something like this:

 features 0 aaa <0> name_aaa <1> job_aaa 1 bbb <0> name_bbb <1> job_bbb 2 ccc <0> name_ccc <1> job_ccc 3 ddd <0> name_ddd <1> job_ddd 

Explanation :

concat each column with "<{}>" where {} will be increasing numbers.

What I've tried so far:

I don't want to modify original DataFrame so I created two new dataframe:

features_df = pd.DataFrame() final_df = pd.DataFrame() for iters in range(len(dataframe.columns)): features_df[dataframe.columns[iters]] = dataframe[dataframe.columns[iters]] + ' ' + "<{}>".format(iters) final_df['features'] = features_df[features_df.columns].agg(' '.join, axis=1) 

There is an issue I am facing, It's adding <2> at last but I want output like above, also this is not panda's way to do this task, How I can make it more efficient?

4 Answers 4

8
from itertools import chain dataframe['features'] = dataframe.apply(lambda x: ''.join([*chain.from_iterable((v, f' <{i}> ') for i, v in enumerate(x))][:-1]), axis=1) print(dataframe) 

Prints:

 col_1 col_2 col_3 features 0 aaa name_aaa job_aaa aaa <0> name_aaa <1> job_aaa 1 bbb name_bbb job_bbb bbb <0> name_bbb <1> job_bbb 2 ccc name_ccc job_ccc ccc <0> name_ccc <1> job_ccc 3 ddd name_ddd job_ddd ddd <0> name_ddd <1> job_ddd 
Sign up to request clarification or add additional context in comments.

Comments

8

You can use df.agg to join the columns of the dataframe by passing the optional parameter axis=1. Use:

df['features'] = df.agg( lambda s: r' <{}> '.join(s).format(*range(s.size)), axis=1) 

Output:

# print(df) col_1 col_2 col_3 features 0 aaa name_aaa job_aaa aaa <0> name_aaa <1> job_aaa 1 bbb name_bbb job_bbb bbb <0> name_bbb <1> job_bbb 2 ccc name_ccc job_ccc ccc <0> name_ccc <1> job_ccc 3 ddd name_ddd job_ddd ddd <0> name_ddd <1> job_ddd 

3 Comments

That's clever solution.
@ShubhamSharma Instead of using len(s) since s is a Series so use s.size which will be faster than len or use s.values.size. Nice answer.+1 ;) df.apply over axis 1 is not encouraged I guess df.agg is the way.
Thanks @Ch3steR! Don't know if there is any benefit from using s.size instead of len(s) but i guess according to this post len(s.index) and s.size are same in terms of speed. By the way thanks for suggestion.
3
def join_(value): vals = [] for i, j in enumerate(value): vals.append(j + " <%d>" % i if i < len(value) - 1 else j) return " ".join(vals) # setting axis=1 will pass all columns to the join_ func. dataframe['featurs'] = dataframe.apply(lambda x: join_(x), axis=1) print(dataframe) 

Output

 col_1 col_2 col_3 featurs 0 aaa name_aaa job_aaa aaa <0> name_aaa <1> job_aaa 1 bbb name_bbb job_bbb bbb <0> name_bbb <1> job_bbb 2 ccc name_ccc job_ccc ccc <0> name_ccc <1> job_ccc 3 ddd name_ddd job_ddd ddd <0> name_ddd <1> job_ddd 

Comments

3
df['features'] = [" ".join(F"{entry}<{num}>" if ent[-1] != entry else entry for num, entry in enumerate(ent) ) for ent in df.to_numpy()] col_1 col_2 col_3 features 0 aaa name_aaa job_aaa aaa<0> name_aaa<1> job_aaa 1 bbb name_bbb job_bbb bbb<0> name_bbb<1> job_bbb 2 ccc name_ccc job_ccc ccc<0> name_ccc<1> job_ccc 3 ddd name_ddd job_ddd ddd<0> name_ddd<1> job_ddd 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.