Adding a new column in my existing dataframe in pandas

Question

Original Dataframe is

column_one 1 1 1 45 45 55 55 56 Expected Output column-new i_1 i_1 i_1 i_2 i_2 i_3 i_3 i_4

Based on Column-1 I want to add another new column in my dataframe. Where there is a consecutive values than add 'i' with the same index. Thank you in advance.

yatu · Accepted Answer · 2019-02-11 10:56:00Z

You could use pd.factorize. From the docs:

Useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.

So it will encode each new value it encounters as an enumerated type. Afterwards you can simply add the 'i_' prefix to the new_col:

df['new_col'] = (df.col1.factorize()[0] + 1).astype(str) df['new_col'] = 'i_' + df.new_col

Output

 col1 new_col 0 1 i_1 1 1 i_1 2 1 i_1 3 45 i_2 4 45 i_2 5 55 i_3 6 55 i_3 7 56 i_4

HadynB · Accepted Answer · 2019-02-11 13:22:33Z

Depending on how fast this needs to perform, you can look into using categoricals (dtype="category") , as they should be incredibly fast with large data sets.

If you import your data as a category data type, this will already determine which are the unique values.

df["col1"] = df["col1"].astype('category')

From here, you can then implement the leading "i_" and output the category value, remembering to scale up so that you begin with 1 rather than 0:

df['newcol1'] = "i_" + (df["col1"].cat.codes + 1).astype(str)

Output

 col1 newcol1 0 1 i_1 1 1 i_1 2 1 i_1 3 45 i_2 4 45 i_2 5 55 i_3 6 55 i_3 7 56 i_4

Timings

As the code is simply reading the category index, timing the category lookup against the factorize function for a column of 10,000,000 values between 0 and 1000 gives a timing that is far faster for the category approach. This is because you are not calling a function, but instead reading the index.

It should be noted that there is an initial setup overhead involved (also shown for completion), so the factorize function would be better if you are only performing this once.

Categoricals: 0 ms Factorize: 2092 ms Categoricals Converstion: 3253 ms

Timings Code:

import numpy as np import pandas as pd import time def timing(label, fn): t0 = time.time() fn() t1 = time.time() print '%s: %d ms' % (label, int((t1 - t0) * 1000)) df = pd.DataFrame(np.random.randint(low=0, high=1000, size=(100000000, 1)), columns=["col1"]) df["col1"] = df["col1"].astype('category') timing('Categoricals', lambda: (df.col1.cat.codes)) timing('Factorize', lambda: (df.col1.factorize()))

Collectives™ on Stack Overflow

Adding a new column in my existing dataframe in pandas

2 Answers 2

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Related