3

Original Dataframe is

column_one 1 1 1 45 45 55 55 56 Expected Output column-new i_1 i_1 i_1 i_2 i_2 i_3 i_3 i_4 

Based on Column-1 I want to add another new column in my dataframe. Where there is a consecutive values than add 'i' with the same index. Thank you in advance.

2 Answers 2

2

You could use pd.factorize. From the docs:

Useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.

So it will encode each new value it encounters as an enumerated type. Afterwards you can simply add the 'i_' prefix to the new_col:

df['new_col'] = (df.col1.factorize()[0] + 1).astype(str) df['new_col'] = 'i_' + df.new_col 

Output

 col1 new_col 0 1 i_1 1 1 i_1 2 1 i_1 3 45 i_2 4 45 i_2 5 55 i_3 6 55 i_3 7 56 i_4 
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much. I got the actual output.
0

Depending on how fast this needs to perform, you can look into using categoricals (dtype="category") , as they should be incredibly fast with large data sets.

If you import your data as a category data type, this will already determine which are the unique values.

df["col1"] = df["col1"].astype('category') 

From here, you can then implement the leading "i_" and output the category value, remembering to scale up so that you begin with 1 rather than 0:

df['newcol1'] = "i_" + (df["col1"].cat.codes + 1).astype(str) 

Output

 col1 newcol1 0 1 i_1 1 1 i_1 2 1 i_1 3 45 i_2 4 45 i_2 5 55 i_3 6 55 i_3 7 56 i_4 

Timings

As the code is simply reading the category index, timing the category lookup against the factorize function for a column of 10,000,000 values between 0 and 1000 gives a timing that is far faster for the category approach. This is because you are not calling a function, but instead reading the index.

It should be noted that there is an initial setup overhead involved (also shown for completion), so the factorize function would be better if you are only performing this once.

Categoricals: 0 ms Factorize: 2092 ms Categoricals Converstion: 3253 ms 

Timings Code:

import numpy as np import pandas as pd import time def timing(label, fn): t0 = time.time() fn() t1 = time.time() print '%s: %d ms' % (label, int((t1 - t0) * 1000)) df = pd.DataFrame(np.random.randint(low=0, high=1000, size=(100000000, 1)), columns=["col1"]) df["col1"] = df["col1"].astype('category') timing('Categoricals', lambda: (df.col1.cat.codes)) timing('Factorize', lambda: (df.col1.factorize())) 

1 Comment

Thank you so much. I got my expected output

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.