Normalize data in pandas

Question

Suppose I have a pandas data frame df:

I want to calculate the column wise mean of a data frame.

This is easy:

df.apply(average)

then the column wise range max(col) - min(col). This is easy again:

df.apply(max) - df.apply(min)

Now for each element I want to subtract its column's mean and divide by its column's range. I am not sure how to do that

Any help/pointers are much appreciated.

Wouter Overmeire · Accepted Answer · 2012-09-21 07:14:15Z

234

In [92]: df Out[92]: a b c d A -0.488816 0.863769 4.325608 -4.721202 B -11.937097 2.993993 -12.916784 -1.086236 C -5.569493 4.672679 -2.168464 -9.315900 D 8.892368 0.932785 4.535396 0.598124 In [93]: df_norm = (df - df.mean()) / (df.max() - df.min()) In [94]: df_norm Out[94]: a b c d A 0.085789 -0.394348 0.337016 -0.109935 B -0.463830 0.164926 -0.650963 0.256714 C -0.158129 0.605652 -0.035090 -0.573389 D 0.536170 -0.376229 0.349037 0.426611 In [95]: df_norm.mean() Out[95]: a -2.081668e-17 b 4.857226e-17 c 1.734723e-17 d -1.040834e-17 In [96]: df_norm.max() - df_norm.min() Out[96]: a 1 b 1 c 1 d 1

answered Sep 21, 2012 at 7:14

Wouter Overmeire

69.7k10 gold badges67 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Amyunimus Over a year ago

Is there a way to do this if you want to normalize a subset? Say that row A and B are part of a larger grouping factor that you want to normalize separately from C and D.

Wouter Overmeire Over a year ago

Select the subset and compute as before. See pandas.pydata.org/pandas-docs/stable/indexing.html on how to index and select data

Dayvid Oliveira Over a year ago

If you need your values to be > 0: df_norm = (df - df.min()) / (df.max() - df.min())

jnPy Over a year ago

should be df_norm = (df - df.min()) / (df.max() - df.min()) rather than df.mean() in the first brackets to get the values between 0 and 1

netskink Over a year ago

If your dataframe has strings in some columns, see this answer

|

aunsid · Accepted Answer · 2019-08-21 23:29:58Z

83

If you don't mind importing the sklearn library, I would recommend the method talked on this blog.

import pandas as pd from sklearn import preprocessing data = {'score': [234,24,14,27,-74,46,73,-18,59,160]} cols = data.columns df = pd.DataFrame(data) df min_max_scaler = preprocessing.MinMaxScaler() np_scaled = min_max_scaler.fit_transform(df) df_normalized = pd.DataFrame(np_scaled, columns = cols) df_normalized

edited Aug 21, 2019 at 23:29

aunsid

3972 silver badges10 bronze badges

answered May 13, 2016 at 0:48

David S.

11.2k13 gold badges75 silver badges114 bronze badges

4 Comments

marts Over a year ago

the link to the blog post is dead. do you have a working one?

abeboparebop Over a year ago

The corresponding method to create unit-normal normalized data is called StandardScaler.

deadcode Over a year ago

I found a similar solution in another place. The problem was that in np_scaled part, it was showing an error expecting 2D array but input is 1D array and it recommended we use reshape(-1,1). Any idea how to solve this as reshape is also not working.?

Jaeyoung Chun Over a year ago

You might get warnings depending on what version of numpy & sklearn you work with, but in general, this should work np_scaled = min_max_scaler.fit_transform(df.score.astype(float).values.reshape(-1, 1))

naught101 · Accepted Answer · 2016-01-26 22:30:38Z

You can use apply for this, and it's a bit neater:

import numpy as np import pandas as pd np.random.seed(1) df = pd.DataFrame(np.random.randn(4,4)* 4 + 3) 0 1 2 3 0 9.497381 0.552974 0.887313 -1.291874 1 6.461631 -6.206155 9.979247 -0.044828 2 4.276156 2.002518 8.848432 -5.240563 3 1.710331 1.463783 7.535078 -1.399565 df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x))) 0 1 2 3 0 0.515087 0.133967 -0.651699 0.135175 1 0.125241 -0.689446 0.348301 0.375188 2 -0.155414 0.310554 0.223925 -0.624812 3 -0.484913 0.244924 0.079473 0.114448

Also, it works nicely with groupby, if you select the relevant columns:

df['grp'] = ['A', 'A', 'B', 'B'] 0 1 2 3 grp 0 9.497381 0.552974 0.887313 -1.291874 A 1 6.461631 -6.206155 9.979247 -0.044828 A 2 4.276156 2.002518 8.848432 -5.240563 B 3 1.710331 1.463783 7.535078 -1.399565 B df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x))) 0 1 2 3 0 0.5 0.5 -0.5 -0.5 1 -0.5 -0.5 0.5 0.5 2 0.5 0.5 0.5 -0.5 3 -0.5 -0.5 -0.5 0.5

Community · Accepted Answer · 2017-05-23 12:10:44Z

Slightly modified from: Python Pandas Dataframe: Normalize data between 0.01 and 0.99? but from some of the comments thought it was relevant (sorry if considered a repost though...)

I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way... because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn't true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):

def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.): if low=='min': low=min(s) elif low=='abs': low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s)) if hi=='max': hi=max(s) elif hi=='abs': hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s)) if center=='mid': center=(max(s)+min(s))/2 elif center=='avg': center=mean(s) elif center=='median': center=median(s) s2=[x-center for x in s] hi=hi-center low=low-center center=0. r=[] for x in s2: if x<low: r.append(0.) elif x>hi: r.append(1.) else: if x>=center: r.append((x-center)/(hi-center)*0.5+0.5) else: r.append((x-low)/(center-low)*0.5+0.) if insideout==True: ir=[(1.-abs(z-0.5)*2.) for z in r] r=ir rr =[x-(x-0.5)*shrinkfactor for x in r] return rr

This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our "10" is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:

#In[1] NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256) #Out[1] [0.1279296875, 0.5826822916666667, 0.99609375]

It can also turn your data inside out... this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:

#In[2] NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256) #Out[2] [0.251953125, 0.8307291666666666, 0.00390625]

So now "2" which is closest to the center, defined as "1" is the highest value.

Anyways, I thought my application was relevant if you're looking to rescale data in other ways that could have useful applications to you.

You can replace all the if/else statements with a dictionary with functions. Looks a bit cleaner then.
that's pretty neat, i'll keep that in mind next time, thanks!

Chad · Accepted Answer · 2019-08-01 21:58:23Z

This is how you do it column-wise:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

Collectives™ on Stack Overflow

Normalize data in pandas

5 Answers 5

9 Comments

4 Comments

Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

9 Comments

4 Comments

Comments

2 Comments

Comments

Linked

Related