489

I have a dataframe in pandas where each column has different value range. For example:

df:

A B C 1000 10 0.5 765 5 0.35 800 7 0.09 

Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?

My desired output is:

A B C 1 1 1 0.765 0.5 0.7 0.8 0.7 0.18(which is 0.09/0.5) 
9
  • 3
    there is an apply function, e.g. frame.apply(f, axis=1) where f is a function that does something with a row... Commented Oct 16, 2014 at 22:30
  • 2
    Normalization might not be the most appropriate wording, since scikit-learn documentation defines it as "the process of scaling individual samples to have unit norm" (i.e. row by row, if I get it correctly). Commented Mar 5, 2019 at 16:58
  • 3
    I do not get it, why min_max scaling is considered normalization! normal has got to have meaning in the sense of normal distribution with mean zero and variance 1. Commented Apr 21, 2019 at 2:21
  • 7
    If you are visiting this question in 2020 or later, look at answer by @Poudel, you get different answer of normalizing if you use pandas vs sklearn. Commented Jan 29, 2020 at 20:10
  • @Poudel is this due to the ddof argument? Commented Apr 4, 2020 at 20:26

25 Answers 25

834

one easy way by using Pandas: (here I want to use mean normalization)

normalized_df=(df-df.mean())/df.std() 

to use min-max normalization:

normalized_df=(df-df.min())/(df.max()-df.min()) 

Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.

Sign up to request clarification or add additional context in comments.

12 Comments

Can it be somehow done with window function? What I mean by that is calculating max() and min() based on eg latest 10 observation.
if you want to save some column - do normalized_df['TARGET'] = df['TARGET']
Comparing this with MinMaxScaler(), which one would be faster in a case where features will be greater than 1000? And, uses less memory?
this is a good solution, but you need a lot of less-beautiful checks to avoid divide by zero errors
is there a built-in standard way of doing this per column without looping over all the columns?
|
427

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd from sklearn import preprocessing x = df.values #returns a numpy array min_max_scaler = preprocessing.MinMaxScaler() x_scaled = min_max_scaler.fit_transform(x) df = pd.DataFrame(x_scaled) 

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

9 Comments

i think this will get rid of the column names, which might be one of the reasons op is using dataframes in the first place.
This will normalize the rows and not the columns, unless you transpose it first. To do what the Q asks for: pd.DataFrame(min_max_scaler.fit_transform(df.T), columns=df.columns, index=df.index)
@pietz to keep column names, see this post. Basically replace the last line with , df=pandas.DataFrame(x_scaled, columns=df.columns)
@hobs This is not correct. Sandman's code normalizes column-wise and per-column. You get the wrong result if you transpose.
@petezurich It looks like Sandman or Praveen corrected their code. Unfortunately, it's not possible to correct comments ;)
|
94

Detailed Example of Normalization Methods

  • Pandas normalization (unbiased)
  • Sklearn normalization (biased)
  • Does biased-vs-unbiased affect Machine Learning?
  • Mix-max scaling

References: Wikipedia: Unbiased Estimation of Standard Deviation

Example Data

import pandas as pd df = pd.DataFrame({ 'A':[1,2,3], 'B':[100,300,500], 'C':list('abc') }) print(df) A B C 0 1 100 a 1 2 300 b 2 3 500 c 

Normalization using pandas (Gives unbiased estimates)

When normalizing we simply subtract the mean and divide by standard deviation.

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0) print(df) A B C 0 -1.0 -1.0 a 1 0.0 0.0 b 2 1.0 1.0 c 

Normalization using sklearn (Gives biased estimates, different from pandas)

If you do the same thing with sklearn you will get DIFFERENT output!

import pandas as pd from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df = pd.DataFrame({ 'A':[1,2,3], 'B':[100,300,500], 'C':list('abc') }) df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy()) print(df) A B C 0 -1.224745 -1.224745 a 1 0.000000 0.000000 b 2 1.224745 1.224745 c 

Does Biased estimates of sklearn makes Machine Learning Less Powerful?

NO.

The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.

From official documentation:

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

What about MinMax Scaling?

There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.

import pandas as pd df = pd.DataFrame({ 'A':[1,2,3], 'B':[100,300,500], }) (df - df.min()) / (df.max() - df.min()) A B 0 0.0 0.0 1 0.5 0.5 2 1.0 1.0 # Using sklearn from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() arr_scaled = scaler.fit_transform(df) print(arr_scaled) [[0. 0. ] [0.5 0.5] [1. 1. ]] df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index) print(df_scaled) A B 0 0.0 0.0 1 0.5 0.5 2 1.0 1.0 

1 Comment

Note, however, that in the desired output the minimum is not mapped to zero.
80

Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

You can do the following:

def normalize(df): result = df.copy() for feature_name in df.columns: max_value = df[feature_name].max() min_value = df[feature_name].min() result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) return result 

You don't need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.

5 Comments

Be careful when min and max values are same, your denominator is 0 and you will get a NaN value.
@HrushikeshDhumal, No need to normalize then, Since all values would be equal.
@AppajiChintimi, this solution applies to entire data, if you haven't done sanity check you could run into trouble.
If you have numeric and non-numeric columns mixed, use for feature_name in df.select_dtypes(include=['int', 'float']).columns: to only normalize numeric columns
Note, however, that in the desired output the minimum is not mapped to zero.
66

Your problem is actually a simple transform acting on the columns:

def f(s): return s/s.max() frame.apply(f, axis=0) 

Or even more terse:

 frame.apply(lambda x: x/x.max(), axis=0) 

5 Comments

The lambda one is the best :-)
isn't this supposed to be axis=1 since the question is column wise normalization?
No, from the docs: axis [...] 0 or 'index': apply function to each column. The default is actually axis=0 so this one-liner can be written even shorter :-) Thanks @tschm.
This is only correct if the min is 0, which isn't something that you should really assume
My example was meant to illustrate how to apply functions on columns of dataframes. Obviously, as always, you need to pay attention to corner cases, e.g. here the max could be zero and result in an issue. Not sure I understand @QFSW.
43

If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() scaled_values = scaler.fit_transform(df) df.loc[:,:] = scaled_values 

1 Comment

Note, however, that in the desired output the minimum is not mapped to zero.
38

Take care with this answer, as it ONLY works for data that ranges [0, n]. This does not work for any range of data.


Simple is Beautiful:

df["A"] = df["A"] / df["A"].max() df["B"] = df["B"] / df["B"].max() df["C"] = df["C"] / df["C"].max() 

5 Comments

Note, that OP asked for [0..1] range and this solution scales to [-1..1] range. Try this with the array [-10, 10].
@AlexanderSosnovshchenko not really. Basil Musa is assuming the OP's matrix is always non-negative, that's why he has given this solution. If some column has a negative entry then this code does NOT normalize to the [-1,1] range. Try it with the array [-5, 10]. The correct way to normalize to [0,1] with negative values was given by Cina's answer df["A"] = (df["A"]-df["A"].min()) / (df["A"].max()-df["A"].min())
Perhaps even simpler: df /= df.max() - assuming the goal is to normalise each and every column, individually.
This answer is wrong. The non negative assumption can't be made here, as not the OP not future readers stated it. Moreover, even strictly positive doesn't work here: [1, 10] will be normalized to [0.1, 1] instead of [0,1].
Thanks @Gulzar, I'm the author of this answer and TBH I was surprised that it was upvoted 29 times.
35

You can create a list of columns that you want to normalize

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol'] x = df[column_names_to_normalize].values x_scaled = min_max_scaler.fit_transform(x) df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index) df[column_names_to_normalize] = df_temp 

Your Pandas Dataframe is now normalized only at the columns you want


However, if you want the opposite, select a list of columns that you DON'T want to normalize, you can simply create a list of all columns and remove that non desired ones

column_names_to_not_normalize = ['B', 'J', 'K'] column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ] 

Comments

18

Normalize

You can use minmax_scale to transform each column to a scale from 0-1.

from sklearn.preprocessing import minmax_scale df[:] = minmax_scale(df) 

Standardize

You can use scale to center each column to the mean and scale to unit variance.

from sklearn.preprocessing import scale df[:] = scale(df) 

Column Subsets

Normalize single column

from sklearn.preprocessing import minmax_scale df['a'] = minmax_scale(df['a']) 

Normalize only numerical columns

import numpy as np from sklearn.preprocessing import minmax_scale cols = df.select_dtypes(np.number).columns df[cols] = minmax_scale(df[cols]) 

Full Example

# Prep import pandas as pd import numpy as np from sklearn.preprocessing import minmax_scale # Sample data df = pd.DataFrame({'a':[0,1,2], 'b':[-10,-30,-50], 'c':['x', 'y', 'z']}) # MinMax normalize all numeric columns cols = df.select_dtypes(np.number).columns df[cols] = minmax_scale(df[cols]) # Result print(df) # a b c # 0 0.0 1.0 x # 2 0.5 0.5 y # 3 1.0 0.0 z 

Notes:

In all examples scale can be used instead of minmax_scale. Keeps index, column names or non-numerical variables unchanged. Function is applied for each column.

Caution:

For machine learning, use minmax_scale or scale after train_test_split to avoid data leakage.

Info

More info on standardization and normalization:

5 Comments

Please include the standardisation as well to make it a comprehensive answer.
@HSRathore, thanks! Updated answer to include standardization.
Note, however, that in the desired output the minimum is not mapped to zero.
Note: using [:] in df[:] = scale(df) keeps the index/column names
Doesn't the test data need to be normalized with the same scaling factors as the training data? How do you apply the normalization to the test data using the scaling values from the training data?
15

I think that a better way to do that in pandas is just

df = df/df.max().astype(np.float64) 

Edit If in your data frame negative numbers are present you should use instead

df = df/df.loc[df.abs().idxmax()].astype(np.float64) 

3 Comments

In case all values of a column are zero this won't work
dividing the current value by the max will not give you a correct normalisation unless the min is 0.
I agree, but that is what the OT was asking for (see his example)
14

The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.

My solution to this type of issue is following:

 from sklearn import preprocesing x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3]) min_max_scaler = preprocessing.MinMaxScaler() x_scaled = min_max_scaler.fit_transform(x) x_new = pd.DataFrame(x_scaled) df = pd.concat([df.Categoricals,x_new]) 

1 Comment

This answer is useful because most examples on the internet apply one scaler to all the columns, whereas this actually addresses the situation where one scaler, say the MinMaxScaler, should not apply to all columns.
12

You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It's a combination of @shg and @Cina answers ):

features_to_normalize = ['A', 'B', 'C'] # could be ['A','B'] df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min())) 

Comments

11

It is only simple mathematics. The answer should as simple as below.

normed_df = (df - df.min()) / (df.max() - df.min()) 

2 Comments

Note, however, that in the desired output the minimum is not mapped to zero.
it will crash if the generated pandas dataframe has 1 record
11
df_normalized = df / df.max(axis=0) 

Comments

5

You can simply use the pandas.DataFrame.transform1 function in this way:

df.transform(lambda x: x/x.max()) 

2 Comments

This solution won't work if all values are negative. Consider [-1, -2, -3]. We divide by -1, and now we have [1,2,3].
To properly handle negative numbers: df.transform(lambda x: x / abs(x).max())
4

This is how you do it column-wise using list comprehension:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns] 

Comments

3
def normalize(x): try: x = x/np.linalg.norm(x,ord=1) return x except : raise data = pd.DataFrame.apply(data,normalize) 

From the document of pandas,DataFrame structure can apply an operation (function) to itself .

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds) 

Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.

You can apply a custom function to operate the DataFrame .

1 Comment

It would be good to explain, why your code solves the OPs problem, so people can adapt the strategy rather than just copy your code. Please read How do I write a good answer?
3

The following function calculates the Z score:

def standardization(dataset): """ Standardization of numeric fields, where all values will have mean of zero and standard deviation of one. (z-score) Args: dataset: A `Pandas.Dataframe` """ dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes))) # Normalize numeric columns. for column, dtype in dtypes: if dtype == 'float32': dataset[column] -= dataset[column].mean() dataset[column] /= dataset[column].std() return dataset 

Comments

3

New Scikit-Learn (Version>=1.2): Keeps DataFrame Column Names

In the new version of scikit-learn, it is now actually possible to keep the pandas column names intact even after the transform, below is an example:

>>> import pandas as pd >>> from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler >>> df = pd.DataFrame({'col1':[1000, 765, 800], 'col2':[10, 5, 7], 'col3':[0.5, 0.35, 0.09]}, ) >>> df.head(3) col1 col2 col3 0 1000 10 0.50 1 765 5 0.35 2 800 7 0.09 >>> scaler = MaxAbsScaler().set_output(transform="pandas") #change here >>> scaler.fit(df) >>> df_scaled = scaler.transform(df) >>> df_scaled.head(3) col1 col2 col3 0 1.000 1.0 1.00 1 0.765 0.5 0.70 2 0.800 0.7 0.18 

I wrote a summary of the new updates here and you can also check the scikit-learn release highlights page.

Also, personally have never been a big fan of MaxAbsScaler, but I went with this one to answer op's question.

Hope this helps, cheers!!

Comments

0

You can do this in one line

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0) 

it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.

Comments

0

Pandas does column wise normalization by default. Try the code below.

X= pd.read_csv('.\\data.csv') X = (X-X.min())/(X.max()-X.min()) 

The output values will be in range of 0 and 1.

Comments

0

Hey use the apply function with lambda which speeds up the process:

def normalize(df_col): # Condition to exclude 'ID' and 'Class' feature if (str(df_col.name) != str('ID') and str(df_col.name)!=str('Class')): max_value = df_col.max() min_value = df_col.min() #It avoids NaN and return 0 instead if max_value == min_value: return 0 sub_value = max_value - min_value return np.divide(np.subtract(df_col,min_value),sub_value) else: return df_col df_normalize = df.apply(lambda x :normalize(x)) 

Comments

-1

To normalise a DataFrame column, using only native Python.

Different values influence processes, e.g. plot colours.

Between 0 and 1:

min_val = min(list(df['col'])) max_val = max(list(df['col'])) df['col'] = [(x - min_val) / max_val for x in df['col']] 

Between -1 to 1:

df['col'] = [float(i)/sum(df['col']) for i in df['col']] 

OR

df['col'] = [float(tp) / max(abs(df['col'])) for tp in df['col']] 

Comments

-3
df.normalize() 

this thread has been over 9 years old by now.

I am not sure when pandas added this func().

It seems work like a charm for me to do quantitative analysis.

1 Comment

I'm finding no record of df.normalize() existing in Pandas. Are you sure you don't have a custom function somewhere in your codebase? pandas.pydata.org/pandas-docs/stable/reference/api/… for reference.
-6

If your data is positively skewed, the best way to normalize is to use the log transformation:

df = np.log10(df) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.