Normalize columns of a dataframe

Question

I have a dataframe in pandas where each column has different value range. For example:

df:

A B C 1000 10 0.5 765 5 0.35 800 7 0.09

Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?

My desired output is:

A B C 1 1 1 0.765 0.5 0.7 0.8 0.7 0.18(which is 0.09/0.5)

there is an apply function, e.g. frame.apply(f, axis=1) where f is a function that does something with a row... — tschm
– tschm, Commented Oct 16, 2014 at 22:30
Normalization might not be the most appropriate wording, since scikit-learn documentation defines it as "the process of scaling individual samples to have unit norm" (i.e. row by row, if I get it correctly). — Skippy le Grand Gourou
– Skippy le Grand Gourou, Commented Mar 5, 2019 at 16:58
I do not get it, why min_max scaling is considered normalization! normal has got to have meaning in the sense of normal distribution with mean zero and variance 1. — OverFlow Police
– OverFlow Police, Commented Apr 21, 2019 at 2:21
If you are visiting this question in 2020 or later, look at answer by @Poudel, you get different answer of normalizing if you use pandas vs sklearn. — BhishanPoudel
– BhishanPoudel, Commented Jan 29, 2020 at 20:10

Cina · Accepted Answer · 2020-02-06 22:40:20Z

834

one easy way by using Pandas: (here I want to use mean normalization)

normalized_df=(df-df.mean())/df.std()

to use min-max normalization:

normalized_df=(df-df.min())/(df.max()-df.min())

Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.

edited Feb 6, 2020 at 22:40

answered Jan 8, 2017 at 11:25

Cina

10.3k4 gold badges24 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

krakowi Over a year ago

Can it be somehow done with window function? What I mean by that is calculating max() and min() based on eg latest 10 observation.

Roman Filippov Over a year ago

if you want to save some column - do normalized_df['TARGET'] = df['TARGET']

SajidSalim Over a year ago

Comparing this with MinMaxScaler(), which one would be faster in a case where features will be greater than 1000? And, uses less memory?

Teddy Ward Over a year ago

this is a good solution, but you need a lot of less-beautiful checks to avoid divide by zero errors

Gulzar Over a year ago

is there a built-in standard way of doing this per column without looping over all the columns?

|

Amir Imani · Accepted Answer · 2019-08-19 18:48:57Z

427

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd from sklearn import preprocessing x = df.values #returns a numpy array min_max_scaler = preprocessing.MinMaxScaler() x_scaled = min_max_scaler.fit_transform(x) df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

edited Aug 19, 2019 at 18:48

Amir Imani

3,2552 gold badges25 silver badges27 bronze badges

answered Oct 16, 2014 at 23:34

Sandman

5,6005 gold badges23 silver badges23 bronze badges

9 Comments

pietz Over a year ago

i think this will get rid of the column names, which might be one of the reasons op is using dataframes in the first place.

hobs Over a year ago

This will normalize the rows and not the columns, unless you transpose it first. To do what the Q asks for: pd.DataFrame(min_max_scaler.fit_transform(df.T), columns=df.columns, index=df.index)

ijoseph Over a year ago

@pietz to keep column names, see this post. Basically replace the last line with , df=pandas.DataFrame(x_scaled, columns=df.columns)

petezurich Over a year ago

@hobs This is not correct. Sandman's code normalizes column-wise and per-column. You get the wrong result if you transpose.

hobs Over a year ago

@petezurich It looks like Sandman or Praveen corrected their code. Unfortunately, it's not possible to correct comments ;)

|

BhishanPoudel · Accepted Answer · 2020-12-22 01:07:49Z

Detailed Example of Normalization Methods

Pandas normalization (unbiased)
Sklearn normalization (biased)
Does biased-vs-unbiased affect Machine Learning?
Mix-max scaling

References: Wikipedia: Unbiased Estimation of Standard Deviation

Example Data

import pandas as pd df = pd.DataFrame({ 'A':[1,2,3], 'B':[100,300,500], 'C':list('abc') }) print(df) A B C 0 1 100 a 1 2 300 b 2 3 500 c

Normalization using pandas (Gives unbiased estimates)

When normalizing we simply subtract the mean and divide by standard deviation.

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0) print(df) A B C 0 -1.0 -1.0 a 1 0.0 0.0 b 2 1.0 1.0 c

Normalization using sklearn (Gives biased estimates, different from pandas)

If you do the same thing with sklearn you will get DIFFERENT output!

import pandas as pd from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df = pd.DataFrame({ 'A':[1,2,3], 'B':[100,300,500], 'C':list('abc') }) df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy()) print(df) A B C 0 -1.224745 -1.224745 a 1 0.000000 0.000000 b 2 1.224745 1.224745 c

Does Biased estimates of sklearn makes Machine Learning Less Powerful?

NO.

The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.

From official documentation:

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

What about MinMax Scaling?

There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.

import pandas as pd df = pd.DataFrame({ 'A':[1,2,3], 'B':[100,300,500], }) (df - df.min()) / (df.max() - df.min()) A B 0 0.0 0.0 1 0.5 0.5 2 1.0 1.0 # Using sklearn from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() arr_scaled = scaler.fit_transform(df) print(arr_scaled) [[0. 0. ] [0.5 0.5] [1. 1. ]] df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index) print(df_scaled) A B 0 0.0 0.0 1 0.5 0.5 2 1.0 1.0

Note, however, that in the desired output the minimum is not mapped to zero.

Community · Accepted Answer · 2017-04-13 12:44:17Z

80

Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

You can do the following:

def normalize(df): result = df.copy() for feature_name in df.columns: max_value = df[feature_name].max() min_value = df[feature_name].min() result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) return result

You don't need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.

edited Apr 13, 2017 at 12:44

CommunityBot

11 silver badge

answered Apr 15, 2015 at 13:25

Michael Aquilina

5,5705 gold badges36 silver badges38 bronze badges

5 Comments

hru_d Over a year ago

Be careful when min and max values are same, your denominator is 0 and you will get a NaN value.

Appaji Chintimi Over a year ago

@HrushikeshDhumal, No need to normalize then, Since all values would be equal.

hru_d Over a year ago

@AppajiChintimi, this solution applies to entire data, if you haven't done sanity check you could run into trouble.

Caridorc Over a year ago

If you have numeric and non-numeric columns mixed, use for feature_name in df.select_dtypes(include=['int', 'float']).columns: to only normalize numeric columns

Robert Pollak Over a year ago

Note, however, that in the desired output the minimum is not mapped to zero.

tschm · Accepted Answer · 2017-02-22 13:52:02Z

66

Your problem is actually a simple transform acting on the columns:

def f(s): return s/s.max() frame.apply(f, axis=0)

Or even more terse:

 frame.apply(lambda x: x/x.max(), axis=0)

edited Feb 22, 2017 at 13:52

answered Oct 17, 2014 at 9:57

tschm

2,9657 gold badges37 silver badges46 bronze badges

5 Comments

Abu Shoeb Over a year ago

The lambda one is the best :-)

Nguai al Over a year ago

isn't this supposed to be axis=1 since the question is column wise normalization?

gosuto Over a year ago

No, from the docs: axis [...] 0 or 'index': apply function to each column. The default is actually axis=0 so this one-liner can be written even shorter :-) Thanks @tschm.

QFSW Over a year ago

This is only correct if the min is 0, which isn't something that you should really assume

tschm Over a year ago

My example was meant to illustrate how to apply functions on columns of dataframes. Obviously, as always, you need to pay attention to corner cases, e.g. here the max could be zero and result in an issue. Not sure I understand @QFSW.

j sad · Accepted Answer · 2017-04-21 15:06:22Z

If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() scaled_values = scaler.fit_transform(df) df.loc[:,:] = scaled_values

Note, however, that in the desired output the minimum is not mapped to zero.

Gulzar · Accepted Answer · 2021-05-18 17:20:06Z

38

Take care with this answer, as it ONLY works for data that ranges [0, n]. This does not work for any range of data.

Simple is Beautiful:

df["A"] = df["A"] / df["A"].max() df["B"] = df["B"] / df["B"].max() df["C"] = df["C"] / df["C"].max()

edited May 18, 2021 at 17:20

Gulzar

28.7k42 gold badges158 silver badges260 bronze badges

answered Feb 6, 2018 at 20:03

Basil Musa

8,8566 gold badges69 silver badges70 bronze badges

5 Comments

Alexander Sosnovshchenko Over a year ago

Note, that OP asked for [0..1] range and this solution scales to [-1..1] range. Try this with the array [-10, 10].

Pepe Mandioca Over a year ago

@AlexanderSosnovshchenko not really. Basil Musa is assuming the OP's matrix is always non-negative, that's why he has given this solution. If some column has a negative entry then this code does NOT normalize to the [-1,1] range. Try it with the array [-5, 10]. The correct way to normalize to [0,1] with negative values was given by Cina's answer df["A"] = (df["A"]-df["A"].min()) / (df["A"].max()-df["A"].min())

n1k31t4 Over a year ago

Perhaps even simpler: df /= df.max() - assuming the goal is to normalise each and every column, individually.

Gulzar Over a year ago

This answer is wrong. The non negative assumption can't be made here, as not the OP not future readers stated it. Moreover, even strictly positive doesn't work here: [1, 10] will be normalized to [0.1, 1] instead of [0,1].

Basil Musa Over a year ago

Thanks @Gulzar, I'm the author of this answer and TBH I was surprised that it was upvoted 29 times.

raullalves · Accepted Answer · 2018-09-29 22:10:41Z

You can create a list of columns that you want to normalize

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol'] x = df[column_names_to_normalize].values x_scaled = min_max_scaler.fit_transform(x) df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index) df[column_names_to_normalize] = df_temp

Your Pandas Dataframe is now normalized only at the columns you want

However, if you want the opposite, select a list of columns that you DON'T want to normalize, you can simply create a list of all columns and remove that non desired ones

column_names_to_not_normalize = ['B', 'J', 'K'] column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]

Benjamin Ziepert · Accepted Answer · 2022-08-22 18:11:54Z

Normalize

You can use minmax_scale to transform each column to a scale from 0-1.

from sklearn.preprocessing import minmax_scale df[:] = minmax_scale(df)

Standardize

You can use scale to center each column to the mean and scale to unit variance.

from sklearn.preprocessing import scale df[:] = scale(df)

Column Subsets

Normalize single column

from sklearn.preprocessing import minmax_scale df['a'] = minmax_scale(df['a'])

Normalize only numerical columns

import numpy as np from sklearn.preprocessing import minmax_scale cols = df.select_dtypes(np.number).columns df[cols] = minmax_scale(df[cols])

Full Example

# Prep import pandas as pd import numpy as np from sklearn.preprocessing import minmax_scale # Sample data df = pd.DataFrame({'a':[0,1,2], 'b':[-10,-30,-50], 'c':['x', 'y', 'z']}) # MinMax normalize all numeric columns cols = df.select_dtypes(np.number).columns df[cols] = minmax_scale(df[cols]) # Result print(df) # a b c # 0 0.0 1.0 x # 2 0.5 0.5 y # 3 1.0 0.0 z

Notes:

In all examples scale can be used instead of minmax_scale. Keeps index, column names or non-numerical variables unchanged. Function is applied for each column.

Caution:

For machine learning, use minmax_scale or scale after train_test_split to avoid data leakage.

Info

More info on standardization and normalization:

Please include the standardisation as well to make it a comprehensive answer.
@HSRathore, thanks! Updated answer to include standardization.
Note, however, that in the desired output the minimum is not mapped to zero.
Note: using [:] in df[:] = scale(df) keeps the index/column names
Doesn't the test data need to be normalized with the same scaling factors as the training data? How do you apply the normalization to the test data using the scaling values from the training data?

Daniele · Accepted Answer · 2014-10-24 17:52:22Z

15

I think that a better way to do that in pandas is just

df = df/df.max().astype(np.float64)

Edit If in your data frame negative numbers are present you should use instead

df = df/df.loc[df.abs().idxmax()].astype(np.float64)

edited Oct 24, 2014 at 17:52

answered Oct 17, 2014 at 13:58

Daniele

6013 silver badges14 bronze badges

3 Comments

ahajib Over a year ago

In case all values of a column are zero this won't work

pietz Over a year ago

dividing the current value by the max will not give you a correct normalisation unless the min is 0.

Daniele Over a year ago

I agree, but that is what the OT was asking for (see his example)

Ozkan Serttas · Accepted Answer · 2017-11-26 21:33:42Z

The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.

My solution to this type of issue is following:

 from sklearn import preprocesing x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3]) min_max_scaler = preprocessing.MinMaxScaler() x_scaled = min_max_scaler.fit_transform(x) x_new = pd.DataFrame(x_scaled) df = pd.concat([df.Categoricals,x_new])

This answer is useful because most examples on the internet apply one scaler to all the columns, whereas this actually addresses the situation where one scaler, say the MinMaxScaler, should not apply to all columns.

masouduut94 · Accepted Answer · 2019-04-25 20:33:02Z

You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It's a combination of @shg and @Cina answers ):

features_to_normalize = ['A', 'B', 'C'] # could be ['A','B'] df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

Yuan · Accepted Answer · 2019-08-08 10:36:12Z

11

It is only simple mathematics. The answer should as simple as below.

normed_df = (df - df.min()) / (df.max() - df.min())

answered Aug 8, 2019 at 10:36

Yuan

4724 silver badges13 bronze badges

2 Comments

Robert Pollak Over a year ago

Note, however, that in the desired output the minimum is not mapped to zero.

Pierrick Rambaud Apr 15 at 8:55

it will crash if the generated pandas dataframe has 1 record

Davoud Taghawi-Nejad · Accepted Answer · 2020-05-31 11:35:48Z

11

df_normalized = df / df.max(axis=0)

answered May 31, 2020 at 11:35

Davoud Taghawi-Nejad

16.9k14 gold badges69 silver badges84 bronze badges

Comments

antonjs · Accepted Answer · 2019-09-26 09:03:12Z

5

You can simply use the pandas.DataFrame.transform1 function in this way:

df.transform(lambda x: x/x.max())

answered Sep 26, 2019 at 9:03

antonjs

14.4k15 gold badges70 silver badges91 bronze badges

2 Comments

Dave Liu Over a year ago

This solution won't work if all values are negative. Consider [-1, -2, -3]. We divide by -1, and now we have [1,2,3].

nvd Over a year ago

To properly handle negative numbers: df.transform(lambda x: x / abs(x).max())

Chad · Accepted Answer · 2019-08-01 22:01:02Z

This is how you do it column-wise using list comprehension:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

shg · Accepted Answer · 2018-04-14 02:49:47Z

def normalize(x): try: x = x/np.linalg.norm(x,ord=1) return x except : raise data = pd.DataFrame.apply(data,normalize)

From the document of pandas,DataFrame structure can apply an operation (function) to itself .

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.

You can apply a custom function to operate the DataFrame .

It would be good to explain, why your code solves the OPs problem, so people can adapt the strategy rather than just copy your code. Please read How do I write a good answer?

gogasca · Accepted Answer · 2019-01-19 01:11:12Z

The following function calculates the Z score:

def standardization(dataset): """ Standardization of numeric fields, where all values will have mean of zero and standard deviation of one. (z-score) Args: dataset: A `Pandas.Dataframe` """ dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes))) # Normalize numeric columns. for column, dtype in dtypes: if dtype == 'float32': dataset[column] -= dataset[column].mean() dataset[column] /= dataset[column].std() return dataset

Suvo · Accepted Answer · 2023-06-06 07:40:31Z

New Scikit-Learn (Version>=1.2): Keeps DataFrame Column Names

In the new version of scikit-learn, it is now actually possible to keep the pandas column names intact even after the transform, below is an example:

>>> import pandas as pd >>> from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler >>> df = pd.DataFrame({'col1':[1000, 765, 800], 'col2':[10, 5, 7], 'col3':[0.5, 0.35, 0.09]}, ) >>> df.head(3) col1 col2 col3 0 1000 10 0.50 1 765 5 0.35 2 800 7 0.09 >>> scaler = MaxAbsScaler().set_output(transform="pandas") #change here >>> scaler.fit(df) >>> df_scaled = scaler.transform(df) >>> df_scaled.head(3) col1 col2 col3 0 1.000 1.0 1.00 1 0.765 0.5 0.70 2 0.800 0.7 0.18

I wrote a summary of the new updates here and you can also check the scikit-learn release highlights page.

Also, personally have never been a big fan of MaxAbsScaler, but I went with this one to answer op's question.

Hope this helps, cheers!!

LOrD_ARaGOrN · Accepted Answer · 2019-04-12 06:39:33Z

You can do this in one line

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)

it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.

ahajib · Accepted Answer · 2020-01-06 14:05:04Z

Pandas does column wise normalization by default. Try the code below.

X= pd.read_csv('.\\data.csv') X = (X-X.min())/(X.max()-X.min())

The output values will be in range of 0 and 1.

Rajdeep Borgohain · Accepted Answer · 2021-08-11 09:33:53Z

Hey use the apply function with lambda which speeds up the process:

def normalize(df_col): # Condition to exclude 'ID' and 'Class' feature if (str(df_col.name) != str('ID') and str(df_col.name)!=str('Class')): max_value = df_col.max() min_value = df_col.min() #It avoids NaN and return 0 instead if max_value == min_value: return 0 sub_value = max_value - min_value return np.divide(np.subtract(df_col,min_value),sub_value) else: return df_col df_normalize = df.apply(lambda x :normalize(x))

DanielBell99 · Accepted Answer · 2022-12-13 14:43:34Z

To normalise a DataFrame column, using only native Python.

Different values influence processes, e.g. plot colours.

Between 0 and 1:

min_val = min(list(df['col'])) max_val = max(list(df['col'])) df['col'] = [(x - min_val) / max_val for x in df['col']]

Between -1 to 1:

df['col'] = [float(i)/sum(df['col']) for i in df['col']]

OR

df['col'] = [float(tp) / max(abs(df['col'])) for tp in df['col']]

pepCoder · Accepted Answer · 2024-05-14 09:34:40Z

-3

df.normalize()

this thread has been over 9 years old by now.

I am not sure when pandas added this func().

It seems work like a charm for me to do quantitative analysis.

answered May 14, 2024 at 9:34

pepCoder

3291 silver badge9 bronze badges

1 Comment

Mark Puchala II Over a year ago

I'm finding no record of df.normalize() existing in Pandas. Are you sure you don't have a custom function somewhere in your codebase? pandas.pydata.org/pandas-docs/stable/reference/api/… for reference.

Adrian Mole · Accepted Answer · 2020-10-12 11:40:38Z

-6

If your data is positively skewed, the best way to normalize is to use the log transformation:

df = np.log10(df)

edited Oct 12, 2020 at 11:40

Adrian Mole

52.1k193 gold badges61 silver badges101 bronze badges

answered Oct 12, 2020 at 10:43

amit haldar

1291 silver badge10 bronze badges

Collectives™ on Stack Overflow

Normalize columns of a dataframe

25 Answers 25

12 Comments

9 Comments

Detailed Example of Normalization Methods

Example Data

Normalization using pandas (Gives unbiased estimates)

Normalization using sklearn (Gives biased estimates, different from pandas)

Does Biased estimates of sklearn makes Machine Learning Less Powerful?

What about MinMax Scaling?

1 Comment

5 Comments

5 Comments

1 Comment

5 Comments

Comments

5 Comments

3 Comments

1 Comment

Comments

2 Comments

Comments

2 Comments

Comments

1 Comment

Comments

New Scikit-Learn (Version>=1.2): Keeps DataFrame Column Names

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

25 Answers 25

12 Comments

9 Comments

Detailed Example of Normalization Methods

Example Data

Normalization using pandas (Gives unbiased estimates)

Normalization using sklearn (Gives biased estimates, different from pandas)

Does Biased estimates of sklearn makes Machine Learning Less Powerful?

What about MinMax Scaling?

1 Comment

5 Comments

5 Comments

1 Comment

5 Comments

Comments

5 Comments

3 Comments

1 Comment

Comments

2 Comments

Comments

2 Comments

Comments

1 Comment

Comments

New Scikit-Learn (Version>=1.2): Keeps DataFrame Column Names

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Linked

Related