How to normalize columns in a dataframe

Question

I have a pandas dataframe which has the term frequencies for corpus with the terms as rows and the years as columns like so:

| | term | 2002 | 2003 | 2004 | 2005 | |------:|:--------|-------:|-------:|-------:|-------:| | 3708 | climate | 1 | 10 | 1 | 14 | | 8518 | global | 12 | 11 | 2 | 12 | | 13276 | nuclear | 10 | 1 | 0 | 4 |

I would like to be able to normalize the values for each word by dividing them by the total number of words for a given year -- some years contain twice as many texts, so I trying to scale by year (like Google Books). I have looked at examples for how to scale for a single column, a la Chris Albon and I have seen examples here on SO for scaling all the columns, but every time I try to convert this dataframe to an array to scale, things choke on the fact that the term column isn't numbers. (I tried setting the terms column as index, but that didn't go well.) I can imagine a way to do this with a for loop, but almost every example of clean pandas code I read says not to use for loops because there's a pandas way of doing, well, everything.

What I would like is some way of saying:

for these columns [the years]: divide each row by the sum of all rows

That's it.

Does this answer your question? Normalize columns of pandas data frame — Zesty Dragon
– Zesty Dragon, Commented Jun 30, 2020 at 18:41
Thanks, @RajuBhaya. In fact, I looked at that answer, but it doesn't show a way to exclude the non-number column from the pre-processing, and, as you probably know, numpy arrays don't like text! (Along my way to asking the question above I even tried that particular code example!) — John Laudun
– John Laudun, Commented Jun 30, 2020 at 20:01

Balaji Ambresh · Accepted Answer · 2020-06-30 18:47:56Z

Try:

In [5]: %paste cols = ['2002', '2003', '2004', '2005'] df[cols] = df[cols] / df[cols].sum() ## -- End pasted text -- In [6]: df Out[6]: term 2002 2003 2004 2005 0 climate 0.043478 0.454545 0.333333 0.466667 1 global 0.521739 0.500000 0.666667 0.400000 2 nuclear 0.434783 0.045455 0.000000 0.133333

Thank you both for great answers. This one got the check for being the simplest and the most pandas-like. (Or whatever the pandas equivalent of "pythonic" is.

kait · Accepted Answer · 2020-06-30 18:46:15Z

Try this:

import pandas as pd df = pd.DataFrame( columns=['term', '2002', '2003', '2004', '2005'], data=[['climate', 1, 10, 1, 14], ['global', 12, 11, 2, 12], ['nuclear', 10, 1, 0, 4], ]) normalized = df.select_dtypes('int').apply(lambda x: x / sum(x)) df = df.merge( right=normalized, left_index=True, right_index=True, suffixes=['', '_norm'] )

Returns

 term 2002 2003 2004 2005 2002_norm 2003_norm 2004_norm 2005_norm 0 climate 1 10 1 14 0.043478 0.454545 0.333333 0.466667 1 global 12 11 2 12 0.521739 0.500000 0.666667 0.400000 2 nuclear 10 1 0 4 0.434783 0.045455 0.000000 0.133333

Again, my thanks for this answer. While I gave the other answer the check for being the on I decided to use, this one's use of the lambda function is something I am going to remember.

Collectives™ on Stack Overflow

How to normalize columns in a dataframe

2 Answers 2

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Linked

Related