How to group data based on multiple columns and construct a new column - Pyspark

Question

I'm learning Pyspark, I used it to read a csv file into a dataframe (with column names):

>>> example_df.show(n=5) +---------------------+------+-------------+---------+---------+-------------+------------+ | id| price| street| locality|town_city| district| country| +---------------------+------+-------------+---------+---------+-------------+------------+ | bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| | aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| | bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| | aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| | cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| +---------------------+------+-------------+---------+---------+-------------+------------+ only showing top 5 rows

For the same location (= same value for street|locality|town_city|district|country), I want to create a new location_id, and group the corresponding fields, something like this:

+-----------+---------------------+------+-------------+---------+---------+-------------+------------+ |location_id| id| price| street| locality|town_city| district| country| +-----------+---------------------+------+-------------+---------+---------+-------------+------------+ | 0| aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| | 0| aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| | 1| bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| | 1| bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| | 2| cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| +-----------+---------------------+------+-------------+---------+---------+-------------+------------+ only showing top 5 rows ...

I know how to create the new column:

from pyspark.sql.functions import monotonically_increasing_id df = df.WithColumn("location_id", monotonically_increasing_id())

This just create a new column with 0,1,2 etc, but how can I group the data for the same location and also give them unique location_id? Thanks.

werner · Accepted Answer · 2021-08-18 16:35:51Z

If you need just an id for each location you can use hash:

from pyspark.sql import functions as F example_df.withColumn('location_id', F.hash('street','locality','town_city','district','country')) \ .show()

Output:

+-----+------+-------------+---------+---------+-------------+------------+-----------+ | id| price| street| locality|town_city| district| country|location_id| +-----+------+-------------+---------+---------+-------------+------------+-----------+ |bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| 406601501| |aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| -97363104| |bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| 406601501| |aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| -97363104| |cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| 1396052469| +-----+------+-------------+---------+---------+-------------+------------+-----------+

Edit: please remember that the hash function might cause hash collisions

Thanks, can I ask why the some of the hash values are negative? And is there a way to move location_id to be the first column?
@Cecilia the hash function might return negative and positive values. It is not an id generator (and might cause hash collisions although it is not very probable if the dataset is not too large). To generate only positive values, please check out this answer. To reorder the columns, you can use .select('location_id', 'id', 'price', ...)
Thanks, the answer you mentioned to convert the hashes to positive values are in Scala, just wondering how I can achieve it in Python?
It seems like we can also use F.sha2: ://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.sha2.html but it seems like it doesn't take multiple columns, I tried example_df.withColumn('location_id', F.sha2('street','locality','town_city','district','country')) but it gave me error TypeError: sha2() takes 2 positional arguments but 5 were given

Collectives™ on Stack Overflow

How to group data based on multiple columns and construct a new column - Pyspark

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related