I'm learning Pyspark, I used it to read a csv file into a dataframe (with column names):
>>> example_df.show(n=5) +---------------------+------+-------------+---------+---------+-------------+------------+ | id| price| street| locality|town_city| district| country| +---------------------+------+-------------+---------+---------+-------------+------------+ | bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| | aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| | bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| | aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| | cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| +---------------------+------+-------------+---------+---------+-------------+------------+ only showing top 5 rows For the same location (= same value for street|locality|town_city|district|country), I want to create a new location_id, and group the corresponding fields, something like this:
+-----------+---------------------+------+-------------+---------+---------+-------------+------------+ |location_id| id| price| street| locality|town_city| district| country| +-----------+---------------------+------+-------------+---------+---------+-------------+------------+ | 0| aaaa2|450000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| | 0| aaaa3|425000| MARKET WAY|PINCHBECK| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| | 1| bbbb1|295000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| | 1| bbbb2|280000| HAYES YARD| INGHAM| LINCOLN| WEST LINDSEY|LINCOLNSHIRE| | 2| cccc1| 96000|CHANCERY LANE| HOLBEACH| SPALDING|SOUTH HOLLAND|LINCOLNSHIRE| +-----------+---------------------+------+-------------+---------+---------+-------------+------------+ only showing top 5 rows ... I know how to create the new column:
from pyspark.sql.functions import monotonically_increasing_id df = df.WithColumn("location_id", monotonically_increasing_id()) This just create a new column with 0,1,2 etc, but how can I group the data for the same location and also give them unique location_id? Thanks.