PySpark, How to parse a string formated as a dict and append some key as new columns

Question

I am facing a problem while reading a csv file with a curious column.

Schema

root |-- Id: integer (nullable = true) |-- Lon_tower: double (nullable = true) |-- Lat_tower: double (nullable = true) |-- Compagny: string (nullable = true) |-- Address_tower: string (nullable = true) |-- Assigned_band_1: string (nullable = true) |-- Assigned_band_2: string (nullable = true) |-- Assigned_band_3: string (nullable = true) |-- Assigned_band_4: string (nullable = true) |-- Assigned_band_5: string (nullable = true) |-- raw_geocode: string (nullable = true)

raw_geocode sample

[{'road': 'Calle el Topo', 'residential': 'Los Sauces', 'hamlet': 'El Cardal', 'village': 'Los Sauces', 'city': 'San Andrés y Sauces', 'county': 'Santa Cruz de Tenerife', 'archipelago': 'Canarias', 'postcode': '38720', 'country': 'España', 'country_code': 'es'}]

I would like get the key as headers and fill the sparkdataframe with the value or Null if the key doesn't exist for this row. I don't want all the key but only some in a list. I removed the [ ' ]

An example to better understand:

myList = ['road', 'tourism', 'country_code'] |Id |...|raw_geocode | |1 |...|{road: Calle el Topo, archipelago: Canarias, postcode: 38720, country_code: es} |2 |...|{tourism: Mirador Montaña El Molino, road: Mirador Montaña El Molino, village: Barlovento, country_code: es}

Desired result

|ID |...|road |tourism |country_code| |1 |...|Calle el Topo |NULL |es |2 |...|Null |Mirador Montaña El Molino |es

mck · Accepted Answer · 2021-04-07 13:47:28Z

You can use regexp_extract to extract the values desired:

myList = ['road', 'tourism', 'country_code'] for i in myList: df = df.withColumn( i, F.when( F.regexp_extract('raw_geocode', i+': ([^,}]+)', 1) != "", F.regexp_extract('raw_geocode', i+': ([^,}]+)', 1) ) ) df.show(truncate=False) +---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+ |Id |raw_geocode |road |tourism |country_code| +---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+ |1 |{road: Calle el Topo, archipelago: Canarias, postcode: 38720, country_code: es} |Calle el Topo |null |es | |2 |{tourism: Mirador Montaña El Molino, road: Mirador Montaña El Molino, village: Barlovento, country_code: es}|Mirador Montaña El Molino|Mirador Montaña El Molino|es | +---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+

Hi, thank you for your quick response, You make my day! Could you explain the` i+ `?
That's for adding the column name in the list dynamically to the regex pattern.

Collectives™ on Stack Overflow

PySpark, How to parse a string formated as a dict and append some key as new columns

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related