0

I am facing a problem while reading a csv file with a curious column.

Schema

root |-- Id: integer (nullable = true) |-- Lon_tower: double (nullable = true) |-- Lat_tower: double (nullable = true) |-- Compagny: string (nullable = true) |-- Address_tower: string (nullable = true) |-- Assigned_band_1: string (nullable = true) |-- Assigned_band_2: string (nullable = true) |-- Assigned_band_3: string (nullable = true) |-- Assigned_band_4: string (nullable = true) |-- Assigned_band_5: string (nullable = true) |-- raw_geocode: string (nullable = true) 

raw_geocode sample

[{'road': 'Calle el Topo', 'residential': 'Los Sauces', 'hamlet': 'El Cardal', 'village': 'Los Sauces', 'city': 'San Andrés y Sauces', 'county': 'Santa Cruz de Tenerife', 'archipelago': 'Canarias', 'postcode': '38720', 'country': 'España', 'country_code': 'es'}] 

I would like get the key as headers and fill the sparkdataframe with the value or Null if the key doesn't exist for this row. I don't want all the key but only some in a list. I removed the [ ' ]

An example to better understand:

myList = ['road', 'tourism', 'country_code'] |Id |...|raw_geocode | |1 |...|{road: Calle el Topo, archipelago: Canarias, postcode: 38720, country_code: es} |2 |...|{tourism: Mirador Montaña El Molino, road: Mirador Montaña El Molino, village: Barlovento, country_code: es} 

Desired result

|ID |...|road |tourism |country_code| |1 |...|Calle el Topo |NULL |es |2 |...|Null |Mirador Montaña El Molino |es 

1 Answer 1

1

You can use regexp_extract to extract the values desired:

myList = ['road', 'tourism', 'country_code'] for i in myList: df = df.withColumn( i, F.when( F.regexp_extract('raw_geocode', i+': ([^,}]+)', 1) != "", F.regexp_extract('raw_geocode', i+': ([^,}]+)', 1) ) ) df.show(truncate=False) +---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+ |Id |raw_geocode |road |tourism |country_code| +---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+ |1 |{road: Calle el Topo, archipelago: Canarias, postcode: 38720, country_code: es} |Calle el Topo |null |es | |2 |{tourism: Mirador Montaña El Molino, road: Mirador Montaña El Molino, village: Barlovento, country_code: es}|Mirador Montaña El Molino|Mirador Montaña El Molino|es | +---+------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+------------+ 
Sign up to request clarification or add additional context in comments.

2 Comments

Hi, thank you for your quick response, You make my day! Could you explain the` i+ `?
That's for adding the column name in the list dynamically to the regex pattern.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.