I have a PySpark DataFrame with columns 'Country' and 'Continent'. In several records, value of 'Continent' is missing. However, there are records containing 'Continent' for the same country (Examples of "Italy" and "China" in the sample dataset below).
# Prepare Data data = [ (1, "Italy", "Europe"), (2, "Italy", "Europe"), (3, "Italy", None), (4, "Italy", None), (5, "China", "Asia"), (6, "China", None), (7, "Japan", "Asia"), (8, "France", None), ] # Create DataFrame columns = ["Code", "Country", "Continent"] df = spark.createDataFrame(data=data, schema=columns) df.show(truncate=False) For records whose 'Continent' is Null, I want to:
Look for any record of the same country whose 'Continent' is not Null
Get the value of 'Continent' and fill in the Null value
The expected output will be like:
| Code | Country | Continent |
|---|---|---|
| 1 | 'Italy' | 'Europe' |
| 2 | 'Italy' | 'Europe' |
| 3 | 'Italy' | 'Europe' |
| 4 | 'Italy' | 'Europe' |
| 5 | 'China' | 'Asia' |
| 6 | 'China' | 'Asia' |
| 7 | 'Japan' | 'Asia' |
| 8 | 'France' | Null |
Any idea how to do that?
