I'm quite new on pyspark and I'm dealing with a complex dataframe. I'm stuck trying to get N rows from a list into my df.column after some filtering.
I have the following df.struct:
root |-- struct1: struct (nullable = true) | |-- array1: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- struct2 : struct (nullable = true) | | | | |-- date: string (nullable = true) | | | | |-- value: string (nullable = true) | | | |-- struct3 : struct (nullable = true) | | | | |-- date: string (nullable = true) | | | | |-- value: string (nullable = true) | | | |-- property: string (nullable = true) What I want to achieve is get the sum of all struct2.values when the property is Good. Because I can have multiple(N) values for array1.
Right now, I got the a small sentence to get the first property. But I cant pass it to an udf in a success way to iterate over all possible rows: df.withColumn("Sum", (col('struct1.array1')[0])['property'])
Some steps that I have in mind is:
Filter each element inside the list when property=Good
Return a lambda value in a udf with the sum of struct3.value
Desired output should be something like:
None +---------------------------------------------------------------------------------------------------------+ |Struct1 |Sum| +---------------------------------------------------------------------------------------------------------+ |[[[[2020-01-01, 10], [2020-02-02, 15], Good], [[2020-01-01, 20], [2020-02-02, 25], Good]]] |20| +---------------------------------------------------------------------------------------------------------+ Any help will be appreciate