Currently I'm making calculations using PySpark and trying to match data from multiple dataframes on a specific conditions.
I'm new to PySpark and decided to ask for a help.
My first dataframe contains general information about loans:
ID ContractDate MaturityDate Bank ID1 2024-06-01 2024-06-18 A ID2 2024-06-05 2024-06-18 B ID3 2024-06-10 2024-06-17 C ID4 2024-06-15 D ID5 2024-08-01 2024-08-22 A ID6 2024-08-08 2024-08-23 B ID7 2024-08-20 D My second dataframe contains information on how payments were made.
For each loan I have one or more payments done.
ID_loan PaymentDate PaymentSum ID1 2024-06-02 10 ID1 2024-06-08 40 ID1 2024-06-10 50 ID2 2024-06-06 30 ID2 2024-06-07 90 ID2 2024-06-08 20 ID3 2024-06-11 20 ID3 2024-06-12 30 ID3 2024-06-13 50 ID5 2024-08-10 15 ID5 2024-08-13 35 ID5 2024-08-15 30 ID6 2024-08-15 20 ID6 2024-08-16 20 ID6 2024-08-20 70 My goal is to add to the first data frame a column 'PaymentSum' which will return for each loan the amount of payment made given the fact that the payment was made on the closest date to the 'ContractDate' of loan issued by the bank 'D'.
In other words I have to get the following table:
ID ContractDate MaturityDate Bank PaymentSum ID1 2024-06-01 2024-06-18 A 50 ID2 2024-06-05 2024-06-18 B 20 ID3 2024-06-10 2024-06-17 C 50 ID4 2024-06-15 D ID5 2024-08-01 2024-08-22 A 30 ID6 2024-08-08 2024-08-23 B 70 ID7 2024-08-20 D I do understand that joins here are not enough Any help is highly appreciated!
