I Have a scenario, where I have 2 tables one table with days and another table with values. So from the table that has days I need to sum the values of the another table same no of days. Dataframe
dataframe1 df1 = spark.createDataFrame( [ ('ll',5) ('yy',6) ], ('x','days') ) dataframe2 df = spark.createDataFrame( [ ('ll','2020-01-05','1','10','50'), ('ll','2020-01-06','1','10'), ('ll','2020-01-07','1','10'), ('ll','2020-01-08','1','10'), ('ll','2020-01-09','1','10'), ('ll','2020-01-10','1','10'), ('ll','2020-01-11','1','20'), ('ll','2020-01-12','1','10'), ('ll','2020-01-05','2','30'), ('ll','2020-01-06','2','30'), ('ll','2020-01-07','2','30'), ('ll','2020-01-08','2','40'), ('ll','2020-01-09','2','30'), ('ll','2020-01-10','2','10'), ('ll','2020-01-11','2','10'), ('ll','2020-01-12','2','10'), ('yy','2020-01-05','1','20'), ('yy','2020-01-06','1','20'), ('yy','2020-01-07','1','20'), ('yy','2020-01-08','1','20'), ('yy','2020-01-09','1','20'), ('yy','2020-01-10','1','40'), ('yy','2020-01-11','1','20'), ('yy','2020-01-12','1','20'), ('yy','2020-01-05','2','40'), ('yy','2020-01-06','2','40'), ('yy','2020-01-07','2','40'), ('yy','2020-01-08','2','40'), ('yy','2020-01-09','2','40'), ('yy','2020-01-10','2','40'), ('yy','2020-01-11','2','60'), ('yy','2020-01-12','2','40') ], ('x','date','flag','value') ) expected_dataframe = spark.createDataFrame( [ ('ll','2020-01-05','1','10','50'), ('ll','2020-01-06','1','10','50'), ('ll','2020-01-07','1','10','60'), ('ll','2020-01-08','1','10','60'), ('ll','2020-01-09','1','10','50'), ('ll','2020-01-10','1','10','40'), ('ll','2020-01-11','1','20','30'), ('ll','2020-01-12','1','10','10'), ('ll','2020-01-05','2','30','170'), ('ll','2020-01-06','2','30','140'), ('ll','2020-01-07','2','30','120'), ('ll','2020-01-08','2','40','100'), ('ll','2020-01-09','2','30','60'), ('ll','2020-01-10','2','10','30'), ('ll','2020-01-11','2','10','20'), ('ll','2020-01-12','2','10','10'), ('yy','2020-01-05','1','20','140'), ('yy','2020-01-06','1','20','140'), ('yy','2020-01-07','1','20','140'), ('yy','2020-01-08','1','20','120'), ('yy','2020-01-09','1','20','100'), ('yy','2020-01-10','1','40','80'), ('yy','2020-01-11','1','20','40'), ('yy','2020-01-12','1','20','20'), ('yy','2020-01-05','2','40','240'), ('yy','2020-01-06','2','40','260'), ('yy','2020-01-07','2','40','260'), ('yy','2020-01-08','2','40','220'), ('yy','2020-01-09','2','40','180'), ('yy','2020-01-10','2','40','140'), ('yy','2020-01-11','2','60','100'), ('yy','2020-01-12','2','40','40') ], ('x','date','flag','value','result') expected_results
+---+----------+----+-----+------+ | x| date|flag|value|result| +---+----------+----+-----+------+ | ll|2020-01-05| 1| 10| 50| | ll|2020-01-06| 1| 10| 50| | ll|2020-01-07| 1| 10| 60| | ll|2020-01-08| 1| 10| 60| | ll|2020-01-09| 1| 10| 50| | ll|2020-01-10| 1| 10| 40| | ll|2020-01-11| 1| 20| 30| | ll|2020-01-12| 1| 10| 10| | ll|2020-01-05| 2| 30| 170| | ll|2020-01-06| 2| 30| 140| | ll|2020-01-07| 2| 30| 120| | ll|2020-01-08| 2| 40| 100| | ll|2020-01-09| 2| 30| 60| | ll|2020-01-10| 2| 10| 30| | ll|2020-01-11| 2| 10| 20| | ll|2020-01-12| 2| 10| 10| | yy|2020-01-05| 1| 20| 140| | yy|2020-01-06| 1| 20| 140| | yy|2020-01-07| 1| 20| 140| | yy|2020-01-08| 1| 20| 120| | yy|2020-01-09| 1| 20| 100| | yy|2020-01-10| 1| 40| 80| | yy|2020-01-11| 1| 20| 40| | yy|2020-01-12| 1| 20| 20| | yy|2020-01-05| 2| 40| 240| | yy|2020-01-06| 2| 40| 260| | yy|2020-01-07| 2| 40| 260| | yy|2020-01-08| 2| 40| 220| | yy|2020-01-09| 2| 40| 180| | yy|2020-01-10| 2| 40| 140| | yy|2020-01-11| 2| 60| 100| | yy|2020-01-12| 2| 40| 40| +---+----------+----+-----+------+ code
from pyspark.sql.window import Window from pyspark.sql.functions import * df_join = df.join(df1,['x'],'inner').withColumn('date',to_date(col('date'),'yyyy-MM-dd')) from pyspark.sql.window import Window w1 = Window.partitionBy('x','flag').orderBy(col['date'].desc()) So I need sum value column based on days column, i,e if days column is 5, I need to sum 5 rows of the values.
I had joined the two tables and using window function I tried to solve, but id didnt work out and not able figure out how to solve it. Can any show me the way how to solve it