I try create DataFrame from Hive table. But I bad work with Spark API.
I need help to optimize the query in method getLastSession, make two tasks into one task for spark:
val pathTable = new File("/src/test/spark-warehouse/test_db.db/test_table").getAbsolutePath val path = new Path(s"$pathTable${if(onlyPartition) s"/name_process=$processName" else ""}").toString val df = spark.read.parquet(path) def getLastSession: Dataset[Row] = { val lastTime = df.select(max(col("time_write"))).collect()(0)(0).toString val lastSession = df.select(col("id_session")).where(col("time_write") === lastTime).collect()(0)(0).toString val dfByLastSession = df.filter(col("id_session") === lastSession) dfByLastSession.show() /* +----------+----------------+------------------+-------+ |id_session| time_write| key| value| +----------+----------------+------------------+-------+ |alskdfksjd|1639950466414000|schema2.table2.csv|Failure| */ dfByLastSession } PS. My Source Table (for example):
| name_process | id_session | time_write | key | value |
|---|---|---|---|---|
| OtherClass | jsdfsadfsf | 43434883477 | schema0.table0.csv | Success |
| OtherClass | jksdfkjhka | 23212123323 | schema1.table1.csv | Success |
| OtherClass | alskdfksjd | 23343212234 | schema2.table2.csv | Failure |
| ExternalClass | sdfjkhsdfd | 34455453434 | schema3.table3.csv | Success |
id_sessionhaving most recenttime_write, true?