In Spark SQL, you can achieve similar functionality to SQL Temporary Tables using DataFrames or by creating temporary views. However, there are some differences to consider when working with temporary views in Spark, for example when you want to perform multiple operations.
Temporary Views:
- Temporary views in Spark are similar to SQL views, as you mentioned.
- Temporary views are commonly used to represent temporary data for the duration of a Spark session
- They don't hold data but provide a structured schema on top of an existing DataFrame or dataset.
- If you perform an operation like an INSERT, the view won't reflect the changes, and it will still return the original data.
DataFrames:
- DataFrames are a more suitable option if you need to perform multiple operations on the same data, including INSERTs.
- DataFrames are distributed collections of data that can be manipulated using Spark SQL operations.
- You can create a DataFrame from your data source and perform various transformations, including filters, joins, and inserts.
The below is the approach how you can perform the operations you mentioned using DataFrames in Spark SQL:
from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("example").getOrCreate() dimEmployee = spark.createDataFrame([ Row(EmployeeID=1, FirstName="John", LastName="Doe"), Row(EmployeeID=2, FirstName="Jane", LastName="Smith"), Row(EmployeeID=3, FirstName="Bob", LastName="Johnson") ]) employee_data = spark.createDataFrame([ Row(EmployeeID=4, FirstName="Alice", LastName="Brown"), Row(EmployeeID=5, FirstName="Eva", LastName="Williams"), Row(EmployeeID=6, FirstName="Charlie", LastName="Davis") ]) filtered_data = employee_data.exceptAll(dimEmployee) display(filtered_data) dimEmployee = dimEmployee.union(filtered_data) display(dimEmployee) other_data = spark.createDataFrame([ Row(EmployeeID=4, Department="HR"), Row(EmployeeID=5, Department="IT"), Row(EmployeeID=7, Department="Sales") ]) joined_data = filtered_data.join(other_data, "EmployeeID", "left") display(joined_data)



In the above code Created two DataFrames dimEmployee and employee_data with sample employee data.
Filtering the data in employee_data that is not in dimEmployee Using the exceptAll method to filter out rows from employee_data that are present in dimEmployee. The filtered result is stored in the filtered_data DataFrame
union the filtered_data DataFrame with the dimEmployee DataFrame,adding the rows from filtered_data to dimEmployee.
A new DataFrame other_data with some department information for employees.
Join filtered_data with other_data based on the "EmployeeID" column using a left join.
The code performs a left join between filtered_data and other_data on the "EmployeeID" column and stores the result in the joined_data DataFrame.