How to create a Temporary Table in Spark-SQL (not a Temp View)

Question

We're migrating code from T-SQL to Spark SQL; it'd be helpful to have Temp Table equivalents.

An earlier question on this, is answered with Temporary Views.

%%SQL CREATE OR ALTER TEMP View MyTmpView AS...

But these are literally Views. Most of the time a Spark Temp View can take the place of a SQL Temp Table, but not always.

Example:

Create a Temp View, MyTmpView, of all Employees not in DimEmployee
INSERT all rows from MyTmpView, INTO DimEmployee
Now use MyTmpView for something else (a second INSERT, a JOIN, etc.). You can't - it's empty, since it's a View, which if ran now, would logically return nothing after that INSERT in step 2.

(Sidenote: Our environment is specifically Azure Synapse Spark, if it matters.)

Edit: I accepted an answer - but in reality, in Spark SQL, there are no equivalents to SQL Temp Tables. If one wants, one can go into PySpark dataframes which can provide similar functionality. For our use cases, it would make more sense to work around the limited situations, rather than move into dataframes.

The difference between a temp table and a real table in T-SQL is that a temp table is automatically deleted when the session ends. The closest thing in Spark might be to use a dataframe instead of a temp table (as it disappears when the session ends). But you might run into difficulties mixing tables and dataframes. IMHO temp tables are overused anyway. I suggest just creating a normal table and dropping it when done. There's no point complicating your migration to implement something that doesn't really exist when you can just use a normal table instead. — Nick.Mc
– Nick.Mc, Commented Sep 25, 2023 at 21:27

Dileep Raj Narayan Thumula · Accepted Answer · 2023-09-26 10:19:24Z

In Spark SQL, you can achieve similar functionality to SQL Temporary Tables using DataFrames or by creating temporary views. However, there are some differences to consider when working with temporary views in Spark, for example when you want to perform multiple operations.

Temporary Views:

Temporary views in Spark are similar to SQL views, as you mentioned.
Temporary views are commonly used to represent temporary data for the duration of a Spark session
They don't hold data but provide a structured schema on top of an existing DataFrame or dataset.
If you perform an operation like an INSERT, the view won't reflect the changes, and it will still return the original data.

DataFrames:

DataFrames are a more suitable option if you need to perform multiple operations on the same data, including INSERTs.
DataFrames are distributed collections of data that can be manipulated using Spark SQL operations.
You can create a DataFrame from your data source and perform various transformations, including filters, joins, and inserts.

The below is the approach how you can perform the operations you mentioned using DataFrames in Spark SQL:

from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("example").getOrCreate() dimEmployee = spark.createDataFrame([ Row(EmployeeID=1, FirstName="John", LastName="Doe"), Row(EmployeeID=2, FirstName="Jane", LastName="Smith"), Row(EmployeeID=3, FirstName="Bob", LastName="Johnson") ]) employee_data = spark.createDataFrame([ Row(EmployeeID=4, FirstName="Alice", LastName="Brown"), Row(EmployeeID=5, FirstName="Eva", LastName="Williams"), Row(EmployeeID=6, FirstName="Charlie", LastName="Davis") ]) filtered_data = employee_data.exceptAll(dimEmployee) display(filtered_data) dimEmployee = dimEmployee.union(filtered_data) display(dimEmployee) other_data = spark.createDataFrame([ Row(EmployeeID=4, Department="HR"), Row(EmployeeID=5, Department="IT"), Row(EmployeeID=7, Department="Sales") ]) joined_data = filtered_data.join(other_data, "EmployeeID", "left") display(joined_data)

enter image description here

In the above code Created two DataFrames dimEmployee and employee_data with sample employee data.
Filtering the data in employee_data that is not in dimEmployee Using the exceptAll method to filter out rows from employee_data that are present in dimEmployee. The filtered result is stored in the filtered_data DataFrame
union the filtered_data DataFrame with the dimEmployee DataFrame,adding the rows from filtered_data to dimEmployee.
A new DataFrame other_data with some department information for employees.
Join filtered_data with other_data based on the "EmployeeID" column using a left join.
The code performs a left join between filtered_data and other_data on the "EmployeeID" column and stores the result in the joined_data DataFrame.

Collectives™ on Stack Overflow

How to create a Temporary Table in Spark-SQL (not a Temp View)

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related