In PySpark, you can join DataFrames on multiple columns by passing a list of column names to the on parameter in the join method. Here's how you can do it:
from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Multiple Columns Join Example") \ .getOrCreate() For demonstration purposes, let's create two sample DataFrames:
from pyspark.sql import Row # Sample data for DataFrame1 data1 = [ Row(id=1, name="Alice", timestamp="2022-01-01"), Row(id=2, name="Bob", timestamp="2022-01-02"), ] df1 = spark.createDataFrame(data1) # Sample data for DataFrame2 data2 = [ Row(id=1, name="Alice", timestamp="2022-01-01", value=100), Row(id=2, name="Bob", timestamp="2022-01-02", value=200), ] df2 = spark.createDataFrame(data2)
To join df1 and df2 on both the id and name columns, you can use the following:
joined_df = df1.join(df2, on=["id", "name"], how="inner") joined_df.show()
The how parameter specifies the type of join to be performed. In the above example, we used an "inner" join. You can replace "inner" with other join types like "left", "right", "outer", etc., based on your requirement.
If you also wanted to join on the timestamp column, you'd simply add it to the list:
joined_df = df1.join(df2, on=["id", "name", "timestamp"], how="inner") joined_df.show()
And that's how you join on multiple columns in PySpark!
jenkins-groovy dialog mariasql executequery mat-pagination autotools zero celerybeat store uikeyinput