Skip to main content
AI Assist is now on Stack Overflow. Start a chat to get instant answers from across the network. Sign up to save and share your chats.
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link
URL Rewriter Bot
URL Rewriter Bot

As you're doing a join between the indices of your array and your original DataFrame, one approach would be to convert your array into a DataFrame, generate the rownumber()-1 (which becomes your indices) and then join the two DataFrames together.

from pyspark.sql import Row # Create original DataFrame `df` df = sqlContext.createDataFrame( [(0, "a", 13.0), (2, "B", -33.0), (1, "B", -63.0)], ("x1", "x2", "x3")) df.createOrReplaceTempView("df") # Create column "x4" row = Row("x4") # Take the array arr = [10, 12, 13] # Convert Array to RDD, and then create DataFrame rdd = sc.parallelize(arr) df2 = rdd.map(row).toDF() df2.createOrReplaceTempView("df2") # Create indices via row number df3 = spark.sql("SELECT (row_number() OVER (ORDER by x4))-1 as indices, * FROM df2") df3.createOrReplaceTempView("df3") 

Now that you have the two DataFrames: df and df3, you can run the SQL query below to join the two DataFrames together.

select a.x1, a.x2, a.x3, b.x4 from df a join df3 b on b.indices = a.x1 

Note, here is also good reference answer to the adding columns to DataFramesadding columns to DataFrames.

As you're doing a join between the indices of your array and your original DataFrame, one approach would be to convert your array into a DataFrame, generate the rownumber()-1 (which becomes your indices) and then join the two DataFrames together.

from pyspark.sql import Row # Create original DataFrame `df` df = sqlContext.createDataFrame( [(0, "a", 13.0), (2, "B", -33.0), (1, "B", -63.0)], ("x1", "x2", "x3")) df.createOrReplaceTempView("df") # Create column "x4" row = Row("x4") # Take the array arr = [10, 12, 13] # Convert Array to RDD, and then create DataFrame rdd = sc.parallelize(arr) df2 = rdd.map(row).toDF() df2.createOrReplaceTempView("df2") # Create indices via row number df3 = spark.sql("SELECT (row_number() OVER (ORDER by x4))-1 as indices, * FROM df2") df3.createOrReplaceTempView("df3") 

Now that you have the two DataFrames: df and df3, you can run the SQL query below to join the two DataFrames together.

select a.x1, a.x2, a.x3, b.x4 from df a join df3 b on b.indices = a.x1 

Note, here is also good reference answer to the adding columns to DataFrames.

As you're doing a join between the indices of your array and your original DataFrame, one approach would be to convert your array into a DataFrame, generate the rownumber()-1 (which becomes your indices) and then join the two DataFrames together.

from pyspark.sql import Row # Create original DataFrame `df` df = sqlContext.createDataFrame( [(0, "a", 13.0), (2, "B", -33.0), (1, "B", -63.0)], ("x1", "x2", "x3")) df.createOrReplaceTempView("df") # Create column "x4" row = Row("x4") # Take the array arr = [10, 12, 13] # Convert Array to RDD, and then create DataFrame rdd = sc.parallelize(arr) df2 = rdd.map(row).toDF() df2.createOrReplaceTempView("df2") # Create indices via row number df3 = spark.sql("SELECT (row_number() OVER (ORDER by x4))-1 as indices, * FROM df2") df3.createOrReplaceTempView("df3") 

Now that you have the two DataFrames: df and df3, you can run the SQL query below to join the two DataFrames together.

select a.x1, a.x2, a.x3, b.x4 from df a join df3 b on b.indices = a.x1 

Note, here is also good reference answer to the adding columns to DataFrames.

Source Link
Denny Lee
  • 3.3k
  • 1
  • 22
  • 34

As you're doing a join between the indices of your array and your original DataFrame, one approach would be to convert your array into a DataFrame, generate the rownumber()-1 (which becomes your indices) and then join the two DataFrames together.

from pyspark.sql import Row # Create original DataFrame `df` df = sqlContext.createDataFrame( [(0, "a", 13.0), (2, "B", -33.0), (1, "B", -63.0)], ("x1", "x2", "x3")) df.createOrReplaceTempView("df") # Create column "x4" row = Row("x4") # Take the array arr = [10, 12, 13] # Convert Array to RDD, and then create DataFrame rdd = sc.parallelize(arr) df2 = rdd.map(row).toDF() df2.createOrReplaceTempView("df2") # Create indices via row number df3 = spark.sql("SELECT (row_number() OVER (ORDER by x4))-1 as indices, * FROM df2") df3.createOrReplaceTempView("df3") 

Now that you have the two DataFrames: df and df3, you can run the SQL query below to join the two DataFrames together.

select a.x1, a.x2, a.x3, b.x4 from df a join df3 b on b.indices = a.x1 

Note, here is also good reference answer to the adding columns to DataFrames.