Newest 'apache-spark-sql' Questions

1 vote

0 answers

20 views

How to optimize special array_intersect in hive sql executed by spark engine?

buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...

Dong Ye

11

asked 5 hours ago

Advice

0 votes

6 replies

155 views

Pyspark SQL: How to do GROUP BY with specific WHERE condition

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...

BeaverFever

21

asked Nov 2 at 6:39

3 votes

1 answer

107 views

How to collect multiple metrics with observe in PySpark without triggering multiple actions

I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....

עומר אמזלג

39

asked Oct 22 at 15:17

0 votes

0 answers

45 views

Spark: VSAM File read issue with special character

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...

Rocky1989

409

asked Oct 15 at 7:06

0 votes

0 answers

62 views

Scala spark: Why does DataFrame.transform calling a transform hang?

I have a job on scala (v. 2.12.15) spark (v. 3.5.1) that works correctly and looks something like this: import org.apache.spark.sql.DataFrame ... val myDataFrame = myReadDataFunction(...) ....

jd_sa

1

asked Oct 7 at 18:07

1 vote

3 answers

98 views

How to pass array of structure as parameter to udf in spark 4

does anybody know what am I doing wrong? Following is reduced code snippet working in spark-3.x but doesn't work in spark-4.x. In my usecase I need to pass complex data structure to udf (let's say ...

Jiri Humpolicek

31

asked Sep 22 at 12:51

1 vote

0 answers

135 views

Conversion of a pyspark DataFrame with a Variant column to pandas fails with an error

When I try to convert a pyspark DataFrame with a VariantType column to a pandas DataFrame, the conversion fails with an error 'NoneType' object is not iterable. Am I doing it incorrectly? Sample code: ...

Ghislain Fourny

7,429

asked Aug 27 at 11:32

0 votes

0 answers

69 views

AWS Glue/Spark performance issue

I am new to AWS Glue and I am facing performance issues with the following code: spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN") # Define S3 path with wildcard to match ...

Alberto

15

asked Aug 24 at 18:22

Collectives™ on Stack Overflow

How to optimize special array_intersect in hive sql executed by spark engine?

Pyspark SQL: How to do GROUP BY with specific WHERE condition

How to collect multiple metrics with observe in PySpark without triggering multiple actions

Spark: VSAM File read issue with special character

Scala spark: Why does DataFrame.transform calling a transform hang?

How to pass array of structure as parameter to udf in spark 4

Conversion of a pyspark DataFrame with a Variant column to pandas fails with an error

AWS Glue/Spark performance issue

Hot Network Questions