26,926 questions
1 vote
0 answers
20 views
How to optimize special array_intersect in hive sql executed by spark engine?
buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...
Advice
0 votes
6 replies
155 views
Pyspark SQL: How to do GROUP BY with specific WHERE condition
So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...
3 votes
1 answer
107 views
How to collect multiple metrics with observe in PySpark without triggering multiple actions
I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....
0 votes
0 answers
45 views
Spark: VSAM File read issue with special character
We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...
0 votes
0 answers
62 views
Scala spark: Why does DataFrame.transform calling a transform hang?
I have a job on scala (v. 2.12.15) spark (v. 3.5.1) that works correctly and looks something like this: import org.apache.spark.sql.DataFrame ... val myDataFrame = myReadDataFunction(...) ....
1 vote
3 answers
98 views
How to pass array of structure as parameter to udf in spark 4
does anybody know what am I doing wrong? Following is reduced code snippet working in spark-3.x but doesn't work in spark-4.x. In my usecase I need to pass complex data structure to udf (let's say ...
1 vote
0 answers
135 views
Conversion of a pyspark DataFrame with a Variant column to pandas fails with an error
When I try to convert a pyspark DataFrame with a VariantType column to a pandas DataFrame, the conversion fails with an error 'NoneType' object is not iterable. Am I doing it incorrectly? Sample code: ...
0 votes
0 answers
69 views
AWS Glue/Spark performance issue
I am new to AWS Glue and I am facing performance issues with the following code: spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN") # Define S3 path with wildcard to match ...