All Questions
Tagged with spark-dataframe or apache-spark-sql
26,926 questions
1 vote
0 answers
35 views
How to optimize special array_intersect in hive sql executed by spark engine?
buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...
Best practices
0 votes
5 replies
85 views
Pushing down filters in RDBMS with Java Spark
I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...
Advice
0 votes
6 replies
159 views
Pyspark SQL: How to do GROUP BY with specific WHERE condition
So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...
0 votes
0 answers
84 views
How to Check if a Query Touches Data Files or just Uses Manifests and Metadata in Iceberg
I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...
3 votes
1 answer
119 views
How to collect multiple metrics with observe in PySpark without triggering multiple actions
I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....
0 votes
0 answers
78 views
Unexpected Write Behavior when using MERGE INTO/INSERT INTO Iceberg Spark Queries
I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...
0 votes
0 answers
47 views
Spark: VSAM File read issue with special character
We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...
0 votes
0 answers
39 views
How to link Spark event log stages to PySpark code or query?
I'm analyzing Spark event logs and have already retrieved the SparkListenerStageSubmitted and SparkListenerTaskEnd events to collect metrics such as spill, skew ratio, memory, and CPU usage. However, ...
0 votes
0 answers
62 views
Scala spark: Why does DataFrame.transform calling a transform hang?
I have a job on scala (v. 2.12.15) spark (v. 3.5.1) that works correctly and looks something like this: import org.apache.spark.sql.DataFrame ... val myDataFrame = myReadDataFunction(...) ....
0 votes
0 answers
53 views
Spatial join without Apache Sedona
currently I'm working in a specific version of Apache Spark (3.1.1) that cannot upgrade. Since that I can't use Apache Sedona and the version 1.3.1 is too slow. My problem is the following code that ...
1 vote
3 answers
104 views
How to pass array of structure as parameter to udf in spark 4
does anybody know what am I doing wrong? Following is reduced code snippet working in spark-3.x but doesn't work in spark-4.x. In my usecase I need to pass complex data structure to udf (let's say ...
0 votes
1 answer
123 views
Problem reading the _last_checkpoint file from the _delta_log directory of a delta lake table on s3
I am trying to read the _delta_log folder of a delta lake table via spark to export some custom metrics. I have configured how to get some metrics from history and description but I have problem ...
1 vote
0 answers
137 views
Conversion of a pyspark DataFrame with a Variant column to pandas fails with an error
When I try to convert a pyspark DataFrame with a VariantType column to a pandas DataFrame, the conversion fails with an error 'NoneType' object is not iterable. Am I doing it incorrectly? Sample code: ...
3 votes
0 answers
73 views
Cannot extend Spark UnaryExpression in Java
I am trying to write a custom decoder function in Java targeting Spark 4.0: public class MyDataToCatalyst extends UnaryExpression implements NonSQLExpression, ExpectsInputTypes, Serializable { //.....
0 votes
0 answers
69 views
AWS Glue/Spark performance issue
I am new to AWS Glue and I am facing performance issues with the following code: spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN") # Define S3 path with wildcard to match ...