Skip to main content

All Questions

1 vote
0 answers
35 views

buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the ...
Dong Ye's user avatar
  • 11
Best practices
0 votes
5 replies
85 views

I have been working as a Data Engineer and got this issue. I came across a use case where I have a view(lets name it as inputView) which is created by reading data from some source. Now somewhere ...
Parth Sarthi Roy's user avatar
Advice
0 votes
6 replies
159 views

So I am doing some SQL aggregation transformations of a dataset and there is a certain condition that I would like to do, but not sure how. Here is a basic code block: le_test = spark.sql(""&...
BeaverFever's user avatar
0 votes
0 answers
84 views

I created a table as follows: CREATE TABLE IF NOT EXISTS raw_data.civ ( date timestamp, marketplace_id int, ... some more columns ) USING ICEBERG PARTITIONED BY ( marketplace_id, ...
shiva's user avatar
  • 2,781
3 votes
1 answer
119 views

I have a PySpark job that reads data from table a, performs some transformations and filters, and then writes the result to table b. Here’s a simplified version of the code: import pyspark.sql....
עומר אמזלג's user avatar
0 votes
0 answers
78 views

I am observing different write behaviors when executing queries on EMR Notebook (correct behavior) vs when using spark-submit to submit a spark application to EMR Cluster (incorrect behavior). When I ...
shiva's user avatar
  • 2,781
0 votes
0 answers
47 views

We have a scenario to read a VSAM file directly along with a copy book to understand the column lengths, we were using COBRIX library as part of spark read. However, we could the same is not properly ...
Rocky1989's user avatar
  • 409
0 votes
0 answers
39 views

I'm analyzing Spark event logs and have already retrieved the SparkListenerStageSubmitted and SparkListenerTaskEnd events to collect metrics such as spill, skew ratio, memory, and CPU usage. However, ...
Carol C's user avatar
0 votes
0 answers
62 views

I have a job on scala (v. 2.12.15) spark (v. 3.5.1) that works correctly and looks something like this: import org.apache.spark.sql.DataFrame ... val myDataFrame = myReadDataFunction(...) ....
jd_sa's user avatar
  • 1
0 votes
0 answers
53 views

currently I'm working in a specific version of Apache Spark (3.1.1) that cannot upgrade. Since that I can't use Apache Sedona and the version 1.3.1 is too slow. My problem is the following code that ...
matdlara's user avatar
  • 149
1 vote
3 answers
104 views

does anybody know what am I doing wrong? Following is reduced code snippet working in spark-3.x but doesn't work in spark-4.x. In my usecase I need to pass complex data structure to udf (let's say ...
Jiri Humpolicek's user avatar
0 votes
1 answer
123 views

I am trying to read the _delta_log folder of a delta lake table via spark to export some custom metrics. I have configured how to get some metrics from history and description but I have problem ...
Melika Ghiasi's user avatar
1 vote
0 answers
137 views

When I try to convert a pyspark DataFrame with a VariantType column to a pandas DataFrame, the conversion fails with an error 'NoneType' object is not iterable. Am I doing it incorrectly? Sample code: ...
Ghislain Fourny's user avatar
3 votes
0 answers
73 views

I am trying to write a custom decoder function in Java targeting Spark 4.0: public class MyDataToCatalyst extends UnaryExpression implements NonSQLExpression, ExpectsInputTypes, Serializable { //.....
Carsten's user avatar
  • 1,288
0 votes
0 answers
69 views

I am new to AWS Glue and I am facing performance issues with the following code: spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN") # Define S3 path with wildcard to match ...
Alberto's user avatar
  • 15

15 30 50 per page
1
2 3 4 5
1796