53 questions
0 votes
0 answers
111 views
Delta Table to Iceberg metadata migration is failing
I am trying to migrate Delta Tables to Iceberg using Scala-Spark keeping the data intact, following this - https://iceberg.apache.org/docs/1.4.3/delta-lake-migration/ Here is the sample code (...
0 votes
1 answer
40 views
Caching Processed Data In Spark in the Event of Task Failures so as to Not Reprocess the same when the Task Restarts
We have a use-case wherein we need to cache certain data that has been processed so that Spark does not reprocess the same data in the event of task failures. So say we have a thousand Foo objects for ...
1 vote
1 answer
64 views
Ignoring codec because it collides with previously generated codec
i am trying to register custom code(for map) like below val session: CqlSession = CassandraConnector.apply(spark.sparkContext).openSession() val codecRegistry: MutableCodecRegistry = session....
0 votes
1 answer
52 views
Is there a way to partition/group by data where sum of column values per each group is under a limit?
i want to partition/group rows for every group of size <= limit for example, if i have: +--------+----------+ | id| size| +--------+----------+ | 1| 3| | 2| 6| ...
0 votes
1 answer
43 views
Scala spark query optimization
I have two dataframes that have 300 columns and 1000 rows each. They have the same column names. The values are of mixed datatypes like Struct/List/Timestamp/String/etc. I am trying to compare the ...
2 votes
1 answer
1k views
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus can't be cast to org.apache.spark.sql.execution.datasources.FileStatusWithMetadat
Getting the following error while creating a delta table using scalaspark. _delta_log is getting created at the warehouse but it lands into this error after _delta_log creation Exception in thread &...
1 vote
3 answers
1k views
How to setup and run scala-spark in intellij?
I am trying to run use Intellij to build spark applications written in scala. I get the following error when I execute the scala program: Exception in thread "main" java.lang....
0 votes
0 answers
79 views
Scala Spark distributed run on Google Cloud Platform but worker not working
i'm newbie to Scala Spark programming. I have to build a Recommendation System for movies in Scala Spark with the usage of Google Cloud Platform. The dataset is composed by (movie_id, user_id, rating) ...
0 votes
1 answer
73 views
Explode nested list of objects into DataFrame in Spark
I have a dataframe that looks like this | Column | |------------------------------------------------| |[{a: 2, b: 4}, {a: 2, b: 3}] | |-------...
-1 votes
3 answers
111 views
Access newly created column in withinColumn
I have the following dataset: |value| +-----+ | 1| | 2| | 3| I want to create a new column newValue that takes the value of newValue from the previous row and does something with it. For ...
1 vote
2 answers
136 views
Spark Array column - Find max interval between two values
I have a Scala Spark dataframe with the schema: root |-- passengerId: string (nullable = true) |-- travelHist: array (nullable = true) | |-- element: integer (containsNull = true)...
-1 votes
2 answers
127 views
Divide a column value into multiple rows by number of months based on start date & end date columns
I want to divide the quantity value into multiple rows divided by number of months from start date & end date column. Each row should have start date and end date of the month. I also want ...
0 votes
1 answer
2k views
Scala - Create Dataframe with only 1 row from a List using for comprehension
For some weird reasons I need to get the column names of a dataframe and insert it as the first row(I cannot just import without header). I tried using for comprehension to create a dataframe that ...
1 vote
1 answer
609 views
How to improve spark filter() performance on an array of struct?
I am working on a spark project and have some performance issue that I am struggling with, any help will be appreciated. I have a column Collection that is an array of struct: root |-- Collection: ...
0 votes
1 answer
211 views
Spark extract values from Json struct
I have a spark dataframe column (custHeader) in the below format and I want to extract the value of the key - phone into a separate column. trying to use the from_json function, but it is giving me a ...