Subscribe to RSS

Question 1

I am trying to migrate Delta Tables to Iceberg using Scala-Spark keeping the data intact, following this - https://iceberg.apache.org/docs/1.4.3/delta-lake-migration/ Here is the sample code (...

Question 2

We have a use-case wherein we need to cache certain data that has been processed so that Spark does not reprocess the same data in the event of task failures. So say we have a thousand Foo objects for ...

Question 3

i am trying to register custom code(for map) like below val session: CqlSession = CassandraConnector.apply(spark.sparkContext).openSession() val codecRegistry: MutableCodecRegistry = session....

Question 4

i want to partition/group rows for every group of size <= limit for example, if i have: +--------+----------+ | id| size| +--------+----------+ | 1| 3| | 2| 6| ...

Question 5

I have two dataframes that have 300 columns and 1000 rows each. They have the same column names. The values are of mixed datatypes like Struct/List/Timestamp/String/etc. I am trying to compare the ...

Question 6

Getting the following error while creating a delta table using scalaspark. _delta_log is getting created at the warehouse but it lands into this error after _delta_log creation Exception in thread &...

Question 7

I am trying to run use Intellij to build spark applications written in scala. I get the following error when I execute the scala program: Exception in thread "main" java.lang....

Question 8

i'm newbie to Scala Spark programming. I have to build a Recommendation System for movies in Scala Spark with the usage of Google Cloud Platform. The dataset is composed by (movie_id, user_id, rating) ...

Question 9

I have a dataframe that looks like this | Column | |------------------------------------------------| |[{a: 2, b: 4}, {a: 2, b: 3}] | |-------...

Question 10

I have the following dataset: |value| +-----+ | 1| | 2| | 3| I want to create a new column newValue that takes the value of newValue from the previous row and does something with it. For ...

Question 11

I have a Scala Spark dataframe with the schema: root |-- passengerId: string (nullable = true) |-- travelHist: array (nullable = true) | |-- element: integer (containsNull = true)...

Question 12

I want to divide the quantity value into multiple rows divided by number of months from start date & end date column. Each row should have start date and end date of the month. I also want ...

Question 13

For some weird reasons I need to get the column names of a dataframe and insert it as the first row(I cannot just import without header). I tried using for comprehension to create a dataframe that ...

Question 14

I am working on a spark project and have some performance issue that I am struggling with, any help will be appreciated. I have a column Collection that is an array of struct: root |-- Collection: ...

Question 15

I have a spark dataframe column (custHeader) in the below format and I want to extract the value of the key - phone into a separate column. trying to use the from_json function, but it is giving me a ...

Collectives™ on Stack Overflow

Delta Table to Iceberg metadata migration is failing

Caching Processed Data In Spark in the Event of Task Failures so as to Not Reprocess the same when the Task Restarts

Ignoring codec because it collides with previously generated codec

Is there a way to partition/group by data where sum of column values per each group is under a limit?

Scala spark query optimization

org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus can't be cast to org.apache.spark.sql.execution.datasources.FileStatusWithMetadat

How to setup and run scala-spark in intellij?

Scala Spark distributed run on Google Cloud Platform but worker not working

Explode nested list of objects into DataFrame in Spark

Access newly created column in withinColumn

Spark Array column - Find max interval between two values

Divide a column value into multiple rows by number of months based on start date & end date columns

Scala - Create Dataframe with only 1 row from a List using for comprehension

How to improve spark filter() performance on an array of struct?

Spark extract values from Json struct

Hot Network Questions