Make a Spark code more efficient and cleaner

Question

I have the following code that cleans a corpus of documents (pipelineClean(corpus)) that returns a Dataframe with two columns:

"id": Long
"tokens": Array[String].

After that, the code produces a Dataframe with the following columns:

"term": String
"postingList": List[Array[Long, Long]] (the first long is the documented the other the term frequency inside that document)

pipelineClean(corpus) .select($"id" as "documentId", explode($"tokens") as "term") // explode creates a new row for each element in the given array column .groupBy("term", "documentId").count //group by and then count number of rows per group, returning a df with groupings and the counting .where($"term" =!= "") // seems like there are some tokens that are empty, even though Tokenizer should remove them .withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq} .select("term", "posting") .groupBy("term").agg(collect_list($"posting") as "postingList") // we do another grouping in order to collect the postings into a list .orderBy("term") .persist(StorageLevel.MEMORY_ONLY_SER)

My question is: would it be possible to make this code shorter and/or more efficient? For example, is it possible to do the grouping within a single groupBy?

Jarrod Baker · Accepted Answer · 2021-10-20 10:31:57Z

It doesn't look like you can do much more than what you've got apart from skipping the withColumn call and using a straight select:

.select(col("term"), struct(col("documentId"), col("count")) as "posting")

instead of

.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq} .select("term", "posting")

Collectives™ on Stack Overflow

Make a Spark code more efficient and cleaner

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related