0

I have the following code that cleans a corpus of documents (pipelineClean(corpus)) that returns a Dataframe with two columns:

  • "id": Long
  • "tokens": Array[String].

After that, the code produces a Dataframe with the following columns:

  • "term": String
  • "postingList": List[Array[Long, Long]] (the first long is the documented the other the term frequency inside that document)
pipelineClean(corpus) .select($"id" as "documentId", explode($"tokens") as "term") // explode creates a new row for each element in the given array column .groupBy("term", "documentId").count //group by and then count number of rows per group, returning a df with groupings and the counting .where($"term" =!= "") // seems like there are some tokens that are empty, even though Tokenizer should remove them .withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq} .select("term", "posting") .groupBy("term").agg(collect_list($"posting") as "postingList") // we do another grouping in order to collect the postings into a list .orderBy("term") .persist(StorageLevel.MEMORY_ONLY_SER) 

My question is: would it be possible to make this code shorter and/or more efficient? For example, is it possible to do the grouping within a single groupBy?

1 Answer 1

1

It doesn't look like you can do much more than what you've got apart from skipping the withColumn call and using a straight select:

.select(col("term"), struct(col("documentId"), col("count")) as "posting") 

instead of

.withColumn("posting", struct($"documentId", $"count")) // merge columns as a single {docId, termFreq} .select("term", "posting") 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.