2

I have multiple database rows per personId with columns that may or may not have values - I'm using colors here as the data is text not numeric so doesn't lend itself to built-in aggregation functions. A simplified example is

PersonId ColA ColB ColB 100 red 100 green 100 gold 100 green 110 yellow 110 white 110 120 etc... 

I want to be able to decide in a function which column data to use per unique PersonId. A three-way join on the table against itself would be a good solution if the data didn't have multiple values(colors) per column. E.g. that join merges 3 of the rows into one but still produces multiple rows.

PersonId ColA ColB ColB 100 red green gold 100 green 110 white yellow 110 120 

So the solution I'm looking for is something that will allow me to address all the values (colors) for a person in one place (function) so the decision can be made across all their data. The real data of course has more columns but the primary ones for this decision are the three columns. The data is being read in Scala Spark as a Dataframe and I'd prefer using the API to sql. I don't know if any of the exotic windows or groupby functions will help or if it's gonna be down to plain old iterate and accumulate. The technique used in [How to aggregate values into collection after groupBy? might be applicable but it's a bit of a leap.

5
  • I don't have time for a full answer, but have you looked at lag/lead or other window operations Commented Jan 20, 2017 at 16:32
  • why in this example the second line for 100 is green and the first is red green gold instead of the first being red and the second green green gold? can you explain more on how you want to combine them? Commented Jan 20, 2017 at 16:54
  • That was just showing what the 3-way join might produce. I initially thought the 3-way (self) join would just return 1 row - which it does unless each column has different values in other rows. The real issue is simply coping with the data spread across multiple rows and columns. Commented Jan 20, 2017 at 17:22
  • I have mainly worked with RDDs in Java Spark, and in that environment, you can first do a 'flatMap' to map this RDD into a paired RDD where you have PersonID as the key and one other column that represent the colors associated with the key. Then you may 'AggragateByKey' and will get what you need. If you think this approach may help, I can elaborate. Commented Jan 20, 2017 at 18:32
  • Please do, this sounds very promising - although I need to have three pieces of information the PersonId, color and which column the color was in. Which color is chosen in which column is significant. Commented Jan 22, 2017 at 11:07

1 Answer 1

1

Think of using customUDF for doing this.

import org.apache.spark.sql.functions._ val df = Seq((100, "red", null, null), (100, null, "white", null), (100, null, null, "green"), (200, null, "red", null)).toDF("PID", "A", "B", "C") df.show() +---+----+-----+-----+ |PID| A| B| C| +---+----+-----+-----+ |100| red| null| null| |100|null|white| null| |100|null| null|green| |200|null| red| null| +---+----+-----+-----+ val customUDF = udf((array: Seq[String]) => { val newts = array.filter(_.nonEmpty) if (newts.size == 0) null else newts.head }) df.groupBy($"PID").agg(customUDF(collect_set($"A")).as("colA"), customUDF(collect_set($"B")).as("colB"), customUDF(collect_set($"C")).as("colC")).show +---+----+-----+-----+ |PID|colA| colB| colC| +---+----+-----+-----+ |100| red|white|green| |200|null| red| null| +---+----+-----+-----+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.