0

I am having a dataframe as shown below.

+------+-------------+------+-----+ |NUM_ID| TIME|SIGNAL|VALUE| +------+-------------+------+-----+ |XXXX01|1571634079547| SIG1|78860| |XXXX01|1571634090000| SIG1|25.73| |XXXX01|1571634042000| SIG1|25.73| |XXXX01|1571634050000| SIG1|25.73| |XXXX01|1571634050000| SIG2|25.73| |XXXX01|1571634066000| SIG2|25.73| |XXXX01|1571634074000| SIG2|25.73| |XXXX01|1571634090000| SIG3|25.73| |XXXX02|1571634088000| SIG1|25.73| |XXXX02|1571634040000| SIG1|25.73| |XXXX02|1571634048000| SIG1|25.73| |XXXX02|1571634056000| SIG1|25.73| |XXXX02|1571634088000| SIG2|25.73| |XXXX02|1571634072000| SIG2|25.73| |XXXX02|1571634080000| SIG2|25.73| |XXXX02|1571634088000| SIG3|25.73| |XXXX02|1571634094000| SIG3|25.73| |XXXX02|1571634038000| SIG3|25.73| |XXXX03|1571634046000| SIG1|25.73| |XXXX03|1571634054000| SIG1|25.73| |XXXX03|1571634062000| SIG1|25.73| |XXXX03|1571634070000| SIG1|25.73| |XXXX03|1571634078000| SIG2|25.73| |XXXX03|1571634092000| SIG2|25.73| |XXXX03|1571634036000| SIG2|25.73| |XXXX03|1571634044000| SIG3|25.73| |XXXX03|1571634052000| SIG3|25.73| |XXXX03|1571634060000| SIG3|25.73| +------+-------------+------+-----+ 

I want to take each SIGx as a new column and corresponding VALUE as rows for each SIGx from existing column SIGNAL.

The output should be as shown below.

+------+-------------+-----+-----+-----+ |NUM_ID| TIME| SIG1| SIG2| SIG3| +------+-------------+-----+-----+-----+ |XXXX01|1571634079547|78860| null| null| |XXXX01|1571634090000|25.73| null|25.73| |XXXX01|1571634042000|25.73| null| null| |XXXX01|1571634050000|25.73|25.73| null| |XXXX01|1571634066000| null|25.73| null| |XXXX01|1571634074000| null|25.73| null| |XXXX02|1571634088000|25.73|25.73|25.73| |XXXX02|1571634040000|25.73| null| null| |XXXX02|1571634048000|25.73| null| null| |XXXX02|1571634056000|25.73| null| null| |XXXX02|1571634072000| null|25.73| null| |XXXX02|1571634080000| null|25.73| null| |XXXX02|1571634094000| null| null|25.73| |XXXX02|1571634038000| null| null|25.73| | | | +------+-------------+-----+-----+-----+ 

The VALUE for SIGx with same TIME should be in same row.

Is there any way to achieve this? I tried with pivot function but not working as expected for pivoted columns having multiple values.

Any leads appreciated. Thanks in advance!

1 Answer 1

1

You can groupBy "NUM_ID" and "TIME" and pivot with "SIGNAL" and get the first value from "VALUE" as below.

df.groupBy("NUM_ID", "TIME") .pivot("SIGNAL") .agg(first("VALUE")) 

Hope this helps!

Sign up to request clarification or add additional context in comments.

5 Comments

I tried this but getting an error as org.apache.spark.sql.AnalysisException: "VALUE" is not a numeric column. Aggregation function can only be applied on a numeric column.; at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$3.apply(RelationalGroupedDataset.scala:103 The column VALUE is of string type. I have values of DOUBLE and BIGINT in the VALUE column, so that casting to a particular type is also not possible.-@Shankar Koirala
Can you provide the schema of dataframe?
-scala> DF.printSchema root |-- NUM_ID: string (nullable = true) |-- TIME: string (nullable = true) |-- SIGNAL: string (nullable = true) |-- VALUE: string (nullable = true)
I tried without agg as df.groupBy("NUM_ID", "TIME") .pivot("SIGNAL") But how can we see the data after execution of pivot function. show function will not work as it is not a member of RelationalGroupedDataset.- @Shankar Koirala
it should always follow group by with some aggregation function as .agg()

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.