In my pyspark job, I have a huge data framework with more than 6,000 columns in the following format:
id_ a1 a2 a3 a4 a5 .... a6250 u1827s True False True False False .... False ... Where the majority of the columns a1,a2,a3,...,a6250 are of binary type. I need to group this data by all these columns and aggregate the number of distinct ids for each combination, e.g.
df = df.groupby(list_of_cols).agg(F.countDistinct("id_")) where list_of_cols = [a1,a2,...,a6250]. When running this pyspark job, I am having java.lang.StackOverflowError error. I am aware that I can increase the stack size (as per https://rangareddy.github.io/SparkStackOverflow/), however, I'd prefer a more elegant solution that would also enable a more convenient output.
I have two ideas what to do before grouping by:
Encode a combination of a1,a2,...,a6250 columns into a single binary column, such as a binary number with 6250 bits where the bit on position
kwould encode True or False value for the columna_k, e.g. in the example above the value would be10100...0(a1 is true, a2 is false, a3 is true, a4 is false, a5 is false,... a6250 is false).Collect these values into a binary array, e.g. have 1 column like array(True,False,True,False,False,....,False).
Which way is better - to increase a stack size and deal with 6000+ columns, to use a single binary column, or an array of binary values?
[True, True, False, True]-->"1101"). use the new column in group by.