Column transform in pyspark dataframe

Question

I'm new in pyspark.

I want to do some column transforms.

My dataframe:

import pandas as pd df = pd.DataFrame([[10, 8, 9], [ 3, 5, 4], [ 1, 3, 9], [ 1, 5, 3], [ 2, 8, 10], [ 8, 7, 9]],columns=list('ABC'))

df:

 A B C 0 10 8 9 1 3 5 4 2 1 3 9 3 1 5 3 4 2 8 10 5 8 7 9

In df, each row is a triangulation, columns 'ABC' are the vertex index of the triangulations.

I want to get the dataframe of all the triangles' edges.

Under conditions:

For each edge, always lesser vertex index first.
Remove duplicate edges.
Edge[8, 9] and edge[9, 8] are seen as same edge, only remain [8,9]. (always lesser vertex index first)

My desire dataframe edge_df:

1 3 1 5 1 9 2 8 2 10 3 4 3 5 3 9 4 5 7 8 7 9 8 9 8 10 9 10

I try to join 'AB', 'AC', 'BA', 'BC', 'CA', 'CB', then distinct(), and drop() the lesser vertex index on the right column.

Is there any way more effective?

Daeho Ro · Accepted Answer · 2021-12-11 10:20:22Z

I think in this case, explode is good. orderby is not good, but I added it for the desired output

from pyspark.sql import functions as f df.select(f.explode(f.array(f.array_sort(f.array('A', 'B')), f.array_sort(f.array('B', 'C')), f.array_sort(f.array('C', 'A')))).alias('temp')) \ .select(f.col('temp')[0].alias('a'), f.col('temp')[1].alias('b')).distinct().orderBy('a', 'b') \ .show(truncate=False) +---+---+ |a |b | +---+---+ |1 |3 | |1 |5 | |1 |9 | |2 |8 | |2 |10 | |3 |5 | |3 |9 | |7 |8 | |7 |9 | |8 |9 | |8 |10 | |9 |10 | +---+---+

Collectives™ on Stack Overflow

Column transform in pyspark dataframe

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related