0

I'm new in pyspark.

I want to do some column transforms.

My dataframe:

import pandas as pd df = pd.DataFrame([[10, 8, 9], [ 3, 5, 4], [ 1, 3, 9], [ 1, 5, 3], [ 2, 8, 10], [ 8, 7, 9]],columns=list('ABC')) 

df:

 A B C 0 10 8 9 1 3 5 4 2 1 3 9 3 1 5 3 4 2 8 10 5 8 7 9 

In df, each row is a triangulation, columns 'ABC' are the vertex index of the triangulations.

I want to get the dataframe of all the triangles' edges.

Under conditions:

  1. For each edge, always lesser vertex index first.
  2. Remove duplicate edges.
  3. Edge[8, 9] and edge[9, 8] are seen as same edge, only remain [8,9]. (always lesser vertex index first)

My desire dataframe edge_df:

1 3 1 5 1 9 2 8 2 10 3 4 3 5 3 9 4 5 7 8 7 9 8 9 8 10 9 10 

I try to join 'AB', 'AC', 'BA', 'BC', 'CA', 'CB', then distinct(), and drop() the lesser vertex index on the right column.

Is there any way more effective?

1 Answer 1

1

I think in this case, explode is good. orderby is not good, but I added it for the desired output

from pyspark.sql import functions as f df.select(f.explode(f.array(f.array_sort(f.array('A', 'B')), f.array_sort(f.array('B', 'C')), f.array_sort(f.array('C', 'A')))).alias('temp')) \ .select(f.col('temp')[0].alias('a'), f.col('temp')[1].alias('b')).distinct().orderBy('a', 'b') \ .show(truncate=False) +---+---+ |a |b | +---+---+ |1 |3 | |1 |5 | |1 |9 | |2 |8 | |2 |10 | |3 |5 | |3 |9 | |7 |8 | |7 |9 | |8 |9 | |8 |10 | |9 |10 | +---+---+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.