1

I have a following DataFrame:

+-----------+----------+----------+ | some_id | one_col | other_col| +-----------+----------+----------+ | xx1 | 11| 177| | xx2 | 1613| 2000| | xx4 | 0| 12473| +-----------+----------+----------+ 

I need to add a new column which is based on some calculations done on the first and second column, namely, for example, for col1_value=1 and col2_value=10 would need to produce a percentage of col1 that is included in col2, so col3_value= (1/10)*100=10%:

+-----------+----------+----------+--------------+ | some_id | one_col | other_col| percentage | +-----------+----------+----------+--------------+ | xx1 | 11| 177| 6.2 | | xx3 | 1| 10 | 10 | | xx2 | 1613| 2000| 80.6 | | xx4 | 0| 12473| 0 | +-----------+----------+----------+--------------+ 

I know I would need to use a udf for this, but how do I directly add a new column value based on the outcome?

Some pseudo-code:

import pyspark from pyspark.sql.functions import udf df = load_my_df def my_udf(val1, val2): return (val1/val2)*100 udf_percentage = udf(my_udf, FloatType()) df = df.withColumn('percentage', udf_percentage(# how?)) 

Thank you!

1
  • 1
    df = df.withColumn('percentage', udf_percentage(df.one_col, df.other_col)) or df = df.withColumn('percentage', udf_percentage(df['one_col'], df['other_col'])) Commented Apr 27, 2018 at 11:38

1 Answer 1

5
df.withColumn('percentage', udf_percentage("one_col", "other_col")) 

or

df.withColumn('percentage', udf_percentage(df["one_col"], df["other_col"])) 

or

df.withColumn('percentage', udf_percentage(df.one_col, df.other_col)) 

or

from pyspark.sql.functions import col df.withColumn('percentage', udf_percentage(col("one_col"), col("other_col"))) 

but why not just:

df.withColumn('percentage', col("one_col") / col("other_col") * 100) 
Sign up to request clarification or add additional context in comments.

1 Comment

Slight typo in your math, you're multiplying the denominator by 100 instead of the numerator. Move the 100* to the front of the second arg to achieve percentage. 100 * col("one_col") / col("other_col"))

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.