How to create a new column based on calculations made in other columns in PySpark

Question

I have a following DataFrame:

+-----------+----------+----------+ | some_id | one_col | other_col| +-----------+----------+----------+ | xx1 | 11| 177| | xx2 | 1613| 2000| | xx4 | 0| 12473| +-----------+----------+----------+

I need to add a new column which is based on some calculations done on the first and second column, namely, for example, for col1_value=1 and col2_value=10 would need to produce a percentage of col1 that is included in col2, so col3_value= (1/10)*100=10%:

+-----------+----------+----------+--------------+ | some_id | one_col | other_col| percentage | +-----------+----------+----------+--------------+ | xx1 | 11| 177| 6.2 | | xx3 | 1| 10 | 10 | | xx2 | 1613| 2000| 80.6 | | xx4 | 0| 12473| 0 | +-----------+----------+----------+--------------+

I know I would need to use a udf for this, but how do I directly add a new column value based on the outcome?

Some pseudo-code:

import pyspark from pyspark.sql.functions import udf df = load_my_df def my_udf(val1, val2): return (val1/val2)*100 udf_percentage = udf(my_udf, FloatType()) df = df.withColumn('percentage', udf_percentage(# how?))

Thank you!

df = df.withColumn('percentage', udf_percentage(df.one_col, df.other_col)) or df = df.withColumn('percentage', udf_percentage(df['one_col'], df['other_col'])) — Anahcolus
– Anahcolus, Commented Apr 27, 2018 at 11:38

2 revs user9710193 · Accepted Answer · 2018-04-27 11:42:18Z

df.withColumn('percentage', udf_percentage("one_col", "other_col"))

or

df.withColumn('percentage', udf_percentage(df["one_col"], df["other_col"]))

or

df.withColumn('percentage', udf_percentage(df.one_col, df.other_col))

or

from pyspark.sql.functions import col df.withColumn('percentage', udf_percentage(col("one_col"), col("other_col")))

but why not just:

df.withColumn('percentage', col("one_col") / col("other_col") * 100)

Slight typo in your math, you're multiplying the denominator by 100 instead of the numerator. Move the 100* to the front of the second arg to achieve percentage. 100 * col("one_col") / col("other_col"))

Collectives™ on Stack Overflow

How to create a new column based on calculations made in other columns in PySpark

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related