I have a following DataFrame:
+-----------+----------+----------+ | some_id | one_col | other_col| +-----------+----------+----------+ | xx1 | 11| 177| | xx2 | 1613| 2000| | xx4 | 0| 12473| +-----------+----------+----------+ I need to add a new column which is based on some calculations done on the first and second column, namely, for example, for col1_value=1 and col2_value=10 would need to produce a percentage of col1 that is included in col2, so col3_value= (1/10)*100=10%:
+-----------+----------+----------+--------------+ | some_id | one_col | other_col| percentage | +-----------+----------+----------+--------------+ | xx1 | 11| 177| 6.2 | | xx3 | 1| 10 | 10 | | xx2 | 1613| 2000| 80.6 | | xx4 | 0| 12473| 0 | +-----------+----------+----------+--------------+ I know I would need to use a udf for this, but how do I directly add a new column value based on the outcome?
Some pseudo-code:
import pyspark from pyspark.sql.functions import udf df = load_my_df def my_udf(val1, val2): return (val1/val2)*100 udf_percentage = udf(my_udf, FloatType()) df = df.withColumn('percentage', udf_percentage(# how?)) Thank you!
df = df.withColumn('percentage', udf_percentage(df.one_col, df.other_col))ordf = df.withColumn('percentage', udf_percentage(df['one_col'], df['other_col']))