Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns.
How would I go about changing a value in row x column y of a dataframe?
In pandas this would be:
df.ix[x,y] = new_value Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.
If you just want to replace a value in a column based on a condition, like np.where:
from pyspark.sql import functions as F update_func = (F.when(F.col('update_col') == replace_val, new_value) .otherwise(F.col('update_col'))) df = df.withColumn('new_column_name', update_func) If you want to perform some operation on a column and create a new column that is added to the dataframe:
import pyspark.sql.functions as F import pyspark.sql.types as T def my_func(col): do stuff to column here return transformed_value # if we assume that my_func returns a string my_udf = F.UserDefinedFunction(my_func, T.StringType()) df = df.withColumn('new_column_name', my_udf('update_col')) If you want the new column to have the same name as the old column, you could add the additional step:
df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')