Updating a dataframe column in spark

Question

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns.

How would I go about changing a value in row x column y of a dataframe?

In pandas this would be:

df.ix[x,y] = new_value

Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.

If you just want to replace a value in a column based on a condition, like np.where:

from pyspark.sql import functions as F update_func = (F.when(F.col('update_col') == replace_val, new_value) .otherwise(F.col('update_col'))) df = df.withColumn('new_column_name', update_func)

If you want to perform some operation on a column and create a new column that is added to the dataframe:

import pyspark.sql.functions as F import pyspark.sql.types as T def my_func(col): do stuff to column here return transformed_value # if we assume that my_func returns a string my_udf = F.UserDefinedFunction(my_func, T.StringType()) df = df.withColumn('new_column_name', my_udf('update_col'))

If you want the new column to have the same name as the old column, you could add the additional step:

df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')

if you want to access the DataFrame by index, you need to build an index first. See, e.g. stackoverflow.com/questions/26828815/…. Or add an index column with your own index. — fanfabbb
– fanfabbb, Commented Mar 31, 2015 at 9:38

karlson · Accepted Answer · 2017-02-21 22:02:49Z

80

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:

from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import StringType name = 'target_column' udf = UserDefinedFunction(lambda x: 'new_value', StringType()) new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])

new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.

edited Feb 21, 2017 at 22:02

answered Mar 25, 2015 at 13:35

karlson

5,4633 gold badges33 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

fanfabbb Over a year ago

this is an actual answer to the problem thanks! yet, the spark jobs don't finish for me, all executors get los. can you think of an alternative way? I use it with a bit more complex UDF where I do transformation to strings. There is no pandas-similar Syntax like new_df = old_df.col1.apply(lambda x: func(x))?

fanfabbb Over a year ago

there is also: new_df = old_df.withColumn('target_column', udf(df.name))

karlson Over a year ago

Yes, that should work fine. Keep in mind that UDFs can only take columns as parameters. If you want to pass other data into the function you have to partially apply it first.

karlson Over a year ago

@KatyaHandler If you just want to duplicate a column, one way to do so would be to simply select it twice: df.select([df[col], df[col].alias('same_column')]), where col is the name of the column you want to duplicate. With the latest Spark release, a lot of the stuff I've used UDFs for can be done with the functions defined in pyspark.sql.functions. UDF performance in Pyspark is really poor, so that might really be worth looking into: spark.apache.org/docs/latest/api/python/…

Namit Juneja Over a year ago

it is StringType not Stringtype in udf = UserDefinedFunction(lambda x: 'new_value', Stringtype())

|

Paul · Accepted Answer · 2015-12-21 22:23:26Z

Commonly when updating a column, we want to map an old value to a new value. Here's a way to do that in pyspark without UDF's:

# update df[update_col], mapping old_value --> new_value from pyspark.sql import functions as F df = df.withColumn(update_col, F.when(df[update_col]==old_value,new_value). otherwise(df[update_col])).

How to use this, when my update_col is a list Ex-=: update_cols=['col1','col2','col3'] ?

Jacek Laskowski · Accepted Answer · 2016-02-24 21:56:18Z

16

DataFrames are based on RDDs. RDDs are immutable structures and do not allow updating elements on-site. To change values, you will need to create a new DataFrame by transforming the original one either using the SQL-like DSL or RDD operations like map.

A highly recommended slide deck: Introducing DataFrames in Spark for Large Scale Data Science.

edited Feb 24, 2016 at 21:56

Jacek Laskowski

75k28 gold badges253 silver badges440 bronze badges

answered Mar 17, 2015 at 21:51

maasg

37.5k14 gold badges91 silver badges116 bronze badges

2 Comments

Luke Over a year ago

What exactly is the dataframe abstraction adding then that couldn't already be done in the same amount of lines with a table?

maasg Over a year ago

" DataFrames introduce new simplified operators for filtering, aggregating, and projecting over large datasets. Internally, DataFrames leverage the Spark SQL logical optimizer to intelligently plan the physical execution of operations to work well on large datasets" - databricks.com/blog/2015/03/13/announcing-spark-1-3.html

Community · Accepted Answer · 2017-05-23 11:33:15Z

Just as maasg says you can create a new DataFrame from the result of a map applied to the old DataFrame. An example for a given DataFrame df with two rows:

val newDf = sqlContext.createDataFrame(df.map(row => Row(row.getInt(0) + SOMETHING, applySomeDef(row.getAs[Double]("y")), df.schema)

Note that if the types of the columns change, you need to give it a correct schema instead of df.schema. Check out the api of org.apache.spark.sql.Row for available methods: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html

[Update] Or using UDFs in Scala:

import org.apache.spark.sql.functions._ val toLong = udf[Long, String] (_.toLong) val modifiedDf = df.withColumn("modifiedColumnName", toLong(df("columnName"))).drop("columnName")

and if the column name needs to stay the same you can rename it back:

modifiedDf.withColumnRenamed("modifiedColumnName", "columnName")

DHEERAJ · Accepted Answer · 2020-05-26 15:59:15Z

importing col, when from pyspark.sql.functions and updating fifth column to integer(0,1,2) based on the string(string a, string b, string c) into a new DataFrame.

from pyspark.sql.functions import col, when data_frame_temp = data_frame.withColumn("col_5",when(col("col_5") == "string a", 0).when(col("col_5") == "string b", 1).otherwise(2))

Collectives™ on Stack Overflow

Updating a dataframe column in spark

5 Answers 5

11 Comments

2 Comments

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

11 Comments

2 Comments

2 Comments

Comments

Comments

Linked

Related