0

I have a dataframe as

import pandas as pd ndf = pd.DataFrame({'a':[False, False,True,True,False], 'b':[False, False,False,False, True]}) ndf_s = sqlContext.createDataFrame(ndf) 

I would like to get a new column named as "action". This could contain two values, if the ndf['a'] is True the "action" has value as "I am a", if ndf['b'] is True the "action" has value as "I am b". Otherwise get value None. In case both column are true, then return value as "I am a and b".In other word I would like to get a DataFrame as:

ndf_result = sqlContext.createDataFrame(pd.DataFrame({'a':[False, False,True,True,False], 'b':[False, False,False,False, True], 'action':[None, None, 'I am a', 'I am a', 'I am b']})) 
2
  • Is there any chance both columns are True? Commented Sep 8, 2017 at 15:44
  • possible, in that case, send the "action" to "I am a and b" Commented Sep 8, 2017 at 15:49

2 Answers 2

4

You can use when.otherwise:

import pyspark.sql.functions as F ndf_s.withColumn("action", F.when( ndf_s["a"] & ndf_s["b"], "I am a and b" ).otherwise( F.when( ndf_s["a"], "I am a" ).otherwise( F.when(ndf_s["b"], "I am b") ) ) ).show() +-----+-----+------------+ | a| b| action| +-----+-----+------------+ | true| true|I am a and b| |false|false| null| | true|false| I am a| | true|false| I am a| |false| true| I am b| +-----+-----+------------+ 

Another option with udf:

import pyspark.sql.functions as F @F.udf def action(col_a, col_b): if col_a and col_b: return "I am a and b" elif col_a: return "I am a" elif col_b: return "I am b" ndf_s.withColumn("action", action(ndf_s["a"], ndf_s["b"])).show() +-----+-----+------------+ | a| b| action| +-----+-----+------------+ | true| true|I am a and b| |false|false| null| | true|false| I am a| | true|false| I am a| |false| true| I am b| +-----+-----+------------+ 
Sign up to request clarification or add additional context in comments.

2 Comments

Hi @Psidom, thank you for your nice solution! Is there are any change to use "udf" to achieve this result?
You can use udf for this if the condition is complicated, updated an option with udf.
1
import pyspark.sql.functions as udf import pandas as pd ndf = pd.DataFrame({'a':[False, False,True,True,False], 'b':[False, False,False,False, True]}) ndf_s = sqlContext.createDataFrame(ndf) def get_expected_string(a,b): if a and b: return "I am a and b" elif a: return "I am a" elif b: return "I am b" else: return None # defining udf function for get_expected_string get_expected_string_udf = udf(get_expected_string, StringType()) ndf_s = ndf_s.withColumn("action",get_expected_string_udf("a","b")) ndf_s.show() +-----+-----+------------+ | a| b| action| +-----+-----+------------+ | true| true|I am a and b| |false|false| null| | true|false| I am a| | true|false| I am a| |false| true| I am b| +-----+-----+------------+ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.