pyspark udf for mutils columns

Question

I have a dataframe as

import pandas as pd ndf = pd.DataFrame({'a':[False, False,True,True,False], 'b':[False, False,False,False, True]}) ndf_s = sqlContext.createDataFrame(ndf)

I would like to get a new column named as "action". This could contain two values, if the ndf['a'] is True the "action" has value as "I am a", if ndf['b'] is True the "action" has value as "I am b". Otherwise get value None. In case both column are true, then return value as "I am a and b".In other word I would like to get a DataFrame as:

ndf_result = sqlContext.createDataFrame(pd.DataFrame({'a':[False, False,True,True,False], 'b':[False, False,False,False, True], 'action':[None, None, 'I am a', 'I am a', 'I am b']}))

Is there any chance both columns are True?

akuiper
– akuiper

2017-09-08 15:44:58 +00:00
Commented Sep 8, 2017 at 15:44 — akuiper
– akuiper, Commented Sep 8, 2017 at 15:44
possible, in that case, send the "action" to "I am a and b"

KEXIN WANG
– KEXIN WANG

2017-09-08 15:49:26 +00:00
Commented Sep 8, 2017 at 15:49 — KEXIN WANG
– KEXIN WANG, Commented Sep 8, 2017 at 15:49

akuiper · Accepted Answer · 2017-09-08 15:54:14Z

You can use when.otherwise:

import pyspark.sql.functions as F ndf_s.withColumn("action", F.when( ndf_s["a"] & ndf_s["b"], "I am a and b" ).otherwise( F.when( ndf_s["a"], "I am a" ).otherwise( F.when(ndf_s["b"], "I am b") ) ) ).show() +-----+-----+------------+ | a| b| action| +-----+-----+------------+ | true| true|I am a and b| |false|false| null| | true|false| I am a| | true|false| I am a| |false| true| I am b| +-----+-----+------------+

Another option with udf:

import pyspark.sql.functions as F @F.udf def action(col_a, col_b): if col_a and col_b: return "I am a and b" elif col_a: return "I am a" elif col_b: return "I am b" ndf_s.withColumn("action", action(ndf_s["a"], ndf_s["b"])).show() +-----+-----+------------+ | a| b| action| +-----+-----+------------+ | true| true|I am a and b| |false|false| null| | true|false| I am a| | true|false| I am a| |false| true| I am b| +-----+-----+------------+

Hi @Psidom, thank you for your nice solution! Is there are any change to use "udf" to achieve this result?
You can use udf for this if the condition is complicated, updated an option with udf.

PRASHANT KUMAR GUPTA · Accepted Answer · 2018-11-20 11:47:08Z

import pyspark.sql.functions as udf import pandas as pd ndf = pd.DataFrame({'a':[False, False,True,True,False], 'b':[False, False,False,False, True]}) ndf_s = sqlContext.createDataFrame(ndf) def get_expected_string(a,b): if a and b: return "I am a and b" elif a: return "I am a" elif b: return "I am b" else: return None # defining udf function for get_expected_string get_expected_string_udf = udf(get_expected_string, StringType()) ndf_s = ndf_s.withColumn("action",get_expected_string_udf("a","b")) ndf_s.show() +-----+-----+------------+ | a| b| action| +-----+-----+------------+ | true| true|I am a and b| |false|false| null| | true|false| I am a| | true|false| I am a| |false| true| I am b| +-----+-----+------------+

Collectives™ on Stack Overflow

pyspark udf for mutils columns

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related