3

I'm trying to use withColumn to null out bad dates in a column in a dataframe, I'm using a when() function to make the update. I have two conditions for "bad" dates. dates before jan 1900 or dates in the future. My current code looks like this:

d = datetime.datetime.today() df_out = df.withColumn(my_column, when(col(my_column) < '1900-01-01' | col(my_column) > '2019-12-09 17:01:37.774418', lit(None)).otherwise(col(my_column))) 

I think my problem is that it doesn't like the or operator "|" . From what I have seen on google "|" is what i should use. I have tried "or" as well. Can anyone advise on what i'm doing wrong here.

here is the stack trace.

df_out = df.withColumn(c, when(col(c) < '1900-01-01' | col(c) > '2019-12-09 17:01:37.774418', lit(None)).otherwise(col(c))) File "C:\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\column.py", line 115, in _ njc = getattr(self._jc, name)(jc) File "C:\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__ File "C:\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco return f(*a, **kw) File "C:\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o48.or. Trace: py4j.Py4JException: Method or([class java.lang.String]) does not exist``` 

2 Answers 2

3

It is a matter of operator precedence. The boolean OR operator or has lower precedence than the comparison operators so

col(my_column) < 'X' or col(my_column) > 'Y' 

reads as

(col(my_column) < 'X') or (col(my_column) > 'Y') 

But the bitwise OR operator | has higher precedence than the comparison operators and

col(my_column) < 'X' | col(my_column) > 'Y' 

actually reads as

col(my_column) < ('X' | col(my_column)) > 'Y' 

Despite | being redefined on the Column type to have the same effect as the or operator, its precedence does not change, so you need to manually parenthesise each subexpression.

Sign up to request clarification or add additional context in comments.

Comments

2

It's just a problem of priority of operators. The error is telling you that pyspark cannot apply OR to a string. More specifically, it is trying to compute '1900-01-01' | col(c) and tells you that it does not know how to do it. You simply need to parenthesize the expression.

df_out = df.withColumn(my_column, when((col(my_column) < '1900-01-01') | (col(my_column) > '2019-12-09 17:01:37.774418'), lit(None)).otherwise(col(my_column))) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.