27

I have a PySpark Dataframe with a column of strings. How can I check which rows in it are Numeric. I could not find any function in PySpark's official documentation.

values = [('25q36',),('75647',),('13864',),('8758K',),('07645',)] df = sqlContext.createDataFrame(values,['ID',]) df.show() +-----+ | ID| +-----+ |25q36| |75647| |13864| |8758K| |07645| +-----+ 

In Python, there is a function .isDigit() which returns True or False if the string contains just numbers or not.

Expected DataFrame:

+-----+-------+ | ID| Value | +-----+-------+ |25q36| False | |75647| True | |13864| True | |8758K| False | |07645| True | +-----+-------+ 

I would like to avoid creating a UDF.

7 Answers 7

35

A simple cast would do the job :

from pyspark.sql import functions as F my_df.select( "ID", F.col("ID").cast("int").isNotNull().alias("Value ") ).show() +-----+------+ | ID|Value | +-----+------+ |25q36| false| |75647| true| |13864| true| |8758K| false| |07645| true| +-----+------+ 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Steven. This definitely works. I thought that might be some inbuilt function as well. If I don't find a one, I will click this one as an answer.
Pay Attention! It's better to use long rather than int because of the integer range of -2,147,483,647 to +2,147,483,647.
7

Filtering with Regex

Indeed I enjoyed the creative solution provided by Steven but here is a more flexible approach for this kind of situation:

df.filter(~df.ID.rlike('\D+')).show() 

Firstly, you select every row which contains a non-digits character with rlike('\D+') and then excluding those rows with ~ at the beginning of the filter.

1 Comment

Regex are not easier. Although they provide a more flexible way to handle various patterns in strings, they are complexe and hard to understand for someone who is not used to. Also, they might be slower compare to a simple cast, especially with large datasets. And OP wants a new column, not a filter.
4

I agree to @steven answer but there is a slight modification since I want the whole table to be filtered out. PFB

df2.filter(F.col("id").cast("int").isNotNull()).show() 

Also there is no need to create a new column called Values


Alternative solution similar to above is -

display(df2.filter(f"CAST({'id'} as INT) IS NOT NULL") 

1 Comment

OP wants a new column, not a filter
1

If you want you can also build a custom udf for this purpose:

from pyspark.sql.types import BooleanType from pyspark.sql import functions as F def is_digit(val): if val: return val.isdigit() else: return False is_digit_udf = udf(is_digit, BooleanType()) df = df.withColumn('Value', F.when(is_digit_udf(F.col('ID')), F.lit(True)).otherwise(F.lit(False))) 

2 Comments

OP specifically asked for no UDF.
UDF should be a last resort if there's no solution using native pyspark functions, they are not performant
1

The clearest way to search for non-numeric rows would be something like this:

from pyspark.sql import functions as F df.select("col_a",F.regexp_replace(col("col_a"), "[^0-9]", "").alias("numeric"))\ .filter(col("col_a")!=col("numeric"))\ .distinct()\ .show() 

1 Comment

OP wants a new column, not a filter
0
df = spark.read.option("header", "true").csv("source_table.csv") df = df.withColumn("is_valid", lit("true")) df.withColumn( "is_valid", when(col("age").cast("int").isNotNull(), col("is_valid")).otherwise("false"), ).show() # this will work # if you want to use rlike this will work pattern = "^[0-9]*$" source_df = df.withColumn( "is_valid", when(col("age").rlike(pattern), col("is_valid")).otherwise("false") ) 

1 Comment

Very nice, because you avoided the use of an udf function..
-1

Try this, is Scala language

spark.udf.register("IsNumeric", (inpColumn: Int) => BigInt(inpColumn).isInstanceOf[BigInt]) spark.sql(s""" select "ABCD", IsNumeric(1234) as IsNumeric_1 """).show(false) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.