How to check if a string column in pyspark dataframe is all numeric

Question

I have a PySpark Dataframe with a column of strings. How can I check which rows in it are Numeric. I could not find any function in PySpark's official documentation.

values = [('25q36',),('75647',),('13864',),('8758K',),('07645',)] df = sqlContext.createDataFrame(values,['ID',]) df.show() +-----+ | ID| +-----+ |25q36| |75647| |13864| |8758K| |07645| +-----+

In Python, there is a function .isDigit() which returns True or False if the string contains just numbers or not.

Expected DataFrame:

+-----+-------+ | ID| Value | +-----+-------+ |25q36| False | |75647| True | |13864| True | |8758K| False | |07645| True | +-----+-------+

I would like to avoid creating a UDF.

Steven · Accepted Answer · 2018-12-12 13:56:02Z

35

A simple cast would do the job :

from pyspark.sql import functions as F my_df.select( "ID", F.col("ID").cast("int").isNotNull().alias("Value ") ).show() +-----+------+ | ID|Value | +-----+------+ |25q36| false| |75647| true| |13864| true| |8758K| false| |07645| true| +-----+------+

answered Dec 12, 2018 at 13:56

Steven

15.4k7 gold badges49 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

cph_sto Over a year ago

Thanks Steven. This definitely works. I thought that might be some inbuilt function as well. If I don't find a one, I will click this one as an answer.

Mohammad-Reza Malekpour Over a year ago

Pay Attention! It's better to use long rather than int because of the integer range of -2,147,483,647 to +2,147,483,647.

Mohammad-Reza Malekpour · Accepted Answer · 2025-08-04 09:58:32Z

Filtering with Regex

Indeed I enjoyed the creative solution provided by Steven but here is a more flexible approach for this kind of situation:

df.filter(~df.ID.rlike('\D+')).show()

Firstly, you select every row which contains a non-digits character with rlike('\D+') and then excluding those rows with ~ at the beginning of the filter.

Regex are not easier. Although they provide a more flexible way to handle various patterns in strings, they are complexe and hard to understand for someone who is not used to. Also, they might be slower compare to a simple cast, especially with large datasets. And OP wants a new column, not a filter.

Mohseen Mulla · Accepted Answer · 2020-11-08 13:53:16Z

I agree to @steven answer but there is a slight modification since I want the whole table to be filtered out. PFB

df2.filter(F.col("id").cast("int").isNotNull()).show()

Also there is no need to create a new column called Values

Alternative solution similar to above is -

display(df2.filter(f"CAST({'id'} as INT) IS NOT NULL")

Manrique · Accepted Answer · 2018-12-18 16:27:19Z

If you want you can also build a custom udf for this purpose:

from pyspark.sql.types import BooleanType from pyspark.sql import functions as F def is_digit(val): if val: return val.isdigit() else: return False is_digit_udf = udf(is_digit, BooleanType()) df = df.withColumn('Value', F.when(is_digit_udf(F.col('ID')), F.lit(True)).otherwise(F.lit(False)))

UDF should be a last resort if there's no solution using native pyspark functions, they are not performant

Bramvd3 · Accepted Answer · 2021-02-22 18:57:33Z

The clearest way to search for non-numeric rows would be something like this:

from pyspark.sql import functions as F df.select("col_a",F.regexp_replace(col("col_a"), "[^0-9]", "").alias("numeric"))\ .filter(col("col_a")!=col("numeric"))\ .distinct()\ .show()

Steven · Accepted Answer · 2025-06-02 08:03:30Z

df = spark.read.option("header", "true").csv("source_table.csv") df = df.withColumn("is_valid", lit("true")) df.withColumn( "is_valid", when(col("age").cast("int").isNotNull(), col("is_valid")).otherwise("false"), ).show() # this will work # if you want to use rlike this will work pattern = "^[0-9]*$" source_df = df.withColumn( "is_valid", when(col("age").rlike(pattern), col("is_valid")).otherwise("false") )

cph_sto · Accepted Answer · 2019-09-16 13:41:35Z

Try this, is Scala language

spark.udf.register("IsNumeric", (inpColumn: Int) => BigInt(inpColumn).isInstanceOf[BigInt]) spark.sql(s""" select "ABCD", IsNumeric(1234) as IsNumeric_1 """).show(false)

Collectives™ on Stack Overflow

How to check if a string column in pyspark dataframe is all numeric

7 Answers 7

2 Comments

Filtering with Regex

1 Comment

1 Comment

2 Comments

1 Comment

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

2 Comments

Filtering with Regex

1 Comment

1 Comment

2 Comments

1 Comment

1 Comment

Comments

Linked

Related