Validating the data type of a column in pyspark dataframe

Question

I have pyspark dataframe with 3 columns. DDL of the hive table 'test1' is all having string data types. So if I do df.printSchema all are string data type as shown below,

>>> df = spark.sql("select * from default.test1") >>> df.printSchema() root |-- c1: string (nullable = true) |-- c2: string (nullable = true) |-- c3: string (nullable = true) +----------+--------------+-------------------+ |c1 |c2 |c3 | +----------+--------------+-------------------+ |April |20132014 |4 | |May |20132014 |5 | |June |abcdefgh |6 | +----------+--------------+-------------------+

Now I want to filter only those records which are of integer type in 'c2' column. So basically I need only first 2 records which are integer type like '20132014'. And exclude the other records.

Alper t. Turker · Accepted Answer · 2017-09-10 21:13:11Z

In one line:

df.withColumn("c2", df["c2"].cast("integer")).na.drop(subset=["c2"])

If c2 is not a valid integer, it will be NULL and dropped in the subsequent step.

Without changing the type

valid = df.where(df["c2"].cast("integer").isNotNull()) invalid = df.where(df["c2"].cast("integer").isNull())

Collectives™ on Stack Overflow

Validating the data type of a column in pyspark dataframe

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related