2

I have pyspark dataframe with 3 columns. DDL of the hive table 'test1' is all having string data types. So if I do df.printSchema all are string data type as shown below,

>>> df = spark.sql("select * from default.test1") >>> df.printSchema() root |-- c1: string (nullable = true) |-- c2: string (nullable = true) |-- c3: string (nullable = true) +----------+--------------+-------------------+ |c1 |c2 |c3 | +----------+--------------+-------------------+ |April |20132014 |4 | |May |20132014 |5 | |June |abcdefgh |6 | +----------+--------------+-------------------+ 

Now I want to filter only those records which are of integer type in 'c2' column. So basically I need only first 2 records which are integer type like '20132014'. And exclude the other records.

1 Answer 1

3

In one line:

df.withColumn("c2", df["c2"].cast("integer")).na.drop(subset=["c2"]) 

If c2 is not a valid integer, it will be NULL and dropped in the subsequent step.

Without changing the type

valid = df.where(df["c2"].cast("integer").isNotNull()) invalid = df.where(df["c2"].cast("integer").isNull()) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.