3

I have a dataframe which I read in using pyspark with:

df1 = spark.read.csv("/user/me/data/*").toPandas() 

Unfortunately, pyspark leaves all the types as Object, even numerical values. I need to merge this with another dataframe I read in with df2 = pd.read_csv("file.csv") so I need the types in df1 to be inferred exactly as pandas would have done it.

How can you infer types of an existing pandas dataframe?

1 Answer 1

4

If you have the same column names you could use pd.DataFrame.astype:

df1 = df1.astype(df2.dtypes) 

Otherwise, you need to construct a dictionary where keys are the column names in df1 and the values are dtypes. You can start with d = df2.dtypes.to_dict() to see what it should look like. Then construct a new dictionary altering the keys where needed.

Once you've constructed the dictionary d, use:

df1 = df1.astype(d) 
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for this. Will this also convert '' into NaN if the type is float? This is what pd.read_csv does.
I don't think it will. You need to manually convert that column with pd.to_numeric(df['mycol'], errors='coerce')

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.