error received when convert a pandas dataframe to spark dataframe

Question

Since no out-of-box support for reading excel files in spark, so i first read the excel file first into a pandas dataframe, then try to convert the pandas dataframe into a spark dataframe but i got below errors (i am using spark 1.5.1)

import pandas as pd from pandas import ExcelFile from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * pdf=pd.read_excel('/home/testdata/test.xlsx') df = sqlContext.createDataFrame(pdf) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame rdd, schema = self._createFromLocal(data, schema) File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 337, in _createFromLocal data = [schema.toInternal(row) for row in data] File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in toInternal return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in <genexpr> return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 435, in toInternal return self.dataType.toInternal(obj) File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 191, in toInternal else time.mktime(dt.timetuple())) AttributeError: 'datetime.time' object has no attribute 'timetuple'

Does anybody know how to fix it?

Can you post a link to your test.xlsx?

Sergey Bushmanov
– Sergey Bushmanov

2016-01-15 08:49:56 +00:00
Commented Jan 15, 2016 at 8:49 — Sergey Bushmanov
– Sergey Bushmanov, Commented Jan 15, 2016 at 8:49
drive.google.com/file/d/0B9n_aOz2bmxzVUc2S084dW1KR1E/…

b4me
– b4me

2016-01-15 19:33:06 +00:00
Commented Jan 15, 2016 at 19:33 — b4me
– b4me, Commented Jan 15, 2016 at 19:33

Sergey Bushmanov · Accepted Answer · 2016-01-15 20:56:18Z

My best guess your problem was about "incorrectly" parsing datetime data when you read your data with Pandas

The following code "just works":

import pandas as pd from pandas import ExcelFile from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * pdf = pd.read_excel('test.xlsx', parse_dates=['Created on','Confirmation time']) sc = SparkContext() sqlContext = SQLContext(sc) sqlContext.createDataFrame(data=pdf).collect() [Row(Customer=1000935702, Country='TW', ...

Please note, you have one more datetime column 'Confirmation date' which in your example consists of NaT and thus reads without a problem to RDD with your short sample, but should you happen to have some data there in a full dataset you'll have to take care about that column as well.

the previous error was gone, but i got a type error. wondering if i need to solve each column type specifically? thank you >>> df = sqlContext.createDataFrame(pdf) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame rdd, schema = self._createFromLocal(data, schema) File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 322, in _createFromLocal ... TypeError: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
@b4me You may think about accepting the solution to your earlier problem and posting new one as a new question....

iretex · Accepted Answer · 2024-01-08 13:23:40Z

Explicitly defining the schema would fix the problem. Depending on your use case, you can dynamically specify the schema as shown in the snippet below;

from pyspark.sql.types import * schema = StructType([ StructField(name, TimestampType() if pd.api.types.is_datetime64_dtype(col) else DateType() if pd.api.types.is_datetime64_any_dtype(col) else DoubleType() if pd.api.types.is_float_dtype(col) else StringType(), True) for name, col in zip(df.columns, df.dtypes)]) sparkDf = spark.createDataFrame(df, schema)

Collectives™ on Stack Overflow

error received when convert a pandas dataframe to spark dataframe

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related