Since no out-of-box support for reading excel files in spark, so i first read the excel file first into a pandas dataframe, then try to convert the pandas dataframe into a spark dataframe but i got below errors (i am using spark 1.5.1)
import pandas as pd from pandas import ExcelFile from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * pdf=pd.read_excel('/home/testdata/test.xlsx') df = sqlContext.createDataFrame(pdf) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame rdd, schema = self._createFromLocal(data, schema) File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 337, in _createFromLocal data = [schema.toInternal(row) for row in data] File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in toInternal return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in <genexpr> return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 435, in toInternal return self.dataType.toInternal(obj) File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 191, in toInternal else time.mktime(dt.timetuple())) AttributeError: 'datetime.time' object has no attribute 'timetuple' Does anybody know how to fix it?
test.xlsx?