3

Since no out-of-box support for reading excel files in spark, so i first read the excel file first into a pandas dataframe, then try to convert the pandas dataframe into a spark dataframe but i got below errors (i am using spark 1.5.1)

import pandas as pd from pandas import ExcelFile from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * pdf=pd.read_excel('/home/testdata/test.xlsx') df = sqlContext.createDataFrame(pdf) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame rdd, schema = self._createFromLocal(data, schema) File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 337, in _createFromLocal data = [schema.toInternal(row) for row in data] File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in toInternal return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in <genexpr> return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 435, in toInternal return self.dataType.toInternal(obj) File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 191, in toInternal else time.mktime(dt.timetuple())) AttributeError: 'datetime.time' object has no attribute 'timetuple' 

Does anybody know how to fix it?

2

2 Answers 2

1

My best guess your problem was about "incorrectly" parsing datetime data when you read your data with Pandas

The following code "just works":

import pandas as pd from pandas import ExcelFile from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * pdf = pd.read_excel('test.xlsx', parse_dates=['Created on','Confirmation time']) sc = SparkContext() sqlContext = SQLContext(sc) sqlContext.createDataFrame(data=pdf).collect() [Row(Customer=1000935702, Country='TW', ... 

Please note, you have one more datetime column 'Confirmation date' which in your example consists of NaT and thus reads without a problem to RDD with your short sample, but should you happen to have some data there in a full dataset you'll have to take care about that column as well.

Sign up to request clarification or add additional context in comments.

2 Comments

the previous error was gone, but i got a type error. wondering if i need to solve each column type specifically? thank you >>> df = sqlContext.createDataFrame(pdf) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame rdd, schema = self._createFromLocal(data, schema) File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 322, in _createFromLocal ... TypeError: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
@b4me You may think about accepting the solution to your earlier problem and posting new one as a new question....
0

Explicitly defining the schema would fix the problem. Depending on your use case, you can dynamically specify the schema as shown in the snippet below;

from pyspark.sql.types import * schema = StructType([ StructField(name, TimestampType() if pd.api.types.is_datetime64_dtype(col) else DateType() if pd.api.types.is_datetime64_any_dtype(col) else DoubleType() if pd.api.types.is_float_dtype(col) else StringType(), True) for name, col in zip(df.columns, df.dtypes)]) sparkDf = spark.createDataFrame(df, schema) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.