Pyspark writing data from databricks into azure sql: ValueError: Some of types cannot be determined after inferring

Question

I am writing data from azure databricks to azure sql using pyspark. Code works well having no nulls, but when dataframe contains nulls I get following error:

databricks/spark/python/pyspark/sql/pandas/conversion.py:300: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Unable to convert the field Product. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion. Context: Unsupported type in conversion from Arrow: null Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) ValueError: Some of types cannot be determined after inferring

The dataframe must be written to sql, including the nulls. How do I solve this?

sqlContext = SQLContext(sc) def to_sql(df, table): finaldf = sqlContext.createDataFrame(df) finaldf.write.jdbc(url=url, table= table, mode ="overwrite", properties = properties) to_sql(data, f"TF_{table.upper()}")

EDIT:

Solved it creating a function that maps pandas dtypes to sql dtypes and outputs columns and dtypes as one string.

def convert_dtype(df): df_mssql = {'int64': 'bigint', 'object': 'varchar(200)', 'float64': 'float'} mydict = {} for col in df.columns: if str(df.dtypes[col]) in df_mssql: mydict[col] = df_mssql.get(str(df.dtypes[col])) l = " ".join([str(k[0] + " " + k[1] + ",") for k in list(mydict.items())]) return l[:-1]

Passing this string to the createTableColumnTypes option solved this scenario

jdbcDF.write \ .option("createTableColumnTypes", convert_dtype(df) \ .jdbc("jdbc:postgresql:dbserver", "schema.tablename", properties={"user": "username", "password": "password"})

AssureTech · Accepted Answer · 2020-11-19 21:38:31Z

For this you'll need to specify the schema in your write statement. Here's an example from the documentation, also linked below:

jdbcDF.write \ .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)") \ .jdbc("jdbc:postgresql:dbserver", "schema.tablename", properties={"user": "username", "password": "password"})

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Hi, thanks for answering. I wrote a small function to map the pandas dtypes to one string containing columns and sql dtypes. Will edit this in my post.

Collectives™ on Stack Overflow

Pyspark writing data from databricks into azure sql: ValueError: Some of types cannot be determined after inferring

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related