I am reading a csv file in Pyspark as follows:
df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ? I know pandas can handle this, but can Spark ? The version I am using is Spark 2.0.0.
Here is an example which works in Pandas but fails using Spark:
In [1]: import pandas as pd In [2]: pdf = pd.read_csv('malformed_data.csv') In [3]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True) In [4]: pdf[['col12','col13','col14']] Out[4]: col12 col13 \ 0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE 1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC col14 0 23.0 1 0.0 In [5]: sdf.select("col12","col13",'col14').show() +------------------+--------------------+--------------------+ | col12| col13| col14| +------------------+--------------------+--------------------+ |"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...| | null|OUTKAST#THROOTS~W...| 0.0| +------------------+--------------------+--------------------+ The contents of the file :
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY ""W"" JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED 61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE