Spark doesn't read columns with null values in first row

Question

Below is the content in my csv file :

A1,B1,C1 A2,B2,C2,D1 A3,B3,C3,D2,E1 A4,B4,C4,D3 A5,B5,C5,,E2

So, there are 5 columns but only 3 values in the first row.

I read it using the following command :

val csvDF : DataFrame = spark.read .option("header", "false") .option("delimiter", ",") .option("inferSchema", "false") .csv("file.csv")

And following is what i get using csvDF.show()

+---+---+---+ |_c0|_c1|_c2| +---+---+---+ | A1| B1| C1| | A2| B2| C2| | A3| B3| C3| | A4| B4| C4| | A5| B5| C5| +---+---+---+

How can i read all the data in all the columns?

is it possible to add all 5 columns to each row? Like for row 1 instead of A1,B1,C1 have it be A1,B1,C1,, — Tom
– Tom, Commented Aug 10, 2017 at 6:34
That's only a workaround, and will not work if the csv is managed by someone else. — Sorabh Kumar
– Sorabh Kumar, Commented Aug 10, 2017 at 6:37
what if we don't know the schema, the content in the csv in not know in advance. — Sorabh Kumar
– Sorabh Kumar, Commented Aug 10, 2017 at 6:39

Glennie Helles Sindholt · Accepted Answer · 2017-08-10 10:04:35Z

Basically your csv-file isn't properly formatted in the sense that it doesn't have a equal number of columns in each row, which is required if you want to read it with spark.read.csv. However, you can instead read it with spark.read.textFile and then parse each row.

As I understand it, you do not know the number of columns beforehand, so you want your code to handle an arbitrary number of columns. To do this you need to establish the maximum number of columns in your data set, so you need two passes over your data set.

For this particular problem, I would actually go with RDDs instead of DataFrames or Datasets, like this:

val data = spark.read.textFile("file.csv").rdd val rdd = data.map(s => (s, s.split(",").length)).cache val maxColumns = rdd.map(_._2).max() val x = rdd .map(row => { val rowData = row._1.split(",") val extraColumns = Array.ofDim[String](maxColumns - rowData.length) Row((rowData ++ extraColumns).toList:_*) })

Hope that helps :)

Fabich · Accepted Answer · 2017-08-10 08:32:53Z

You can read it as a dataset with only one column (for example by using another delimiter) :

var df = spark.read.format("csv").option("delimiter",";").load("test.csv") df.show() +--------------+ | _c0| +--------------+ | A1,B1,C1| | A2,B2,C2,D1| |A3,B3,C3,D2,E1| | A4,B4,C4,D3| | A5,B5,C5,,E2| +--------------+

Then you can use this answer to manually split your column in five, this will add null values when the element does not exist :

var csvDF = df.withColumn("_tmp",split($"_c0",",")).select( $"_tmp".getItem(0).as("col1"), $"_tmp".getItem(1).as("col2"), $"_tmp".getItem(2).as("col3"), $"_tmp".getItem(3).as("col4"), $"_tmp".getItem(4).as("col5") ) csvDF.show() +----+----+----+----+----+ |col1|col2|col3|col4|col5| +----+----+----+----+----+ | A1| B1| C1|null|null| | A2| B2| C2| D1|null| | A3| B3| C3| D2| E1| | A4| B4| C4| D3|null| | A5| B5| C5| | E2| +----+----+----+----+----+

Anahcolus · Accepted Answer · 2017-08-10 15:30:59Z

If the column dataTypes and number of columns are known then you can define schema and apply the schema while reading the csv file as dataframe. Below I have defined all five columns as stringType

val schema = StructType(Seq( StructField("col1", StringType, true), StructField("col2", StringType, true), StructField("col3", StringType, true), StructField("col4", StringType, true), StructField("col5", StringType, true))) val csvDF : DataFrame = sqlContext.read .option("header", "false") .option("delimiter", ",") .option("inferSchema", "false") .schema(schema) .csv("file.csv")

You should be getting dataframe as

+----+----+----+----+----+ |col1|col2|col3|col4|col5| +----+----+----+----+----+ |A1 |B1 |C1 |null|null| |A2 |B2 |C2 |D1 |null| |A3 |B3 |C3 |D2 |E1 | |A4 |B4 |C4 |D3 |null| |A5 |B5 |C5 |null|E2 | +----+----+----+----+----+

Collectives™ on Stack Overflow

Spark doesn't read columns with null values in first row

3 Answers 3

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Linked

Related