0

Below is the content in my csv file :

A1,B1,C1 A2,B2,C2,D1 A3,B3,C3,D2,E1 A4,B4,C4,D3 A5,B5,C5,,E2 

So, there are 5 columns but only 3 values in the first row.

I read it using the following command :

val csvDF : DataFrame = spark.read .option("header", "false") .option("delimiter", ",") .option("inferSchema", "false") .csv("file.csv") 

And following is what i get using csvDF.show()

+---+---+---+ |_c0|_c1|_c2| +---+---+---+ | A1| B1| C1| | A2| B2| C2| | A3| B3| C3| | A4| B4| C4| | A5| B5| C5| +---+---+---+ 

How can i read all the data in all the columns?

6
  • is it possible to add all 5 columns to each row? Like for row 1 instead of A1,B1,C1 have it be A1,B1,C1,, Commented Aug 10, 2017 at 6:34
  • That's only a workaround, and will not work if the csv is managed by someone else. Commented Aug 10, 2017 at 6:37
  • Just specify schema manually Commented Aug 10, 2017 at 6:38
  • what if we don't know the schema, the content in the csv in not know in advance. Commented Aug 10, 2017 at 6:39
  • All content in csv can be specified as StringType Commented Aug 10, 2017 at 6:44

3 Answers 3

1

Basically your csv-file isn't properly formatted in the sense that it doesn't have a equal number of columns in each row, which is required if you want to read it with spark.read.csv. However, you can instead read it with spark.read.textFile and then parse each row.

As I understand it, you do not know the number of columns beforehand, so you want your code to handle an arbitrary number of columns. To do this you need to establish the maximum number of columns in your data set, so you need two passes over your data set.

For this particular problem, I would actually go with RDDs instead of DataFrames or Datasets, like this:

val data = spark.read.textFile("file.csv").rdd val rdd = data.map(s => (s, s.split(",").length)).cache val maxColumns = rdd.map(_._2).max() val x = rdd .map(row => { val rowData = row._1.split(",") val extraColumns = Array.ofDim[String](maxColumns - rowData.length) Row((rowData ++ extraColumns).toList:_*) }) 

Hope that helps :)

Sign up to request clarification or add additional context in comments.

Comments

0

You can read it as a dataset with only one column (for example by using another delimiter) :

var df = spark.read.format("csv").option("delimiter",";").load("test.csv") df.show() +--------------+ | _c0| +--------------+ | A1,B1,C1| | A2,B2,C2,D1| |A3,B3,C3,D2,E1| | A4,B4,C4,D3| | A5,B5,C5,,E2| +--------------+ 

Then you can use this answer to manually split your column in five, this will add null values when the element does not exist :

var csvDF = df.withColumn("_tmp",split($"_c0",",")).select( $"_tmp".getItem(0).as("col1"), $"_tmp".getItem(1).as("col2"), $"_tmp".getItem(2).as("col3"), $"_tmp".getItem(3).as("col4"), $"_tmp".getItem(4).as("col5") ) csvDF.show() +----+----+----+----+----+ |col1|col2|col3|col4|col5| +----+----+----+----+----+ | A1| B1| C1|null|null| | A2| B2| C2| D1|null| | A3| B3| C3| D2| E1| | A4| B4| C4| D3|null| | A5| B5| C5| | E2| +----+----+----+----+----+ 

Comments

0

If the column dataTypes and number of columns are known then you can define schema and apply the schema while reading the csv file as dataframe. Below I have defined all five columns as stringType

val schema = StructType(Seq( StructField("col1", StringType, true), StructField("col2", StringType, true), StructField("col3", StringType, true), StructField("col4", StringType, true), StructField("col5", StringType, true))) val csvDF : DataFrame = sqlContext.read .option("header", "false") .option("delimiter", ",") .option("inferSchema", "false") .schema(schema) .csv("file.csv") 

You should be getting dataframe as

+----+----+----+----+----+ |col1|col2|col3|col4|col5| +----+----+----+----+----+ |A1 |B1 |C1 |null|null| |A2 |B2 |C2 |D1 |null| |A3 |B3 |C3 |D2 |E1 | |A4 |B4 |C4 |D3 |null| |A5 |B5 |C5 |null|E2 | +----+----+----+----+----+ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.