Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

7
  • Is it critical that they be "read in" as Nulls or is it acceptable to read them into the dataframe (say as strings) and then convert to Nulls? Commented Nov 7, 2017 at 1:11
  • most elegant solution would be to use a replaceAll and make your data uniform. Commented Nov 7, 2017 at 8:55
  • @combinatorist , I want to read that against a schema and use it as a dataset. So certain fields that are integers by default contains values like "N/A" or "-" all of which I want to be parsed as null to be able to read into the interger field of my schema case class. So I'd prefer to do it when being read from the file into a dataset tself. Commented Nov 7, 2017 at 17:14
  • @philantrovert . I would do it as the last case. But Ideally, I want spark to handle the whole thing rather than a regular in memory replaceAll. Commented Nov 7, 2017 at 17:15
  • @VishnuPrathish, what if you read the field into a dataframe as a string, make Null replacements there, convert the field to an int, and then cast that dataframe as a dataset? Commented Nov 7, 2017 at 17:18