Edit - Stack Overflow

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Is it critical that they be "read in" as Nulls or is it acceptable to read them into the dataframe (say as strings) and then convert to Nulls?

combinatorist
– combinatorist

2017-11-07 01:11:52 +00:00
Commented Nov 7, 2017 at 1:11
most elegant solution would be to use a replaceAll and make your data uniform.

philantrovert
– philantrovert

2017-11-07 08:55:31 +00:00
Commented Nov 7, 2017 at 8:55
@combinatorist , I want to read that against a schema and use it as a dataset. So certain fields that are integers by default contains values like "N/A" or "-" all of which I want to be parsed as null to be able to read into the interger field of my schema case class. So I'd prefer to do it when being read from the file into a dataset tself.

Vishnu Prathish
– Vishnu Prathish

2017-11-07 17:14:44 +00:00
Commented Nov 7, 2017 at 17:14
@philantrovert . I would do it as the last case. But Ideally, I want spark to handle the whole thing rather than a regular in memory replaceAll.

Vishnu Prathish
– Vishnu Prathish

2017-11-07 17:15:41 +00:00
Commented Nov 7, 2017 at 17:15
@VishnuPrathish, what if you read the field into a dataframe as a string, make Null replacements there, convert the field to an int, and then cast that dataframe as a dataset?

combinatorist
– combinatorist

2017-11-07 17:18:32 +00:00
Commented Nov 7, 2017 at 17:18

| Show 2 more comments

Correct minor typos or mistakes
Clarify meaning without changing it
Add related resources or links
Always respect the author’s intent
Don’t use edits to reply to the author

Collectives™ on Stack Overflow