Change delimiter in Apache Spark

Question

I'm an Apache Spark newbie, and I want to be able to read an XML file and count the amount of words per title. The XML file looks like this:

<title>first title</title> <words>there are seven words in this example</words> <title>second title</title> <words>there are more words here, ten words to be precise</words>

I'm using Python to write the Spark job, but when I type

sc.textFile("file://...")

It automatically splits my file using the line break (\n) as its delimiter. I'd like it to split using several lines, until it finds " < title > " again.

The result I'd like to obtain would be something like:

first title: 7 second title: 10

How can I achieve this?

Thanks in advance

Could you check this stackoverflow.com/questions/46408558/… — Avishek Bhattacharya
– Avishek Bhattacharya, Commented Sep 26, 2017 at 14:16

Zouzias · Accepted Answer · 2017-09-26 13:38:10Z

0

I would suggest to give a try to spark-xml if you work with XML files.

answered Sep 26, 2017 at 13:38

Zouzias

2,3601 gold badge23 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Change delimiter in Apache Spark

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related