0

I'm an Apache Spark newbie, and I want to be able to read an XML file and count the amount of words per title. The XML file looks like this:

<title>first title</title> <words>there are seven words in this example</words> <title>second title</title> <words>there are more words here, ten words to be precise</words> 

I'm using Python to write the Spark job, but when I type

sc.textFile("file://...") 

It automatically splits my file using the line break (\n) as its delimiter. I'd like it to split using several lines, until it finds " < title > " again.

The result I'd like to obtain would be something like:

first title: 7 second title: 10 

How can I achieve this?

Thanks in advance

1

1 Answer 1

0

I would suggest to give a try to spark-xml if you work with XML files.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.