I'm an Apache Spark newbie, and I want to be able to read an XML file and count the amount of words per title. The XML file looks like this:
<title>first title</title> <words>there are seven words in this example</words> <title>second title</title> <words>there are more words here, ten words to be precise</words> I'm using Python to write the Spark job, but when I type
sc.textFile("file://...") It automatically splits my file using the line break (\n) as its delimiter. I'd like it to split using several lines, until it finds " < title > " again.
The result I'd like to obtain would be something like:
first title: 7 second title: 10 How can I achieve this?
Thanks in advance