0

I'd like to take a web-hosted xml podcast file and loop through, putting all the titles into a txt file matching the guid, ie abcd.mp3.txt (or abcd.txt) will contain This is the title

<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" version="2.0"> <channel> <item> <title>This is the title</title> <enclosure url="http://www.example.com/abcd.mp3" length="402024" type="audio/mpeg"/> <guid>http://www.example.com/abcd.mp3</guid> 

I've SO'd the question and looked at xmlstarlet, xmlgrep, xmlsh. And then there's things like Osmosis which looks powerful but requires node and is lacking practical documentation. Ideally using as few external dependencies as possible (although Python 3.6 is installed).

After a morning at this I'm starting to wonder if I'm over-thinking/complicating things. Any pointers appreciated.

5
  • Is a Perl solution acceptable? We really need a proper sample of the XML. Commented Mar 30, 2017 at 12:20
  • You can use XSLT/Xpath to produce a text file as the output (stackoverflow.com/questions/5908668/…) Commented Mar 30, 2017 at 12:35
  • Thanks @Borodin. Perl would be fine - although regarding a "proper sample", I thought xml of a specific type (eg podcast-1.0.dtd) was fairly standard? Commented Mar 30, 2017 at 12:47
  • Thanks @Hobbes - I'm not entirely sure how I'd apply this to a command line scenario? I thought XSLT was a stylesheet that you applied to a web page to render an xml file in-browser? Commented Mar 30, 2017 at 12:49
  • XSLT is a language for transforming XML documents. The stylesheet you mention is just one of its applications. There are command-line XSLT processors, e.g. saxonica.com and MSXSL. Commented Mar 30, 2017 at 12:52

1 Answer 1

0

OK, after much messing with stylesheets, I stumbled upon BeautifulSoup.

And the answer is as simple as this (HT to this guide)

pip install bs4 pip install lxml 

and then

#! /usr/bin/env python3 from bs4 import BeautifulSoup import requests url = 'http://www.example.com/somepodcast.xml' content = requests.get(url).content soup = BeautifulSoup(content,'lxml') # choose lxml parser titles = soup.find_all('title') for title in titles: print(title) # or do whatever. 

Thanks for the other suggestions, but this works for me as far as not messing with xpaths, regex etc.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.