Parsel is a BSD-licensed Python library to extract data from HTML, JSON, and XML documents.
It supports:
- CSS and XPath expressions for HTML and XML documents
- JMESPath expressions for JSON documents
- Regular expressions
Find the Parsel online documentation at https://parsel.readthedocs.org.
Example (open online demo):
>>> from parsel import Selector >>> text = """ <html> <body> <h1>Hello, Parsel!</h1> <ul> <li><a href="http://example.com">Link 1</a></li> <li><a href="http://scrapy.org">Link 2</a></li> </ul> <script type="application/json">{"a": ["b", "c"]}</script> </body> </html>""" >>> selector = Selector(text=text) >>> selector.css('h1::text').get() 'Hello, Parsel!' >>> selector.xpath('//h1/text()').re(r'\w+') ['Hello', 'Parsel'] >>> for li in selector.css('ul > li'): ... print(li.xpath('.//@href').get()) http://example.com http://scrapy.org >>> selector.css('script::text').jmespath("a").get() 'b' >>> selector.css('script::text').jmespath("a").getall() ['b', 'c']