PicoFeed was originally developed for Miniflux, a minimalist and open source news reader.
However, this library can be used inside any project. PicoFeed is tested with a lot of different feeds, it's simple and easy to use.
- Simple and fast
- Feed parser for Atom 1.0 and RSS (0.91, 0.92, 1.0 and 2.0)
- Feed writer for Atom 1.0 and RSS 2.0
- Import/Export OPML subscriptions
- Content filter: HTML cleanup, remove pixel trackers and Ads
- Many HTTP client adapters: cURL or Stream Context
- Content grabber: download from the original website the full content
- License: Unlicense http://unlicense.org/
- PHP >= 5.3
- libxml >= 2.7
- XML PHP extensions: DOM and SimpleXML
- cURL or Stream Context (
allow_url_fopen=On)
require 'vendor/PicoFeed/Import.php'; use PicoFeed\Import; $opml = file_get_contents('mySubscriptions.opml'); $import = new Import($opml); $entries = $import->execute(); print_r($entries); require 'vendor/PicoFeed/Export.php'; use PicoFeed\Export; $feeds = array( array( 'title' => 'Site title', 'description' => 'Optional description', 'site_url' => 'http://petitcodeur.fr/', 'site_feed' => 'http://petitcodeur.fr/feed.xml' ) ); $export = new Export($feeds); $opml = $export->execute(); echo $opml; // XML content require 'vendor/PicoFeed/Reader.php'; use PicoFeed\Reader; $reader = new Reader; // Try to discover the XML feed automatically $reader->download('http://petitcodeur.fr/'); $parser = $reader->getParser(); if ($parser !== false) { $feed = $parser->execute(); echo $feed->title; echo $feed->url; print_r($feed->items); } require 'vendor/PicoFeed/Reader.php'; use PicoFeed\Reader; $reader = new Reader; // Get last modified infos from previous requests $lastModified = '...'; $etag = '...'; // Download directly the feed $resource = $reader->download('http://petitcodeur.fr/feed.xml', $lastModified, $etag); if ($resource->isModified()) { $parser = $reader->getParser(); if ($parser !== false) { $feed = $parser->execute(); echo $feed->title; echo $feed->url; print_r($feed->items); // Save cache infos for the next request $lastModified = $resource->getLastModified(); $etag = $resource->getEtag(); } } $reader->download( 'http://petitcodeur.fr/', 'last modified date', 'etag value', 10, 'My RSS reader user agent' ); Just call the static method proxy() before everything else:
PicoFeed\Client::proxy($hostname, $port); If your proxy is protected with a login/password:
PicoFeed\Client::proxy($hostname, $port, $username, $password); require_once 'lib/PicoFeed/Writers/Rss20.php'; use PicoFeed\Writers\Rss20; $writer = new Rss20(); $writer->title = 'My site'; $writer->site_url = 'http://boo/'; $writer->feed_url = 'http://boo/feed.atom'; $writer->author = array( 'name' => 'Me', 'url' => 'http://me', 'email' => 'me@here' ); $writer->items[] = array( 'title' => 'My article 1', 'updated' => strtotime('-2 days'), 'url' => 'http://foo/bar', 'summary' => 'Super summary', 'content' => '<p>content</p>' ); $writer->items[] = array( 'title' => 'My article 2', 'updated' => strtotime('-1 day'), 'url' => 'http://foo/bar2', 'summary' => 'Super summary 2', 'content' => '<p>content 2 © 2015</p>', 'author' => array( 'name' => 'Me too', ) ); $writer->items[] = array( 'title' => 'My article 3', 'url' => 'http://foo/bar3' ); echo $writer->execute(); require_once 'lib/PicoFeed/Writers/Atom.php'; use PicoFeed\Writers\Atom; $writer = new Atom(); $writer->title = 'My site'; $writer->site_url = 'http://boo/'; $writer->feed_url = 'http://boo/feed.atom'; $writer->author = array( 'name' => 'Me', 'url' => 'http://me', 'email' => 'me@here' ); $writer->items[] = array( 'title' => 'My article 1', 'updated' => strtotime('-2 days'), 'url' => 'http://foo/bar', 'summary' => 'Super summary', 'content' => '<p>content</p>' ); echo $writer->execute(); You can got all debug output by calling this code:
print_r(PicoFeed\Logging::$messages); You will got an output like that:
Array ( [0] => Fetch URL: http://petitcodeur.fr/feed.xml [1] => Etag: [2] => Last-Modified: [3] => cURL total time: 0.711378 [4] => cURL dns lookup time: 0.001064 [5] => cURL connect time: 0.100733 [6] => cURL speed download: 74825 [7] => HTTP status code: 200 [8] => HTTP headers: Set-Cookie => start=R2701971637; path=/; expires=Sat, 06-Jul-2013 05:16:33 GMT [9] => HTTP headers: Date => Sat, 06 Jul 2013 03:55:52 GMT [10] => HTTP headers: Content-Type => application/xml [11] => HTTP headers: Content-Length => 53229 [12] => HTTP headers: Connection => close [13] => HTTP headers: Server => Apache [14] => HTTP headers: Last-Modified => Tue, 02 Jul 2013 03:26:02 GMT [15] => HTTP headers: ETag => "393e79c-cfed-4e07ee78b2680" [16] => HTTP headers: Accept-Ranges => bytes ) These variables are static arrays, extends the actual array or replace it.
By example to add a new iframe whitelist:
Filter::$iframe_whitelist[] = 'http://www.kickstarter.com'; Or to replace the entire whitelist:
Filter::$iframe_whitelist = array('http://www.kickstarter.com'); Available variables:
// Allow only specified tags and attributes Filter::$whitelist_tags // Strip content of these tags Filter::$blacklist_tags // Allow only specified URI scheme Filter::$whitelist_scheme // List of attributes used for external resources: src and href Filter::$media_attributes // Blacklist of external resources Filter::$media_blacklist // Required attributes for tags, if the attribute is missing the tag is dropped Filter::$required_attributes // Add attribute to specified tags Filter::$add_attributes // Integer Attributes Filter::$integer_attributes // Iframe allowed source Filter::$iframe_whitelist For more details, have a look to the class Filter.
- Try with rules first (xpath patterns) for the domain name (see
PicoFeed\Rules\) - Try to find the text content by using common attributes for class and id
- Finally, if nothing is found, the feed content is displayed
The content downloader use a fake user agent, actually Google Chrome under Mac Os X.
The best results are obtained with Xpath rules file.
There is a PHP script inside PicoFeed to import Fivefilters rules, but I dont' use it because almost of these patterns are not up to date.
Add a PHP file to the directory PicoFeed\Rules, the filename must be the domain name:
Example with the BBC website, www.bbc.co.uk.php:
<?php return array( 'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833', 'body' => array( '//div[@class="story-body"]', ), 'strip' => array( '//script', '//form', '//style', '//*[@class="story-date"]', '//*[@class="story-header"]', '//*[@class="story-related"]', '//*[contains(@class, "byline")]', '//*[contains(@class, "story-feature")]', '//*[@id="video-carousel-container"]', '//*[@id="also-related-links"]', '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]', ) ); Actually, only body, strip and test_url are supported.
Don't forget to send a pull request or a ticket to share your contribution with everybody,
require 'vendor/PicoFeed/Reader.php'; use PicoFeed\Reader; $reader = new Reader; $reader->download('http://www.egscomics.com/rss.php'); $parser = $reader->getParser(); if ($parser !== false) { $parser->grabber = true; // <= Enable the content grabber $feed = $parser->execute(); // ... } When the content scraper is enabled, everything will be slower because for each item a new HTTP request is made and the HTML downloaded is parsed with XML/Xpath.
If you want to add new rules, just open a ticket and I will do it.
- *.blog.lemonde.fr
- *.blog.nytimes.com
- *.nytimes.com
- *.phoronix.com
- *.slate.com
- *.theguardian.com
- *.wikipedia.org
- *.wired.com
- *.wsj.com
- github.com
- golem.de
- ing.dk
- karriere.jobfinder.dk
- lifehacker.com
- lists.*
- medium.com
- pastebin.com
- plus.google.com
- rue89.com
- smallhousebliss.com
- spiegel.de
- techcrunch.com
- version2.dk
- www.bbc.co.uk
- www.businessweek.com
- www.cnn.com
- www.egscomics.com
- www.forbes.com
- www.lemonde.fr
- www.lepoint.fr
- www.npr.org
- www.numerama.com
- www.slate.fr