Skip to content

Commit 9150b3e

Browse files
committed
Version 2
1 parent 9ae6fed commit 9150b3e

File tree

6 files changed

+540
-275
lines changed

6 files changed

+540
-275
lines changed

.gitattributes

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
* linguist-vendored
2+
*.php linguist-vendored=false

DOC.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Class HTML_Scraper
2+
### Static Functions:
3+
-`new_from($source)`
4+
5+
Create a new HTML_Scraper object from the passed source.
6+
`$source` can be of type `DOMNodeList`, `DOMNode` or `string`.
7+
8+
**Returns:**
9+
| Type | Description |
10+
|------|-------------|
11+
| `array` | When `$source` is an instance of `DOMNodeList` then returns an `array` of `HTML_Scraper` objects. |
12+
| `HTML_Scraper` | When `$source` is an instance of `DOMNode` or a `string` |
13+
14+
15+
-`CSS_to_Xpath(string $path) : string`
16+
17+
Translates CSS selector to XPath expression.
18+
19+
### Functions:
20+
-`__toString() : string`
21+
22+
Magic function to convert `HTML_Scraper` into a `string` containing the HTML code of the loaded document.
23+
24+
25+
-`textContent() : string`
26+
27+
Get the *textContent* of the loaded HTML document.
28+
29+
30+
-`load_HTML_str(string $source, int $options = NULL) : bool`
31+
32+
Load HTML from a string.
33+
34+
- `$options`
35+
It is for passing LIBXML constant flags. `LIBXML_NOERROR | LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED` is always applied (even when `$options` is `NULL`).
36+
37+
Returns `TRUE` on success and `FALSE` on failure.
38+
39+
40+
-`load_HTML_file(string $filename, int $options = NULL, array $context = NULL) : bool`
41+
42+
Load HTML from a file.
43+
44+
- `$options`
45+
*see `$options` in `HTML_Scraper->load_HTML_str()`*
46+
47+
- `$context`
48+
*see `$context` in `stream_context_create()`*
49+
50+
Returns `TRUE` on success and `FALSE` on failure.
51+
52+
53+
-`xpath(string $expr, int ...$items)`
54+
55+
Get `DOMNode` that match the passed *XPath* path expression.
56+
57+
- `$items`
58+
Index of the `DOMNode` to be returned in the `DOMNodeList` matching the *XPath* path expression.
59+
It is 0-indexed. (*i.e.* to get first node use `0`, for second node use `1` and so on).
60+
Negative values can be used for referencing the list item from the end. (*i.e.* use `-1` for last node, `-2` for second last node and so on).
61+
If invalid index is used `NULL` is returned. (*i.e.* if only two nodes match the *XPath* path expression then using 3 will return `NULL`).
62+
63+
**Returns:**
64+
| Type | Description |
65+
|------|-------------|
66+
| `NULL` | When no nodes matches the XPath path expression |
67+
| `DOMNodeList` | When no `...$items` are passed |
68+
| `DOMNode` | When only one `...$items` is passed |
69+
| `array` | When more than one `...$items` are passed. Array contains `DOMNode` or `NULL` |
70+
71+
Returns `DOMNodeList` (or `DOMNode` when `$item` index is specified) that matches the specified *XPath* path expression.
72+
73+
74+
-`querySelector(string $selector, int ...$items)`
75+
76+
Same as `HTML_Scraper->xpath()` except that it uses CSS selector instead of *XPath* path expression.
77+
78+
-`xpath_extract($mapper, string $expr, int ...$items)`
79+
80+
Find `DOMNode`(s) in the same way as in `HTML_Scraper->xpath()` then extract data from the `DOMNode`(s) as specified by the `$mapper`.
81+
82+
- `$mapper`
83+
It can be any one of the `string` specified below or a `function` that takes a `DOMNode` and returns any extracted value.
84+
| Mapper Value | Description |
85+
|---|---|
86+
| `'innerHTML'` | Maps `DOMNode` to its *innerHTML* |
87+
| `'outerHTML'` | Maps `DOMNode` to its *outerHTML* |
88+
| `'textContent'` | Maps `DOMNode` to its *textContent* |
89+
| `'textContentTrim'` | Maps `DOMNode` to its *textContent* without any whitespaces at the beginning or at the end of the *textContent* |
90+
91+
-`querySelector_extract($mapper, string $selector, int ...$items)`
92+
93+
Same as `HTML_Scraper->xpath_extract()` except that it uses CSS selector instead of *XPath* path expression.
94+
95+
---
96+
97+
# Class DOMNodeHelper
98+
99+
### Static Functions:
100+
101+
-`innerHTML(DOMNode &$node) : string`
102+
103+
Returns *innerHTML* of the passed `DOMNode`.
104+
105+
106+
-`outerHTML(DOMNode &$node) : string`
107+
108+
Returns *outerHTML* of the passed `DOMNode`.
109+
110+
111+
-`xpath(DOMNode &$node, string $expr, int ...$items)`
112+
113+
Similar to `HTML_Scraper->xpath()` except that it works on a `DOMNode` instead of the `HTML_Scraper`'s `DOMDocument`.
114+
115+
-`querySelector(DOMNode &$node, string $selector, int ...$items)`
116+
117+
Similar to `DOMNodeHelper::xpath()` except it uses CSS selector instead of a *XPath* path expression.
118+
119+
-`getChildNode(DOMNode &$node, int ...$indexes)`
120+
121+
Get one or more child nodes of the `DOMNode`.
122+
123+
- `$indexes`
124+
*See `$items` in `HTML_Scraper->expath()`.*
125+
126+
**Returns:**
127+
128+
| Type | Description |
129+
|---|---|
130+
| `DOMNodeList` | When no `...$indexes` is passed |
131+
| `DOMNode` | When only one `...$indexes` is passed |
132+
| `array` | When more that one `...$indexes` is passed. Array contains `DOMNode` or `NULL` |
133+
134+
135+
-`getChildElements(DOMNode &$node, int ...$indexes) : array`
136+
137+
Same as `DOMNode::getChildNode()` except that it works on child **elements** instead of child **nodes**.
138+
139+
-`remove_self(DOMNode &$node)`
140+
141+
Removes the `DOMNode` from its parent `DOMDocument`.
142+
143+
-`filter_child_elements_xpath(DOMNode &$node, string ...$exprs)`
144+
145+
Removes the child elements of the passed `DOMNode` that match the passed *XPath* path expression(s).
146+
147+
-`filter_child_elements_querySelector(DOMNode &$node, string ...$selectors)`
148+
149+
Removes the child elements of the passed `DOMNode` that match the passed CSS selector(s).
150+
151+
-`filter_child_elements_index(DOMNode &$node, int ...$indexes)`
152+
153+
Removes the child elements of the passed `DOMNode` specified by the `...$indexes`.
154+
155+
- `$indexes`
156+
*See `$items` in `HTML_Scraper->expath()`.*

README.md

Lines changed: 25 additions & 139 deletions
Original file line numberDiff line numberDiff line change
@@ -1,160 +1,46 @@
11
# HTML Scraper
2-
A PHP class to simplify data extraction from HTML.
2+
A set of PHP classes to simplify data extraction from HTML.
33

44
---
55

6-
>Base code for the *CSS_to_Xpath* method in *HTML_Scraper* was cloned from [https://github.com/zendframework/zend-dom](https://github.com/zendframework/zend-dom).
7-
>
6+
>Base code for the *CSS_to_Xpath* method in *HTML_Scraper* was cloned from [https://github.com/zendframework/zend-dom](https://github.com/zendframework/zend-dom).
87
>Zend Framework
9-
>: [http://framework.zend.com/](http://framework.zend.com/)
10-
>
8+
>: [http://framework.zend.com/](http://framework.zend.com/)
119
>Repository
12-
>: [http://github.com/zendframework/zf2](http://github.com/zendframework/zf2)
13-
>
14-
>Copyright (c) 2005-2015 Zend Technologies USA Inc. [http://www.zend.com](http://www.zend.com)
15-
>
10+
>: [http://github.com/zendframework/zf2](http://github.com/zendframework/zf2)
11+
>Copyright (c) 2005-2015 Zend Technologies USA Inc. [http://www.zend.com](http://www.zend.com)
1612
>License
1713
>: [https://framework.zend.com/license](https://framework.zend.com/license) New BSD License
1814
---
19-
## Static methods:
20-
---
21-
-`CSS_to_Xpath(string $selector) : string`
22-
23-
Translate *CSS* selector to *XPath* path query.
24-
25-
*Returns:*
26-
- `string` containing the equivalent *XPath* path query.
27-
---
28-
-`from($source [, bool $utf = TRUE])`
29-
30-
Create new `HTML_Scraper` object from various sources.
31-
32-
`$source` can be of type
33-
- `DOMNodeList`
34-
- `DOMNode`
35-
- `string` containing HTML
36-
37-
*Returns:*
38-
- `array` of `HTML_Scraper` objects when `$source instanceof DOMNodeList`
39-
- `HTML_Scraper` object when `$source instanceof DOMNode`
40-
- `HTML_Scraper` object when `$source` is `string`
41-
---
42-
-`outerHTML(DOMNode $node) : string`
43-
44-
Extract *outerHTML* from a `DOMNode`
4515

46-
*Returns:*
47-
- `string` containing *outerHTML* of the `DOMNode`
48-
---
49-
-`innerHTML(DOMNode $node) : string`
50-
51-
Extract *innerHTML* from a `DOMNode`
52-
53-
*Returns:*
54-
- `string` containing *innerHTML* of the `DOMNode`
55-
---
56-
## Methods:
57-
---
58-
-`__toString() : string`
59-
60-
***Magic*** method to convert `HTML_Scraper` object to HTML `string`.
61-
---
62-
-`from_querySelector(string $selector, int $item = NULL, bool $utf = TRUE)`
16+
For *basic* documentation see the DOC file.
6317

64-
Create `HTML_Scraper` object (or `array` of objects) from `DOMNode` (or `DOMNodeList`) that matches the specified *CSS* selector.
65-
66-
Returns `NULL` when no match is found.
67-
---
68-
-`from_xpath(string $expr, int $item = NULL, bool $utf = TRUE)`
69-
70-
Create `HTML_Scraper` object (or `array` of objects) from `DOMNode` (or `DOMNodeList`) that matches the specified *XPath* path expression.
71-
72-
Returns `NULL` when no match is found.
73-
---
74-
-`getBody() : string`
75-
76-
Get *innerHTML* of `document.body`
77-
---
78-
-`getHead() : string`
79-
80-
Get *innerHTML* of `document.head`
81-
---
82-
-`load_HTML_file(string $filename, bool $utf = TRUE, resource $context = NULL) : bool`
83-
84-
Load *HTML* text from local or remote file.
85-
86-
Returns `TRUE` on success and `FALSE` on failure.
87-
---
88-
-`load_HTML_str(string $source, bool $utf = TRUE) : bool`
89-
90-
Load *HTML* text from `string`.
91-
92-
Returns `TRUE` on success and `FALSE` on failure.
93-
---
94-
-`querySelector(string $selector, int $item = NULL)`
95-
96-
Returns `DOMNodeList` (or `DOMNode` when `$item` index is specified) that matches the specified *CSS* selector.
97-
98-
`$item` is *0-indexed*.
99-
100-
Returns `NULL` when no match is found.
101-
---
102-
-`querySelector_innerHTML(string $expr, int $item = 0)`
103-
104-
Returns *innerHTML* of the `DOMNode` that matches the specified *CSS* selector.
105-
106-
Returns `NULL` when no match is found.
107-
---
108-
-`querySelector_outerHTML(string $expr, int $item = 0)`
109-
110-
Returns *outerHTML* of the `DOMNode` that matches the specified *CSS* selector.
111-
112-
Returns `NULL` when no match is found.
113-
---
114-
-`querySelector_textContent(string $expr, int $item = 0)`
115-
116-
Returns *textContent* of the `DOMNode` that matches the specified *CSS* selector.
117-
118-
Returns `NULL` when no match is found.
119-
---
120-
-`xpath(string $expr, int $item = NULL)`
121-
122-
Returns `DOMNodeList` (or `DOMNode` when `$item` index is specified) that matches the specified *XPath* path expression.
123-
124-
`$item` is *0-indexed*.
18+
### Example
19+
```php
20+
<?php
21+
require_once 'HTML_Scraper.php';
12522

126-
Returns `NULL` when no match is found.
127-
---
128-
-`xpath_innerHTML(string $expr, int $item = 0)`
23+
$doc = new HTML_Scraper;
12924

130-
Returns *innerHTML* of the `DOMNode` that matches the specified *XPath* path expression.
25+
if(!$doc->load_HTML_file('https://www.royalroad.com/fiction/10073/the-wandering-inn')) {
26+
echo 'Unable to load data';
27+
exit(1);
28+
}
13129

132-
Returns `NULL` when no match is found.
133-
---
134-
-`xpath_outerHTML(string $expr, int $item = 0)`
30+
$data = [];
13531

136-
Returns *outerHTML* of the `DOMNode` that matches the specified *XPath* path expression.
32+
$data['title'] = $doc->querySelector_extract('textContentTrim', 'div.fic-title h1[property="name"]', 0);
13733

138-
Returns `NULL` when no match is found.
139-
---
140-
-`xpath_textContent(string $expr, int $item = 0)`
34+
$data['url'] = $doc->xpath_extract(function($meta) {
35+
return $meta->getAttribute('content');
36+
}, '//meta[@property="og:url"]', 0);
14137

142-
Returns *textContent* of the `DOMNode` that matches the specified *XPath* path expression.
38+
$data['description'] = $doc->querySelector_extract(function(&$div) {
39+
return trim(DOMNodeHelper::innerHTML($div));
40+
}, 'div.description div[property="description"]', 0);
14341

144-
Returns `NULL` when no match is found.
145-
---
146-
## Example:
147-
```php
148-
<?php
149-
$doc = new HTML_Scraper;
150-
if($doc->load_HTML_file('sample_data_file.html') === TRUE) {
151-
$title = $doc->querySelector_textContent('.fic-title [property="name"]', 0);
152-
echo "Fiction name is {$title}.<br />",
153-
154-
$rows = $doc->querySelector('#chapters tbody tr');
155-
echo "There are ", count($rows), "chapters. <br />";
42+
$data['tags'] = $doc->querySelector_extract('textContentTrim', 'span.tags span[property="genre"]');
15643

157-
echo "First chapter is called", $doc->querySelector_textContent('#chapters tbody tr a', 0), "<br />";
158-
}
44+
var_dump($data);
15945
?>
16046
```

0 commit comments

Comments
 (0)