PHP5: Screen scraping with DOM and XPath
This tutorial is continuation from previous yahoo screen-scraping using PHP4 tutorial.
We will try different method using DOM and XPath which only supported in PHP5.
First, a bit knowledge of XPath is required. More about XPATH can be read on:
http://www.zvon.org/xxl/XPathTutorial/General/examples.html
Also there's small concern that using XPATH is a bit slower than pure DOM Traversal. Read Speed: DOM traversal vs. XPath in PHP 5
But i personally also think that XPath is neat and easier.
Let's start. First we diagnose document structure using Mozilla Firebug.
Try a very easy case, which is to grab the title "Top Movies":

Copy XPath using Firebug and get this query:
/html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr/td/font/b
- Firefox automatically fix broken html structure, and it also add tbody tag. So, we need to remove this tag.
- Only grab first row of table. Change .../tr/td/font/b into .../tr[1]/td/font/b
Now we get our first XPath query:
/html/body/center/table[8]/tr/td[5]/table[4]/tr[1]/td/font
Next harder case is to grab contents.

XPath query from Firebug is:
/html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr[2]/td[2]/a/font/b
- Same problem here. Firefox automatically fix broken html structure, and it also add tbody tag. Remove tbody tag from XPath query.
- Grab all row of table. Change .../tr[2]/td[2]/a/font/b into .../tr/td[2]/a/font/b
Final XPath query for content is:
/html/body/center/table[8]/tr/td[5]/table[4]/tr/td[2]/a/font/b
Now final step is to put all two XPath queries into few lines of code, and we're done:
-
<?php
-
include ('Snoopy.class.php');
-
-
$snooper = new Snoopy();
-
if ($snooper->fetch('http://movies.yahoo.com/mv/boxoffice/')) {
-
$dom = new DomDocument();
-
$dom->loadHTML($snooper->results);
-
-
$x = new DomXPath($dom);
-
-
// /html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr/td/font/b
-
$nodes = $x->query('/html/body/center/table[8]/tr/td[5]/table[4]/tr[1]/td/font/b');
-
-
// /html/body/center/table[8]/tbody/tr/td[5]/table[4]/tbody/tr[2]/td[2]/a/font/b
-
$nodes = $x->query('/html/body/center/table[8]/tr/td[5]/table[4]/tr/td[2]/a/font/b');
-
foreach ($nodes as $node) {
-
}
-
}
-
?>
Tags: php5, screen scraping, code example
on May 17th, 2008 at 12:07 am
Thanks a lot for this post; you really helped me out a lot!
on June 23rd, 2008 at 12:33 am
Nice tutorial. This is exactly what I was looking for. Thanks a lot.
on August 7th, 2008 at 8:35 am
Hey, this is exactly what i was trying to do as well. I missed the bit where Firefox fixes broken HTML. I hope my script will work now. Thanks for this.