PHP-ist: All about learning PHP, frameworks, tips and tricks

PHP4, PHP5, cakephp, codeigniter, code example (scriptlets / snipplets) and many easy and effective way to learn

PHP4: How to Steal from Yahoo! (another Web Screen Scraper)

Posted in php4 by PHP-ist on the February 11th, 2008

If you've been wondering on how to scrape some information from your favorite website. This web screen scraping techniques has already there since dinosaur age. Here's an example on how to do it in PHP 4. Thanks to:

  1. Mozilla Firefox and Firebug for spying and taking overall bird view so we could design a strategy.
  2. The good old snoopy-php project which will do his works as http client, snipping and sucking the whole page.
  3. And also phphtmlparser that make dirty HTML parsing work a lot more easy, taking down the enemy element by element.

First, our target operation is to grab a list of currently hot box office movies from Yahoo! Movies.
Target located: http://movies.yahoo.com/mv/boxoffice/ ..... locked on!!!

We need to see the HTML layout of the page, using Firefox and Firebug
getting html layout for screen scraping yahoo movies

Place your sight on the right side of screenshot.
The area we want to steal started with Top Movies and ended with Top Cast/Crew'.
Here's part of the code you'll see later.

PHP:
  1. ...
  2.             incIf($step1, 0, $parser->iNodeValue == 'Top Movies');
  3. ...
  4.             incIf($step1, 1, $parser->iNodeValue == 'Top Cast/Crew');
  5. ...

Now look at the bottom part. The full element tree starting from the root html tag, and ended in b tag.
xpath for screen firefox and firebug scraping yahoo movies
Decide that b < font < a < td should enough to distinct and separate the element from others.
So we will use a variable $step2 for digging. If $step2 == 0 and current element is td, we set $step2 into 1. If $step2 == 1 and current element is b, increment it. This $step2 continues to dig further to font and until we reach the treasure box in b tag. Finally, print out what inside treasure box and go up again to the surface, set $step2 to 0.

Now it's time for the full code guys.
If you're too lazy to copy paste it, just download the source code of web scraping tutorial (PHP 4).

PHP:
  1. <?
  2.     // Let's hire both experts
  3.     include ('Snoopy.class.php');
  4.     include ('htmlparser.inc');
  5.  
  6.     // Move and dig deeper
  7.     function incIf(&$step_counter, $current_step, $condition) {
  8.         if (($step_counter == $current_step) && ($condition))
  9.             $step_counter = $current_step+1;
  10.     }
  11.  
  12.     // If it's deep enough, take it and leave;
  13.     function doIf(&$step_counter, $current_step, $condition) {
  14.         if (($step_counter == $current_step) && ($condition)) {
  15.             $step_counter = 0;
  16.             return true;
  17.         }
  18.         return false;
  19.     }
  20.  
  21.     // C'mon snoopy suck that page
  22.     $snooper = new Snoopy();
  23.     if ($snooper->fetch('http://movies.yahoo.com/mv/boxoffice/')) {
  24.         // Pass the page to HtmlParser, and let him do his work
  25.         $parser = new HtmlParser ($snooper->results);
  26.         $step1 = 0; $step2 = 0;
  27.         echo "TODAY's BOX OFFICE\r\n<br/>";
  28.         while ($parser->parse()) {
  29.             incIf($step1, 0, $parser->iNodeValue == 'Top Movies');
  30.             incIf($step1, 1, $parser->iNodeValue == 'Top Cast/Crew');
  31.             if ($step1 == 1) {
  32.                 incIf($step2, 0, $parser->iNodeName == 'TD');
  33.                 incIf($step2, 1, $parser->iNodeName == 'A');
  34.                 incIf($step2, 2, $parser->iNodeName == 'FONT');
  35.                 incIf($step2, 3, $parser->iNodeName == 'B');
  36.                 if (doIf($step2, 4, $parser->iNodeType == NODE_TYPE_TEXT))
  37.                     echo $parser->iNodeValue."\r\n<br/>";
  38.             }
  39.         }
  40.     }
  41. ?>

Tags: php4, screen scraping, code example

Leave a Reply




Cannot find your answer here?
Feel free to get in touch and ask PHP-ist anything, just anything :)