Porto

Parse HTML With PHP And DOM

Parse HTML With PHP And DOM

DO NOT USE REGEX TO PARSE HTML

Parsing HTML with PHP is not the difficult task that many think it to be. Often times folks grab for the nearest tool to parse out the bits between other bits, and PHP regular expressions are a great tool for this. However, regular expressions are slow and often need to parse the same information many times to gain the end result. This constant looping is benifical when matching data, but when applied to parsing large and/or complex data, it works against us.

The PHP DOM extension provides the required tools to parse our XML data, and as HTML is merely a subset of XML it is also able to be parsed with the DOM functions. The example data here is of a weather table as supplied by the Australian Bureau of Meteorology, they have no RSS, or other feeds and this is the only way to get the daily records from the source.

If regex were used to parse this information out everytime a request was made, the server would quickly fall over. By using DOM, the HTML can be parsed and the infomation locked within liberated for use in scripts or inserted a database for future use and/or reference.

In the example code the HTML is given as a string, but in production was gained with file_get_contents().


<?php

$html 
'
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" dir="ltr">
<head>
<title>PHPRO.ORG</title>
</head>
<body>
<h2>Forecast for Saturday</h2>
<!-- Issued at 0828 UTC Friday 23 May 2008 -->
<table border="0" summary="Capital Cities Precis Forecast">
   <tbody>
      <tr>
         <td><a href="/products/IDN10064.shtml" title="Link to Sydney forecast">Sydney</a></td>
         <td title="Maximum temperature in degrees Celsius" class="max alignright">19&deg;</td>
         <td>Fine. Mostly sunny.</td>
      </tr>

      <tr>
         <td><a href="/products/IDV10450.shtml" title="Link to Melbourne forecast">Melbourne</a></td>
         <td title="Maximum temperature in degrees Celsius" class="max alignright">16&deg;</td>
         <td>Fog then fine.</td>
      </tr>

      <tr>
         <td><a href="/products/IDQ10095.shtml" title="Link to Brisbane forecast">Brisbane</a></td>
         <td title="Maximum temperature in degrees Celsius" class="max alignright">24&deg;</td>
         <td>Mostly fine</td>
      </tr>

      <tr>
         <td><a href="/products/IDW12300.shtml" title="Link to Perth forecast">Perth</a></td>
         <td title="Maximum temperature in degrees Celsius" class="max alignright">21&deg;</td>
         <td>Few showers, increasing later.</td>
      </tr>

      <tr>
         <td><a href="/products/IDS10034.shtml" title="Link to Adelaide forecast">Adelaide</a></td>
         <td title="Maximum temperature in degrees Celsius" class="max alignright">20&deg;</td>
         <td>Fine. Mostly sunny.</td>
      </tr>

      <tr>
         <td><a href="/products/IDT65061.shtml" title="Link to Hobart forecast">Hobart</a></td>
         <td title="Maximum temperature in degrees Celsius" class="max alignright">13&deg;</td>
         <td>Mainly fine.</td>
      </tr>

      <tr>
         <td><a href="/products/IDN10035.shtml" title="Link to Canberra forecast">Canberra</a></td>
         <td title="Maximum temperature in degrees Celsius" class="max alignright">15&deg;</td>
         <td>Fine, mostly sunny.</td>
      </tr>

      <tr>
         <td><a href="/products/IDD10150.shtml" title="Link to Darwin forecast">Darwin</a></td>
         <td title="Maximum temperature in degrees Celsius" class="max alignright">32&deg;</td>
         <td>Fine and sunny.</td>
      </tr>

   </tbody>
</table>

</body>
</html>
'
;

    
/*** a new dom object ***/
    
$dom = new domDocument;

    
/*** load the html into the object ***/
    
$dom->loadHTML($html);

    
/*** discard white space ***/
    
$dom->preserveWhiteSpace false;

    
/*** the table by its tag name ***/
    
$tables $dom->getElementsByTagName('table');

    
/*** get all rows from the table ***/
    
$rows $tables->item(0)->getElementsByTagName('tr');

    
/*** loop over the table rows ***/
    
foreach ($rows as $row)
    {
        
/*** get each column by tag name ***/
        
$cols $row->getElementsByTagName('td');
        
/*** echo the values ***/
        
echo $cols->item(0)->nodeValue.'<br />';
        echo 
$cols->item(1)->nodeValue.'<br />';
        echo 
$cols->item(2)->nodeValue;
        echo 
'<hr />';
    }
?>

Demonstration

Sydney
19°
Fine. Mostly sunny.
Melbourne
16°
Fog then fine.
Brisbane
24°
Mostly fine
Perth
21°
Few showers, increasing later.
Adelaide
20°
Fine. Mostly sunny.
Hobart
13°
Mainly fine.
Canberra
15°
Fine, mostly sunny.
Darwin
32°
Fine and sunny.