Login or Register Now   Email:  Password:   

Get Links With DOM

Do not use REGEX to parse HTML

Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.

By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired.

Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.


<?php
    
function getLinks($link)
    {
        
/*** return array ***/
        
$ret = array();

        
/*** a new dom object ***/
        
$dom = new domDocument;

        
/*** get the HTML (suppress errors) ***/
        
@$dom->loadHTML(file_get_contents($link));

        
/*** remove silly white space ***/
        
$dom->preserveWhiteSpace false;

        
/*** get the links from the HTML ***/
        
$links $dom->getElementsByTagName('a');
    
        
/*** loop over the links ***/
        
foreach ($links as $tag)
        {
            
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return 
$ret;
    }
?>

A similar approach could be to use XPath which would achieve the same results. Either way, using the DOM is going to prove far more efficient than REGEX.

Example Usage


<?php
    
/*** a link to search ***/
    
$link "http://php.net";

    
/*** get the links ***/
    
$urls getLinks($link);

    
/*** check for results ***/
    
if(sizeof($urls) > 0)
    {
        foreach(
$urls as $key=>$value)
        {
            echo 
$key ' - '$value '<br >';
        }
    }
    else
    {
        echo 
"No links found at $link";
    }
?>

Demonstration

/ -
/downloads.php - downloads
/docs.php - manual
/FAQ.php - faq
/support.php - getting help
/mailing-lists.php - mailing lists
/license - licenses
http://wiki.php.net/ - wiki
http://bugs.php.net/ - reporting bugs
/sites.php - php.net sites
/links.php - links section
/conferences/ - conferences
/my.php - my php.net
/tut.php - introductory tutorial
/usage.php - Netcraft Survey
/thanks.php - Thanks To
http://www.easydns.com/?V=698570efeb62a6e2 - easyDNS
http://www.directi.com/ - Directi
http://promote.pair.com/direct.pl?php.net - pair Networks
http://www.servercentral.net/ - Server Central
http://www.hostedsolutions.com/ - Hosted Solutions
http://www.spry.com/ - Spry VPS Hosting
http://ez.no/ - eZ Systems
http://www.hit.no/english - HiT
http://www.osuosl.org - OSU Open Source Lab
http://www.yahoo.com/ - Yahoo! Inc.
http://www.binarysec.com/ - BinarySEC
http://www.nexcess.net/ - NEXCESS.NET
http://www.rackspace.com/ - Rackspace
http://www.eukhost.com/ - EUKhost
http://www.apache.org/ - Apache
http://www.mysql.com/ - MySQL
http://www.postgresql.org/ - PostgreSQL
http://www.zend.com/ - Zend Technologies
http://www.linuxfund.org/ - LinuxFund.org
http://www.ostg.com/ - OSTG
/feed.atom - Atom
/downloads.php#v5 - Current PHP 5.2 Stable:
http://qa.php.net/rc.php - Release Candidates
http://qa.php.net/ - 5.3.2RC1 (22 Dec 2009)
/submit-event.php - [add]
/cal.php?id=3947 - PHP Overview
/cal.php?id=3790 - PHP UK Conference 2010
/cal.php?id=1385 - Hamburg
/cal.php?id=1523 - Dallas PHP/MySQL Users Group
/cal.php?id=1670 - Dallas PHP Users Group (DPUG)
/cal.php?id=1652 - Austin PHP Meetup
/cal.php?id=1665 - OKC PHP Meetup
/cal.php?id=3643 - Oklahoma City PHP User Group
/cal.php?id=1395 - Wash DC PHP Developers Group
/cal.php?id=3684 - PHP User Group Stuttgart
/cal.php?id=3918 - Denver - FRPUG
/cal.php?id=1848 - Meeting usergroup Dortmund
/cal.php?id=1946 - PHP Usergroup Frankfurt/Main
/cal.php?id=3483 - Edinburgh PHP Users Group
/cal.php?id=1732 - PHP User Group Nanaimo, BC/CA
/cal.php?id=2580 - PEA meeting from phpchina
/cal.php?id=3722 - Nagpur PHP Meetup
/cal.php?id=3760 - Los Angeles PHP Developers Group
/cal.php?id=1738 - Madison PHP User's Group
/cal.php?id=2246 - PHP Brisbane Meetup Group
/cal.php?id=3708 - Nashville Enterprise LAMP UG
/cal.php?id=3761 - Chattanooga PHP Developers
/cal.php?id=1545 - Miami PHP User Group
/cal.php?id=1546 - Broward Php Usergroup
/cal.php?id=1847 - Nashville PHP Users Group
/cal.php?id=2208 - Chicago PHP User Group Brunch
/cal.php?id=3925 - Baltimore PHP User Group
/cal.php?id=1704 - TriPUG
/cal.php?id=1719 - OINK-PUG (Cincinnati, Ohio)
/cal.php?id=1820 - Utah PHP Users Group Meeting
/cal.php?id=3844 - NorfolkPHP
/cal.php?id=1131 - Kansas City
/cal.php?id=1346 - Miami Linux Users Group
/cal.php?id=1671 - Twin Cities PHP
/cal.php?id=2449 - Los Angeles LAMPsig
/cal.php?id=1099 - Long Island PHP Users Group
/cal.php?id=409 - New York
/cal.php?id=384 - AzPHP
/cal.php?id=2527 - Malaysia PHP Meetup
/cal.php?id=2600 - PHP Usergroup Karlsruhe
/cal.php?id=2660 - PHPUG Wuerzburg
/cal.php?id=3075 - DCPHP Beverage Subgroup
/cal.php?id=3653 - Brisbane PHP User Group
/cal.php?id=2500 - Irish PHP Users Group meeting
/cal.php?id=3917 - Colorado Springs - FRPUG
/cal.php?id=1316 - Arabic PHP Group Meeting
/cal.php?id=1708 - Malaysia PHP User Group Meet Up
/cal.php?id=2499 - Sandy PHP Group
/cal.php?id=2629 - Sacramento PHP Group
/cal.php?id=2662 - Miami Linux Meetup
/cal.php?id=3422 - PHP RIO Meetup
/cal.php?id=3827 - ZEND: PHP I on-line class
/cal.php?id=3828 - ZEND: PHP II on-line class
/cal.php?id=3829 - ZEND:Test Prep PHP5 Cert on-line
/cal.php?id=3869 - Zend Framework: Class FRENCH
/cal.php?id=3870 - Test Prep:PHP 5 Cert (French)
/cal.php?id=3872 - PHPI: Foundations (French)
/cal.php?id=3911 - PHP Training Philippines
/cal.php?id=3913 - ZEND:(GERMAN) Security in PHP
/cal.php?id=3385 - UK Object Orientation Workshop
/cal.php?id=3952 - Zend PHP II: Higher Structures
/cal.php?id=3386 - UK Smarty Templating Workshop
/cal.php?id=3812 - Object Oriented programming &PHP
/cal.php?id=1200 - PHP & MySQL Training / Gießen
/cal.php?id=2589 - PHP Intro Course South Africa
/cal.php?id=1389 - Cursos de PHP en Bilbao
/cal.php?id=3830 - ZEND:QuickStart-Exp Programmers
/cal.php?id=3838 - ZEND:Zend Framework online class
/cal.php?id=2408 - Chennai PHP Training
/cal.php?id=2421 - Basic PHP Course
/cal.php?id=3878 - Curso - PHP Zend Certified
/cal.php?id=3914 - ZEND:(GERMAN) Zend Studio
/cal.php?id=3915 - ZEND:GERMAN-Framework:Grundlagen
/cal.php?id=231 - UK PHP Training
/cal.php?id=3871 - Building Security w/PHP App (FR)
/cal.php?id=1137 - PHP Brasil - Training
http://www.php.net/conferences/index.php#id2010-01-16-1 - ConFoo Web Techno Conference
http://www.php.net/conferences/index.php#id2009-12-09-1 - PHP UK Conference 2010
http://www.php.net/archive/2009.php#id2009-12-17-1 - PHP 5.2.12 Released!
/releases/5_2_12.php - release announcement
/ChangeLog-5.php#5.2.12 - ChangeLog
http://www.php.net/archive/2009.php#id2009-11-19-1 - PHP 5.3.1 Released!
http://www.php.net/releases/5_3_1.php - release announcement
http://www.php.net/ChangeLog-5.php#5.3.1 - ChangeLog
http://www.php.net/archive/2009.php#id2009-09-17-1 - PHP 5.2.11 Released!
/releases/5_2_11.php - release announcement
/ChangeLog-5.php#5.2.11 - ChangeLog
http://www.php.net/archive/2009.php#id2009-07-30-1 - PHP TestFest 2009 Winners
http://www.flickr.com/search/?w=all&q=elephpants&m=tags - elePHPhants
http://www.flickr.com/search/?w=all&q=testfest+mug&m=tags - TestFest mugs
http://testfest.php.net/repos/testfest/ - 887 tests
http://wiki.php.net/qa/testfest - 2009 PHP TestFest
http://www.php.net/archive/2009.php#id2009-07-16-1 - Subversion Migration Complete
http://svn.php.net - svn.php.net
http://php.net/svn.php - php.net/svn.php
http://wiki.php.net/vcs/svnfaq - wiki.php.net/vcs/svnfaq
http://github.com/php - github mirror
http://wiki.php.net/vcs/svnfaq#git - wiki.php.net/vcs/svnfaq#git
/archive/index.php - News Archive
/source.php?url=/index.php - show source
/credits.php - credits
/stats/ - stats
/sitemap.php - sitemap
/contact.php - contact
/contact.php#ads - advertising
/mirrors.php - mirror sites
/copyright.php - Copyright © 2001-2009 The PHP Group
/mirror.php - This mirror
http://developer.yahoo.com/ - Yahoo! Inc.