Login or Register Now   Email:  Password:   

Get Links With DOM

Do not use REGEX to parse HTML

Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.

By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired.

Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.


<?php
    
function getLinks($link)
    {
        
/*** return array ***/
        
$ret = array();

        
/*** a new dom object ***/
        
$dom = new domDocument;

        
/*** get the HTML (suppress errors) ***/
        
@$dom->loadHTML(file_get_contents($link));

        
/*** remove silly white space ***/
        
$dom->preserveWhiteSpace false;

        
/*** get the links from the HTML ***/
        
$links $dom->getElementsByTagName('a');
    
        
/*** loop over the links ***/
        
foreach ($links as $tag)
        {
            
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return 
$ret;
    }
?>

A similar approach could be to use XPath which would achieve the same results. Either way, using the DOM is going to prove far more efficient than REGEX.

Example Usage


<?php
    
/*** a link to search ***/
    
$link "http://php.net";

    
/*** get the links ***/
    
$urls getLinks($link);

    
/*** check for results ***/
    
if(sizeof($urls) > 0)
    {
        foreach(
$urls as $key=>$value)
        {
            echo 
$key ' - '$value '<br >';
        }
    }
    else
    {
        echo 
"No links found at $link";
    }
?>

Demonstration

/ -
/downloads.php - downloads
/docs.php - manual
/FAQ.php - faq
/support.php - getting help
/mailing-lists.php - mailing lists
/license - licenses
http://wiki.php.net/ - wiki
http://bugs.php.net/ - reporting bugs
/sites.php - php.net sites
/links.php - links section
/conferences/ - conferences
/my.php - my php.net
/tut.php - introductory tutorial
/usage.php - Netcraft Survey
/thanks.php - Thanks To
http://www.easydns.com/?V=698570efeb62a6e2 - easyDNS
http://www.directi.com/ - Directi
http://promote.pair.com/direct.pl?php.net - pair Networks
http://www.servercentral.net/ - Server Central
http://www.hostedsolutions.com/ - Hosted Solutions
http://www.spry.com/ - Spry VPS Hosting
http://ez.no/ - eZ Systems
http://www.hit.no/english - HiT
http://www.osuosl.org - OSU Open Source Lab
http://www.yahoo.com/ - Yahoo! Inc.
http://www.binarysec.com/ - BinarySEC
http://www.nexcess.net/ - NEXCESS.NET
http://www.rackspace.com/ - Rackspace
http://www.eukhost.com/ - EUKhost
http://www.apache.org/ - Apache
http://www.mysql.com/ - MySQL
http://www.postgresql.org/ - PostgreSQL
http://www.zend.com/ - Zend Technologies
http://www.linuxfund.org/ - LinuxFund.org
http://www.ostg.com/ - OSTG
/feed.atom - Atom
/downloads.php#v5 - Current PHP 5.2 Stable:
/submit-event.php - [add]
/cal.php?id=3649 - PHP'n Rio 09
/cal.php?id=3513 - CakeFest Berlin 2009
/cal.php?id=3533 - Official CakePHP conference
/cal.php?id=153 - Köln/Bonn
/cal.php?id=2663 - Iran PHP developer's meetup
/cal.php?id=1923 - PHP meeting online in China
/cal.php?id=2540 - meeting de LAMPistas en La Paz
/cal.php?id=1745 - SW Florida Linux Users Group
/cal.php?id=1860 - PDXPHP monthly meeting
/cal.php?id=3294 - PHPNW: PHP North West user group
/cal.php?id=1395 - Wash DC PHP Developers Group
/cal.php?id=2503 - Stuttgart
/cal.php?id=1848 - Meeting usergroup Dortmund
/cal.php?id=1946 - PHP Usergroup Frankfurt/Main
/cal.php?id=3483 - Edinburgh PHP Users Group
/cal.php?id=3648 - Moscow PHP Templating Workshop
/cal.php?id=1732 - PHP User Group Nanaimo, BC/CA
/cal.php?id=2580 - PEA meeting from phpchina
/cal.php?id=3652 - PHPMS - Workshop PHP Extremo
/cal.php?id=1385 - Hamburg
/cal.php?id=1523 - Dallas PHP/MySQL Users Group
/cal.php?id=1670 - Dallas PHP Users Group (DPUG)
/cal.php?id=1652 - Austin PHP Meetup
/cal.php?id=1665 - OKC PHP Meetup
/cal.php?id=3643 - Oklahoma City PHP User Group
/cal.php?id=3663 - DrupalMad
/cal.php?id=1545 - Miami PHP User Group
/cal.php?id=1546 - Broward Php Usergroup
/cal.php?id=1847 - Nashville PHP Users Group
/cal.php?id=2208 - Chicago PHP User Group Brunch
/cal.php?id=1704 - TriPUG
/cal.php?id=1719 - OINK-PUG (Cincinnati, Ohio)
/cal.php?id=1820 - Utah PHP Users Group Meeting
/cal.php?id=1131 - Kansas City
/cal.php?id=1346 - Miami Linux Users Group
/cal.php?id=1671 - Twin Cities PHP
/cal.php?id=2449 - Los Angeles LAMPsig
/cal.php?id=3664 - Melbourne PHP User Group
/cal.php?id=1738 - Madison PHP User's Group
/cal.php?id=2246 - PHP Brisbane Meetup Group
/cal.php?id=2629 - Sacramento PHP Group
/cal.php?id=2662 - Miami Linux Meetup
/cal.php?id=3422 - PHP RIO Meetup
/cal.php?id=1099 - Long Island PHP Users Group
/cal.php?id=409 - New York
/cal.php?id=384 - AzPHP
/cal.php?id=2527 - Malaysia PHP Meetup
/cal.php?id=2600 - PHP Usergroup Karlsruhe
/cal.php?id=2660 - PHPUG Wuerzburg
/cal.php?id=3075 - DCPHP Beverage Subgroup
/cal.php?id=3653 - Brisbane PHP User Group
/cal.php?id=2500 - Irish PHP Users Group meeting
/cal.php?id=1316 - Arabic PHP Group Meeting
/cal.php?id=1708 - Malaysia PHP User Group Meet Up
/cal.php?id=2499 - Sandy PHP Group
/cal.php?id=3520 - Extreme PHP
/cal.php?id=2702 - PHP & AJAX -Construindo Websites
/cal.php?id=3560 - Core and Advanced PHP Workshop
/cal.php?id=2023 - Ahmedabad PHP Group Training
/cal.php?id=338 - MySQL Spain
/cal.php?id=456 - Curso PHP Madrid
/cal.php?id=641 - PHP E-Learning/Germany
/cal.php?id=998 - Curso on-line ActionScript / PHP
/cal.php?id=1198 - PHP & MySQL Training in Kassel
/cal.php?id=1360 - PHP & MySQL com Dreamweaver MX
/cal.php?id=1981 - Curso on-line de PHP
/cal.php?id=2051 - PHP & MYSQL-Construindo WebSites
/cal.php?id=3053 - PHP Training Heilbronn
/cal.php?id=3377 - PHP Programming
/cal.php?id=841 - Curso on-line de PHP-MySQL
/cal.php?id=1490 - PHP Class at CalTek
/cal.php?id=3385 - UK Object Orientation Workshop
/cal.php?id=3386 - UK Smarty Templating Workshop
/cal.php?id=1466 - PHP para Expertos Curso on-line
/cal.php?id=1583 - Curso PHP y MySQL
/cal.php?id=2408 - Chennai PHP Training
/cal.php?id=3638 - Learn Basic PHP In One Night!
/cal.php?id=1200 - PHP & MySQL Training / Gießen
/cal.php?id=2589 - PHP Intro Course South Africa
/cal.php?id=3591 - Curso - Framework CakePHP
/cal.php?id=1389 - Cursos de PHP en Bilbao
/cal.php?id=1137 - PHP Brasil - Training
/cal.php?id=2421 - Basic PHP Course
/cal.php?id=231 - UK PHP Training
http://www.php.net/conferences/index.php#id2009-06-03-1 - CodeWorks Conference
http://www.php.net/conferences/index.php#id2009-05-29-1 - Forum PHP Paris 2009
http://www.php.net/archive/2009.php#id2009-06-30-1 - PHP 5.3.0 Released!
http://php.net/downloads.php#v5.3.0 - PHP 5.3.0
http://php.net/namespaces - namespaces
http://php.net/lsb - late static binding
http://php.net/closures - closures
http://php.net/gc_enable - garbage collection
http://php.net/phar - ext/phar
http://php.net/intl - ext/intl
http://php.net/fileinfo - ext/fileinfo
http://php.net/migration53 - migration guide
http://php.net/releases/5_3_0.php - release announcement
http://php.net/ChangeLog-5.php - ChangeLog
http://www.php.net/archive/2009.php#id2009-06-19-1 - PHP 5.3.0RC4 Release Announcements
http://qa.php.net/ - qa.php.net
http://cvs.php.net/viewvc.cgi/php-src/UPGRADING?revision=PHP_5_3 - 5.3 upgrade guide
http://www.php.net/archive/2009.php#id2009-06-18-1 - PHP 5.2.10 Released!
/releases/5_2_10.php - release announcement
/ChangeLog-5.php#5.2.10 - ChangeLog
http://www.php.net/archive/2009.php#id2009-06-12-1 - PHP 5.2.10RC2 and PHP 5.3.0RC3 Release Announcements
http://wiki.php.net/doc/scratchpad/upgrade/53 - 5.3 upgrade guide
http://www.php.net/archive/2009.php#id2009-05-09-1 - TestFest 2009
http://qa.php.net/testfest.php - QA TestFest page
/archive/index.php - News Archive
/source.php?url=/index.php - show source
/credits.php - credits
/stats/ - stats
/sitemap.php - sitemap
/contact.php - contact
/contact.php#ads - advertising
/mirrors.php - mirror sites
/copyright.php - Copyright © 2001-2009 The PHP Group
/mirror.php - This mirror
http://developer.yahoo.com/ - Yahoo! Inc.