Login or Register Now   Email:  Password:   

Get Links With DOM

Do not use REGEX to parse HTML

Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.

By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired.

Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.


<?php
    
function getLinks($link)
    {
        
/*** return array ***/
        
$ret = array();

        
/*** a new dom object ***/
        
$dom = new domDocument;

        
/*** get the HTML (suppress errors) ***/
        
@$dom->loadHTML(file_get_contents($link));

        
/*** remove silly white space ***/
        
$dom->preserveWhiteSpace false;

        
/*** get the links from the HTML ***/
        
$links $dom->getElementsByTagName('a');
    
        
/*** loop over the links ***/
        
foreach ($links as $tag)
        {
            
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return 
$ret;
    }
?>

A similar approach could be to use XPath which would achieve the same results. Either way, using the DOM is going to prove far more efficient than REGEX.

Example Usage


<?php
    
/*** a link to search ***/
    
$link "http://php.net";

    
/*** get the links ***/
    
$urls getLinks($link);

    
/*** check for results ***/
    
if(sizeof($urls) > 0)
    {
        foreach(
$urls as $key=>$value)
        {
            echo 
$key ' - '$value '<br >';
        }
    }
    else
    {
        echo 
"No links found at $link";
    }
?>

Demonstration

/ -
/downloads.php - downloads page
/docs.php - manual
/FAQ.php - faq
/support.php - getting help
/mailing-lists.php - mailing lists
/license - licenses
http://wiki.php.net/ - wiki
http://bugs.php.net/ - reporting bugs
/sites.php - php.net sites
/links.php - links section
/conferences/ - conferences
/my.php - my php.net
/tut.php - introductory tutorial
/usage.php - Netcraft Survey
/thanks.php - Thanks To
http://www.easydns.com/?V=698570efeb62a6e2 - easyDNS
http://www.directi.com/ - Directi
http://promote.pair.com/direct.pl?php.net - pair Networks
http://www.servercentral.net/ - Server Central
http://www.hostedsolutions.com/ - Hosted Solutions
http://www.spry.com/ - Spry VPS Hosting
http://ez.no/ - eZ Systems
http://www.hit.no/english - HiT
http://www.osuosl.org - OSU Open Source Lab
http://www.yahoo.com/ - Yahoo! Inc.
http://www.binarysec.com/ - BinarySEC
http://www.nexcess.net/ - NEXCESS.NET
http://www.rackspace.com/ - Rackspace
http://www.eukhost.com/ - EUKhost
http://www.apache.org/ - Apache
http://www.mysql.com/ - MySQL
http://www.postgresql.org/ - PostgreSQL
http://www.zend.com/ - Zend Technologies
http://www.linuxfund.org/ - LinuxFund.org
http://www.ostg.com/ - OSTG
/feed.atom - Atom
/downloads.php#v5 - Current PHP 5.2 Stable:
/submit-event.php - [add]
/cal.php?id=1848 - Meeting usergroup Dortmund
/cal.php?id=1946 - PHP Usergroup Frankfurt/Main
/cal.php?id=3483 - Edinburgh PHP Users Group
/cal.php?id=1732 - PHP User Group Nanaimo, BC/CA
/cal.php?id=2580 - PEA meeting from phpchina
/cal.php?id=3722 - Nagpur PHP Meetup
/cal.php?id=3760 - Los Angeles PHP Developers Group
/cal.php?id=1738 - Madison PHP User's Group
/cal.php?id=2246 - PHP Brisbane Meetup Group
/cal.php?id=3708 - Nashville Enterprise LAMP UG
/cal.php?id=3761 - Chattanooga PHP Developers
/cal.php?id=1545 - Miami PHP User Group
/cal.php?id=1546 - Broward Php Usergroup
/cal.php?id=1847 - Nashville PHP Users Group
/cal.php?id=2208 - Chicago PHP User Group Brunch
/cal.php?id=3925 - Baltimore PHP User Group
/cal.php?id=1704 - TriPUG
/cal.php?id=1719 - OINK-PUG (Cincinnati, Ohio)
/cal.php?id=1820 - Utah PHP Users Group Meeting
/cal.php?id=3844 - NorfolkPHP
/cal.php?id=1131 - Kansas City
/cal.php?id=1346 - Miami Linux Users Group
/cal.php?id=1671 - Twin Cities PHP
/cal.php?id=2449 - Los Angeles LAMPsig
/cal.php?id=409 - New York
/cal.php?id=384 - AzPHP
/cal.php?id=3075 - DCPHP Beverage Subgroup
/cal.php?id=3653 - Brisbane PHP User Group
/cal.php?id=3917 - Colorado Springs - FRPUG
/cal.php?id=1316 - Arabic PHP Group Meeting
/cal.php?id=1708 - Malaysia PHP User Group Meet Up
/cal.php?id=2499 - Sandy PHP Group
/cal.php?id=2629 - Sacramento PHP Group
/cal.php?id=2662 - Miami Linux Meetup
/cal.php?id=3422 - PHP RIO Meetup
/cal.php?id=1099 - Long Island PHP Users Group
/cal.php?id=2527 - Malaysia PHP Meetup
/cal.php?id=2600 - PHP Usergroup Karlsruhe
/cal.php?id=2660 - PHPUG Wuerzburg
/cal.php?id=2500 - Irish PHP Users Group meeting
/cal.php?id=109 - SDPHP (San Diego, CA)
/cal.php?id=272 - Hannover
/cal.php?id=561 - Meetup Day
/cal.php?id=1005 - Omaha PHP Users Group Meetup
/cal.php?id=1304 - PHP London
/cal.php?id=1624 - The Houston PHP Users Group
/cal.php?id=1632 - Boston PHP Meetup
/cal.php?id=1706 - Atlanta PHP User Group
/cal.php?id=1795 - Manchester UK - PHP Group
/cal.php?id=1918 - Sydney PHP Group meetings
/cal.php?id=2017 - PHP UG Meetup Auckland
/cal.php?id=2418 - Seattle PHP Meetup Group
/cal.php?id=2734 - The Copenhagen PHP Meetup Group
/cal.php?id=2932 - SF PHP Meetup
/cal.php?id=3416 - Knoxville Python & PHP UG
/cal.php?id=3861 - Minnesota PHP User Group
/cal.php?id=153 - Köln/Bonn
/cal.php?id=2663 - Iran PHP developer's meetup
/cal.php?id=1923 - PHP meeting online in China
/cal.php?id=2540 - meeting de LAMPistas en La Paz
/cal.php?id=1745 - SW Florida Linux Users Group
/cal.php?id=1860 - PDXPHP monthly meeting
/cal.php?id=2301 - Jacksonville User Group
/cal.php?id=2814 - Berlin PHP Usergroup Meeting
/cal.php?id=3294 - PHPNW: PHP North West user group
/cal.php?id=2352 - Meeting PHP Usergroup OWL
/cal.php?id=2682 - BostonPHP
/cal.php?id=3793 - Pittsburgh PHP Meetup Group
/cal.php?id=1385 - Hamburg
/cal.php?id=1523 - Dallas PHP/MySQL Users Group
/cal.php?id=1670 - Dallas PHP Users Group (DPUG)
/cal.php?id=1652 - Austin PHP Meetup
/cal.php?id=1665 - OKC PHP Meetup
/cal.php?id=3643 - Oklahoma City PHP User Group
/cal.php?id=3980 - Buffalo PHP Meetup
/cal.php?id=1395 - Wash DC PHP Developers Group
/cal.php?id=3684 - PHP User Group Stuttgart
/cal.php?id=3918 - Denver - FRPUG
/cal.php?id=1516 - Curso de PHP Avanzado en Bilbao
/cal.php?id=3880 - Curso PHP avanzado
/cal.php?id=2702 - PHP & AJAX -Construindo Websites
/cal.php?id=3560 - Core and Advanced PHP Workshop
/cal.php?id=2023 - Ahmedabad PHP Group Training
/cal.php?id=338 - MySQL Spain
/cal.php?id=456 - Curso PHP Madrid
/cal.php?id=641 - PHP E-Learning/Germany
/cal.php?id=998 - Curso on-line ActionScript / PHP
/cal.php?id=1198 - PHP & MySQL Training in Kassel
/cal.php?id=1360 - PHP & MySQL com Dreamweaver MX
/cal.php?id=1981 - Curso on-line de PHP
/cal.php?id=2051 - PHP & MYSQL-Construindo WebSites
/cal.php?id=3053 - PHP Training Heilbronn
/cal.php?id=3927 - ZEND: On-line PHPI: Foundations
/cal.php?id=3928 - ZEND: On-line PHPII
/cal.php?id=3931 - ZEND: Framework Fundamentals
/cal.php?id=841 - Curso on-line de PHP-MySQL
/cal.php?id=1490 - PHP Class at CalTek
/cal.php?id=3969 - Linux Apache MySQL PHP/Ottawa
/cal.php?id=3996 - Разработка на PHP5
/cal.php?id=2144 - PHP Training - Chennai - India
/cal.php?id=3703 - Zend Certification
/cal.php?id=3386 - UK Smarty Templating Workshop
/cal.php?id=1466 - PHP para Expertos Curso on-line
/cal.php?id=1583 - Curso PHP y MySQL
/cal.php?id=3929 - ZEND: On-line Test Prep PHP5
/cal.php?id=3930 - ZEND: PHP for Exp Programmers
/cal.php?id=3385 - UK Object Orientation Workshop
/cal.php?id=2408 - Chennai PHP Training
/cal.php?id=1200 - PHP & MySQL Training / Gießen
/cal.php?id=2589 - PHP Intro Course South Africa
/cal.php?id=3991 - PHP. Основы создани
/cal.php?id=1389 - Cursos de PHP en Bilbao
/cal.php?id=3933 - Zend: On-line Server Course
/cal.php?id=1137 - PHP Brasil - Training
/cal.php?id=4008 - Intermediate PHP, Weekend Course
/cal.php?id=2421 - Basic PHP Course
/cal.php?id=231 - UK PHP Training
http://www.php.net/conferences/index.php#id2010-02-19-1 - Dutch PHP Conference
http://www.php.net/archive/2010.php#id2010-03-04-1 - PHP 5.3.2 Release Announcement
http://php.net/migration53 - here
/ChangeLog-5.php#5.3.2 - ChangeLog
http://windows.php.net/download/ - windows.php.net/download/
http://www.php.net/archive/2010.php#id2010-02-25-1 - PHP 5.2.13 Released!
/releases/5_2_13.php - release announcement
/ChangeLog-5.php#5.2.13 - ChangeLog
http://www.php.net/archive/2009.php#id2009-12-17-1 - PHP 5.2.12 Released!
/releases/5_2_12.php - release announcement
/ChangeLog-5.php#5.2.12 - ChangeLog
http://www.php.net/archive/2009.php#id2009-11-19-1 - PHP 5.3.1 Released!
http://www.php.net/releases/5_3_1.php - release announcement
http://www.php.net/ChangeLog-5.php#5.3.1 - ChangeLog
http://www.php.net/archive/2009.php#id2009-09-17-1 - PHP 5.2.11 Released!
/releases/5_2_11.php - release announcement
/ChangeLog-5.php#5.2.11 - ChangeLog
/archive/index.php - News Archive
/source.php?url=/index.php - show source
/credits.php - credits
/stats/ - stats
/sitemap.php - sitemap
/contact.php - contact
/contact.php#ads - advertising
/mirrors.php - mirror sites
/copyright.php - Copyright © 2001-2009 The PHP Group
/mirror.php - This mirror
http://developer.yahoo.com/ - Yahoo! Inc.