Login or Register Now   Email:  Password:   

Get Links With DOM

Do not use REGEX to parse HTML

Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.

By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired.

Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.


<?php
    
function getLinks($link)
    {
        
/*** return array ***/
        
$ret = array();

        
/*** a new dom object ***/
        
$dom = new domDocument;

        
/*** get the HTML (suppress errors) ***/
        
@$dom->loadHTML(file_get_contents($link));

        
/*** remove silly white space ***/
        
$dom->preserveWhiteSpace false;

        
/*** get the links from the HTML ***/
        
$links $dom->getElementsByTagName('a');
    
        
/*** loop over the links ***/
        
foreach ($links as $tag)
        {
            
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return 
$ret;
    }
?>

A similar approach could be to use XPath which would achieve the same results. Either way, using the DOM is going to prove far more efficient than REGEX.

Example Usage


<?php
    
/*** a link to search ***/
    
$link "http://php.net";

    
/*** get the links ***/
    
$urls getLinks($link);

    
/*** check for results ***/
    
if(sizeof($urls) > 0)
    {
        foreach(
$urls as $key=>$value)
        {
            echo 
$key ' - '$value '<br >';
        }
    }
    else
    {
        echo 
"No links found at $link";
    }
?>

Demonstration

/ -
/downloads.php - downloads
/docs.php - manual
/FAQ.php - faq
/support.php - getting help
/mailing-lists.php - mailing lists
http://wiki.php.net/ - wiki
http://bugs.php.net/ - reporting bugs
/sites.php - php.net sites
/links.php - links section
/conferences/ - conferences
/my.php - my php.net
/tut.php - introductory tutorial
/usage.php - Netcraft Survey
/thanks.php - Thanks To
http://www.easydns.com/?V=698570efeb62a6e2 - easyDNS
http://www.directi.com/ - Directi
http://promote.pair.com/direct.pl?php.net - pair Networks
http://www.servercentral.net/ - Server Central
http://www.hostedsolutions.com/ - Hosted Solutions
http://www.spry.com/ - Spry VPS Hosting
http://ez.no/ - eZ Systems
http://www.hit.no/english - HiT
http://www.osuosl.org - OSU Open Source Lab
http://www.yahoo.com/ - Yahoo! Inc.
http://www.binarysec.com/ - BinarySEC
http://www.nexcess.net/ - NEXCESS.NET
http://www.rackspace.com/ - Rackspace
http://www.eukhost.com/ - EUKhost
http://www.apache.org/ - Apache
http://www.mysql.com/ - MySQL
http://www.postgresql.org/ - PostgreSQL
http://www.zend.com/ - Zend Technologies
http://www.linuxfund.org/ - LinuxFund.org
http://www.ostg.com/ - OSTG
/feed.atom - Atom
/downloads.php#v5 - Current PHP 5 Stable:
/downloads.php#v4 - Historical PHP 4 Stable:
http://qa.php.net/rc.php - Release Candidates
http://qa.php.net/ - Current PHP 5 RC:
/submit-event.php - [add]
/cal.php?id=3220 - PHPNW Conference
/cal.php?id=3311 - First Especializa meeting
/cal.php?id=3299 - PHP World Kongress in Munich
/cal.php?id=3302 - OpenSource ContributorConference
/cal.php?id=3210 - PHP Conference Brazil '08
/cal.php?id=1704 - TriPUG
/cal.php?id=1719 - OINK-PUG (Cincinnati, Ohio)
/cal.php?id=1820 - Utah PHP Users Group Meeting
/cal.php?id=3295 - Comunidad Argentina de PHP
/cal.php?id=2629 - Sacramento PHP Group
/cal.php?id=1099 - Long Island PHP Users Group
/cal.php?id=409 - New York
/cal.php?id=384 - AzPHP
/cal.php?id=2527 - Malaysia PHP Meetup
/cal.php?id=2600 - PHP Usergroup Karlsruhe
/cal.php?id=2660 - PHPUG Wuerzburg
/cal.php?id=3075 - DCPHP Beverage Subgroup
/cal.php?id=2500 - Irish PHP Users Group meeting
/cal.php?id=1316 - Arabic PHP Group Meeting
/cal.php?id=1708 - Malaysia PHP User Group Meet Up
/cal.php?id=2499 - Sandy PHP Group
/cal.php?id=2662 - Miami Linux Meetup
/cal.php?id=3309 - PHP Paraiba Meeting
/cal.php?id=3298 - PHP Forum in Paris, France
/cal.php?id=3303 - PHP Forum 2008 in Paris, France
/cal.php?id=1745 - SW Florida Linux Users Group
/cal.php?id=1860 - PDXPHP monthly meeting
/cal.php?id=3294 - PHPNW: PHP North West user group
/cal.php?id=2352 - Meeting PHP Usergroup OWL
/cal.php?id=2495 - PHP Meetup Columbia MD
/cal.php?id=2682 - BostonPHP
/cal.php?id=2814 - Berlin PHP Usergroup Meeting
/cal.php?id=109 - SDPHP (San Diego, CA)
/cal.php?id=272 - Hannover
/cal.php?id=561 - Meetup Day
/cal.php?id=1005 - Omaha PHP Users Group Meetup
/cal.php?id=1304 - PHP London
/cal.php?id=1624 - The Houston PHP Users Group
/cal.php?id=1632 - Boston PHP Meetup
/cal.php?id=1706 - Atlanta PHP User Group
/cal.php?id=1795 - Manchester UK - PHP Group
/cal.php?id=1797 - EdPUG - Edinburgh PHP User Group
/cal.php?id=1918 - Sydney PHP Group meetings
/cal.php?id=2017 - PHP UG Meetup Auckland
/cal.php?id=2195 - Cape Town PHP Users Group
/cal.php?id=2301 - Jacksonville User Group
/cal.php?id=2418 - Seattle PHP Meetup Group
/cal.php?id=2734 - The Copenhagen PHP Meetup Group
/cal.php?id=2932 - SF PHP Meetup
/cal.php?id=153 - Köln/Bonn
/cal.php?id=2663 - Iran PHP developer's meetup
/cal.php?id=1923 - PHP meeting online in China
/cal.php?id=2540 - meeting de LAMPistas en La Paz
/cal.php?id=1385 - Hamburg
/cal.php?id=1523 - Dallas PHP/MySQL Users Group
/cal.php?id=1670 - Dallas PHP Users Group (DPUG)
/cal.php?id=1652 - Austin PHP Meetup
/cal.php?id=1665 - OKC PHP Meetup
/cal.php?id=1395 - Wash DC PHP Developers Group
/cal.php?id=2503 - Stuttgart
/cal.php?id=1848 - Meeting usergroup Dortmund
/cal.php?id=1946 - PHP Usergroup Frankfurt/Main
/cal.php?id=2670 - Melbourne PHP User Group
/cal.php?id=1732 - PHP User Group Nanaimo, BC/CA
/cal.php?id=2580 - PEA meeting from phpchina
/cal.php?id=1738 - Madison PHP User's Group
/cal.php?id=2246 - PHP Brisbane Meetup Group
/cal.php?id=1545 - Miami PHP User Group
/cal.php?id=1546 - Broward Php Usergroup
/cal.php?id=1847 - Nashville PHP Users Group
/cal.php?id=2208 - Chicago PHP User Group Brunch
/cal.php?id=1131 - Kansas City
/cal.php?id=1346 - Miami Linux Users Group
/cal.php?id=1671 - Twin Cities PHP
/cal.php?id=2449 - Los Angeles LAMPsig
/cal.php?id=338 - MySQL Spain
/cal.php?id=456 - Curso PHP Madrid
/cal.php?id=641 - PHP E-Learning/Germany
/cal.php?id=998 - Curso on-line ActionScript / PHP
/cal.php?id=1198 - PHP & MySQL Training in Kassel
/cal.php?id=1360 - PHP & MySQL com Dreamweaver MX
/cal.php?id=1981 - Curso on-line de PHP
/cal.php?id=2051 - PHP & MYSQL-Construindo WebSites
/cal.php?id=2831 - Developing Websites with PHP
/cal.php?id=3053 - PHP Training Heilbronn
/cal.php?id=3233 - PHP/MySQL training San Francisco
/cal.php?id=3246 - PHP Grundlagen in Giessen
/cal.php?id=3285 - Zend PHP I - Online course
/cal.php?id=3286 - Zend PHP II - Online course
/cal.php?id=3304 - Introduzione a PHP 5
/cal.php?id=841 - Curso on-line de PHP-MySQL
/cal.php?id=1490 - PHP Class at CalTek
/cal.php?id=3275 - Introduction to PHP
/cal.php?id=2144 - PHP Training - Chennai - India
/cal.php?id=3276 - PHP 5 Programming
/cal.php?id=1516 - Curso de PHP Avanzado en Bilbao
/cal.php?id=3192 - PHP - Object Orientation
/cal.php?id=2702 - PHP & AJAX -Construindo Websites
/cal.php?id=2023 - Ahmedabad PHP Group Training
/cal.php?id=1466 - PHP para Expertos Curso on-line
/cal.php?id=1583 - Curso PHP y MySQL
/cal.php?id=2977 - Formation maitrise PHP a Paris
/cal.php?id=3168 - PHP Boot Camp (Raleigh, NC)
/cal.php?id=3237 - Securing PHP Web Apps (Anaheim)
/cal.php?id=3289 - Zend Framework - Online Course
/cal.php?id=3301 - PHP Training Philippines
/cal.php?id=3273 - Quality Assurance in PHP Project
/cal.php?id=3277 - Security of PHP applications
/cal.php?id=3278 - Performance of web applications
/cal.php?id=3186 - Design Patterns
/cal.php?id=1200 - PHP & MySQL Training / Gießen
/cal.php?id=2986 - Formation PHP Expert certifie
/cal.php?id=1389 - Cursos de PHP en Bilbao
/cal.php?id=2880 - PHP & MySQL Seminar
/cal.php?id=2408 - Chennai PHP Training
/cal.php?id=2589 - PHP Intro Course South Africa
/cal.php?id=231 - UK PHP Training
/cal.php?id=1137 - PHP Brasil - Training
/cal.php?id=2421 - Basic PHP Course
http://www.php.net/conferences/index.php#id2008-11-15-1 - Forum PHP Paris 2008
http://www.php.net/conferences/index.php#id2008-11-04-2 - PHPNW08 - November 22nd - Manchester, UK
http://www.php.net/conferences/index.php#id2008-11-04-1 - PHP UK Conference 2009
http://www.php.net/archive/2008.php#id2008-08-07-1 - PHP 4.4.9 released!
/ChangeLog-4.php#4.4.9 - ChangeLog
http://www.php.net/archive/2008.php#id2008-08-01-1 - PHP 5.3 alpha1 released!
http://downloads.php.net/johannes/ - first alpha release
http://downloads.php.net/pierre/ - Windows binaries
http://snaps.php.net - snaps.php.net
http://php.net/docs.php - official documentation
http://wiki.php.net/doc/scratchpad/upgrade/53 - several major features
http://php.net/php5news - NEWS
mailto:php-qa@lists.php.net - QA mailinglist
http://bugs.php.net - bug tracker
http://php.net/language.namespaces - Namespaces
http://php.net/oop5.late-static-bindings - Late static binding
http://php.net/language.oop5.overloading - __callStatic
http://wiki.php.net/rfc/closures - Lambda functions and closures
http://php.net/book.intl - intl
http://php.net/book.phar - phar
http://php.net/book.fileinfo - fileinfo
http://php.net/book.sqlite3 - sqlite3
http://forge.mysql.com/wiki/PHP_MYSQLND - MySQLnd
http://wiki.php.net/internals/windows/releasenotes - details
http://php.net/language.types.string#language.types.string.syntax.nowdoc - NOWDOC
http://wiki.php.net/todo/php53 - release plan
http://www.php.net/archive/2008.php#id2008-07-30-1 - TestFest 2008 wrap-up
http://qa.php.net/testfest.php - TestFest 2008
http://testfest.php.net - TestFest submission site
http://gcov.php.net/ - test coverage
http://qa.php.net/write-test.php - phpt
http://www.deshong.net/?p=76 - blog
http://flickr.com/groups/elephpants/pool/ - elePHPant
http://www.nexen.net - Nexen
http://www.php.net/archive/2008.php#id2008-07-29-1 - Manual restructure and license change
/manual - manual
/pdo.prepared-statements - per-extension chapters
/haru.examples - usage examples
/class.xmlreader - improved documentation
/oop5/ - object oriented
/funcref - function reference
/reserved.variables - predefined variables
/context - context options and parameters
/reserved.exceptions - predefined exceptions
/contact - we would really appreciate feedback
/namespaces - namespaces
/lsb - late static bindings
/intl - internationalization functions
/ini.sections - INI sections
/phar - Phar
http://creativecommons.org/licenses/by/3.0/ - CreativeCommons Attribution license
http://www.php.net/archive/2008.php#id2008-04-22-1 - Google Summer of Code: php.net students
http://code.google.com/soc/ - Google Summer of Code
http://code.google.com/soc/2008/php/appinfo.html?csaid=73D5F5E282F9163F - Zend LLVM Extension
http://code.google.com/soc/2008/php/appinfo.html?csaid=12A8D27646C9771A - PHP Optimizer
http://code.google.com/soc/2008/php/appinfo.html?csaid=3D5258783F22F62C - PhD (PHP Docbook) Project
http://code.google.com/soc/2008/php/appinfo.html?csaid=93F63E6C761134FB - Replace auto* with CMake
http://code.google.com/soc/2008/php/appinfo.html?csaid=F74E5E31D92F95D0 - gsoc:2008 - XDebug
http://code.google.com/soc/2008/php/appinfo.html?csaid=435245F847240738 - Rewrite the run-tests.php script
http://code.google.com/soc/2008/php/appinfo.html?csaid=837287100B93044F - PHP Bindings for Cairo
http://code.google.com/soc/2008/php/appinfo.html?csaid=25AE6211DDEC86FD - Algorithm Optimizations
http://code.google.com/soc/2008/php/appinfo.html?csaid=5A442E6A7568434D - PECL, Website Improvements
http://code.google.com/soc/2008/php/appinfo.html?csaid=AD4803BA9A70BCB3 - Implement Unicode into PHP 6
/archive/index.php - News Archive
/source.php?url=/index.php - show source
/credits.php - credits
/stats/ - stats
/sitemap.php - sitemap
/contact.php - contact
/contact.php#ads - advertising
/mirrors.php - mirror sites
/copyright.php - Copyright © 2001-2008 The PHP Group
/mirror.php - This mirror
http://developer.yahoo.com/ - Yahoo! Inc.