Get Links With DOM
Do not use REGEX to parse HTML
Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.
By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired.
Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.
<?php
function getLinks($link)
{
/*** return array ***/
$ret = array();
/*** a new dom object ***/
$dom = new domDocument;
/*** get the HTML (suppress errors) ***/
@$dom->loadHTML(file_get_contents($link));
/*** remove silly white space ***/
$dom->preserveWhiteSpace = false;
/*** get the links from the HTML ***/
$links = $dom->getElementsByTagName('a');
/*** loop over the links ***/
foreach ($links as $tag)
{
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
?>
A similar approach could be to use XPath which would achieve the same results. Either way, using the DOM is going to prove far more efficient than REGEX.
Example Usage
<?php
/*** a link to search ***/
$link = "http://php.net";
/*** get the links ***/
$urls = getLinks($link);
/*** check for results ***/
if(sizeof($urls) > 0)
{
foreach($urls as $key=>$value)
{
echo $key . ' - '. $value . '<br >';
}
}
else
{
echo "No links found at $link";
}
?>
Demonstration
/ -/downloads.php - downloads page
/docs.php - manual
/FAQ.php - faq
/support.php - getting help
/mailing-lists.php - mailing lists
/license - licenses
http://wiki.php.net/ - wiki
http://bugs.php.net/ - reporting bugs
/sites.php - php.net sites
/links.php - links section
/conferences/ - conferences
/my.php - my php.net
/tut.php - introductory tutorial
/usage.php - Netcraft Survey
/thanks.php - Thanks To
http://www.easydns.com/?V=698570efeb62a6e2 - easyDNS
http://www.directi.com/ - Directi
http://promote.pair.com/direct.pl?php.net - pair Networks
http://www.servercentral.net/ - Server Central
http://www.hostedsolutions.com/ - Hosted Solutions
http://www.spry.com/ - Spry VPS Hosting
http://ez.no/ - eZ Systems
http://www.hit.no/english - HiT
http://www.osuosl.org - OSU Open Source Lab
http://www.yahoo.com/ - Yahoo! Inc.
http://www.binarysec.com/ - BinarySEC
http://www.nexcess.net/ - NEXCESS.NET
http://www.rackspace.com/ - Rackspace
http://www.eukhost.com/ - EUKhost
http://www.apache.org/ - Apache
http://www.mysql.com/ - MySQL
http://www.postgresql.org/ - PostgreSQL
http://www.zend.com/ - Zend Technologies
http://www.linuxfund.org/ - LinuxFund.org
http://www.ostg.com/ - OSTG
/feed.atom - Atom
/downloads.php#v5 - Current PHP 5.2 Stable:
/submit-event.php - [add]
/cal.php?id=1316 - Arabic PHP Group Meeting
/cal.php?id=1708 - Malaysia PHP User Group Meet Up
/cal.php?id=2499 - Sandy PHP Group
/cal.php?id=2662 - Miami Linux Meetup
/cal.php?id=3422 - PHP RIO Meetup
/cal.php?id=4019 - PHP User Group Hong Kong
/cal.php?id=1923 - PHP meeting online in China
/cal.php?id=2540 - meeting de LAMPistas en La Paz
/cal.php?id=1745 - SW Florida Linux Users Group
/cal.php?id=1860 - PDXPHP monthly meeting
/cal.php?id=2301 - Jacksonville User Group
/cal.php?id=2814 - Berlin PHP Usergroup Meeting
/cal.php?id=3294 - PHPNW: PHP North West user group
/cal.php?id=2352 - Meeting PHP Usergroup OWL
/cal.php?id=2682 - BostonPHP
/cal.php?id=3793 - Pittsburgh PHP Meetup Group
/cal.php?id=109 - SDPHP (San Diego, CA)
/cal.php?id=272 - Hannover
/cal.php?id=561 - Meetup Day
/cal.php?id=1005 - Omaha PHP Users Group Meetup
/cal.php?id=1304 - PHP London
/cal.php?id=1624 - The Houston PHP Users Group
/cal.php?id=1632 - Boston PHP Meetup
/cal.php?id=1706 - Atlanta PHP User Group
/cal.php?id=1795 - Manchester UK - PHP Group
/cal.php?id=1918 - Sydney PHP Group meetings
/cal.php?id=2017 - PHP UG Meetup Auckland
/cal.php?id=2418 - Seattle PHP Meetup Group
/cal.php?id=2734 - The Copenhagen PHP Meetup Group
/cal.php?id=2932 - SF PHP Meetup
/cal.php?id=3416 - Knoxville Python & PHP UG
/cal.php?id=3861 - Minnesota PHP User Group
/cal.php?id=4014 - OrlandoPHP User Group
/cal.php?id=4147 - PHP Cardiff Meetup
/cal.php?id=153 - Köln/Bonn
/cal.php?id=2663 - Iran PHP developer's meetup
/cal.php?id=3760 - Los Angeles PHP Developers Group
/cal.php?id=1385 - Hamburg
/cal.php?id=1523 - Dallas PHP/MySQL Users Group
/cal.php?id=1670 - Dallas PHP Users Group (DPUG)
/cal.php?id=1652 - Austin PHP Meetup
/cal.php?id=1665 - OKC PHP Meetup
/cal.php?id=1847 - Nashville PHP User Group
/cal.php?id=3643 - Oklahoma City PHP User Group
/cal.php?id=3980 - Buffalo PHP Meetup
/cal.php?id=1395 - Wash DC PHP Developers Group
/cal.php?id=3684 - PHP User Group Stuttgart
/cal.php?id=3918 - Denver - FRPUG
/cal.php?id=1848 - Meeting usergroup Dortmund
/cal.php?id=1946 - PHP Usergroup Frankfurt/Main
/cal.php?id=3483 - Edinburgh PHP Users Group
/cal.php?id=1732 - PHP User Group Nanaimo, BC/CA
/cal.php?id=2580 - PEA meeting from phpchina
/cal.php?id=3722 - Nagpur PHP Meetup
/cal.php?id=1738 - Madison PHP User's Group
/cal.php?id=2246 - PHP Brisbane Meetup Group
/cal.php?id=3708 - Nashville Enterprise LAMP UG
/cal.php?id=3761 - Chattanooga PHP Developers
/cal.php?id=1545 - Miami PHP User Group
/cal.php?id=1546 - Broward Php Usergroup
/cal.php?id=2208 - Chicago PHP User Group Brunch
/cal.php?id=3925 - Baltimore PHP User Group
/cal.php?id=1704 - TriPUG
/cal.php?id=1719 - OINK-PUG (Cincinnati, Ohio)
/cal.php?id=1820 - Utah PHP Users Group Meeting
/cal.php?id=3844 - NorfolkPHP
/cal.php?id=1131 - Kansas City
/cal.php?id=1346 - Miami Linux Users Group
/cal.php?id=1671 - Twin Cities PHP
/cal.php?id=2449 - Los Angeles LAMPsig
/cal.php?id=409 - New York
/cal.php?id=384 - AzPHP
/cal.php?id=3075 - DCPHP Beverage Subgroup
/cal.php?id=3653 - Brisbane PHP User Group
/cal.php?id=2500 - Irish PHP Users Group meeting
/cal.php?id=3917 - Colorado Springs - FRPUG
/cal.php?id=4216 - Meeting ColombiaPHP User Group
/cal.php?id=2629 - Sacramento PHP Group
/cal.php?id=1099 - Long Island PHP Users Group
/cal.php?id=2527 - Malaysia PHP Meetup
/cal.php?id=2600 - PHP Usergroup Karlsruhe
/cal.php?id=2660 - PHPUG Wuerzburg
/cal.php?id=2023 - Ahmedabad PHP Group Training
/cal.php?id=338 - MySQL Spain
/cal.php?id=456 - Curso PHP Madrid
/cal.php?id=641 - PHP E-Learning/Germany
/cal.php?id=998 - Curso on-line ActionScript / PHP
/cal.php?id=1198 - PHP & MySQL Training in Kassel
/cal.php?id=1360 - PHP & MySQL com Dreamweaver MX
/cal.php?id=1981 - Curso on-line de PHP
/cal.php?id=2051 - PHP & MYSQL-Construindo WebSites
/cal.php?id=3053 - PHP Training Heilbronn
/cal.php?id=3990 - PHP. Основы создани
/cal.php?id=4032 - ZEND:PHP I:Foundations online
/cal.php?id=4033 - ZEND:PHPII Higher Structures
/cal.php?id=4035 - ZEND: Studio on-line class
/cal.php?id=4208 - PHP Essentials
/cal.php?id=4210 - PHP & MySQL : Advanced Training
/cal.php?id=841 - Curso on-line de PHP-MySQL
/cal.php?id=1490 - PHP Class at CalTek
/cal.php?id=2144 - PHP Training - Chennai - India
/cal.php?id=3703 - Zend Certification
/cal.php?id=4215 - PHP training Hyderabad
/cal.php?id=1516 - Curso de PHP Avanzado en Bilbao
/cal.php?id=2702 - PHP & AJAX -Construindo Websites
/cal.php?id=3560 - Core and Advanced PHP Workshop
/cal.php?id=3800 - Learning to Program in PHP
/cal.php?id=1466 - PHP para Expertos Curso on-line
/cal.php?id=1583 - Curso PHP y MySQL
/cal.php?id=3805 - PHP Programming
/cal.php?id=4076 - Advanced PHP Training
/cal.php?id=4164 - LAMP Training in Montreal
/cal.php?id=4197 - PHP Training Philippines
/cal.php?id=3385 - UK Object Orientation Workshop
/cal.php?id=3386 - UK Smarty Templating Workshop
/cal.php?id=3814 - Object Oriented programming &PHP
/cal.php?id=1200 - PHP & MySQL Training / Gießen
/cal.php?id=4034 - ZEND:Test Prep: PHP 5 cert
/cal.php?id=4037 - ZEND: Framework Fundamentals
/cal.php?id=1389 - Cursos de PHP en Bilbao
/cal.php?id=2408 - Chennai PHP Training
/cal.php?id=2589 - PHP Intro Course South Africa
/cal.php?id=4124 - PHP Fortgeschrittene Seminar
/cal.php?id=4165 - LAMP Training in Quebec City
/cal.php?id=231 - UK PHP Training
/cal.php?id=4036 - ZEND: PHP Security
/cal.php?id=3810 - PHP Techniques
/cal.php?id=1137 - PHP Brasil - Training
/cal.php?id=4220 - PHP Training
/cal.php?id=2421 - Basic PHP Course
/cal.php?id=4190 - GERMAN: Zend Studio on-line
/cal.php?id=4198 - Zend Framework Philippines
http://www.php.net/archive/2010.php#id2010-07-22-2 - PHP 5.3.3 Released!
/migration53 - http://php.net/migration53
/ChangeLog-5.php#5.3.3 - ChangeLog
http://www.php.net/archive/2010.php#id2010-07-22-1 - PHP 5.2.14 Released!
/ChangeLog-5.php#5.2.14 - http://www.php.net/ChangeLog-5.php#5.2.14
http://www.php.net/archive/2010.php#id2010-06-23-1 - TestFest 2010
http://wiki.php.net/qa/testfest-2010 - TestFest 2010
http://www.php.net/archive/2010.php#id2010-03-04-1 - PHP 5.3.2 Released!
http://php.net/migration53 - here
/ChangeLog-5.php#5.3.2 - ChangeLog
http://windows.php.net/download/ - windows.php.net/download/
http://www.php.net/archive/2010.php#id2010-02-25-1 - PHP 5.2.13 Released!
/releases/5_2_13.php - release announcement
/ChangeLog-5.php#5.2.13 - ChangeLog
/archive/index.php - News Archive
/source.php?url=/index.php - show source
/credits.php - credits
/stats/ - stats
/sitemap.php - sitemap
/contact.php - contact
/contact.php#ads - advertising
/mirrors.php - mirror sites
/copyright.php - Copyright © 2001-2010 The PHP Group
/mirror.php - This mirror
http://developer.yahoo.com/ - Yahoo! Inc.
RSS Feed
Search
PHPRO.ORG Poll
Warning: Participation in PHPRO.ORG polls may incorrectly lead you to believe your opinions matter.






