Get Links With DOM
Do not use REGEX to parse HTML
Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.
By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired.
Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.
<?php
function getLinks($link)
{
/*** return array ***/
$ret = array();
/*** a new dom object ***/
$dom = new domDocument;
/*** get the HTML (suppress errors) ***/
@$dom->loadHTML(file_get_contents($link));
/*** remove silly white space ***/
$dom->preserveWhiteSpace = false;
/*** get the links from the HTML ***/
$links = $dom->getElementsByTagName('a');
/*** loop over the links ***/
foreach ($links as $tag)
{
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
?>
A similar approach could be to use XPath which would achieve the same results. Either way, using the DOM is going to prove far more efficient than REGEX.
Example Usage
<?php
/*** a link to search ***/
$link = "http://php.net";
/*** get the links ***/
$urls = getLinks($link);
/*** check for results ***/
if(sizeof($urls) > 0)
{
foreach($urls as $key=>$value)
{
echo $key . ' - '. $value . '<br >';
}
}
else
{
echo "No links found at $link";
}
?>
Demonstration
/ -/downloads.php - downloads page
/docs.php - manual
/FAQ.php - faq
/support.php - getting help
/mailing-lists.php - mailing lists
/license - licenses
https://wiki.php.net/ - wiki
https://bugs.php.net/ - PHP bug tracker
/sites.php - php.net sites
/links.php - links section
/conferences/ - conferences
/my.php - my php.net
/tut.php - introductory tutorial
/usage.php - Netcraft Survey
/thanks.php - Thanks To
http://www.easydns.com/?V=698570efeb62a6e2 - easyDNS
http://www.directi.com/ - Directi
http://promote.pair.com/direct.pl?php.net - pair Networks
http://www.servercentral.net/ - Server Central
http://www.hostedsolutions.com/ - Hosted Solutions
http://www.spry.com/ - Spry VPS Hosting
http://www.osuosl.org - OSU Open Source Lab
http://www.yahoo.com/ - Yahoo! Inc.
http://www.nexcess.net/ - NEXCESS.NET
http://www.rackspace.com/ - Rackspace
http://www.eukhost.com/ - EUKhost
http://www.sohosted.nl/webhosting/ - SoHosted Webhosting
http://www.redpill-linpro.com - Redpill Linpro
http://www.facebook.com - Facebook
http://krystal.co.uk - Krystal.co.uk
http://servergrove.com/ - ServerGrove
http://www.bauer-kirch.de/ - Bauer + Kirch GmbH
http://www.apache.org/ - Apache
http://www.mysql.com/ - MySQL
http://www.postgresql.org/ - PostgreSQL
http://www.zend.com/ - Zend Technologies
http://www.linuxfund.org/ - LinuxFund.org
http://ostg.com/ - OSTG
/feed.atom - Atom
/downloads.php#v5 - Current PHP 5.3 Stable:
http://qa.php.net/rc.php - Release Candidates
http://qa.php.net/ - 5.4.0RC7 (2 February 2012)
/submit-event.php - [add]
/cal.php?id=5104 - Web development conference
/cal.php?id=5227 - PHP UK Conference 2012
/cal.php?id=5062 - ConFoo 2012
/cal.php?id=1923 - PHP meeting online in China
/cal.php?id=2540 - meeting de LAMPistas en La Paz
/cal.php?id=4720 - PHP Online User Group
/cal.php?id=1745 - SW Florida Linux Users Group
/cal.php?id=1860 - PDXPHP monthly meeting
/cal.php?id=2301 - Jacksonville User Group
/cal.php?id=2814 - Berlin PHP Usergroup Meeting
/cal.php?id=3294 - PHPNW: PHP North West user group
/cal.php?id=1395 - Wash DC PHP Developers Group
/cal.php?id=3684 - PHP User Group Stuttgart
/cal.php?id=4512 - South FL PUG- Miami
/cal.php?id=4751 - PHP South West User Group
/cal.php?id=5017 - PHPSW, UK
/cal.php?id=5212 - DC PHP Developer's Community
/cal.php?id=1848 - Meeting usergroup Dortmund
/cal.php?id=1946 - PHP Usergroup Frankfurt/Main
/cal.php?id=4636 - Metro Jersey PHP Usergroup
/cal.php?id=1732 - PHP User Group Nanaimo, BC/CA
/cal.php?id=2580 - PEA meeting from phpchina
/cal.php?id=3722 - Nagpur PHP Meetup
/cal.php?id=4258 - Nezahualcoyotl PHP Ramptors
/cal.php?id=3760 - Los Angeles PHP Developers Group
/cal.php?id=4308 - Queen City (Charlotte) PHP
/cal.php?id=1385 - Hamburg
/cal.php?id=1523 - Dallas PHP/MySQL Users Group
/cal.php?id=1670 - Dallas PHP Users Group (DPUG)
/cal.php?id=1652 - Austin PHP Meetup
/cal.php?id=1665 - OKC PHP Meetup
/cal.php?id=1847 - Nashville PHP User Group
/cal.php?id=3643 - Oklahoma City PHP User Group
/cal.php?id=3980 - Buffalo PHP Meetup
/cal.php?id=4222 - South Florida PHP Users Group
/cal.php?id=4511 - South Florida PUG - Lauderdale
/cal.php?id=1545 - Miami PHP User Group
/cal.php?id=1546 - Broward Php Usergroup
/cal.php?id=2208 - Chicago PHP User Group Brunch
/cal.php?id=3925 - Baltimore PHP User Group
/cal.php?id=1704 - TriPUG
/cal.php?id=1719 - OINK-PUG (Cincinnati, Ohio)
/cal.php?id=1820 - Utah PHP Users Group Meeting
/cal.php?id=4507 - Denver - FRPUG
/cal.php?id=5092 - B/CS PHP User Group
/cal.php?id=1131 - Kansas City
/cal.php?id=1346 - Miami Linux Users Group
/cal.php?id=1671 - Twin Cities PHP
/cal.php?id=2449 - Los Angeles LAMPsig
/cal.php?id=2246 - PHP Brisbane Meetup Group
/cal.php?id=3708 - Nashville Enterprise LAMP UG
/cal.php?id=3761 - Chattanooga PHP Developers
/cal.php?id=4725 - PHP North-East User Group
/cal.php?id=5222 - NWO-PUG User Group Meeting
/cal.php?id=5135 - Edinburgh PHP Users Group
/cal.php?id=1316 - Arabic PHP Group Meeting
/cal.php?id=1708 - Malaysia PHP User Group Meet Up
/cal.php?id=2499 - Sandy PHP Group
/cal.php?id=4256 - Memphis PHP
/cal.php?id=5052 - PHP Usergroup D/DU/KR
/cal.php?id=2662 - Miami Linux Meetup
/cal.php?id=3422 - PHP RIO Meetup
/cal.php?id=4019 - PHP User Group Hong Kong
/cal.php?id=1099 - Long Island PHP Users Group
/cal.php?id=4648 - Tampa Bay Florida PHP
/cal.php?id=4767 - Winnipeg PHP
/cal.php?id=409 - New York
/cal.php?id=384 - AzPHP
/cal.php?id=2527 - Malaysia PHP Meetup
/cal.php?id=2600 - PHP Usergroup Karlsruhe
/cal.php?id=2660 - PHPUG Wuerzburg
/cal.php?id=3075 - DCPHP Beverage Subgroup
/cal.php?id=3653 - Brisbane PHP User Group
/cal.php?id=4626 - PHP User Group Roma
/cal.php?id=2500 - Irish PHP Users Group meeting
/cal.php?id=4922 - Guelph PHP Users Group
/cal.php?id=2702 - PHP & AJAX -Construindo Websites
/cal.php?id=3560 - Core and Advanced PHP Workshop
/cal.php?id=2023 - Ahmedabad PHP Group Training
/cal.php?id=4230 - php training
/cal.php?id=338 - MySQL Spain
/cal.php?id=456 - Curso PHP Madrid
/cal.php?id=641 - PHP E-Learning/Germany
/cal.php?id=998 - Curso on-line ActionScript / PHP
/cal.php?id=1198 - PHP & MySQL Training in Kassel
/cal.php?id=1360 - PHP & MySQL com Dreamweaver MX
/cal.php?id=1981 - Curso on-line de PHP
/cal.php?id=2051 - PHP & MYSQL-Construindo WebSites
/cal.php?id=3053 - PHP Training Heilbronn
/cal.php?id=4929 - Schulung PHP, Scripting language
/cal.php?id=5072 - ZEND: PHPI: Foundations On-line
/cal.php?id=5073 - ZEND: PHPII: Higher Structures
/cal.php?id=5075 - ZEND: PHP for OO/Procedural Prog
/cal.php?id=5078 - ZEND: Framework: Advanced
/cal.php?id=5081 - ZEND: PHP I Foundations for IBMi
/cal.php?id=841 - Curso on-line de PHP-MySQL
/cal.php?id=1490 - PHP Class at CalTek
/cal.php?id=5138 - Zend Framework Philippines
/cal.php?id=3385 - UK Object Orientation Workshop
/cal.php?id=3386 - UK Smarty Templating Workshop
/cal.php?id=5173 - Développement orienté objet/ph
/cal.php?id=5225 - Unit Testing Zend Framework Apps
/cal.php?id=1466 - PHP para Expertos Curso on-line
/cal.php?id=1583 - Curso PHP y MySQL
/cal.php?id=5076 - ZEND: Studio On-line
/cal.php?id=5077 - ZEND: Framework: Fundamentals
/cal.php?id=5080 - Zend: Server On-line
/cal.php?id=5217 - PHP and XML-Seminar
/cal.php?id=5254 - Formation PHP Niveau 1 Bordeaux
/cal.php?id=5204 - PHP105 - Le Framework Zend
/cal.php?id=2408 - Chennai PHP Training
/cal.php?id=1200 - PHP & MySQL Training / Gießen
/cal.php?id=2589 - PHP Intro Course South Africa
/cal.php?id=5205 - PHP109 - ORM Doctrine
/cal.php?id=5253 - Formation PHP Niveau 2 Bordeaux
/cal.php?id=1389 - Cursos de PHP en Bilbao
/cal.php?id=5074 - ZEND: Test Prep: PHP 5.3 Cert
/cal.php?id=1137 - PHP Brasil - Training
/cal.php?id=5226 - Git for Subversion Users
/cal.php?id=4220 - PHP Training
/cal.php?id=2421 - Basic PHP Course
/cal.php?id=4935 - Schulung PHP dynamic websites
/cal.php?id=4939 - Schulung Advanced PHP 5
/cal.php?id=231 - UK PHP Training
/cal.php?id=5079 - ZEND: PHP Security On-line
http://www.php.net/conferences/index.php#id2012-01-20-1 - ConFoo 2012
http://www.php.net/conferences/index.php#id2011-12-23-1 - Dutch PHP Conference 2012
http://www.php.net/conferences/index.php#id2011-11-17-1 - Italian phpDay 2012
http://www.php.net/archive/2012.php#id2012-02-02-1 - PHP 5.3.10 Released!
http://windows.php.net/download/ - windows.php.net/download/
http://www.php.net/archive/2012.php#id2012-01-24-1 - PHP 5.4.0 RC6 released
http://qa.php.net - release candidate
http://windows.php.net/qa/ - Windows QA site
mailto:php-qa@lists.php.net - QA mailing list
https://svn.php.net/repository/php/php-src/tags/php_5_4_0RC6/NEWS - NEWS
http://www.php.net/archive/2012.php#id2012-01-11-1 - PHP 5.3.9 Released!
/ChangeLog-5.php#5.3.9 - ChangeLog
http://www.php.net/archive/2012.php#id2012-01-07-2 - PHP 5.4.0 RC5 released
https://svn.php.net/repository/php/php-src/tags/php_5_4_0RC5/NEWS - NEWS
http://www.php.net/archive/2011.php#id2011-12-25-1 - PHP 5.4.0 RC4 released
https://svn.php.net/repository/php/php-src/tags/php_5_4_0RC4/NEWS - NEWS
/archive/index.php - News Archive
/source.php?url=/index.php - show source
/credits.php - credits
/stats/ - stats
/sitemap.php - sitemap
/contact.php - contact
/contact.php#ads - advertising
/mirrors.php - mirror sites
/copyright.php - Copyright © 2001-2012 The PHP Group
/mirror.php - This mirror
http://developer.yahoo.com/ - Yahoo! Inc.
Support PHPRO.ORG
Search
PHPRO.ORG Poll
Warning: Participation in PHPRO.ORG polls may incorrectly lead you to believe your opinions matter.

RSS Feed




