PHPRO.ORG

Get Links With DOM

Get Links With DOM

Do not use REGEX to parse HTML

Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM.

By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired.

Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.


<?php
    
function getLinks($link)
    {
        
/*** return array ***/
        
$ret = array();

        
/*** a new dom object ***/
        
$dom = new domDocument;

        
/*** get the HTML (suppress errors) ***/
        
@$dom->loadHTML(file_get_contents($link));

        
/*** remove silly white space ***/
        
$dom->preserveWhiteSpace false;

        
/*** get the links from the HTML ***/
        
$links $dom->getElementsByTagName('a');
    
        
/*** loop over the links ***/
        
foreach ($links as $tag)
        {
            
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
        }

        return 
$ret;
    }
?>

A similar approach could be to use XPath which would achieve the same results. Either way, using the DOM is going to prove far more efficient than REGEX.

Example Usage


<?php
    
/*** a link to search ***/
    
$link "http://php.net";

    
/*** get the links ***/
    
$urls getLinks($link);

    
/*** check for results ***/
    
if(sizeof($urls) > 0)
    {
        foreach(
$urls as $key=>$value)
        {
            echo 
$key ' - '$value '<br >';
        }
    }
    else
    {
        echo 
"No links found at $link";
    }
?>

Demonstration

/ -
/downloads - Downloads
/docs.php - Documentation
/get-involved - Get Involved
/support - Help
/manual/en/getting-started.php - Getting Started
/manual/en/introduction.php - Introduction
/manual/en/tutorial.php - A simple tutorial
/manual/en/langref.php - Language Reference
/manual/en/language.basic-syntax.php - Basic syntax
/manual/en/language.types.php - Types
/manual/en/language.variables.php - Variables
/manual/en/language.constants.php - Constants
/manual/en/language.expressions.php - Expressions
/manual/en/language.operators.php - Operators
/manual/en/language.control-structures.php - Control Structures
/manual/en/language.functions.php - Functions
/manual/en/language.oop5.php - Classes and Objects
/manual/en/language.namespaces.php - Namespaces
/manual/en/language.exceptions.php - Exceptions
/manual/en/language.generators.php - Generators
/manual/en/language.references.php - References Explained
/manual/en/reserved.variables.php - Predefined Variables
/manual/en/reserved.exceptions.php - Predefined Exceptions
/manual/en/reserved.interfaces.php - Predefined Interfaces and Classes
/manual/en/context.php - Context options and parameters
/manual/en/wrappers.php - Supported Protocols and Wrappers
/manual/en/security.php - Security
/manual/en/security.intro.php - Introduction
/manual/en/security.general.php - General considerations
/manual/en/security.cgi-bin.php - Installed as CGI binary
/manual/en/security.apache.php - Installed as an Apache module
/manual/en/security.filesystem.php - Filesystem Security
/manual/en/security.database.php - Database Security
/manual/en/security.errors.php - Error Reporting
/manual/en/security.globals.php - Using Register Globals
/manual/en/security.variables.php - User Submitted Data
/manual/en/security.magicquotes.php - Magic Quotes
/manual/en/security.hiding.php - Hiding PHP
/manual/en/security.current.php - Keeping Current
/manual/en/features.php - Features
/manual/en/features.http-auth.php - HTTP authentication with PHP
/manual/en/features.cookies.php - Cookies
/manual/en/features.sessions.php - Sessions
/manual/en/features.xforms.php - Dealing with XForms
/manual/en/features.file-upload.php - Handling file uploads
/manual/en/features.remote-files.php - Using remote files
/manual/en/features.connection-handling.php - Connection handling
/manual/en/features.persistent-connections.php - Persistent Database Connections
/manual/en/features.safe-mode.php - Safe Mode
/manual/en/features.commandline.php - Command line usage
/manual/en/features.gc.php - Garbage Collection
/manual/en/features.dtrace.php - DTrace Dynamic Tracing
/manual/en/funcref.php - Function Reference
/manual/en/refs.basic.php.php - Affecting PHP's Behaviour
/manual/en/refs.utilspec.audio.php - Audio Formats Manipulation
/manual/en/refs.remote.auth.php - Authentication Services
/manual/en/refs.utilspec.cmdline.php - Command Line Specific Extensions
/manual/en/refs.compression.php - Compression and Archive Extensions
/manual/en/refs.creditcard.php - Credit Card Processing
/manual/en/refs.crypto.php - Cryptography Extensions
/manual/en/refs.database.php - Database Extensions
/manual/en/refs.calendar.php - Date and Time Related Extensions
/manual/en/refs.fileprocess.file.php - File System Related Extensions
/manual/en/refs.international.php - Human Language and Character Encoding Support
/manual/en/refs.utilspec.image.php - Image Processing and Generation
/manual/en/refs.remote.mail.php - Mail Related Extensions
/manual/en/refs.math.php - Mathematical Extensions
/manual/en/refs.utilspec.nontext.php - Non-Text MIME Output
/manual/en/refs.fileprocess.process.php - Process Control Extensions
/manual/en/refs.basic.other.php - Other Basic Extensions
/manual/en/refs.remote.other.php - Other Services
/manual/en/refs.search.php - Search Engine Extensions
/manual/en/refs.utilspec.server.php - Server Specific Extensions
/manual/en/refs.basic.session.php - Session Extensions
/manual/en/refs.basic.text.php - Text Processing
/manual/en/refs.basic.vartype.php - Variable and Type Related Extensions
/manual/en/refs.webservice.php - Web Services
/manual/en/refs.utilspec.windows.php - Windows Only Extensions
/manual/en/refs.xml.php - XML Manipulation
/downloads.php#v5.6.3 - 5.6.3
/ChangeLog-5.php#5.6.3 - Release Notes
/downloads.php#v5.5.19 - 5.5.19
/ChangeLog-5.php#5.5.19 - Release Notes
/downloads.php#v5.4.35 - 5.4.35
/ChangeLog-5.php#5.4.35 - Release Notes
http://php.net/archive/2014.php#id2014-11-13-3 - PHP 5.4.35 Released
http://www.php.net/downloads.php - downloads page
http://windows.php.net/download/ - windows.php.net/download/
http://www.php.net/ChangeLog-5.php#5.4.35 - ChangeLog
http://php.net/archive/2014.php#id2014-11-13-2 - PHP 5.6.3 is available
http://www.php.net/ChangeLog-5.php#5.6.3 - ChangeLog
http://php.net/archive/2014.php#id2014-11-13-1 - PHP 5.5.19 is available
http://www.php.net/ChangeLog-5.php#5.5.19 - ChangeLog
http://php.net/archive/2014.php#id2014-10-16-3 - PHP 5.6.2 is available
http://www.php.net/ChangeLog-5.php#5.6.2 - ChangeLog
http://php.net/archive/2014.php#id2014-10-16-2 - PHP 5.4.34 Released
http://www.php.net/ChangeLog-5.php#5.4.34 - ChangeLog
http://php.net/archive/2014.php#id2014-10-16-1 - PHP 5.5.18 is available
http://www.php.net/ChangeLog-5.php#5.5.18 - ChangeLog
http://php.net/archive/2013.php#id2013-11-20-1 - Our modern web theme goes live!
http://bugs.php.net - bugs.php.net
http://php.net/archive/2013.php#id2013-10-24-2 - A further update on php.net
https://twitter.com/official_php -
http://php.net/archive/2013.php#id2013-10-24-1 - A quick update on the status of php.net
/archive/ - Older News Entries
/migration56 - Upgrading to PHP 5.6
/conferences - Upcoming conferences
http://php.net/conferences/index.php#id2014-10-27-1 - PHP Unconference Europe 2015
http://php.net/conferences/index.php#id2014-10-18-1 - SunshinePHP Developer Conference 2015
http://php.net/conferences/index.php#id2014-09-26-1 - PHP Australia Conference 2015
/cal.php - User Group Events
/thanks.php - Special Thanks
/copyright.php - Copyright © 2001-2014 The PHP Group
/my.php - My PHP.net
/contact.php - Contact
/sites.php - Other PHP.net sites
/mirrors.php - Mirror sites
/privacy.php - Privacy policy
javascript:; -