How To Parse A Remote URL WebPage To Extract Desired Content Using PHP?

This Article Was Live On July 6th, 2020 And So Far Have: 0 Comments...
Last Updated on July 6th, 2020

The domdocument class of Php is a very handy one that can be used for a number of tasks like parsing XML, HTML, and creating XML. It is documented here. In this tutorial, we are going to see how to use this class to parse HTML content. The need to parse HTML happens when are you are for example writing scrapers, or similar data extraction scripts.

Table of Contents

What Is DOMDocument, And When Is It Used?

The DOMDocument is a class built into PHP that helps developers navigate an HTML document tree and provides methods to help interact with the document. If you ever need to parse HTML content or manipulate HTML content using PHP, DOMDocument can help you quickly and easily access nodes.

So Are you looking for a PHP script to parse a remote URL webpage to extract desired content using PHP? This tutorial will provide a code snippet that will help you to parse a remote URL webpage to extract desired content easily.

There are many code snippets available online or on many other blogs and website but everyone is not able to optimize your blog or website so you need some optimized code snippet. So now checkout out code snippet for your blog and website that will give you all features for your desired code. Now grab the ready to use code and paste it where you want.

How To Parse A Remote URL WebPage To Extract Desired Content Using PHP?

The below snipper with return a PHP variables is $webPageContent that is a plain string variable. Then we will instantiate the DOMDocument Class.

<?php
$webPageURL = "https://www.google.com";

/****************************************************************************/
// Garb The WebPage Content
/****************************************************************************/
$ch = curl_init();
$timeout = 5; //5 is seconds
$userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
curl_setopt($ch, CURLOPT_URL, $webPageURL);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch, CURLOPT_HEADER, 1);
$webPageContent= curl_exec($ch);
curl_close($ch);

/****************************************************************************/
// Parse The HTML Content
/****************************************************************************/
//Instantiate The DOMDocument Class
$htmlDom = new DOMDocument;
$htmlDom->validateOnParse = true;

//Parse the HTML of the page using DOMDocument::loadHTML In UTF8 Encoding
//@$htmlDom->loadHTML($webPageContent);
@$htmlDom->loadHTML(mb_convert_encoding($webPageContent, 'HTML-ENTITIES', 'UTF-8'));
?>

Extract Links, HTML Tags, Images etc Everything From `DOMDocument`:

Here are some little examples of extracting all links with their attributes, extracting HTML tags, extracting IMG, etc that are self-explanatory .

/****************************************************************************/
// Extract All Links On The Web Page
/****************************************************************************/
//Extract the links from the HTML.
$links = $htmlDom->getElementsByTagName('a');

//Array that will contain our extracted links.
$extractedLinks = array();

//Loop through the DOMNodeList.
//We can do this because the DOMNodeList object is traversable.
foreach($links as $link){
	//Get the link text.
	$linkText = $link->nodeValue;

	//Get the link in the href attribute.
	$linkHref = $link->getAttribute('href');

	//Get the link in the rel attribute.
	$linkRel = $link->getAttribute('rel');

	//If the link is empty, skip it and don't
	//add it to our $extractedLinks array
	if(strlen(trim($linkHref)) == 0){
		continue;
	}

	//Skip if it is a hashtag / anchor link.
	if($linkHref[0] == '#'){
		continue;
	}

	//Add the link to our $extractedLinks array.
	$extractedLinks[] = array(
		'text' => $linkText,
		'href' => $linkHref,
		'rel' => $linkRel
	);
	
}

/****************************************************************************/
// Extract <title> Tag
/****************************************************************************/
$titleTags = $htmlDom->getElementsByTagName("title");
foreach($titleTags as $titleTag){
	$extractedTitleTagText = $titleTag->nodeValue;
}

/****************************************************************************/
// Extract All Heading H1,H2,H3,H4,H5,H6 Tags
/****************************************************************************/
$tags = $htmlDom->getElementsByTagName('h1');
$extractedH1 = array();
foreach($tags as $tag){
	$tagsText = $tag->nodeValue;
	if(strlen(trim($tagsText)) == 0){
		continue;
	}
	$extractedH1[] = array(
		'tag' => "H1",
		'text' => $tagsText
	);
}

/****************************************************************************/
// Extract All Image Alt-Title Tags
/****************************************************************************/
$tags = $htmlDom->getElementsByTagName('img');
$extractedIMG = array();
$extractedIMGaltCOUNT = 0;
$extractedIMGtitleCOUNT = 0;
foreach($tags as $tag){
	$tagsTextALT = $tag->getAttribute('alt');
	$tagsTextTITLE = $tag->getAttribute('title');
	$tagsTextSRC = $tag->getAttribute('src');
	if (strlen(trim($tagsTextALT)) == 0) {
		$extractedIMGaltCOUNT++;
	}
	if (strlen(trim($tagsTextTITLE)) == 0) {
		$extractedIMGtitleCOUNT++;
	}
	$extractedIMG[] = array(
		'alt' => $tagsTextALT,
		'title' => $tagsTextTITLE,
		'src' => $tagsTextSRC
	);
}

Final Words:

Be aware that the is placed well in your document. Rest all is in your hand if you want to customize it or play with it. That’s all we have. If you have any problem with this code in your template then feel free to contact us with a full explanation of your problem. We will reply to you as time allowed to us. Don’t forget to share this with your friends so they can also take benefit from it and leave.

Recommended For You:

How To Highlight All TextArea Content Using JavaScript Button?