How to create sitemap.xml for your website using PHP?

We all know that sitemap.xml is the file search engines loves to understand your website’s structure. It provides ease to the search engine to crawl the pages on your website. This post will help you learn how to create sitemap.xml using PHP.

You can definitely create the sitemap.xml manually but if you have ever wondered how a sitemap.xml file can be created using PHP? Well, this post may be for you.

How the create sitemap.xml using PHP code works?

We will be making the simplest form of website crawler. The code will index the homepage and create the sitemap.xml file for the pages found on navigation.

To make the code easy to read and understand I limit the logic to index the home page. And prepare the sitemap based on page links found on navigation. You can extend the functionality of this code easily.

What are we doing inside?

We do have the filters in place to filter out the other domain, non-page URLs, and repeating page URLs while preparing the sitemap.xml file.

The code uses DOMDocument class to crawl the page and mine the links in the page. Additionally, we will use the header Content-Disposition attribute to prompt the download popup when the file created and ready to download.

How to create sitemap.xml using PHP code:

The code is simple and easy to understand. I will try to explain every bit of the details.

Logical flow of the code:

Setting headers.
Getting content from file/URL.
Converting content into DOM document.
Finding link nodes from DOM object.
Iterating link nodes, applying filters and extracting links.
Creating an XML file.
Sample HTML file to test the code.

Setting headers:

header(‘Content-type: text/xml’);
header(‘Content-Disposition: attachment;filename=myfile.xml’);

The contect-type will help browser understand the output coming from url requested. And the content-disposition allows us to prompt user to download the xml file we created using code.

Getting content from file/URL:

$html = file_get_contents(‘other.html’);

We are using file_get_contents function to load the content of file or the URI. the function will read the content in the file/URI and returns the string representation. In case of failure the function will return FALSE.

The function is similar to file() function but you can specify the offset and maxlen (to get the specific part of the content) with this function. The function uses memory mapping techniques for better performance if supported by your OS.

Converting content into DOM document:

$dom = new DOMDocument;
@$dom->loadHTML($html);

The document object model(DOM) is a programming API for HTML and XML document. It represents the logical structure of the given document.

Programmers use DOM to easily access, create, change or delete the content of the document. And if you are familiar with the jQuery, it uses the DOM to traverse elements.

We will create an instance of the DOMDocument class to create the DOM representation of the content.

We have the string representation of the page on $html variable that we created in earlier step. We need to use loadHTML function of DOMDocument class to load HTML from string.

You may probably know the usage of @ we used with $dom. In PHP it’s used to suppress the error messages. Check the error control operators in PHP for more info.

Finding link nodes from DOM object:

$links = $dom->getElementsByTagName(‘a’);

We $dom the instance of the DomDocument class. It offers various method extracting the data based on HTML tags.

As we need the page links from the home page we are interested with <a href=””>…</a> tag. We will use the getElementsByTagName function of DOMDocument. It will search for all the elements with given tag name (it can be any ‘a’, ‘p’, ‘div’ etc…) from the DOM source object.

The $links variable will have all the link element on the page. We will filter the data and extract the correct page links in the next section.

Iterating link nodes, applying filters, and extracting links:

In the previous step we created a $links variable and stored all the link nodes from the document into it. Let’s extract the needed page links from it.

We will iterate through each link node and compare it with three conditions:

The link should not be null
The link should not point to #
The link should point to a URL of our specified domain
The link should be unique to reduce redundancy

To achieve this objective we first need to store our domain name somewhere to check the link node against. We will use following code for it.

$domain = “demo.com”;

We will be storing the each unique found link into an array called urlIndexed like below.

$urlIndexed = array();

It will help us prevent redundant URL coming into our final link collection.

We are preparing the sitemap.xml file. We will store the xml markup in $xmlData variable. Let’s declare and initiate the $xmlData with root node.

$xmlData = ‘<?xml version=”1.0″ encoding=”UTF-8″?>

<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>’;

Now, let’s iterate over the links collection.

foreach ($links as $link){

$linkPointTo = $link->getAttribute(‘href’);

if(!empty($linkPointTo) && $linkPointTo != “#” && strpos($linkPointTo, $domain) != FALSE && !in_array($linkPointTo, $urlIndexed)){

$xmlData .= “<url><loc>{$linkPointTo}</loc><changefreq>weekly</changefreq><priority>0.8</priority></url>”;

$urlIndexed[] = $linkPointTo;

}

$xmlData .= ‘</urlset>’;

We are iterating each link node stored in $links. We will use getAttribute of DOMDocument to get the link to (href) part. Next we are checking the link node against the four condition I have mentioned earlier.

Once a match found, we are appending the final XML data variable $xmlData. And adding that link to our URL collection array $urlIndexed.

Creating an XML file:

$xml = new SimpleXMLElement($xmlData);

print($xml->asXML());

We will create the xml file using the instance of class SimpleXMLElements we need to provide the XML data variable. It represents an element in XML document.

And finally, we are using asXML function of SimpleXMLElement to create formatted XML string and displaying it using print function.

Sample HTML file to test the code:

To make things clear I have created a demo HTML file. The file is the collection of links. The code will create a demo sitemap.xml file. The sitemap file is a valid XML markup you can test it here. And you can upload it to your hosting and submit to Google Webmaster Tools.

Conclusion:

The code is really simple and uses the PHP native classes and functions. Anyone with elementary PHP development experience would be able to relate to this tutorial. The code shows the most basic form of page crawling. You can modify the code to crawl multiple pages and create sitemap.xml for larger and deep linked pages on the site.

Tagged in: php programs, sitemap for websites