Crawling
After the intake and implementation, we can start crawling.
Crawling means systematically visiting a website with the aim of discovering all available information. For the sake of convenience we will refer to this information as “pages” here, but it could also be PDF documents, meta-information with videos, images, etc.
Crawling is done in two ways: organic crawling and using sitemaps. These methods can be used separately and in conjunction with each other.
Organic crawling
Organic crawling means that we start on a particular page – often the home page of a website – and search for links to other pages on the website from there.
Suppose we start on the page https://pandosearch.com/. Here we will probably find something like a main menu with links to “Blog”, “Contact” and other pages. These links are called URLs (Uniform Resource Locator).
Pandosearch recognises these URLs and will then visit these pages as well. On the homepage, for example, we could find:
We visit these pages and we also find links to other pages there. Pandosearch continues until it no longer finds any URLs that have not already been encountered.
During this process, we also apply exclusion rules where necessary. For example:
- Ignore all URLs starting with “/archive”
- Ignore all URLs ending in “.pdf”
- Only include URLs starting with “/products”
These rules can vary from very simple to very complex, depending on what is needed to achieve the desired search results.
The end result of organic crawling is that an entire website is mapped as a network of web pages that link to each other.
Sitemaps
In addition to organic crawling, we also look at sitemaps. A sitemap is simply a list of web pages (in the form of URLs) that can be found on a website. These can all be pages, but not necessarily:
- Sometimes a sitemap is deliberately a limited selection of pages the customer considers relevant for the search engine.
- Sometimes a site map can also contain “hidden” pages that would not be found through organic crawling.
Creating and serving a sitemap requires some technical knowledge. That is why we have created specific technical documentation for this aimed at software developers.
Depending on the situation and in consultation with the client, we can choose to only look at a sitemap and not continue with organic crawling. We can also deliberately ignore sitemaps if they are present.
Once we have collected all the information, we move on to the next step in the process: indexing.