Crawling and indexing
In addition to customer-specific configuration, Pandosearch also supports a number of web standards for crawling and indexing websites. This gives you a certain degree of control over what information should or should not appear in your search results. This article describes these standards and how we deal with them.
When we refer to a “robot” or “bot” in this article, we mean an automated program that searches through a website to gather information from it. Pandosearch works with robots. Google, Bing and other online search engines also use robots to collect the information on which they base their search results.
You can instruct robots on three levels:
- Entire site: a robots.txt file
- Individual pages: HTML tags and HTTP headers
- Specific links within a page: link attributes
The rest of this article looks at these three levels in more detail.
Entire site: robots.txt
A robots.txt file is a recommendation to bots about how a website should be crawled and accessed. This file is placed directly in the root folder of your website. For pandosearch.com, robots.txt is located here:
https://www.pandosearch.com/robots.txt
User-agent and Disallow
Instructions in robots.txt are basically a combination of User-agent
and Disallow
rules. For example:
User-agent: *
Disallow: /cgi-bin/
-
User-agent: *
means that this rule applies to all bots -
Disallow: /cgi-bin/
means that bots should not look at anything under /cgi-bin/
Another example:
User-agent: Googlebot
Disallow: /
-
User-agent: Googlebot
means that this rule applies specifically to the bot that identifies itself as “Googlebot” -
Disallow: /
means that this bot should ignore the whole website (and therefore should not index it)
A robots.txt file may contain several User-agent
sections.
Pandosearch identifies itself as “Pandosearch-Webcrawler” or “Enrise-Webcrawler”, depending on the configuration. You can use this to provide specific instructions with a User-agent rule combined with one or more Disallow rules.
It is important to note that robots.txt is only a recommendation. Bots are not obliged to follow it. Pandosearch does this by default, but we can disable it on request.
More information can be found in the proposed web standard RFC 9309.
Sitemap
In addition to Disallow rules, robots.txt can also contain one or more Sitemap rules:
Sitemap: https://www.pandosearch.com/sitemap.xml
This is used to show the location of a sitemap or sitemap index. Pandosearch uses these sitemaps as input for pages to find. See Crawling for more information on how this works.
Individual pages: HTML tags and HTTP headers
A robots.txt file is mostly used to exclude whole sections of your website. It is also possible to specify how bots may use specific pages. This can be done in a number of ways:
Robots meta tag
For individual HTML pages, you provide instructions with a robots
meta tag. Two instructions are relevant for Pandosearch: noindex
and nofollow
.
noindex
<meta name="robots" content="noindex">
noindex
is used to instruct robots not to index the page.
nofollow
<meta name="robots" content="nofollow">
nofollow
is used to instruct robots not to follow links that appear on the page.
Combine
It is also possible to combine both:
<meta name="robots" content="noindex,nofollow">
In the same way as for robots.txt, Pandosearch follows these instructions by default unless configured differently.
Canonical tag
Aside from excluding a page entirely, you can also let robots know which URL is the correct one. Why is this important? The same information can often be found through several URLs. A familiar example is with and without “www”, for example:
Strictly speaking these are two different pages, but they contain the same information. If you show both in your search results, this will be confusing to end users.
To avoid this, you can use the “canonical” tag in your HTML page. Like the robots meta tag, this appears in the <head>
part of your HTML. For example:
<link rel="canonical" href="https://www.pandosearch.com">
This indicates that the URL preceded by “www.” is the preferred variant for indexing and displaying search results.
Pandosearch follows these instructions by default. In consultation, we can switch this off if necessary.
HTTP headers
The HTML tags above cannot be used for content other than HTML (e.g. PDF files). This is why there is another option to specify in an X-Robots-Tag
HTTP header what robots may and may not do. Attention: this requires some technical knowledge about web servers.
You can also use noindex
, nofollow
or a combination here. In practice, noindex
is particularly useful because content other than HTML usually does not contain links to other pages:
X-Robots-Tag: noindex
Specific links: link attributes
Meta tags and HTTP headers can be used to specify for an entire page what should happen with the links on that page. Sometimes you want to be able to specify what should happen to links on a page more precisely: you want some links to be followed and others not.
One example could be a forum on your website. You want to exclude links that users include in their posts, but you do want links to the pages where these individual posts are located to be indexed.
To achieve this, you can add rel="nofollow"
to individual links. For example:
<a href="https://pandosearch.com/" rel="nofollow">Pandosearch</a>
This instructs Pandosearch not to follow this link.
Side note: By default, Pandosearch only follows links that refer within the same domain. This means that it is not necessary for Pandosearch to add rel="nofollow"
to links that refer to external websites. Even so, it can still make sense to do this for something like the forum in the example, because you have no control over what people include in their posts. Using rel="nofollow"
prevents links to internal pages posted by users from unintentionally appearing in the search results.
Standard or not?
Although the ideas begin robots.txt originate back to 1994, strictly speaking no official standards have been defined for the instructions above (one is in the making at the time of writing). This means that different bots can interpret the same information differently and/or support different variations.
Practice shows, however, that the elements from this article are widely supported. They offer good options to control behaviour, both for Pandosearch and other robots.
Read more
Would you like to know more about robot instructions? The pages below are good starting points:
- https://en.wikipedia.org/wiki/Robots_exclusion_standard
- https://www.sitemaps.org/protocol.html#submit_robots
- https://en.wikipedia.org/wiki/Nofollow
- https://en.wikipedia.org/wiki/Noindex
- https://developers.google.com/search/docs/advanced/robots/create-robots-txt
- https://www.rfc-editor.org/rfc/rfc9309.html