Crawling and indexing

In addition to customer-specific configuration, Pandosearch also supports a number of web standards for crawling and indexing websites. This gives you a certain degree of control over what information should or should not appear in your search results. This article describes these standards and how we deal with them.

When we refer to a “robot” or “bot” in this article, we mean an automated program that searches through a website to gather information from it. Pandosearch works with robots. Google, Bing and other online search engines also use robots to collect the information on which they base their search results.

You can instruct robots on three levels:

  • Entire site: a robots.txt file
  • Individual pages: HTML tags and HTTP headers
  • Specific links within a page: link attributes

The rest of this article looks at these three levels in more detail.

Entire site: robots.txt

A robots.txt file is a recommendation to bots about how a website should be crawled and accessed. This file is placed directly in the root folder of your website. For pandosearch.com, robots.txt is located here:

https://www.pandosearch.com/robots.txt

User-agent and Disallow

Instructions in robots.txt are basically a combination of User-agent and Disallow rules. For example:

User-agent: * 
Disallow: /cgi-bin/
  • User-agent: * means that this rule applies to all bots
  • Disallow: /cgi-bin/ means that bots should not look at anything under /cgi-bin/

Another example:

User-agent: Googlebot 
Disallow: /
  • User-agent: Googlebot means that this rule applies specifically to the bot that identifies itself as “Googlebot”
  • Disallow: / means that this bot should ignore the whole website (and therefore should not index it)

A robots.txt file may contain several User-agent sections.

Pandosearch identifies itself as “Pandosearch-Webcrawler” or “Enrise-Webcrawler”, depending on the configuration. You can use this to provide specific instructions with a User-agent rule combined with one or more Disallow rules.

It is important to note that robots.txt is only a recommendation. Bots are not obliged to follow it. Pandosearch does this by default, but we can disable it on request.

More information can be found in the proposed web standard RFC 9309.

Sitemap

In addition to Disallow rules, robots.txt can also contain one or more Sitemap rules:

Sitemap: https://www.pandosearch.com/sitemap.xml

This is used to show the location of a sitemap or sitemap index. Pandosearch uses these sitemaps as input for pages to find. See Crawling for more information on how this works.

Individual pages: HTML tags and HTTP headers

A robots.txt file is mostly used to exclude whole sections of your website. It is also possible to specify how bots may use specific pages. This can be done in a number of ways:

Robots meta tag

For individual HTML pages, you provide instructions with a robots meta tag. Two instructions are relevant for Pandosearch: noindex and nofollow.

noindex

<meta name="robots" content="noindex">

noindex is used to instruct robots not to index the page.

nofollow

<meta name="robots" content="nofollow">

nofollow is used to instruct robots not to follow links that appear on the page.

Combine

It is also possible to combine both:

<meta name="robots" content="noindex,nofollow">

In the same way as for robots.txt, Pandosearch follows these instructions by default unless configured differently.

Canonical tag

Aside from excluding a page entirely, you can also let robots know which URL is the correct one. Why is this important? The same information can often be found through several URLs. A familiar example is with and without “www”, for example:

Strictly speaking these are two different pages, but they contain the same information. If you show both in your search results, this will be confusing to end users.

To avoid this, you can use the “canonical” tag in your HTML page. Like the robots meta tag, this appears in the <head> part of your HTML. For example:

<link rel="canonical" href="https://www.pandosearch.com">

This indicates that the URL preceded by “www.” is the preferred variant for indexing and displaying search results.

Pandosearch follows these instructions by default. In consultation, we can switch this off if necessary.

HTTP headers

The HTML tags above cannot be used for content other than HTML (e.g. PDF files). This is why there is another option to specify in an X-Robots-Tag HTTP header what robots may and may not do. Attention: this requires some technical knowledge about web servers.

You can also use noindex, nofollow or a combination here. In practice, noindex is particularly useful because content other than HTML usually does not contain links to other pages:

X-Robots-Tag: noindex

Specific links: link attributes

Meta tags and HTTP headers can be used to specify for an entire page what should happen with the links on that page. Sometimes you want to be able to specify what should happen to links on a page more precisely: you want some links to be followed and others not.

One example could be a forum on your website. You want to exclude links that users include in their posts, but you do want links to the pages where these individual posts are located to be indexed.

To achieve this, you can add rel="nofollow" to individual links. For example:

<a href="https://pandosearch.com/" rel="nofollow">Pandosearch</a>

This instructs Pandosearch not to follow this link.

Side note: By default, Pandosearch only follows links that refer within the same domain. This means that it is not necessary for Pandosearch to add rel="nofollow" to links that refer to external websites. Even so, it can still make sense to do this for something like the forum in the example, because you have no control over what people include in their posts. Using rel="nofollow" prevents links to internal pages posted by users from unintentionally appearing in the search results.

Standard or not?

Although the ideas begin robots.txt originate back to 1994, strictly speaking no official standards have been defined for the instructions above (one is in the making at the time of writing). This means that different bots can interpret the same information differently and/or support different variations.

Practice shows, however, that the elements from this article are widely supported. They offer good options to control behaviour, both for Pandosearch and other robots.

Read more

Would you like to know more about robot instructions? The pages below are good starting points: