Indexing

After crawling, the information discovered needs to be indexed.

Indexing means translating raw information into information that is suitable for searching. This is basically done by categorising, cleaning up, selecting, and analysing text. Finally, there are additional customisation options that we will mention briefly in this article.

The exact indexing process depends in part on the type of information (a web page differs from a PDF document, for example) and on the specific wishes of our customers. In this article we assume that we have a web page with HTML source code, which we translate into information that can be found using a search bar.

Categorising

Pandosearch was created to break down large amounts of information into smaller pieces and categorise them.

In search terminology, we call these categories facets. A facet is both a filter and a grouping.

For example:

The pandosearch.com website contains general pages and news articles. For the search function, we would like to know which pages are "general" and which are "news". This makes it possible to:

Search only in "news", for example in a search bar on the News page. This prevents people getting irrelevant search results if they only want to search within News.
In general search results, show how many results are in "general" and how many are in "news". This gives visitors an indication of where to find information. They can also click either of these to filter the results further.

Facets make this (and much more!) possible.

But how does Pandosearch determine whether something is a "general" or "news" page? This can be done in a number of ways:

The URL (the web address) can be useful for this. For example, all news articles could be located under "pandosearch.com/news/". Pandosearch then marks all pages of which the URL begins with this as "news" and the rest as "general".
Another way is to look at so-called meta tags in the HTML source code of the page. During implementation, we can agree on a specific meta tag with the website builders of pandosearch.com from which we can determine whether something is "news" or not.

There are many other ways to determine facets. This greatly depends on customer-specific requirements. We will therefore discuss this during intake and implementation.

Cleaning up

You don’t think much about it when you visit a website, but every web page you view contains all kinds of information in the source code that you can’t see. This includes formatting information (such as text colour and text format) and programming code (scripts) that reacts to movements on the screen with your mouse or finger. This is very useful for you as a visitor, but not something that is relevant to a search engine.

That is why Pandosearch cleans up a lot of this kind of information when indexing an HTML page.

Selecting

Once we have done an initial clean-up, there is often still information left over that we do not want to use. One example of this is a main menu, containing all kinds of words that are not relevant to the current page, but to the page to which the menu refers. The information at the bottom of a web page also often contains text that is always the same, and therefore not useful for a search function.

What we do about this is explicitly select the relevant page content. How we do this depends very much on the content management system (CMS) behind a website. That is why we often do this in consultation with technical people at our customers, such as website administrators and/or software developers.

Text analysis

After cleaning up and selecting, a piece of text remains that we want to make findable. To do this, we use logic that analyses the text. Depending on the language and the nature of the text, we cut the text into words, look at word conjugations, make sure special characters (é, ü, à, etc.) work properly, and also build a list of autocomplete suggestions.

Additional text analyses

In addition to the examples above, we also apply specific text analysis when necessary. As Pandosearch can vary by implementation, it would be going too far to cover all the variants here in this general documentation. During the intake and implementation, we always look at what is needed for a specific situation.

Customisation options

In addition to the aforementioned customisation in the text analysis, customisation is also possible in other phases of the indexing process. Examples are indexing information other than HTML (XML, PDFs, JSON, etc.) or very customer-specific settings for cleaning up and analysing web pages.

We are happy to advise on this and ultimately always decide together with our customers whether the (extra) investment outweighs the added value that customisation can offer.

Once the indexing is done, Pandosearch can start to serve all the information through a search function.