Search States Business – The People's Trust Guernsey

About
Search tool

The States of Guernsey’s Website (www.gov.gg) has poor search facilities. Whether this is by incompetence or intent is unclear.

Openness, accountability and good corporate governance demand easy and fast access to information; this is particularly so with States Business – the meetings, Hansards, proposals etc.

A solution to this problem would be to scrape the content carefully and index it properly with a full-text search.

Here is a ‘beta’ version.

Abuse it and be blocked 😉

I made a tool to scrape thousands of public documents from the government server, restructure them, extract metadata, extract text, and index them properly with a full-text search tool. You may find the source code on github: https://github.com/JBDAC/govgg-scraper

Technical:

Almost all search engines crawl web servers to extract text to index.

Some sites might manage web crawler and scraper activities, but typically they do so at the expense of search placement. The most prevalent way is the robots.txt file, placed in the root directory, which guides crawlers about which pages to avoid. Additionally, the meta robots tag in the HTML head section of individual pages can specify indexing and following preferences, like noindex or nofollow. For non-HTML content, the X-Robots-Tag HTTP header performs a similar role. Another technique involves using the rel='nofollow' attribute in hyperlinks to prevent crawlers from following specific links. More restrictive methods include password-protecting certain areas of the site and disabling directory listings to hide files from crawlers. These methods offer varying degrees of control but they do not require that the crawler obey them: they are guidelines only.

In all events, Gov.gg does not use any – and nor should it – it’s a public service website, where all content is in the public domain. Its problem is the abysmal searching. You can check the downloaded files with:

grep -RilE "nofollow|noindex|robots.txt" ./

So we’re free to crawl! Having grabbed the files (which are public domain, anyway) we can use Recoll. This is a powerful open-source full-text search tool that efficiently indexes and retrieves information from various document formats, allowing users to quickly search and locate specific content within their collections of documents. Now we need to allow people to use this data: so we use Recoll with a web front end.

This is the tool that is running on our server: https://github.com/koniu/recoll-webui

You can enter complex searches, use date ranges, and drill into the document tree with the folder option (but that’s typically date-oriented, too)

It should be more or less up to date. If not, drop us a message via the contacts page.