1. Knowledge Base
  2. How we get our data

How to control the crawler/Crawl Control

There are two ways to instruct our crawlers how to crawl and index your web page content.

You can create a robots.txt file with instructions what folders not to index and on a page level use a meta robots tag in the HTML page to specify what not to index or to index.

Robots.txt

What is robots.txt? 

The robots exclusion standard, also known as the robots exclusion protocol, or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. A robots.txt file instructs web crawlers which pages and files of a website can and cannot be processed or crawled. This is used mainly to avoid overloading a website with requests. 

Like all respectable search engine crawlers, our crawlers respect standard robots.txt directives: before it starts crawling your website, one of our crawlers will always check the robots.txt file first. 

Examples how to create robots.txt

The robots.txt file has to be placed at the root of your website. For example, if your website is www.mydomain.com, you can create a plain text file at www.mydomain.com/robots.txt. If you don’t know how to access your website root or need permissions to do so, you should contact your web hosting service provider.

You must name the file robots.txt: remember to use only small letters, so don’t write Robots.TXT.

A robots.txt file consists of one or more rules. Each rule blocks or allows access for a given crawler to a specified file path in that website.

Here is an example of a robots.txt file with two rules:

User-agent: Dataprovider.com
Disallow: /nodataprovider.com/
User-agent: *
Disallow: /

In the first rule, our user agent Dataprovider.com is not allowed to crawl the folder https://example.com/nodataprovider.com/ or any subdirectories. All other user agents can access the entire website.

In the second rule, the '*' in the User-agent field means all web robots and crawlers and they are not allowed to crawl the entire website.

Robots.txt Creator

You can find more information and a detailed explanation how to create a robots.txt file on www.robotstxt.org. You can also use a search engine where you will find many robots.txt creators and plugins that can do the work for you.

The W3 Consortium, whose mission is to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure the long-term growth of the Web, has added notes and instructions to the HTML specification which you can find at: https://www.w3.org/TR/html4/appendix/notes.html#h-B.4

Meta name (Meta robots tag) 

What is a meta robots tag? 

Meta tags provide information about a web page in the HTML of a document. Meta tags are not displayed on the page itself but can be read by search engines and web crawlers. 

Meta robots tags are pieces of code that provide web crawlers with instructions how to crawl or index web page content: for example, whether or not to index a given web page, whether or not to follow links to another page, etc. Whereas a robots.txt file instructs web crawlers about the entire website, meta robots tags specify page-level settings: they can block a web crawler from crawling a specific page on the website.

Examples how to make a meta robots tag

You can add a meta robots tag in the HTML of a particular page on your website. You have to place the meta robots tag in the <head> section of the page.

Here are two examples of meta robots tags:

<meta name=“robots” content=“noindex, nofollow”>

This tag disallows any crawler from indexing the content on the respective page and prevents it from following any links on the page.

<meta name=“dataprovider.com” content="noindex”>

This tag disallows the Dataprovider’s crawler from indexing the content on the respective page.

Here is an overview of the most common meta robots tag commands:

  • Index: Tells a search engine to index the page.
  • Follow: Tells a crawler to follow all the links on the page.
  • Noindex: Tells a search engine not to index the page.
  • Nofollow: Tells a crawler not to follow any of the links on the page.
  • Nosnippet: Tells a search engine not to show a snippet (meta description) of the page in search engine results.
  • Noimageindex: Tells a crawler not to index any images on the page.

Explanation and links to the protocol https://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.2 

HTTP redirect

What is an HTTP redirect?

An HTTP redirect is a way to forward visitors and search engines from one URL to another. Redirects are used in different cases such as when moving content to a new URL, deleting pages, changing domain names, or merging websites, to name a few. 

Redirects can be temporary, for example during site maintenance or downtime, or permanent that preserve existing links after changing website's URLs. The Dataprovider’s crawler follows the instructions given by the HTTP redirect. Keep in mind that temporary redirects (302 and 307) will not update the index because of the temporary nature of the redirect. 

How can I set up an HTTP redirect in Apache/IIS?

The Apache HTTP Server is the most widely used open source web server software. Here you can find more information how to set up an HTTP redirect in Apache: https://httpd.apache.org/docs/2.4/rewrite/remapping.html 

The Internet Information Services (IIS) is a web server software created by Microsoft which supports HTTP, HTTP/2, HTTPS, FTP, FTPS, SMTP, and NNTP. Here you can find more information how to set up an HTTP redirect in IIS: https://docs.microsoft.com/en-us/iis/configuration/system.webserver/httpredirect/

Explanation of the Hypertext Transfer Protocol 

The Hypertext Transfer Protocol (HTTP) is the underlying application protocol for distributed, collaborative, hypermedia information systems. It defines how messages are formatted and transmitted and what actions web servers and browsers should take in response to various commands. HTTP has been in use by the World Wide Web global information initiative since 1990. 

Links to sources

Here you can find further information about HTTP:

How can I block the Dataprovider’s crawler?

As previously mentioned, Dataprovider’s crawler strictly follows the robots.txt file on your website, so you can fully control it if you want to. If for some reason you want to prevent our crawler from visiting your website, place the following line into the robots.txt file on your server:

COPY

User-agent: Dataprovider.com
Disallow: /

Please note that it may take some time before our crawler processes the changes in your robots.txt file. 

Please note that if your robots.txt file contains errors and our crawler is not able to recognize your commands, it will continue crawling your website the way it did before.

How can I unsubscribe my website?

If you want to ensure that the Dataprovider’s crawler doesn’t access your website at all, i.e. it doesn’t even reach your robots.txt file, you should contact us at support@dataprovider.com and we will add the URLs that you don’t want us to scan in our blacklist.