When you have uploaded a file to our Know Your Customer product, the processing begins.
Dataprovider indexes domains with ‘spiders’ on a certain time. A spider is a piece of software that runs on servers in datacenters. You can recognise our spider in the access logs of your website because the spider identifies itself with the following user agent:
“Mozilla/5.0 (compatible; Lipperhey-Kaus-Australis/5.0; +https://www. lipperhey.com/en/about/)”
The spiders try to be as efficient as possible only downloading 10 to 20 pages per website.
The response of a website that a spider gets can be different from the response you get as a human. This is because you have a real browser and don’t visit the website from a datacenter. Always remember that the data is captured on a different time than when you visit the website. Websites change and react differently on browsers operated by humans then spiders from datacenters.
An access denied means that the spider can’t access the website. This can occur when the DNS is not configured, the server is unavailable or access is not allowed. In most cases there is no website (DNS is not configured) but sometimes there is. In that case the hosting provider or CMS of the website doesn’t allow the spider to visit the website. Not everybody likes to have spiders on their website so they block us or mistake our visit as a DDOS attack.
Redirects and Access Denied are very strong fields. There is no room for interpretation for a spider. For us a placeholder is a website with only one page that contains words like “under construction” or “this domain is for sale”. Placeholders can vary per hosting company because there is no real definition for it. It’s not a very exact method like with Redirects and Access denied but it works. Some clients look more at the content on a website to see if a domain is used. A good method for example is: websites with only one indexed page and less than 200 words. These websites are typical low content websites and that improves looking for default placeholders.
To download a page from a website the spider requests an URL. A status code is a number of tree digits that specifies the result of a requested URL. For example: If a page was found then the spider will receive a status code 200 and if a page was not found then it will receive a status code 404. Dataprovider has a variable called 'Status Codes'. This variable contains a list of all the status codes the spider received while indexing the website. Because a website can have multiple pages it can also have multiple status codes. Officially there are five classes of response. The first digit of the status code specifies these classes. The classes are:
- 1xx Informational
- 2xx Success
- 3xx Redirection
- 4xx Client Error
- 5xx Server Error
These classes are limited and give only insights in the response of the server or the website. Because Dataprovider also offers a private crawl we want to give insights in the response of the spider as well. This is why we added an extra (unofficial) 9xx class. When you upload a list of domains into Dataprovider for a custom crawl our spider will index these domains. All the domains are checked, de-duplicated and for each domain we will have a response. The field 'response' is a variable that is only available in our private section (custom crawl). The response of a domain can be:
- Available (we received a valid response with status code 1xx or 2xx)
- Host not found (there is no IP configured in the DNS for this domain or the IP is not responding)
- Redirect (we received a server side redirect with status code 3xx)
- Access Denied (the spider could not access the website and received status code 4xx, 5xx or 9xx)
The term access denied is a broad term. These errors can be caused by the server on which the website is hosted (5xx), the website itself (4xx) or in the communication between the spider and server (9xx). A complete overview of the official status codes (1xx, 2xx, 3xx, 4xx and 5xx) can be found on this website. The (unofficial) 9xx class is added by Dataprovider. The 9xx class shows why the spider could not get any results from the domain. These status codes are:
- Websumit and Webpost range
- 901 The remote name could not be resolved.
- 902 Unable to connect to the remote server (IP in A record does not exist).
- 903 An unexpected error occurred when receiving a response.
- 904 An unexpected error occurred when sending a request (no TLS support).
- 907 The remote server returned an error (404 Not found).
- 908 The connection was closed unexpectedly
- 909 Could not establish a trusted relationship for the SSL/TLS secure channel
- 910 The request was aborted (Could not create SSL/TLS secure channel)
- 911 The server committed a protocol violation
- 914 The operation has timed out (homepage of website)
- Regulator range
- 940 Hostname has more than 2.500 redirecting subdomains (infinity loop).
- 941 Redirect target URL contains syntax error.
- Crawler common range
- 950 Website is disallowed by robots.txt.
- 952 Tried to connect to the website five times and gave up to prevent hammering.
- 953 Root document could not be parsed.
- 954 Domain is blacklisted by Dataprovider
- 955 Root document available but timed out (header status code 200, long wait)
- 956 Main tread crawler timed out, crawler froze and reached time limit
- 957 Website data was send by crawler but not received by regulator (old 997)
- 960 Aborted by regulator (waited for 20 minutes)
- Crawler exception range
- 990 IO exception in GET response.
- 991 URI format exception in GET.
- 992 Argument out of range exception.
- 993 Maximum page size has been reached.
- 994 Object disposed exception.
- 995 Undocumented exception.
- 996 Website redirects to an empty result page.
- 997 Website data was send by crawler but not received by regulator
- 999 There is no A record (IP address) configured for the given hostname.