1. Knowledge Base
  2. Know Your Customer

Processing KYC files

When you have uploaded a file to our Know Your Customer product, the processing begins.

Spiders

Dataprovider indexes domains with ‘spiders’ on a certain time. A spider is a piece of software that runs on servers in datacenters. You can recognise our spider in the access logs of your website because the spider identifies itself with the following user agent: 

“Mozilla/5.0 (compatible; Lipperhey-Kaus-Australis/5.0; +https://www. lipperhey.com/en/about/)” 

The spiders try to be as efficient as possible only downloading 10 to 20 pages per website.

The response of a website that a spider gets can be different from the response you get as a human. This is because you have a real browser and don’t visit the website from a datacenter. Always remember that the data is captured on a different time than when you visit the website. Websites change and react differently on browsers operated by humans then spiders from datacenters. 

Redirects

Your browser is a different kind of software than a spider. Your browser can execute commands that it receives from a website. For example Javascript can make your browser redirect to another page or even another domain. META Refresh tags can let your browser refresh the page to another page or even another domain. A spider can’t do all this and only follows server side redirects.

Access Denied

An access denied means that the spider can’t access the website. This can occur when the DNS is not configured, the server is unavailable or access is not allowed. In most cases there is no website (DNS is not configured) but sometimes there is. In that case the hosting provider or CMS of the website doesn’t allow the spider to visit the website. Not everybody likes to have spiders on their website so they block us or mistake our visit as a DDOS attack.

Some hosting companies (such as GoDaddy) redirect their access denied. In this case the spider gets a page with an access denied but the same page also contains a piece of JavaScript. The Javascript redirects the browser to another location but doesn’t redirect the spider.

Placeholders

Redirects and Access Denied are very strong fields. There is no room for interpretation for a spider. For us a placeholder is a website with only one page that contains words like “under construction” or “this domain is for sale”. Placeholders can vary per hosting company because there is no real definition for it. It’s not a very exact method like with Redirects and Access denied but it works. Some clients look more at the content on a website to see if a domain is used. A good method for example is: websites with only one indexed page and less than 200 words. These websites are typical low content websites and that improves looking for default placeholders.

Status Codes

To download a page from a website the spider requests an URL. A status code is a number of tree digits that specifies the result of a requested URL. For example: If a page was found then the spider will receive a status code 200 and if a page was not found then it will receive a status code 404. Dataprovider has a variable called 'Status Codes'. This variable contains a list of all the status codes the spider received while indexing the website. Because a website can have multiple pages it can also have multiple status codes. Officially there are five classes of response. The first digit of the status code specifies these classes. The classes are:

  • 1xx Informational
  • 2xx Success
  • 3xx Redirection
  • 4xx Client Error
  • 5xx Server Error

These classes are limited and give only insights in the response of the server or the website. Because Dataprovider also offers a private crawl we want to give insights in the response of the spider as well. This is why we added an extra (unofficial) 9xx class. When you upload a list of domains into Dataprovider for a custom crawl our spider will index these domains. All the domains are checked, de-duplicated and for each domain we will have a response. The field 'response' is a variable that is only available in our private section (custom crawl). The response of a domain can be: 

  • Available (we received a valid response with status code 1xx or 2xx)
  • Host not found (there is no IP configured in the DNS for this domain or the IP is not responding)
  • Redirect (we received a server side redirect with status code 3xx)
  • Access Denied (the spider could not access the website and received status code 4xx, 5xx or 9xx)

The term access denied is a broad term. These errors can be caused by the server on which the website is hosted (5xx), the website itself (4xx) or in the communication between the spider and server (9xx). A complete overview of the official status codes (1xx, 2xx, 3xx, 4xx and 5xx) can be found on this website. The (unofficial) 9xx class is added by Dataprovider. The 9xx class shows why the spider could not get any results from the domain. These status codes are:

  • Websumit and Webpost range
    • 901  The remote name could not be resolved.
    • 902  Unable to connect to the remote server (IP in A record does not exist).
    • 903  An unexpected error occurred when receiving a response.
    • 904  An unexpected error occurred when sending a request (no TLS support).
    • 907  The remote server returned an error (404 Not found).
    • 908  The connection was closed unexpectedly
    • 909  Could not establish a trusted relationship for the SSL/TLS secure channel
    • 910  The request was aborted (Could not create SSL/TLS secure channel)
    • 911  The server committed a protocol violation
    • 914  The operation has timed out (homepage of website)
  • Regulator range
    • 940  Hostname has more than 2.500 redirecting subdomains (infinity loop).
    • 941  Redirect target URL contains syntax error.
  • Crawler common range
    • 950  Website is disallowed by robots.txt.
    • 952  Tried to connect to the website five times and gave up to prevent hammering.
    • 953  Root document could not be parsed.
    • 954  Domain is blacklisted by Dataprovider
    • 955  Root document available but timed out (header status code 200, long wait)
    • 956  Main tread crawler timed out, crawler froze and reached time limit
    • 957  Website data was send by crawler but not received by regulator (old 997)
    • 960  Aborted by regulator (waited for 20 minutes)
  • Crawler exception range
    • 990  IO exception in GET response.
    • 991  URI format exception in GET.
    • 992  Argument out of range exception.
    • 993  Maximum page size has been reached.
    • 994  Object disposed exception.
    • 995  Undocumented exception.
    • 996  Website redirects to an empty result page.
    • 997  Website data was send by crawler but not received by regulator
    • 999  There is no A record (IP address) configured for the given hostname.