Deep Web, Dark Internet and Darknets

From IT

Jump to: navigation, search

Contents

Introduction

Deep Web, Dark Internet and Darknet represent three concepts that might seem similar and that are sometimes used as synonimous. Really, these three concepts represent three very different worlds.

Hearing terms like these, they may seem fascinating, perhaps because are subjects that does not usually get talk about on the Web. More in deep, when a user surfs the Web, he thinks that all the information available all over the world are also available on it. There are, on the other hand, areas of the World Wide Web that cannot be easily accessed by the public.

In the following sections, these three concepts will be explained, in a way in which the differences between them can be clear.

The Deep Web

The Deep Web it's simply that part of the Internet which is not indexed by search engines.
Before to begin, we have to consider a preliminary difference between the Surface Web and the Deep Web. [1]

Surface Web is the term used to identify that portion of the World Wide Web that is indexed by conventional search engines: in other words, is what you can find by using general web search engines.

On the contrary, the Deep Web is defined as that portion of the Wolrd Wide Web that is not accessible through a research executed using general search engines, and is much bigger than the previous one (http://www.internettutorials.net/deepweb.asp).

Beyond the trillion pages a search engines such as Google knows, there is a really vast world of hidden data. This content could be:

  • the content of database, that is accessible only by query;
  • files such as multimedia ones, images, software;
  • the content on web sites protected by passwords or other kinds of restrictions;
  • the content of "full text" articles and books;
  • the content of social networks;
  • financial information;
  • medical research.

Nowadays, we have to consider also other kinds of content, such as:

  • blog postings;
  • bookmarks and citations in bookmarking sites;
  • flight schedules.

Search engines and the Deep Web

Search engines rely on crawlers, or spiders, that wander the Web following the trails of hyperlinks that link the Web together: it means that spidersindex the addresses of the pages they discover.

The negative aspect referable to this indiscriminate crawl approach had been replaced with the so called "popularity of pages" in a search engine like Google: in other words, the most popular pages, and so those that register the highest frequence of research, have priority both for crawling and displaying results.

In the Deep Web, happens that spiders, when finding a page, don't know what to do with it: it means that spiders can record these pages, but aren't able to display the content of them. The most frequent reasons can be refereable to technical barriers (database driven content, for instance) or decisions taken by the owners of web sites (the necessity to be register with a password to access the site, for instance), that make impossible for spiders to do their work. (http://websearch.about.com/od/invisibleweb/a/invisible_web.htm)

Another important reason refers to the linkage: if a web document is not linked to another, it will never be discovered.

Commercial search engineselaborated some methods that permit to navigate the Deep Web and find resources on specific Web Servers. Sitemap Protocol is one of them, and allows search engines to crawl the sites in a more intelligent way (http://en.wikipedia.org/wiki/Deep_Web).

To search the Deep Web

Most users tend to use search engines in a very elementary way, and perhaps for this reason the navigation on the Deep Web is limited to a smaller part of Web users. Despite of this fact, the Deep Web is considered to be the " paradigm for the next generation Internet " (2005, Deep Web FAQ, par. 35). In fact, a good use of the Deep Web can drastically reduce the time necessary for a research and give back high quality information: let consider only the fact that, if the Surface Web contains 1% of the information available on the Web, the Deep Web contains the 99% of them.
Even if Google can be considered one of the best existent search engines, it isn't able, or better, crawlers aren't able to rank documents contained in the Deep Web: they don't know how to search the Deep Web collections, as said before.
But the most serious problem according to Google regards the fact that not always it ranks pages in the way in which you would rank documents. Google ranks pages according to the assumpion related to the popularity of pages, but they are not necessary the most relevant for the users.
When a user is looking for resources related to science, technologies, business and so on it's so better to navigate the Deep Web instead of the Surface Web.

With the evolution of the Web and with the portals describe below, the Deep Web will become easier to use and surf.

In order to search Deep Web content, it's necessary to use specific portals and Federated Search Engines. Federated search allows the simultaneous search of multiple resources and databases. A Federated Search Engine works in this way (http://en.wikipedia.org/wiki/Federated_search):

  • A query request is distributed through the search engines;
  • The query is then transformed and delivered through databases and resources;
  • Results acquired from the above sources are merged;
  • They are presented to the users.

One of the most positive aspect of federated search regards the time saving: thanks to the fact that researches are conducted simultaneously on different databases, the research can be performed in a more rapid way.
An exaple of portal is CompletePlanet (http://www.completeplanet.com/). Because of the presence of thousands of databases that contain Deep Web content, CompletePlanet offers the possibility to navigate these Deep Web databases.
The linked image [2] shows the home page of the portal, with the list of all the available dynamic searcheable databases. it's in fact possible to to go to various topic areas (medicine, art&design, science, politics, and so on) and find content that are not display by using conventional search engines [3].
Other Deep Web search engines are (http://websearch.about.com/od/invisibleweb/tp/deep-web-search-engines.htm):

  • Clusty, that is a meta search engine able to combine results from different sources and give back the best possible result;
  • SurfWax, that gives the possibility to obtain results from different search engines at the same time, and to create personalized set of sources;
  • InternetArchive, that gives access to specific searcheable topics such as live music, audio and printed materials;
  • Scirus, that is dedicated only to scientific material;
  • USA.gov for access information and databases from the USA government.</p>

Search engines may find deep data, but their coverage sometimes give back only less relevant content. In addition, the research of this kind of data needs a certain ability in navigating the web. A possible technique to find Deep Web resources without using specific portals like the above mentioned ones is for instance this:

  • using a search engine that contains both Surface and Deep Web resources, it's possible to make researches in databases by usgin specific research terms:
    • on Google and InfoMine: what you are looking for (database OR repository OR archive)
    • on Teoma: what you are looking for (resources OR meta site OR portal OR pathfinder) (http://techdeepweb.com/4.html).

Deep Web Technologies

Deep Web explorers usually work, search, according to two ways:

  • They harvest documents. In this first case, we have to consider that harvesting has both positive and negative points: it's an ideal technique if there are adequate infrastructures to make contents available to users, but isn't the best one if the search interface is not intuitive and does not permit to the user to retrieve a huge amount of documents in an easy way. The Open Archive Initiative (http://www.openarchives.org/) is an harvesting protocol. It develops standards to facilitate the dissemination of content. Through it it's especially possible to collect multiple collections and aggregate them into a single, sentralized one.

  • They search collections on the fly. Deep Web Technologies Explorit (http://www.deepwebtech.com/index.html) presents this real-time search approach, also this with its weak and potisive points. The most considerable positive point is related to the fact that also most Deep Web collections led themselves to real-time searching, even if they do not led themselves to harvesting: in fact, by not implementing a harvesting protocol, the owner of a document does not have to do something to allow this document to be searched. According to this approach, the client uses a simple HTTP protocol and a web form that initiates the query to enter the database: the content will be processed and results will be displayed to the user. The weak point is especially related to remot collection, and in particular to the fact that there are ongoing demands on it. Another possible negative aspect is, on the contrary with conventional search engines, that rank results in an ordered way, related to the inability to rank documents in an effective way.

How big is it?

It seems that the Deep Web is 500 times bigger than the Surface Web, containing 7.500 terabytes of data and 550 billion of documents (August 2001, Bergman, Michael K., "The Deep Web: Surfacing Hidden Value". The Journal of Electronic Publishing 7).

Why the Deep Web is important?

Usually we trust only to what we find in search engines such as Google or Yahoo!. But sometimes is not so easy to find what we are really looking for and what we specifically need on them, especially if we are trying to find obscure and complicated topics.
If we compare the Web to a vast, a very big library, can be easier to understand the problem.
In a vast library, we wouldn't expect to find what we are looking for simply walking on the first bookshelf, but we have to go more in deep with our research, searching on a number of shelves and so on, especially if we are searching for a rare document or book.
It's the same with the Web: sometimes, search engines are not able to help us with our search because they do not present all the information about that topic, while the Deep Web will.

To make some real examples, let consider the CIA edition of "The Chemical and Biological Walfare Thread", that is a public document, but that cannot be accessed through a search engines like Google, or the documents of the U.S. Federal Government, and so millions of documents related to law, economics and so on. These data, that may be visible to everyone, are stored in databases, and for this reason cannot be accessed by using search engines. These outputs can be penetrated, on the other hand, by the Deep Web search engines: they are able to solve this situation, in this case caused by bureaucratic inertia, by displaying those documents.
While traditional search engines work weel on HTML pages, the Deep Web gathers information contained in databases, and so remains unseen to a lot of users (http://www.salon.com/technology/feature/2004/03/09/deep_web/index.html).

The Deep Web is a "contenitor" of useful content on the Internet that is usually of high quality, is very specific in nature and that is well managed.
Let consider, for instance, the PubMed Database. PubMed, that is managed by the U.S. National Library of Medicine, contains 20 million citations from sources such as MEDLINE, authorized journals, online books, it provides access to websites and other additional resources. The content is authored by professionals authors, writers, and is contained in professional sources: in fact, the National Center for Biotechnology Information and the National Library of Medicine spend time and resources to take the content available and to manage it. In particular, if we enter the section called "All Databases" - see the linked image [4] - we are able to see all the possible resources made available by PubMed, and typing what you are looking for in the search box, you receive back all the resources dedicated to that specific topic (http://www.ncbi.nlm.nih.gov/pubmed).

So, why the Deep Web, and especially the content of the Deep Web is so important? Because in fields such as business, sicence, and other worlds like these, time is very important and the marketplace is competitive, because it's important for a student who has to write his thesis to find authored and high quality material.

Darknets

With the term Darknet we refer to a collection of networks and technologies used to share digital content in which participants are anonymous.
Darknets, in a different way from Lightnets (such as LimeWire, BitTorrent and Usenet, where resources are available to everyone and linked via available to public URIs), are anonymous and unpoliced. First, we have to consider the assumption that a darknet is not a physically separated network but a layer that works on existing one: they have a sort of "parallel life" with respect to the net as we know it.
Example of darknets are:

  • peer-to-peer (P2P) file sharing;
  • CD/DVD copying;
  • password sharing on email or newsgroups.
The Darknets are distribution networks through which users can distribute items (programs, songs, movies, books, articles and so on) in a form that permits copying and then their distribution, if items are interesting ad connected to high-bandwidth channels (http://msl1.mit.edu/ESD10/docs/darknet5.pdf).

To be effective a darknet needs some infrastructure and technological requirements, and so:

  • Input facilities- to put material on the darknet;
  • A network that permits the transmission of the copies;
  • Devices to consume them;
  • A database to enable users to find these items;
  • A repository where to store items.

The evolution of the Darknet [5]

In the course of its evolution, the Darknet had been put under attack: legal actions are the most frequent ones but, even if cops can access darknets, there is nothing any policity agency in the world can do about it; on the other hand, Darknets are also under viruses or spamming attacks, and in particular, along time, they have become one of the most dangerous source of damage.

In the 1990s, the act of copying was organized around groups of friends and was especially limited to music and computer programs transferred through sneakernets [6], methods for transferring electronic information (http://en.wikipedia.org/wiki/Sneakernet). The weak points of sneakernets were the slowness and the lack of an effective search engine.

From the 1998, due to technological advances and the increased power of computers, a new form of darknet developed: the Internet began to permit a new kind of connection with centralized services. The central Internet servers with collections of MP3 files, and more in general, with centralized storage and search, are one of the innovation of the new darknet.
Centralized networks have the weak point to be not robust against attack, and especially this aspect led to innovations in peer-to-peer networking and file sharing.
In order of time, the principal systems to consider are:

  • Napster, in which the most critical components remained centralized;
  • Gnutella, that uses a distributed database and does not depend on centralized services.

A system like Gnutella shows two main weak points that should be solved, and are related to:

  • Free riding, that represent a typical behavior in Gnutella and for which users download items without sharing items too. This can be considered a sort of "social dilemma" in which users have to decide to contribute to the network or just maximize only their own personal good by freed riding. In this case, the main risk is that of an unidirectional system that reproposes the vulnerablities of centralized systems;
  • Lack of anonimity, in the sense that users who use Gnutella are not anonymous, in a differet way from, for instance, Freenet.

These two aspects make those kind of darknets assailable, vulnerable especially in relation to items storage and search infrastructures, exploiting the lack of anonimity.

Next developments have been made possible by other technologies regarding injection, distribution, storage and so on, and by the incorporation of cryptography (http://msl1.mit.edu/ESD10/docs/darknet5.pdf).

The example of Freenet

Freenet (http://freenetproject.org/index.html) is a free software through which a user can publich and obtain information on the Internet in totally freedom. Freenet moves beyond the two limits described above: the network is decentralized (as said before, without decentralization it can be vulnerable to attack) and publichers and consumers are anonymous.
Communication are encrypted (where encryption is the method through which information are made unreadeable) and are routed-through other nodes to make difficult to define who is requesting for information.
The contribution to the network from users count on the fact that users give a part of the hard drive (data store) for storing files, and, unlike other peer-to-peer networks, users do not have control on what is stored in the data store.
In order to increase the "security" of Freenet, a user has to manually add people he/she trusts to the connections list. Nodes restrict their connections only to nodes they trust, so each participant communicates only with trusted peers and reveals his identity only to peers he already trusts.
These developments, that are recent with respect to the previous version of Freenet, brought Freenet to be one of the best example of darknet: especially, the fact that a user can be connected only with users he/she knows permits to drastically reduce the vulnerability of the network: this respect for the privacy of publishers and consumers is very important especially because, instead of direct connections between nodes, the limited connections only to the nodes users trust permit to avoid vulnerability problems. In this way, it's difficult for governments to block it.

How does a darknet like Freenet work

  • Inserting and requesting data: the first assumption is that to each piece of data is associated a key, the second is that each node has a routing table (a data structure in the form of a table stored in a network that lists the routes to particular network destinations). The construction of the routing tables needs only to be consistent (http://freenetproject.org/papers/freenet-0.7.5-paper.pdf).
    • Search: to search for a document that matches a key, one establishes a route for that key. The node checks to see if the searched document is present: if the document is found, the route terminates and the document is retourned to the previous node in the route; if the document is not found, it will be searched in the cache of neighbors.
    • Insert: this process is similar to the search one. A route is established and if the document matching the key is not found, is proxied up the route, while if it is not found, it will be retourned down the path.

Along these two processes, nodes store data into their cache. When proxying and catching the documents, the node also test the authenticity of them.

  • Encryption: the encryption of data follows the so called "symmetric encryption": it means that the same key is used for encryption and descryption
  • Routing in the network: the first thing to assume is that the structure of nodes and links between them is fixed and cannot be changed for routing purposes. Developers consider the network as a small world in which each node has an identity to be routed on. After giving an identity to the node, and keeping a routing table at each node, is shown where to reach queries to reach every node: in this way, the best possible route is always used. The performed routing forsees that, at each step, "the desiderability of neighbors is ordered by the proximity of their identities to the route key" (http://freenetproject.org/papers/freenet-0.7.5-paper.pdf).

Frost

Frost is a Freenet 0.7 client that provides messaging like that of newsgroups, private encrypted messages, the possibility to download and upload files, and a sharing system (http://jtcfrost.sourceforge.net/).

Other darknets

Waste (http://waste.sourceforge.net/index.php) is a software and protocol that allows communications among a little group of users (in the order of the 10-50 nodes) in totally security thanks to anonimity and encryption.
It provides different services:

  • Instant messaging;
  • Group chat;
  • Files sharing;
  • Files transfer.


Tor (http://www.torproject.org/index.html.en) is a networkthat allows users to improve privacy while using a network and provides different applications that permit to share information over public networks in order to obtain a respect of the privacy other networks do not grant.
Through Tor it's possible to publish also websites without the need to reveal their location.
The idea under Tor is that of a hard-to-follow route instead of a direct one: in this way, data packets take a random pathway through several relays in order to make difficult to understand from where the data come. To create this private pathway, the user's client builds a circuit of encrypted connection through relays on the network and the complete path remains "secret". The same circuit is used for about ten minutes, and after this time, requests are given a new one.
Onionland is the collection of all the anonymous websites that exploit Tor network.

I2P (http://www.i2p2.de/index.html) is another anonymous network to securely communicate and also to publish in an anonymous way websites. Data are enveloped with a lot of layer of encryption, and the network is distributed and dynamic.
Unlike other networks, I2P does not provide anonymity only for the originator of the communication, but allow to communicate anonimously according to both sender and receiver.

Dark Internet

The term Dark Internet, also called dark address space, refers to the network on the Internet that cannot be reached (http://compass.com.pa/index.php?option=com_content&view=article&id=466&Itemid=4429).
Dark Internet should not be confused with Deep Web or Darknet: as said before, a Darknet is a network used to share digital content in an illegal way, while Deep Web consists of hard-to-find web sites and information that are beyond the reach of search engines. Dark Internet is a ter used to identify those spaces of the Internet that cannot be accessed by conventional means.

Causes

There are different reasons recognized as being causes of the birth of the Dark Internet. Sometimes, problems are related to the allocation of resources due to the chaotic disposition of Internet growth. A cause of dark addresses is associated to the military archive known as MILNET. MILNET was that part of ARPANET assigned to the unclassified traffic of the U.S. Department of Defence. These networks are sometimes old as the same ARPANET, and for this reason sometimes they are not incorporated into the current Internet architecture. So, sometimes military sites fall into shadow because of the occupation of neglected addresses. (http://compass.com.pa/index.php?option=com_content&view=article&id=466&Itemid=442)
But the U.S. military system is not the only victim of this phenomenon of dark address space: a study conducted by the researchers of the Arbor Network (http://www.arbornetworks.com/dmdocuments/dark_address_space.pdf) demonstrate that the 5% of the Internet routable address do not present a global connectivity. This fact contradict the idea that Internet is not a complete interconnected graph: in fact, commercial strategies, peering disputes, network failures and misconfiguration of the routers cause the "partioning" of the Internet, permitting so the creation of those "dark address space".
To be more precise, the Internet is a system of interconnected computer networks and each computer has its own address: routers know where to send data just because they know the addresses of computers. But sometimes, these addresses disappear, becoming unreacheable. This happens mainly for two reasons:

  • or because routers are not configured in the righ way;
  • or hackers are attacking this address.

Attacks

Sometimes happens that hackers hijack routers with illegal purposes: in this way, occur abnormal functionings of Internet. Especially, DoS (Denial of Sevice, http://en.wikipedia.org/wiki/Denial-of-service_attack) )attacks have caused different problems on the Internet and are usually done by "botnets": it means that different computers access a single Internet system and take part to attacks, after have been affected by viruses, without the control of the owners. Usually, these attacks are just launched from ureacheable IP addresses, so the responsible cannot be found.
U.S. CERT (http://www.cert.org/) uses a program called "Einstein" to monitor dark address space on the Internet, gaining insights about techniques and attacks: this programs, in fact, provides information abut dark addresses activities.

The example of Estonia

The cyberattack on Estonia is one of the best example of cyber attacks that swamped a lot of web sites regarding Estonian organizations, and so banks, ministries, newspapers, army. The "cyber-walfare" became famous because was the first attack that involved the entire security system of a Nation.

The example of WikiLeaks

The WikiLeaks domain, wikileaks.org redirects to a mirror site (a copy of a website hosted at a remote location), mirror.wikileaks.info. with the IP address 92.241.190.202, an address that is refuge for criminals and fraudsters, with a long list of examples tied with criminality. The IP address is in fact registered to a well known blackhat-hosting provider, a company called Heihachi Ltd, that breaks networks with the purpose to create and launch viruses and attacks, whose IP space only sees spamming, phishing and other cybercriminal activities.

The example of Russian Business Network

Another important case to consider is that of the Russian Business Network, a cybercrime organization specialized, as Internet service provider, in a lot of different illegal activities such as child pornography, phishing, spam and malware distribution (http://en.wikipedia.org/wiki/Russian_Business_Network).
The Russian Business Network can be considered a dark zone of the Internet: it shows an illegal identity, the figures are anonymous as the belonged web sites, that have anonymous addresses with dead e-mails.
This organization has created a nebulous network to hide its illegal activities and to prevent authorities from stopping them. Unfortunately, the Russian Business Network has become one of the best dark zones in which to host cybercrime behaviors and activities, due to the garantee of "bulletproof" hosting (http://www.bizeul.org/files/RBN_study.pdf).

In 2007, the Russian Business Network started to catch the unwanted attention of bloggers and U.S. Media: for this reason they tryed to go out from view, but the computer security company "Trend Micro" saw some activities reapper and they thought that they could be attempts to re-locate the RBN services (http://www.theage.com.au/news/security/the-hunt-for-russias-web-crims/2007/12/12/1197135470386.html).

Terrorism

During 2008, Georgia declared something about the intention to break away from Russia and its government. The Russian Business Network did not appreciate this fact, entered illegally the Georgian government web site, and changed photos of the President with some others in which he looked like Hitler: the attack stopped with the killing of Georgia's Internet and most of its electronic communications (http://www.computerworld.com/s/article/9112201/Cyberattacks_knock_out_Georgia_s_Internet_presence).

Personal tools