If we envision the internet as an ocean, the Surface Web represents no more than the superficial waves flowing on top. The deep web, on the other hand, represents information that is deeply sunken and cannot be accessed by conventional search engines. The Deep Web contains a number of Darknets, that cannot also be indexed by conventional search engines, and special types of browsers , such as the Tor browser, or dedicated proxy servers are required to access them. The Tor network, or the Onion Router, is by far the most popular Darknet and is associated with activities that are considered illegal under many jurisdictions.
Throughout a study conducted by The Intelliag group in 2015, more than 1000 of websites on Tor were sampled and analyzed, concluding that 68% of activities taking place on these websites are illegal. Another study that analyzed 5000 onion domains in 2016, concluded that the most common usages of Tor’s hidden services are buying/selling drugs, distribution and download of pornography (whether child or other genres of pornography material), selling counterfeited documents/money and to a lesser extent buying/selling weapons.
A recently published paper presented a novel dataset for Tor active domains, that was referred to as “Darknet Usage Text Addresses” (DUTA), categorizing all activities taking place on the Tor network into 26 categories, which covered both legal and illegal activities, during the study’s sampling period.
Dataset Construction Procedure:
The DUTA was constructed using a specially customized web crawler that fetches onion domain pages via the Tor socket through port 80, i.e. using the HTTP protocol only. The customized web crawler is comprised of 70 worker threads that operate in a parallel fashion to fetch the HTML code of pages of onion domains. Each worker thread dives deep to the second level for each onion domain to collect as much data as possible rather than just collect only the content of index pages, as done by researchers in previous works. The crawler also searched for onion domain links on a number of popular Darknet services such as onion.city.
The researchers managed to reach more than 250,000 onion domains, yet only 7,000 were alive and the remainder were either down or not responding. Afterwards, the HTML content of every onion domain was coalesced into a single HTML page, yielding a single HTML file for every onion domain. 7,931 onion domains were surveyed using the customized browser for two complete months between May and July 2016.
Categorizing Activities On the Tor Network:
As shown via the below table, the researchers classified activities taking place on the Tor network into 26 categories, collectively referred to as DUTA classes. They labeled main classes and subclasses for legal and illegal activities. For instance, they divided “Counterfeit Personal Identification” into three subclasses; “Passport”, “Identity Card” and “Driving License”.
As “Counterfeit Personal Identification” represents a rather wide class of activities, it was divided into three main subclasses; “Counterfeit ID” which refers to forging governmental personal ID documents; “Counterfeit Money” which refers to fake currencies and “Counterfeit Credit Cards” which refers to fake cards, cloned market cards like eBay and Amazon and hacked payment processor accounts e.g. Paypal. Skrill…etc.
The “Services” class covers all legitimate services whether provided by individuals or businesses. The “Down” class refers to errors that were returned by domains when the crawler attempted to access them e.g. database errors.
The “Empty” class included websites with very short amount of text (less than 5 words), websites with images only, websites that include unreadable text and pages that include ransomware, or the Cryptolocker pages.
The “Locked” class includes onion domains that require entering a CAPTCHA or log-in credentials to access them. The “People” class covered pages that included works, personal information or projects. If a page fitted into more than one class, it was labeled on the basis of its main content. For instance, the “Forum” label was assigned to multi-topic forums, unless the entire forum was centered upon a single topic, e.g. a forum about hacking was put under the “Hacking” class instead of “Forum”.
The “Marketplace” class was categorized into “Black” and “White”; the “Black” subclass included sites that sold drugs, counterfeit services and weapons, while the “White” subclass included legal sites that sold clothes, mobile phones….etc.
As the DUTA classification was done manually, the researchers discovered several forums on the Tor network that contained pages that are all related to a single class, i.e. a forum centered upon child pornography was found including around 800 pages of text, so it was split into single samples that represent a single forum page, then they were added to the dataset.