How Semrush Created the Strongest Backlink Tool in History (2021)

About a year ago, Semrush began to study backlinks in depth and set a goal: to build the largest, fastest update, and highest quality in the current market for its customers.Backlink database, And better than all other well-known competitors in the market. to this end:

  • The whole team worked for 18 months
  • 30,000 expert engineering hours
  • Completely migrate to the new architecture (all 43.8 trillion known historical data links are saved)
  • 500+ servers
  • 16,128 CPU cores
  • 245 TB memory for computing
  • 13.9 PB of space for storing linked databases
  • There are 16,722 cups of coffee

The picture below shows the Semrush backlink analysis database, which is currently growing

  • Backlinks: 43 trillion/trillion
  • Referred domain name/quoted domain name: 1.6 billion
  • Number of URLs crawled per day: 25 billion
Semrush backlink analysis database

How Semrush Backlink Database Works

First, generate a URL queue and decide which pages to crawl and crawl. 

Second, the crawler checks these pages. When the crawler recognizes a hyperlink from these pages to another page on the Internet, it saves that information.

Then, save all this data temporarily, and then dump it into the Semrush library, so that all Semrush users can see the data in their interface.

To this end, all temporary storage steps are basically removed, crawlers are increased by 3 times, and a set of filters are added in front of the queue, so the whole process is faster and more efficient. The whole process is shown in the figure below:

past

  1. queue
  2. Add a new link to the queue
  3. reptile
  4. Temporary storage
  5. storage
  6. Feedback to Semrush users

now

  1. Initial filtration
  2. Prioritize
  3. queue
  4. Add a new link to the queue
  5. 3 times the number of crawlers (compared to the original)
  6. Consider various parameters (robot, site map, RSS, etc.)
  7. Add backlinks now
  8. 4 times the server storage (compared to the original)
  9. Semrush users quickly saw these backlinks
Semrush backlink database (new & old) working principle comparison
Semrush backlink database (new & old) working principle comparison

queue

There are endless pages on the Internet, and many pages need to be crawled, but not all of them are crawled.

Some are fetched frequently, and some do not need to be fetched at all. Therefore, a queue is used to determine the order in which URLs will be submitted for crawling.

The problem is: if too many similar and irrelevant URLs are crawled, on the one hand, spam will increase, but on the other hand, the number of referenced domain names will decrease.

measure:

Optimize the crawl queue: filters have been added to prioritize unique content, higher authority websites, and prevent link farms.
As a result, unique content pages have been added and reports with duplicate links have been reduced. 

Some highlights:

  • In order to protect the queue from link groups, check whether there are a large number of domain names from the same IP address.
    If there are too many domain names from the same IP, lower the priority. In this way, more domain names can be grabbed from different IPs.
  • Check if there are too many URLs from the same domain name.
    If there are too many URLs in the same domain name, they will not all be crawled on the same day.
  • To ensure that new pages are crawled as quickly as possible, any URLs that have not been crawled before will have a higher priority.
  • Each page has its own hash code, which helps to prioritize crawling of unique content.
  • Consider the frequency of generating new links on the source page.
  • Consider the authoritative scores of web pages and domain names.

Improve the queue: 

  • More than 10 different factors to filter out unnecessary links.
  • Thanks to the new quality control algorithm, get more unique and high-quality pages.

 Get free access to the largest backlink database for a limited time only!

reptile

Crawlers follow internal and external links on the Internet to search for new pages with links. Therefore, if there is an incoming link, only one page is found.

measure

  • The number of crawlers has tripled (from 10 to 30).
  • Stop crawling pages with url parameters that do not affect the content of the page (&sessionid, UTM, etc.).
  • Increase the frequency of reading and following the instructions in the robots.txt file on the website.

Improve crawler:

  • More crawlers (now 30)
  • Clean up data, no junk or duplicate links
  • Better at finding the most relevant content
  • Crawling speed of 25 billion pages per day

storage

The storage is that Semrush stores all the link data for its free/paid users to view. Give the user a place where all the links can be seen.

Because crawler crawling is a continuous process, after getting new data, the database needs to be updated again. This process takes about 2-3 weeks. Under normal circumstances, in the update process, users will experience delays in using this tool, so we must solve this problem-that is, increase the speed.

measure

To improve this, Semrush rewrote the architecture from the ground up. In order to solve the storage needs, the number of servers has been increased by four times, and its experts have spent more than 30,000 man-hours. Now, this reverse link database system is not subject to any restrictions.

Improve storage:

  • 500+ servers
  • 287TB RAM memory
  • 16,128 CPU cores
  • 30 PB total storage space
  • Lightning filtering and reporting
  • Instant updates-no more temporary storage

Semrush free accountTo gain access,Backlink analysis functionFully open.

Leave a Comment

error: Content is protected !!