Web Crawler System Design

June 12, 2023 | 0 Minute Read

Table of content:

Functional Requirements

  1. System should download and validate pages
  2. System should able to generate mirror site
  3. System should identify copyright infringements
  4. Restrict crawling per /robots.txt

Non-Functional Requirements

  1. System should highly reliable/scalable
  2. System should be highly available with eventual consistency

Capacity Estimation

Throughput
  • DAU :10M pages/day -> 100 pages/sec
Storage
  • 100KBX10M -> 1GB/day
WebCrawler High Level System Design Components