Web Crawler System Design
Table of content:
Functional Requirements
- System should download and validate pages
 - System should able to generate mirror site
 - System should identify copyright infringements
 - Restrict crawling per /robots.txt
 
Non-Functional Requirements
- System should highly reliable/scalable
 - System should be highly available with eventual consistency
 
Capacity Estimation
Throughput
- DAU :10M pages/day -> 100 pages/sec
 
Storage
- 100KBX10M -> 1GB/day