System Design Notes | Yuchen's Blog

A simple note for system design.

From System Design Interview by Alex Xu

K-V Store
Unique ID Generator
URL Shortener
- API Design
  - Redirecting
  - Shortening
- Possible process
Web Crawler
Notification System
News Feed System
- requirements
- high-level design
CHAT SYSTEM
SEARCH AUTOCOMPLETE SYSTEM
DESIGN YOUTUBE / video sharing platform
DESIGN GOOGLE DRIVE

K-V Store

Design scopes

The size of a key-value pair is small: less than 10 KB.
Ability to store big data.
High availability: The system responds quickly, even during failures.
High scalability: The system can be scaled to support large data set.
Automatic scaling: The addition/deletion of servers should be automatic based on traffic.
Tunable consistency.
Low latency.

CAP Theorem

Check the detailed explanation if not familiar with CAP theorem.

CP (consistency and partition tolerance) systems: a CP key-value store supports consistency and partition tolerance while sacrificing availability.
AP (availability and partition tolerance) systems: an AP key-value store supports availability and partition tolerance while sacrificing consistency.
CA (consistency and availability) systems: a CA key-value store supports consistency and availability while sacrificing partition tolerance. Since network failure is unavoidable, a distributed system must tolerate network partition. Thus, a CA system cannot exist in real-world applications.

System Components

Data partition: split the data into smaller partitions and store them in multiple servers
- Distribute data across multiple servers evenly
- Minimize data movement when nodes are added or removed
- Using consistent hashing to distribute data
  - Automatic scaling: servers could be added and removed automatically depending on the load.
  - Heterogeneity: the number of virtual nodes for a server is proportional to the server capacity. For example, servers with higher capacity are assigned with more virtual nodes.
Data replication: data must be replicated asynchronously over N servers, where N is a configurable parameter
- after a key is mapped to a position on the hash ring, walk clockwise from that position and choose the first N servers on the ring to store data copies.
- With virtual nodes, the first N nodes on the ring may be owned by fewer than N physical servers. To avoid this issue, we only choose unique servers while performing the clockwise walk logic.
Consistency: Using Quorum consensus to guarantee consistency
- Definitions:
  - R: read quorum of size R
  - W: write quorum of size W
  - N: the total number of nodes
  - If R = 1 and W = N, the system is optimized for a fast read.
  - If W = 1 and R = N, the system is optimized for fast write.
  - If W + R > N, strong consistency is guaranteed (Usually N = 3, W = R = 2).
- consistency model:
  - Strong consistency
  - Weak consistency: subsequent read operations may not see the most updated value
  - Eventual consistency: this is a specific form of weak consistency. Given enough time, all updates are propagated, and all replicas are consistent. Adopted by Dynamo and Cassandra
Inconsistency resolution: Versioning and vector clocks
- Versioning: treating each data modification as a new immutable version of data
- Vector clocks: a [server, version] pair associated with a data item. It can be used to check
  - represented by D([S1, v1], [S2, v2], …, [Sn, vn]), where D is a data item, v1 is a version counter, and s1 is a server number, etc.
  - If data item D is written to server Si
    - Increment vi if [Si, vi] exists
    - Otherwise, create a new entry [Si, 1]
  - set a threshold for the length, if it exceeds the limit, the oldest pairs are removed
Handling failures
- Failure detection: gossip protocol
  - Each node maintains a node membership list, which contains member IDs and heartbeat counters.
  - Each node periodically increments its heartbeat counter
  - Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes
  - Once nodes receive heartbeats, membership list is updated to the latest info
  - If the heartbeat has not increased for more than predefined periods, the member is considered as offline
- temporary failure handling
  - replica set: a set of nodes that store the same data. Any strict quorum standards should be performed on the replica set.
  - sloppy quorum: Instead of enforcing the quorum requirement, the system chooses the first W healthy servers for writes and first R healthy servers for reads on the hash ring. Offline servers are ignored.
  - hinted handoff: If a server is unavailable due to network or server failures, another server will process requests temporarily. When the down server is up, changes will be pushed back to achieve data consistency.
- permanent failure handling
  - anti-entropy protocol: Merkle tree
  - check the detailed structure of Merkle tree if needed
  - Merkle tree is used for inconsistency detection and minimizing the amount of data transferred
System architecture diagram
- Clients communicate with the key-value store through simple APIs: get(key) and put(key, value).
- A coordinator is a node that acts as a proxy between the client and the key-value store.
- Nodes are distributed on a ring using consistent hashing.
- The system is completely decentralized so adding and moving nodes can be automatic.
- Data is replicated at multiple nodes.
- There is no single point of failure as every node has the same set of responsibilities.
Write path
- write request is persisted on a commit log file
- Data is saved in the memory cache
- When the memory cache is full or reaches a predefined threshold, data is flushed to SSTable on disk
Read path

Unique ID Generator

Requirements

IDs must be unique.
IDs are numerical values only.
IDs fit into 64-bit.
IDs are ordered by date.
Ability to generate over 10,000 unique IDs per second.

Probable solutions

Multi-master replication

master

uses the databases’ auto_increment feature
increase it by k, where k is the number of database servers in use
Hard to scale with multiple data centers
IDs do not go up with time across multiple servers.
It does not scale well when a server is added or removed.

UUID

let each web server contain an UUID generator
Pros:
- Generating UUID is simple. No coordination between servers is needed so there will not be any synchronization issues.
- The system is easy to scale because each web server is responsible for generating IDs they consume. ID generator can easily scale with web servers.
Cons:
- IDs are 128 bits long, but our requirement is 64 bits.
- IDs do not go up with time.
- IDs could be non-numeric.

Ticket server

ticket_server

Single point of failure.

Snowflake

snowflake

Sign bit: 1 bit. It will always be 0. This is reserved for future uses. It can potentially be used to distinguish between signed and unsigned numbers.
Timestamp: 41 bits. Milliseconds since the epoch or custom epoch. We use Twitter snowflake default epoch 1288834974657, equivalent to Nov 04, 2010, 01:42:54 UTC.
Datacenter ID: 5 bits, which gives us 2 ^ 5 = 32 datacenters.
Machine ID: 5 bits, which gives us 2 ^ 5 = 32 machines per datacenter.
Sequence number: 12 bits. For every ID generated on that machine/process, the sequence number is incremented by 1. The number is reset to 0 every millisecond.

Furhter problems

Clock synchronization: Network Time Protocol (NTP)
Section length tuning
High availability

URL Shortener

API Design

URL shortening: POST
URL redirection: GET

Redirecting

changes the short URL to the long URL with 301 redirect

shorten

301 vs 302
- 301: permanent redirect, subsequent requests for the same URL will not be sent to the URL shortening service.
- 302: temporary redirect, subsequent requests for the same URL will be sent to the URL shortening service first.
- For server load considerations, 301 is preferred.
- For analytics, 302 is preferred.

Shortening

Using Hashing

Each longURL must be hashed to one hashValue.
Each hashValue can be mapped back to the longURL.

Using Database

memory resources are limited and expensive
relational database

Handle collisions

most existing Hash functions produce too-long hash values
shorten hash results might have collisions
How to fix:
frequently check the database for collisions can be expensive
- use bloom filters

Use Base62 conversion

check the detailed explanation if not familiar with Base62 conversion

comparing

Possible process

Assuming the input longURL is: https://en.wikipedia.org/wiki/Systems_design
Unique ID generator returns ID: 2009215674938.
Convert the ID to shortURL using the base 62 conversion. ID (2009215674938) is converted to “zn9edcu”.
Save ID, shortURL, and longURL to the database

Web Crawler

Possible Requirements

used for search engine indexing
1 billion pages per month
HTML only
ignore duplicate pages
store page content up to 5 years

High-level design

web_crawler

Pick Seed URLs

use seed URLs as a starting point for the crawl process
based on locality
based on topics

URL Frontier

component that stores URLs to be downloaded is called the URL Frontier: FIFO

HTML Downloader

download web pages

DNS resolver

URL must be translated into an IP address before downloading

Content Parser

parse the content of the web page, seperate component from crawling process

Content Seen Detector

Using Hash values to compare the content of the web page

Content Storage

disk + memory

URL Extractor

URL filter

URL Seen Detector

Bloom filter / Hash table

Way of Traverse

BFS with FIFO
- Most links from the same web page are linked back to the same host. If download web pages in parallel, servers will be flooded with requests. This is considered as “impolite”.
- Standard BFS does not take the priority of a URL into consideration. Might need to prioritize URLs according to their page ranks, web traffic, update frequency, etc.

URL Frontier

A URL frontier is a data structure that stores URLs to be downloaded.

ensure politeness, URL prioritization, and freshness

Politeness: avoid sending too many requests to the same host
- A delay can be added between two download tasks
- maintain a mapping from website hostnames to download (worker) threads
- Queue router: It ensures that each queue (b1, b2, … bn) only contains URLs from the same host.
- Mapping table: It maps each host to a queue.
- FIFO queues b1, b2 to bn: Each queue contains URLs from the same host.
- Queue selector: Each worker thread is mapped to a FIFO queue, and it only downloads URLs from that queue. The queue selection logic is done by the Queue selector.
- Worker thread 1 to N. A worker thread downloads web pages one by one from the same host. A delay can be added between two download tasks.
Priority:
- PageRank, website traffic, update frequency
- Prioritizer: It takes URLs as input and computes the priorities
- Queue selector: Randomly choose a queue with a bias towards queues with higher priority
- Queues with high priority are selected with higher probability
Freshness:
- Recrawl based on web pages’ update history.
- Prioritize URLs and recrawl important pages first and more frequently.

HTML Downloader

Robots Exclusion Protocol

robots.txt: standard used by websites to communicate with crawlers
Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules.

Performance optimization

Distributed crawl: crawl jobs are distributed into multiple servers, URL space is partitioned into smaller pieces
Cache DNS Resolver
- Once a request to DNS is carried out by a crawler thread, other threads are blocked until the first request is completed.
- Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization.
- Our DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs.
Locality
- Distribute crawl servers geographically.
- When crawl servers are closer to website hosts, crawlers experience faster download time.
Short timeout
- a maximal wait time is specified

Extensibility

extension

PNG Downloader module is plugged-in to download PNG files.
Web Monitor module is added to monitor the web and prevent copyright and trademark infringements.

Notification System

Main Components

Different types of notifications
Contact info gathering flow
Notification sending/receiving flow

iOS push notification

provider
- Device token: a unique identifier for the device
- Payload: a JSON dictionary that contains a notification’s payload.
APNs
iOS device

Android push notification

Instead of using APNs, Firebase Cloud Messaging (FCM) is commonly used to send push notifications to android devices.

SMS message

Third-party SMS services:

Twilio
Nexmo

Email

Third-party email services:

SendGrid
Amazon SES

Contact info gathering flow

contact_gater

for relational database, use user table, user_id as primary key, and device table, user_id as foreign key
a user can have multiple devices

Notification sending flow

notification better_notification

service: a micro-service, a cron job, or a distributed system that triggers notification sending events
Notification system
- provides APIs
- basic validation of contact info
- send notifications to message queues
- Query the database or cache to fetch data needed to render a notification
Third-party services
- extensibility matters
- service availability problems
Cache & Database: User info, device info, notification templates, Settings
Message queues: decouple & buffer
Worker: simple logic, send notifications

Reliability

notification should be never-lost
- database persistence: workers can hold notification log database
- retry mechanism
exactly-once: can’t ensure exactly-once delivery
- dedupe mechanism: use Event ID to ensure idempotency

Addtional considerations

template reusing
notification settings
- give users fine-grained control over notification settings
- use setting table to store user settings
rate limiting
retry mechanism
security
- appKey
- appSecret
monitor queue
- total number of queued notifications should be controlled
Event tracking
- “bury the point”

final_design

News Feed System

requirements

user can publish a post and see her friends’ posts on the news feed page
feed is sorted mainly by reverse chronological order
10 million DAU
contain media files, including both images and videos

high-level design

Feed publishing: a user publishes a post, corresponding data is written into cache and database. A post is populated to her friends’ news feed.
Newsfeed building: assume the news feed is built by aggregating friends’ posts in reverse chronological order.

Feed publishing

feed

Post service: persist post in the database and cache.
Fanout service: push new content to friends’ news feed. Newsfeed data is stored in the cache for fast retrieval.
Notification service: inform friends that new content is available and send out push notifications.

feed_final

fanout service
- fanout on write (also called push model): A new post is delivered to friends’ cache immediately after it is published
  - Pros:
    - news feed is generated in real-time and can be pushed to friends immediately
    - Fetching news feed is fast
  - Cons:
    - hotkey problem: a user has a large number of friends, fetching the friend list and generating news feeds for all of them are slow and time consuming
    - For inactive users or those rarely log in, pre-computing news feeds waste computing resources
- fanout on read (also called pull model): The news feed is generated during read time. This is an on-demand model.
  - Pros:
    - For inactive users or those who rarely log in, fanout on read works better because it will not waste computing resources on them
    - Data is not pushed to friends so there is no hotkey problem
  - Cons:
    - Fetching the news feed is slow as the news feed is not pre-computed
- Pratical Solution: Hybrid model
  - push model for the majority of users
  - pull model for celebrity users or those with a large number of friends
  - Consistent hashing is a useful technique to mitigate the hotkey problem as it helps to distribute requests/data more evenly

Newsfeed building

newsfeed

Newsfeed service: news feed service fetches news feed from the cache.
Newsfeed cache: store news feed IDs needed to render the news feed.

retrieval

Newsfeed cache

like <post_id, user_id> mapping table

CHAT SYSTEM

Requirements

support both 1 on 1 and group chat
support 50 million daily active users (DAU)
group member limit: maximum of 100 people
only supports text messages
text length should be less than 100,000 characters long
chat history persistence
Multiple device support

High-level design

Use basic client-server architecture:

Server
- receives messages
- relays messages to the intended recipients
- store if not online

choice of network protocols

Sender side

Uses HTTP to initially send messages to the server.

use keep-alive header
- maintain a persistent connection
- reduces the number of TCP handshakes

Receiver side

Polling: Client periodically asks the server if there are any new messages.
- costly
- consume server resources and bandwidth
Long polling: a client holds the connection open until there are actually new messages available or a timeout threshold has been reached
- Sender and receiver may not connect to the same chat server. HTTP based servers are usually stateless. If you use round robin for load balancing, the server that receives the message might not have a long-polling connection with the client who receives the message.
- A server has no good way to tell if a client is disconnected.
- It is inefficient. If a user does not chat much, long polling still makes periodic connections after timeouts
- WebSocket: bi-directional and persistent, port 80 or 443
  - starts its life as a HTTP connection and could be “upgraded” via some well-defined handshake to a WebSocket connection

General Design

Note: everything else does not have to be WebSocket (sign up, login, user profile, etc)

general_design_for_chat

service discovery: give the client a list of DNS host names of chat servers that the client could connect to
chat service: only stateful service
- stateful because each client maintains a persistent network connection to a chat server

adjusted version without single-server problem:

adjusted_version

chat servers: message sending/receiving
presence servers: online/offline status
API servers: everything stateless
- user login, signup, change profile, etc
KV store: chat history, for scenarios like offline user comes online

Storage

Relational or NOSQL database?

check data types
- generic data: user profile, setting, user friends list
  - relational database
- chat history data:
  - enormous
  - only recent data is frequently accessed
  - users might use features require random access (search, view your mentions, jump to specific messages)
  - Read Write Ratio (R/W) is 1:1 for 1 on 1 chat
  - Use KV store for chat history
    - easy horizontal scaling
    - very low latency to access data
    - Relational databases do not handle long tail of data well
    - Most existing services use KV store for chat history

Data Model

message

group_message

explanation about message ID:

IDs must be unique
IDs should be sortable by time, meaning new rows have higher IDs than old ones

approach:

global ID generation: snowflake
local sequence number generator: because maintaining message sequence within one-on-one channel or a group channel is sufficient

Deep Dive

Service discovery

recommend the best chat server for a client

Apache Zookeeper
- use znodes to store chat server IP addresses
- each chat server registers its IP address to Zookeeper
- check more details about zookeeper in Common Sense for Distributed System

explanation_server_discovery

Message flows

message_flow

Message synchronization across multiple devices

each device keeps track of the last message ID it has received

fetch messages with IDs greater than the last message ID from KV store through chat server

small group chat flow

small_group small_group_r

message from User A is copied to each group member’s message sync queue
- simplifies message sync flow as each client only needs to check its own inbox
- group number is small, storing a copy in each recipient’s inbox is not too expensive
- for groups with a lot of users, storing a message copy for each member is not acceptable
Each recipient has an inbox (message sync queue) which contains messages from different senders.

Online presence

presence server: managing online status and communicating with clients through WebSocket

log_in log_out

User login
- After a WebSocket connection is built between the client and the real-time service
- save online status and timestamp
User logout
User disconnection
- Updating online status on every disconnect/reconnect would make the presence indicator change too often because of the unstable network connection
- use heartbeat instead: from client to server
- alike low-pass filter
Online status fanout
- publish-subscribe model
- The communication between clients and servers is through real-time WebSocket

SEARCH AUTOCOMPLETE SYSTEM

search autocomplete system / design top k / design top k most searched queries

Basic Requirements

matching only supported at the beginning of the query
5 suggestions determined by popularity
no spell check or correction
suppose all queries are lowercase alphabetic characters
10 million DAU
Fast response time
relevant results
Scalable: handle high traffic

High-level design

Data gethering: gathers user input queries and aggregates them in real-time ( not pratical for large data set )
Query Service: Given a search query or prefix, return 5 most frequently searched terms

Deep Dive

Trie / prefix tree

check the detailed explanation if not familiar with Trie

leetcode: 208 Implement Trie (Prefix Tree)

a trie example with frequency count in leaf nodes:

time complecity for get top k most frequent queries: O(p) + O(c) + O(clogc)

O(p): find the prefix
O(c): c is the number of children of the prefix node, Traverse the subtree from the prefix node to get all valid children
O(c * log(c)): sort the children and get top k

trie

Step 1: Find the prefix node “tr”.
Step 2: Traverse the subtree to get all valid children. In this case, nodes [tree: 10], [true: 35], [try: 29] are valid.
Step 3: Sort the children and get top 2. [true: 35] and [try: 29] are the top 2 queries with prefix “tr”.

problem: worst-scenario need to traverse the whole tree

optimization:

Limit the max length of a prefix
cache top search queries at each node
- each node keeps track of the top k most frequent queries
- example:
- time complexity: O(1)

Data gathering

realtime data gathering is not pratical

Users may enter billions of queries per day. Updating the trie on every query significantly slows down the query service.
Top suggestions may not change much once the trie is built.

data_gathering

Analytics Logs: stores raw data about search queries, append-only, not indexed
Aggregators: aggregate search queries on the demand of real-time
Aggregated data: queries with frequency count and timestamp
Workers: build the trie data structure and store it in Trie DB asynchronously
Trie Cache: distributed cache system that keeps trie in memory for fast read, takes weekly snapshots of Trie DB
Trie DB: persistent storage for trie
- Document store: store the serialized snapshots of the trie periodically into database like MongoDB
- KV store: transform the trie into a hash table form:

Query Service

query

use ajax to send request that won’t refresh the page
use browser cache to store autocomplete suggestions (with cache-control header: max-age, private)
data sampling: only sample a small percentage of queries

Trie Operation

Update:
- Update the trie weekly. Replace the old trie with the new one.
- Update individual trie node directly. only do this when trie is relatively small. also don’t forget to update all the ancestors
Delete:
- add a filter layer in front of the Trie Cache to filter out unwanted suggestions.
- Unwanted suggestions are removed physically from the database asynchronically so the correct data set will be used to build trie in the next update cycle.

Scale the storage

Sharding problem: uneven distribution of words

use a shard map manager to manage the shards: use a lookup database for storing trie

shard

Further questions:

multiple languages
- use unicode characters
different search queries across regions
- build different tries and store in CDNs for response time
real-time search queries
- Reduce the working data set by sharding
- Change the ranking model and assign more weight to recent search queries
- Stream processing data

Basic Requirements

users can upload videos and watch videos
5 million DAU
average daily spent time: 30 minutes
international user support
accepts most of video resolutions and formats
encryption required
maximum video size: 1 GB
utilize existing cloud service
Ability to upload videos fast
Smooth video streaming
Ability to change video quality
Low infrastructure cost
High availability, scalability, and reliability requirements
Clients supported: mobile apps, web browser, and smart TV

High-level design

video uploading flow
- Metadata DB: Video metadata are stored in Metadata DB.
- Metadata cache: For better performance, video metadata and user objects are cached
- Original storage: A blob storage system is used to store original videos.
- Transcoding servers: converting a video format to other formats (MPEG, HLS, etc), which provide the best video streams possible.
- Transcoded storage: It is a blob storage that stores transcoded video files
- CDN: Videos are cached in CDN.
- Completion queue: MQ that stores information about video transcoding completion events
1. flow for upload actual video
  - API servers inform the client that the video is successfully uploaded and is ready for streaming
2. update metadata
Video streaming flow
- streaming protocols
  - MPEG–DASH: Dynamic Adaptive Streaming over HTTP, Moving Picture Experts Group
  - Apple HLS. HLS stands for “HTTP Live Streaming”
  - Microsoft Smooth Streaming.
  - Adobe HTTP Dynamic Streaming (HDS).
- Videos are streamed from CDN directly

Design deep dive

Video transcoding

container: like a basket that contains the video file, audio, and metadata.
codec: an algorithm that compresses and decompresses video and audio data, like H.264, VP9, etc.
DAG model:
Video transcoding architecture

video_trans_arch

preprocessor:
- Video splitting: Video stream is split or further split into smaller Group of Pictures (GOP) alignment. GOP is a group/chunk of frames arranged in a specific order. Each chunk is an independently playable unit, usually a few seconds in length.
- Some old mobile devices or browsers might not support video splitting. Preprocessor should ensure GOP alignment for old clients.
- DAG generation, processor generates DAG based on configuration files client programmers write.
- cache data: preprocessor is a cache for segmented videos, preprocessor stores GOPs and metadata in temporary storage. If video encoding fails, the system could use persisted data for retry operations.
DAG Scheduler:
- DAG scheduler splits a DAG graph into stages of tasks and puts them in the task queue in the resource manager
resource manager：
- Task queue: a priority queue contains tasks to be executed.
- Worker queue: a priority queue that contains worker utilization info.
- Running queue: It contains info about the currently running tasks and workers running the tasks
- Task scheduler: It picks the optimal task/worker, and instructs the chosen task worker to execute the job
- The resource manager works as follows:
  - The task scheduler gets the highest priority task from the task queue.
  - The task scheduler gets the optimal task worker to run the task from the worker queue.
  - The task scheduler instructs the chosen task worker to run the task.
  - The task scheduler binds the task/worker info and puts it in the running queue.
  - The task scheduler removes the job from the running queue once the job is done.
Task workers
- Task workers run the tasks which are defined in the DAG
Temporary Storage
- Data in temporary storage is freed up once the corresponding video processing is complete.

Optimization

Speed optimization:

parallelize video uploading

multiple_upload

split a video into smaller chunks by GOP alignment
allows fast resumable uploads when the previous upload failed
speed up the upload process by uploading multiple chunks in parallel

place upload centers close to users

setting up multiple upload centers across the globe

parallelism everywhere
- make the system more loosely coupled: introduce message queues
- parallelization for each step in the whole process

Safety optimization

use pre-signed URLs (this name is from AWS S3)
- pre-signed URL gives the access permission to the object identified in the URL
protect videos copyright
- DRM (Digital Rights Management): Apple FairPlay, Google Widevine, Microsoft PlayReady
- AES encryption: encrypt a video and configure an authorization policy. The encrypted video will be decrypted upon playback.
- Visual watermarking

Cost-saving optimization

Only serve the most popular videos from CDN and other videos from our high capacity storage video servers
For less popular content, we may not need to store many encoded video versions. Short videos can be encoded on-demand.
Some videos are popular only in certain regions. There is no need to distribute these videos to other regions.
Build your own CDN like Netflix and partner with Internet Service Providers (ISPs).

Error handling

Recoverable error: retry the operation a few times, If the task continues to fail and the system believes it is not recoverable, it returns a proper error code to the client.
Non-recoverable error: stops the running tasks associated with the video and returns the proper error code to the client

DESIGN GOOGLE DRIVE

Basic Requirements

Add files. The easiest way to add a file is to drag and drop a file into Google drive.
Download files.
Sync files across multiple devices. When a file is added to one device, it is automatically synced to other devices.
See file revisions.
Share files with your friends, family, and coworkers
Send a notification when a file is edited, deleted, or shared with you.
Reliability. Reliability is extremely important for a storage system. Data loss is unacceptable.
Fast sync speed. If file sync takes too much time, users will become impatient and abandon the product.
Bandwidth usage. If a product takes a lot of unnecessary network bandwidth, users will be unhappy, especially when they are on a mobile data plan.
Scalability. The system should be able to handle high volumes of traffic.
High availability. Users should still be able to use the system when some servers are offline, slowed down, or have unexpected network errors.

Basic Design

basic_structure

file storage: AWS S3, files are replicated in two separate geographical regions

Sync Conflict

Two users edit the same file at the same time
Solution:
- the first version that gets processed wins
- the version that gets processed later receives a conflict
- system presents both copies of the same file: user 2’s local copy and the latest version from the server. User 2 has the option to merge both files or override one version with the other.

High-level design

high_level

Detailed Design

Block servers

block_server

A file is split into smaller blocks.
Each block is compressed using compression algorithms.
To ensure security, each block is encrypted before it is sent to cloud storage.
Blocks are uploaded to the cloud storage.

Block servers are used to minimize the amount of network traffic being transmitted, especially in following situations:

Delta sync. When a file is modified, only modified blocks are synced instead of the whole file using a sync algorithm.
Compression. Applying compression on blocks can significantly reduce the data size. Like gzip and bzip2 are used to compress text files.

Consistency requirements (metadata db)

Memory caches adopt an eventual consistency model by default, which means different replicas might have different data.

To achieve strong consistency, we must ensure the following:

Data in cache replicas and the master is consistent.
Invalidate caches on database write to ensure cache and database hold the same value.

Achieving strong consistency in a relational database is easy because it maintains the ACID (Atomicity, Consistency, Isolation, Durability) properties. However, NoSQL databases do not support ACID properties by default. ACID properties must be programmatically incorporated in synchronization logic. In our design, we choose relational databases because the ACID is natively supported.

Possible Metadata DB schema

metadataDB

upload flow

upload_flow

edit flow is alike

Download flow

Download flow is triggered when a file is added or edited elsewhere.

Two ways a client can know a file is added or edited by another client:

Notification system: If client is online
If client A is offline while a file is changed by another client, data will be saved to the cache. When the offline client is online again, it pulls the latest changes.

download_flow

Notification service

purpose: maintain file consistency, any mutation of a file performed locally needs to be informed to other clients to reduce conflicts

Long polling: better choice
- Communication for notification service is not bi-directional
- WebSocket is suited for real-time bi-directional communication such as a chat app. For Google Drive, notifications are sent infrequently with no burst of data
WebSocket

Save storage space

De-duplicate data blocks. Eliminating redundant blocks at the account level is an easy way to save space. Two blocks are identical if they have the same hash value.
Adopt an intelligent data backup strategy
- Set a limit for maximum file versions: If the limit is reached, the oldest version will be replaced with the new version.
- Keep valuable versions only: To avoid unnecessary copies, we could limit the number of saved versions. We give more weight to recent versions.
Moving infrequently used data to cold storage. (which is cheaper, like AWS Glacier)

Failure handling

Load balancer failure: If a load balancer fails, the secondary would become active and pick up the traffic. Load balancers usually monitor each other using a heartbeat, a periodic signal sent between load balancers. A load balancer is considered as failed if it has not sent a heartbeat for some time.
Block server failure: If a block server fails, other servers pick up unfinished or pending jobs.
Cloud storage failure: S3 buckets are replicated multiple times in different regions. If files are not available in one region, they can be fetched from different regions.
API server failure: It is a stateless service. If an API server fails, the traffic is redirected to other API servers by a load balancer.
Metadata cache failure: Metadata cache servers are replicated multiple times. If one node goes down, you can still access other nodes to fetch data. We will bring up a new cache server to replace the failed one.
Metadata DB failure.
- Master down: If the master is down, promote one of the slaves to act as a new master and bring up a new slave node.
- Slave down: If a slave is down, you can use another slave for read operations and bring another database server to replace the failed one.
Offline backup queue failure: Queues are replicated multiple times. If one queue fails, consumers of the queue may need to re-subscribe to the backup queue.

Tags: System Deign

K-V Store

Design scopes

CAP Theorem

System Components

Unique ID Generator

Requirements

Probable solutions

Multi-master replication

UUID

Ticket server

Snowflake

Furhter problems

URL Shortener

API Design

Redirecting

Shortening

Possible process

Web Crawler

Possible Requirements

High-level design

Pick Seed URLs

URL Frontier

HTML Downloader

DNS resolver

Content Parser

Content Seen Detector

Content Storage

URL Extractor

URL filter

URL Seen Detector

Way of Traverse

URL Frontier

HTML Downloader

Robots Exclusion Protocol

Performance optimization

Extensibility

Notification System

Main Components

iOS push notification

Android push notification

SMS message

Email

Contact info gathering flow

Notification sending flow

Reliability

Addtional considerations

News Feed System

requirements

high-level design

Feed publishing

Newsfeed building

Newsfeed cache

CHAT SYSTEM

Requirements

High-level design

choice of network protocols

General Design

Storage

Data Model

Deep Dive

Service discovery

Message flows

Message synchronization across multiple devices

small group chat flow

Online presence

SEARCH AUTOCOMPLETE SYSTEM

Basic Requirements

High-level design

Deep Dive

Trie / prefix tree

Data gathering

Query Service

Trie Operation

Scale the storage

Further questions:

DESIGN YOUTUBE / video sharing platform

Basic Requirements

High-level design

Design deep dive

Video transcoding