Sharding

Updated on 01 Dec, 202531 mins read 263 views

Below are the problem that happens:

Problems:

1 Storage Limitations

A single database cannot store unlimited data.

DB is reaching TBs/PBs of storage
Indexes become huge and slow
Adding more disk/SSD is not enough

Sharding fixes it:

Split data across multiple DBs → each DB stores a small subset.

2 High Read/Write Throughput

A single server can handle only limited QPS (queries per second).

Millions of users hitting the same DB.
Writes create lock contention
Reads overwhelm CPU and RAM

Sharding fixes it:

Each shard handles only a fraction of total traffic.

10 shards = 10x write throughput
10 shards = 10x read throughput

3 Large Index and Slow Queries

When DB grows too big:

Index scans slow down
Query latency increases
Cache hit ratio decreases

Sharding fixes it:

Each shard's index is small -> queries are faster.

4 Hotspot / Hot Key Problem

Some keys become extremely active (e.g., popular users, trending items).

One DB node gets all traffic for that user.
Other DBs are idle – unbalanced load

Sharding distributed load evenly:

With good shard key -> no single node becomes overloaded.

5 Vertical Scaling Limit (Hardware Limit)

You cannot indefinitely upgrade.

CPU
RAM
Storage

Eventually you hit the ceiling

Sharding fixes it:

Scale horizontally by adding more nodes without upgrading hardware.

6 Latency Optimization (Geographical Distribution)

Global users -> global latency.

Users far away from DB region get slow responses.

Sharding fixes it:

Users are routed to the shard closet to them (geo-sharding).

7 Fault Isolation

If everything is stored in one DB:

One failure -> full outage.

Sharding improves reliability:

If shard 3 fails:

Only 1/N of users affected.
Rest of the system works fine.

8 Cost Optimization

Large monolithic database need:

bigger machines
high-performance SSDs
enterprise licenses

Sharding allows:

Smaller cheaper machine
Independent scaling
Pay only for capacity needed

9 Backup & Restore Complexity

Backing up a huge monolithic DB:

take hours
impacts performance
restoring takes even longer

With sharding:

Each shard is smaller -> quicker to back up
Can backup shards independetly
Restore only the affected shard, not the whole database

10 Operational Maintenance

Maintenance on a monolithic DB affects the entire system.

Sharding helps:

Rolling upgrades
Per-shard maintenance
Per-shard failover
Zero-downtime schema changes

What is Sharding?

Sharding = Horizontal Partitioning

Instead of storing:

All users -> one DB

We split:

Shard 1 -> Users 1-1M
Shard 2 -> Users 1M-2M
Shard 3 -> Users 2M-3M

Each shard is a separate independent database (own CPU, RAM, disk)

This increases:

capacity
less index size
smaller working set
faster queries

Example – Users Table

| user_id | name | email |
|---------+------+-------+
|    1    |  A   |
|    2    |  B   |
|    3    |  C   |
|    4    |  D   |

If we use user_id % 2:

Shard 1 gets rows: user 1, user 3
Shard 2 gets rows: user 2, user 4

But in real systems, sharding is done at the database level:

Shard 1:

DB instance 1

Contains:

users table (rows for shard 1)
orders table (rows for shard 1)
posts table (rows for shard 1)

Shard 2:

DB instance 2

Contains:

users table (rows for shard 2)
orders table (rows for shard 2)
posts table (rows for shard 2)

So:

same schema (tables)
Different rows in each shard

How Data Is Distributed (Sharding Strategies)

There are 3 commonly used sharding methods:

1 Range-Based Sharding

Split data into ranges based on the shard key.

Example (user_id):

Shard 1 → 1 to 1,000,000  
Shard 2 → 1,000,001 to 2,000,000  
Shard 3 → 2,000,001 to 3,000,000

Pros:

Very easy to implement
Great for range queries
Easy to understand
Efficient for time-series data

Cons:

Hotspot problem (sequential IDs go the latest shard)
Uneven load distribution
Need manual splitting when full

2 Hash-Based Sharding

Apply a hash on the shard key.

We pick a shard key – often:

user_id
email
device_id
order_id

Then we compute:

shard_number = hash(shard_key) % total_shards

Example:

shard_id = hash(user_id) % 8

Pros:

Uniform distribution
No hotspots
Simple static routing logic

Cons:

Adding/removing shard -> massive reshuffling
```
If we add more shards:
old: hash(key) % 4
new: hash(key) % 5
```
It requires massive migration, when mode operator is changed almost all rows get affected.
Fix: Use consistent hashing.
Hard to scale dynamically

3 Directory-Based Sharding

Maintains a lookup table (metadata server) that stores:

user_id → shard_id
customer_id → shard_id
tenant_id → shard_id

The router fetches this mapping.

Pros:

Very flexible
Easy resharding (just update mapping)
Shard key can be changed without changing DB logic
Ideal for multi-tenant SaaS

Cons:

Directory server must be highly available
Adds a network hop
Extra complexity

5 Geo-Sharding (Location-Based)

Users are placed based on geography.

Example:

Shard 1 → Asia  
Shard 2 → Europe  
Shard 3 → US East  
Shard 4 → US West

Pros:

Low latency for global users
Complies with data residency laws (GDPR, etc.)
Even naturally distributed traffic

Cons:

Cross-region operations are expensive
Hard if users travel globally
Different shards = different consistency rules

6 Consistent Hashing

It is a smart sharding technique used to avoid massive data movement when scaling a distributed system (adding/removing servers).

It is used in:

Databases
Caches
Load Balancers
Distributed File Systems

Why Do We Need Consistent Hashing? (The rehashing problem)

If we have n db, a common way to balance the load is to use the following hash method:

serverIndex = hash(key) % n, where n is the size of the server pool.

Let us use an example to illustrate how it works.

Formula:

shard = hash(user_id) % 3

Let's say we insert these users:

user_id	hash(user_id)	hash % 3	Goes to
101	754322	754322 % 3 = 1	Shard1
202	123876	123876 % 3 = 0	Shard0
303	987321	987321 % 3 = 2	Shard2
404	453221	453221 % 3 = 1	Shard1

So data is placed like this:

Shard0 → user 202
Shard1 → user 101, 404
Shard2 → user 303

This is normal modulo-based sharding.

What happens when you need to scale and add one more shard?

Say you add Shard3, so now n=4

Your formula becomes:

shard = hash(user_id) % 4

Now re-evaluate the keys:

| user_id | Old shard (mod 3) | New shard (mod 4) | Status   |
| ------- | ----------------- | ----------------- | -------- |
| 101     | 1                 | 754322 % 4 = 2    | MOVED ❌  |
| 202     | 0                 | 123876 % 4 = 0    | STAYS ✔️ |
| 303     | 2                 | 987321 % 4 = 1    | MOVED ❌  |
| 404     | 1                 | 453221 % 4 = 1    | STAYS ✔️ |

Out of 4 users, 2 moved (50%)

In real systems with millions of users:

50-80% of the data gets reshuffled
Cache warmup is lost
Databases overloaded
Massive downtime for migration

In big clusters this becomes unmanageable.

This is the main drawback of hash-based sharding.

Why Does this Happen?

Because modulo depends on total number of shards:

key_location = hash(key) % NUMBER_OF_SHARDS

If NUMBER_OF_SHARDS changes -> every key's location changes.

This is why normal hashing does NOT scale horizontally.

Imagine a Hash Ring

Instead of using % n, we hash keys onto a circular space (0 degree -> 360 degree).

       (0°)
        |
 270° -- -- 90°
        |
       180°

Now servers (shards) are also hashed onto this ring.

Suppose we have 3 servers:

Server A → 40°
Server B → 120°
Server C → 300°

Placing Keys on the Ring

Example keys:

Key	Hash	Point on ring
user_101	50°	50°
user_202	130°	130°
user_303	260°	260°
user_404	310°	310°

Rule for Choosing Shard

Move clockwise and pick the next server you hit.

Example:

user_101 → 50° → next server clockwise = 120° (Server B)
user_202 → 130° → next server clockwise = 300° (Server C)
user_303 → 260° → next server clockwise = 300° (Server C)
user_404 → 310° → wrap → next server clockwise = 40° (Server A)

Final mapping:

Server A → user_404  
Server B → user_101  
Server C → user_202, user_303

This works fine – but the magic happens next.

What Happens When You Add a New Server?

Suppose you add Server at 200 degree.

Old: A (40°), B (120°), C (300°)
New: A (40°), B (120°), D (200°), C (300°)

Which keys move?

Only keys between B (120degree) and D (200degree)B.

That means:

Only a small range of keys moves: user_202
All other users stays untouched

This is the superpower:

Adding a server only relocates keys in the server's segment, not the entire dataset.

Compared to modulo hashing where almost 80% keys move, here maybe 5-10% move.

Removing a Server

If Server B (120 degree) goes down:

Only the keys assigned to B will move -> to the next server clockwise (D in this case).

Everything else remains where it was.

Again -> minimal movement.

Key Distribution Problem & Virtual Nodes (Vnodes)

If you place servers randomly, distribution may be uneven:

Server A may own 40% of the ring
Server B only 10%, etc.

Solution: Virtual Nodes:

Each server is represented multiple times on the ring:

Server A: 100 virtual points
Server B: 100 virtual points
Server C: 100 virtual points

This gives:

Perfect uniform distribution
Smooth scaling
Fine-grained rebalancing

Your email address will not be published. Required fields are marked *

Sharding

Problems:

1 Storage Limitations

2 High Read/Write Throughput

3 Large Index and Slow Queries

4 Hotspot / Hot Key Problem

5 Vertical Scaling Limit (Hardware Limit)

6 Latency Optimization (Geographical Distribution)

7 Fault Isolation

8 Cost Optimization

9 Backup & Restore Complexity

10 Operational Maintenance

What is Sharding?

Example – Users Table

Shard 1:

Shard 2:

How Data Is Distributed (Sharding Strategies)

1 Range-Based Sharding

Example (user_id):

2 Hash-Based Sharding

Example:

Pros:

Cons:

3 Directory-Based Sharding

5 Geo-Sharding (Location-Based)

Example:

6 Consistent Hashing

Why Do We Need Consistent Hashing? (The rehashing problem)

Formula:

What happens when you need to scale and add one more shard?

Why Does this Happen?

Imagine a Hash Ring

Placing Keys on the Ring

Rule for Choosing Shard

What Happens When You Add a New Server?

Removing a Server

Key Distribution Problem & Virtual Nodes (Vnodes)

Leave a comment