How Twitter generates millions of unique ids per second?
A note on system design of generating unique ids per second.
Unique ID Generator in a Distributed System.
Unique id generator is a must do system design question. The season being unique id is required in almost every system nowadays.
Scope of the problem
- What is the length of the unique id? 64 bits
- Is the unique id sortable? Yes
- Is the unique id generated for a distributed system? Yes
- How many unique ids will be generated per-second? 10K
- How many regions or clusters are the in the system? 6
Real Use Cases
For every new registration an unique id is required for the account. For every photo an unique id is required. For every tweet an unique id is required.
HLD of different systems
Multi master replication
For a multi-master replication setup each server has to generate an unique id and increment it by the number of replication present.
Let’s say we have 2 Database Servers
Server 2 starts its counter from 0000000 + N
Here N is the number of servers in the system.
- Easy to implement.
- Each server is independent of each other.
- Cost optimised
There are number of problems in this approach. As we are discussing about distributed systems here the number of servers added and removed from the entire stack is dynamic and that causes some serious problems to this architecture.
If we increase or decrease server counts then we will have to update the counters of every server.
The solution is not scalable as when the DB servers are added or removed from the system dynamically, the entire system has to be updated with the new number. If not done correctly then there can be collision of IDs.
Next widely accepted way is to generate UUID using libraries. Below is the HLD of the system.
Each DB Server is generating its own UUID and it has been found that the chance of getting same UUID is once in a billion.
From Wikipedia:💡 Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. Or, to put it another way, the probability of one duplicate would be about 50% if every person on earth owned 600 million UUIDs.
So this is pretty much solves the problem but there are some drawbacks in this approach as well:
- The length of the string is 128bits while we need 64 bits
- Ids are not built with UTC timestamp its based on CPU Clock Time - thus very hard to sort
- The generated ids can be alphanumeric thus might take longer time to sort
Lets look at a basic HLD of a ticket server
The ticket server is responsible for generating tokens for all the servers that are part of the entire design. Whenever a request will come no matter from which zone the same ticket server will generate the ticket. This approach was widely popularised by Flickr and till date it is widely used across many products in the industry.
If your scale is < 1M users this should be an approach we can consider, but as the system scales this will require some more upgrades.
1. Scales very well with n number of servers
2. Can generate 64 bit Numeric ID since code is in developer control.
3. Very easy of implement using caches and hash tables.
1. Single point of failure.
2. If made a distributed system then can increase cost of the project.
3. Distributed database problem will then kick in. Problems like data synchronization will happen.
4. This will become a full blown project and if this fails all the dependent Systems will fail.
The idea is fairly simple and solves almost all our problems.
To think about this problem we have to divide and conquer the problem and think about what available entities we have.
- Data Centres
- Machines in Data Centres
How can we use these values to generate a sortable 64Bit Numeric ID for our system? We want to join all of these together and create value that can be identified sorted and scaled.
Here we can divide the 64 bits into sub parts
| Sign bit 1 bit | Timestamp 41 bits | Data centre ID 5 bits | Machine ID 5 bits | Sequence Number 12 bits | | --- | --- | --- | --- | --- | | 0 | 1643943869 | 0 - 32 | 0 - 32 | 0 - 999 |
Datacenter ID This is pretty straight forward, We can generate and assign 2 ^ 5 =32 data centre IDs Generally there's one data-centre per region like US-East, US-Central, IN-Mumbai etc.
Machine ID: This is also same we can generate and put 32 machines per data centre.
This number is generated at the runtime.
The total combinations that can be generated is 2 ^ 12 = 4096 combinations per millisecond.
A combination can be generated in multiple ways but in a real system the counter will reset to 0 every second.
What does that mean? If there is more than one id is requested to be created then every millisecond the counter will increase by 1 and by the end of the second the counter will again be reset to 0.
Twitter Snowflake approach is pretty simple and we need to generate a new id by using SIGN BIT, TIMESTAMP, DATA CENTRE ID and sequence number.
Additional Resources for Reference.
We have not considered on how to co-ordinate the timestamp between all the servers. To achieve that precision we have to consider something called as NTP
NTP : Network Time Protocol for generating timestamp.
Clock Synchronisation is also required to generate correct logs across multiple data centres. This we can see in practice at Kibana setup by the cloud infra team.
Learn more about it here.
*System Design Primer - Great resource to prepare for system design interviews.*
Did you find this article valuable?
Support Sagar by becoming a sponsor. Any amount is appreciated!