Understanding Redis replication in AWS ElastiCache
Redis is one of the most beloved data stores according to StackOverflow survey. Amazon Web Services offers managed Redis as ElastiCache. It has many features that allow us to build great systems; one of them is replication. How does it work?
What is Redis and AWS ElastiCache
Redis is a cache. It is a fast in-memory NoSQL data store. Unlike databases, Redis does not store its data on disk. Instead, it keeps the whole dataset in memory. This allows Redis to offer performance that is far superior to traditional databases. Using Redis in your system can decrease latency, make the system simple, and reduce operating cost. It is highly scalable and performs well under load.
AWS ElastiCache is a cloud service that offers managed Redis (it also offers Memcached but it is less popular). Amazon takes over operational aspects - it provisions servers and manages cache cluster and recovery after failures.
Why replication
All the amazing benefits that Redis and AWS ElastiCache bring to the table come at a price. The nature of Redis creates its main drawback. Redis stores data only in RAM. It means that in case of a failure (power outage, internal software error, hardware failure), all data in Redis is lost. Redis and AWS use multiple ways to mitigate the issue and to make Redis and ElastiCache operationally more reliable. One of the most effective approaches that both use is replication. Using replication, data is stored in multiple copies in (almost) indepdendent Redis instances. If one of them fails - data is available in another. How does it work?
ElastiCache replication
Let’s look at how ElastiCache offers to replicate data in Redis. ElastiCache can be run in several different configurations:
-
No Replication. The simplest situation. Just one Redis process running on a single server instance. It has all the data, serves all read and write requests. If it goes down - all data is lost.
-
Replication in a single shard. In this mode data in Redis is replicated across several instances. These are independent Redis processes that run on different EC2 instances (always of the same type) grouped together. This group is called a shard. One node in the shard is a master node and others are replicas. There is always 1 master and up to 5 replicas. Master node serves all write requests with data asyncronously copied to replicas (making it eventually-consistent). Replicas are read-only. This replication group (single shard) will have all Redis data, with a copy in every replica (you can have up to 6 copies of your data, 1 master + 5 replicas). This setup is more robust as there is no single point of failure. This configuration additionally offers superior performance for read-intensive applications as reads are spread across read replicas making it horizontally scalable.
-
Multiple Shard Replication. In this most complex configuration, in AWS terms called “cluster mode enabled”, ElastiCache cluster consists of several shards (up to 90) with data split between them. Each shard will have its own master and a set of replicas. Cache cluster has totally 16384 data slots that are spread between shards. For example, if you have two shards, slots 0-8191 (16384/2) will be stored on the first shard, and 8192-16384 on the second. For each write, Redis will calculate hash value of the key that will correspond to a single slot, which will be used to store data (similar to hash tables). Using ElastiCache in “cluster mode” with multiple shards allows to scale both read and write performance horizonally and spreads data even more, reducing probability of a complete data loss.
Summary
Replication in Redis comes at a cost. Running multiple cache instances proportionally increases cost. Additionally, it adds operational complexity which can be eliminated by using managed Redis solutions such as ElastiCache.
But on the other hand, running cache with replication brings valuable benefits. It can drastically improve performance. And most importantly, it adds a level of rubustness to a system. With replication, failure of a node is no longer a catastrophic event.
But what is the actual mechanism in ElastiCache that handles failures? How does it recover data from replicas exactly? I will dive into this topic in my next article.