Every month, Netcore’s few thousand customers schedule many SMS campaigns, resulting in more than 2 billion SMSes being sent per month from our platform. This causes a lot of load on our hardware. So, our engineering teams are always obsessed with keeping the infra – up, running, and healthy.
These billions of SMSes are processed through various middleware components. Redis in-memory database is one of those important intermediate components. Although Redis is quite light and uses single threaded processing, we saw throttles in processing heavy data due to huge load with single instance of Redis. Taking snapshots of these huge data overloads made Redis non-responsive, and turned all efforts of debugging other processes on the same system, futile.
In short, we were facing challenges in getting the right performance out of Redis in-memory database, and were experiencing heavy load due to operations (incoming and outgoing) on single in-memory Redis database instance. Redis being a single-thread process, uses only one core. We observed that the infrastructure (multi-core CPU machines) wasn’t utilised to its complete capacity. So, the non-optimal performance.
Redis in-memory database being very light weight, has a foot print of 1 MB only. The snapshots taken at desired intervals must also be very light weight on the disk. We researched and experimented with different approaches to finally arrive at the following:
1. We segregated the mandatory critical data to be saved on disk from the non-critical data. Then, we started with two high-end machines in High Availability (HA) mode, each machine having one instance of Redis in-memory database.
2. What we observed next, was with BGSAVE config – a simple command in Redis used for taking snapshots in the background at desired intervals. There was nothing wrong with the command or with its usage. However, the snapshots being taken were getting bigger with each instance being cumulative in nature. With very high loads (like in our case) the backup size was nearing 50GB, resulting in slowing down of the system. So, we removed all the non-critical events from getting backed up, making the action lean.
Eventually, these findings helped us in many ways.
We could reduce the size of back-up data from 50 GB to less than 200 MB. It was a huge saving considering the mammoth traffic that gets generated with campaigns sent through Netcore’s platform.
The process of writing-to and reading-from disk was smoothened, which was a major plus.
We could now bring down snapshot writing time from hours to minutes because of these lighter snapshots. This saved time and gave room for running other queries and operations. Another big plus.
Additionally, we explored options of running multiple instances of Redis on the machines under HA, where each instance is allocated a single operation. The various events or operations performed under our topology on the same instance and port of Redis were distributed across different Redis instances and different ports on the same server. This ensured individual CPU attention to each instance. One more plus.
Overall, the distributed mechanism provided much better performing system. We recorded an overall performance boost of almost 3.5x.
Redis being very light weight and mostly single threaded in nature, we can run as many instances of Redis based on the number of CPU’s available. Hence, scaling at will as per increase in traffic – became easy.
Overall, the cumbersome operations through Redis in-memory database aren’t cumbersome anymore.
Therefore, we here at Netcore bank on Redis the same way as our several thousand customers rely on us.
Must know fact: Redis is the most popular key-value database, says a monthly ranking by DB-Engines.com.
Fun, right? Let us know what you think, in the comments below!