Redis TTL is no Free Lunch

redis-logo

Do you know how Redis cleans up memory? If not, keep reading. You see, I know much more about how Redis cleans up memory, more than I ever wanted to know. Recently, I built a system that had a variable load and was hammered during the fall foliage season and we ran into a little memory issue with my favorite database. We had to keep scaling it up and it was getting expensive.

TTL and a Red Bird

doughnut_cookie_desert_sweetsThe REDBird system uses Redis and writes and reads a lot of time series data. Truth be told, the system creates a lot of keys and leaves them on the system for Redis to clean up. Kind of like that brother that doesn’t clean up after himself in the kitchen and thinks mom will clean it up for him. There is only one problem, we write and expire keys faster than Redis can clean up in its default configuration.

Keys in Redis are expired in two ways. There is a passive, and an active way. When a client tries to read an expired key, it is deleted in the passive way. The active way entails the system looking up 20 keys at random and testing them for an expired Time To Live (TTL). If it finds that more than 25% are expired, it will run again.

stopwatch_timer_running_timeThis is where things get a little murky for me. The Redis documents say that the cleanup function will run 10 times a second but that it will stop if it does not find greater than 25% of keys are expired. I have some questions. Stop for how long? What does start again if it finds greater than 25% are expired mean?

hz and Frequency

The answer to most of my questions was inside the redis.conf file. In the file, under the hz configuration settings, we get this little gem of a comment.

Redis calls an internal function to perform many background tasks, like closing connections of clients in timeout, purging expired keys that are never requested, and so forth.

Wow, and when we keep reading, we get this bonus.

By default “hz” is set to 10. Raising the value will use more CPU when Redis is idle, but at the same time will make Redis more responsive when there are many keys expiring at the same time, and timeouts may be handled with more precision.

settings_panel_equalizer_preferencesThere it is. The magical 10 number. Now I know what you are thinking. Adjust the “hz” configuration value and REDBird will fly again. The problem is, Amazon Elasticache does not give you access to this configuration setting. I wonder why? Not really, I know why. Moving “hz” to 100 or 500 would reduce the total number of Redis instances that Amazon can deployed on the same hardware with a “hz” of 10. Ok, that’s fair, but sucks. Moving on.

Fly REDBird Fly

send_post_paper_planeSo how did we fix this? We took some inspiration from this and another blog post and started asking Redis to lookup random keys in a fire and forget pipeline batch. We started out with batches of 50K every second and eventually found that a 10K batch every couple of seconds gave us a good balance of memory cleanup and system load. Today REDBird is flying even during the fall and we were able to reduce the size of our cluster saving more money for the chipmunks and lumberjacks retirement fund.

The Hyper Logger

thin_wood_axThe other day while working at Chipmunks & Lumberjacks, I was asked to create a computer system to count the distinct set of leaf shapes, sizes and colors of every tree in the forest. I know what you are thinking! Why would anyone want to know this information? For C&L it’s really important, they use it when scoring whether a tree can be harvested.

thin_leaf_tree_forestThe naive implementation of this system would be to add all of the leaves into a database and do distinct queries. This system might work for a while but what happens when the total number of trees and their associated leaves climb into the billions or trillions? Will it still perform? The short answer is no; the longer answer is, absolutely not, are you kidding me!

Probabilistic Data Structures

So, what is a geek working for C&L to do? Enter the probabilistic data structure HyperLogLog. This sleek and beautiful data structure will tell you the distinct count of items entered into its structure with a 2% error rate. Ok now, come on, 2% is not that bad, especially when you can fit 10^9 items in just 1.5K of memory. HyperLog might be the best thing to come out of the 80’s besides Nintendo and the Sony Walkman.

leaf_tree_forestHyperLogLog data structures have an Add, Count and Merge function. Add will insert a new value into the data structure. Count will tell you the distinct number of items in the data structure. Merge will take two HyperLogLog data structures and return a new combined data structure with only the distinct set of the two data structures.
Add and Count are the meat and potatoes of the HyperLogLog data structure but don’t discount Merge. Some very interesting features can be created with Merge. For C&L we needed the 1, 7 and 30 day counts for our tree leaves. To save some space we just used one data structure per day and merged the days to find the week and month values. Neat, right?

Benefits

There are some significant side benefits of using this type of data structure over the typical row in a database. Three of them are GDPR compliance, easy replay and size.

maple_leaf_canada_treeFor GDPR, if you are just exchanging the data structures, there is no recognizable information stored inside the data structure due to the hashing of the original value and flipping a bit from 0 to 1 during the Add command.

Replay is all about recovering from a distributed system failure and that trying to insert the same value into a HyperLogLog data structure has no effect on the count. It’s basically a no-op.

You have to admit that storing a bazillion items in 1.5 kilobytes is pretty neat. The data size makes sharing and transmitting the structure between data-centers or servers in the same center relativity easy.

A Red Bird

bird_house_twitter

The new system named REDBird is built and deployed to multiple forests around the world and C&L is very happy with the new system. If you have a cardinality challenge at work, you might want to learn more about probabilistic data structures and HyperLogLog.