Recently I met an issue that one of our APIs of the online production servers is very slow due to redis scan.

From next graph generated by NewRelic, we found AVERAGE response time is 2930ms while redis scan could take 2680ms, which is 2680/2930 = 91.4% portion of the total time.

Why redis scan cost this much? From our local environment, we never noticed such huge performance issue.

1. Locate the culprit

1.1. Phenomenon

After that I dug into the source code and found next code snippet in our Django Application and this is the only code of this API to access redis, so this must be the culprit of this issue.

from django.core.cache import cache

// ...... more code is ommitted here

cache.delete_pattern(CACHE_KEY)

From the above code, we could know that redis is treated as a cache server here.

So what does delete_pattern do on the earth? So I traced the code and in the end found next delete_pattern definition from django_redis.client.default.DefaultClient#delete_pattern.

def delete_pattern(self, pattern, version=None, prefix=None, client=None):
    """
    Remove all keys matching pattern.
    """

    if client is None:
        client = self.get_client(write=True)

    pattern = self.make_key(pattern, version=version, prefix=prefix)
    try:
        count = 0
        for key in client.scan_iter(pattern): // please note here
            client.delete(key)
            count += 1
        return count
    except _main_exceptions as e:
        raise ConnectionInterrupted(connection=client, parent=e)

The code calls client.scan_iter(pattern) to get next cursor and try to delete that key of the cursor.

1.2. Redis scan

From the official site of redis scan, we can combine MATCH and COUNT in the SCAN command.

redis 127.0.0.1:6379> scan 0 MATCH *11*
1) "288"
2) 1) "key:911"
redis 127.0.0.1:6379> scan 288 MATCH *11*
1) "224"
2) (empty list or set)
redis 127.0.0.1:6379> scan 224 MATCH *11*
1) "80"
2) (empty list or set)
redis 127.0.0.1:6379> scan 80 MATCH *11*
1) "176"
2) (empty list or set)
redis 127.0.0.1:6379> scan 176 MATCH *11* COUNT 1000
1) "0"
2)  1) "key:611"
    2) "key:711"
    3) "key:118"
    4) "key:117"
    5) "key:311"
    6) "key:112"
    7) "key:111"
    8) "key:110"
    9) "key:113"
   10) "key:211"
   11) "key:411"
   12) "key:115"
   13) "key:116"
   14) "key:114"
   15) "key:119"
   16) "key:811"
   17) "key:511"
   18) "key:11"
redis 127.0.0.1:6379>

The above command returned me with a 0 means, in the first line of response, there is no more data to scan. If not 0, the number is next cursor to be used to scan.

It is important to note that the MATCH filter is applied after elements are retrieved from the collection, just before returning data to the client.

This means, firstly it will retrieve the data, and then will use a filter to match the retrieved data.
Does this means there could be multiple back-and-forth between the client and server, and which will cost the huge latency of this redis scan? If that’s true, we just need an extra easy COUNT parameter.

2. Validation in local environment

2.1. KEYS number

How many keys are there in our online production redis server?

1
2
3

127.0.0.1:6379[10]> INFO
# more code is ommitted here
db10:keys=6986,expires=6986,avg_ttl=40118300

From above code, we can know there are nearly 7000 keys in our server.

2.2. Client-Server latency

I simply use the ping command to get the latency and it’s 3.5 ms. Due to classified cause, sensitive info is hidden.

[xxx@xxxx.xxx.xxx.xxx ~]$ ping xxxx.com
PING xxxx.com (xx.xx.xx.xx) 56(84) bytes of data.
64 bytes from xxxx.com (xx.xx.xx.xx): icmp_seq=1 ttl=58 time=3.58 ms
64 bytes from xxxx.com (xx.xx.xx.xx): icmp_seq=2 ttl=58 time=3.46 ms
--- xxxx.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 3.461/3.498/3.585/0.050 ms

2.3. Simulate the Network Latency Locally

From this post, I successfully simulated the network latency.

(venv) vagrant@localhost:~ $ sudo tc qdisc add dev lo root netem delay 2ms 
(venv) vagrant@localhost:~ $ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=4.34 ms
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 4.348/4.348/4.348/0.000 ms

So I made the lo NIC as slow as around 4.3 ms.

Other commands to be used for tc qdisc:
baisc syntax: tc qdisc add/del dev lo/eth0/eth1 root netem [delay 2ms]

# add rule to lo NIC
tc qdisc add dev lo root netem delay 2ms
# add rule to eth0 NIC
tc qdisc add dev eth0 root netem delay 2ms
# test
tc -s qdisc
# delete rule
tc qdisc del dev eth0 root netem

2.4. Measure the time of cleaning the cache before and after setting the delay

Since this is a Django application, so I can use Django Shell to try to execute the clean cache statement, which as a result operate the local redis server.

(venv) vagrant@localhost:~ $ time python manage.py shell -c 'from django.core.cache import cache; cache.delete_pattern("view_cache:dispatch:123456789:category_slug:all")'

real    0m8.404s
user    0m0.944s
sys     0m0.696s

What if we don’t set the delay option, what’s would the timing metric be?

(venv) vagrant@lcoalhost:~ $ time python manage.py shell -c 'from django.core.cache import cache; cache.delete("view_cache:dispatch:123456789:category_slug:all")'

real    0m4.366s
user    0m0.800s
sys     0m0.648s

From the above test, we can see that before and after will make a huge difference as large as 4 seconds !!! And this is under the network latency of 4.3 ms. If the network latency is around 3.5 ms, the total time of cleaning cache could be as large as 3 seconds, which is very close the above phenomenon!

Now, we can get a conclusion: network latency will influence too much on the redis scan performance. SCAN will call the redis multiple times back-and-forth, which in the end, cause a large redis scan time.

How to solve the problem? Use the COUNT parameter!

3. Solve the problem

3.1. Estimation of deleting time again

From the definition of delete_pattern method, we can see it just scan_iter the redis using the default COUNT which is 10. Let’s give an estimation here:

The Redis Scan use a sequential scan with COUNT as 10, there will be 6986/10 = 700 times to scan.
Each scan costs at least 1 ping time, which is 3.5 ms.
In total, 3.5*700 = 2450 ms, which is close to the 2680 ms of the phenomenon above. Interesting.

From above estimation, we are more confident of our guess now, and using the COUNT must be the solution!

But the delete_pattern method provided by the default client doesn’t use COUNT parameter, so what I can do is to create a new Redis Client which is inherited from the default Client and override this delete_pattern method to use the COUNT parameter.

Here is my code:

import socket

# Compatibility with redis-py 2.10.x+
from redis.exceptions import ConnectionError
from redis.exceptions import ResponseError
from django_redis.exceptions import ConnectionInterrupted, CompressorError
from django_redis.client import DefaultClient

try:
    from redis.exceptions import TimeoutError, ResponseError
    _main_exceptions = (TimeoutError, ResponseError, ConnectionError, socket.timeout)
except ImportError:
    _main_exceptions = (ConnectionError, socket.timeout)


class CacheClient(DefaultClient):
    def delete_pattern(self, pattern, itersize=None, version=None, prefix=None, client=None):
        """
        Remove all keys matching pattern.
        """

        if client is None:
            client = self.get_client(write=True)

        pattern = self.make_key(pattern, version=version, prefix=prefix)
        try:
            count = 0
            for key in client.scan_iter(pattern, count=itersize): // have a itersize here
                client.delete(key)
                count += 1
            return count
        except _main_exceptions as e:
            raise ConnectionInterrupted(connection=client, parent=e)

When I call the code, I make sure to pass the itersize very large, say, 10,000 to eliminate too many back-and-forth RPC calls.

3.2. Enable the customized redis client class

Of course we need to enable this client in our settings/local.py

CACHES = {
    'default': {
        'BACKEND': 'django_redis.cache.RedisCache',
        # redis server
        'LOCATION':
        os.environ.get('NOJ_REDIS_CACHE_URL', 'redis://localhost:6379/10'),
        'OPTIONS': {
            'PARSER_CLASS': 'redis.connection.HiredisParser',
            'CLIENT_CLASS': 'common.cache_client.CacheClient' // enable the customized redis client class
        }
    }
}

3.3. Test after fixing

I added a 10000 after my delete_pattern call.

(venv) vagrant@localhost:~ $ time python manage.py shell -c 'from django.core.cache import cache; cache.delete_pattern("view_cache:dispatch:123456789:category_slug:all",10000)' 

real    0m4.021s
user    0m0.784s
sys     0m0.612s

The time now is very close to the time before setting the network latency, which means, delete_pattern won’t be a performance issue anymore, since it only has one RPC call now.

Cheers, bro!

4. Conclusion

Don’t be afraid of the library source code, just dig into it.
Use tc command to simulate the network latency.
MATCH inRedis Scan is to filter on top of the retrieved data.
Default COUNT of Redis Scan is 10.
Use python manage.py shell to execute ad-hoc test.
Use ping to get network latency.
Use INFO to get redis statistical information.

References