Recently I met an issue that one of our APIs of the online production servers is very slow due to redis scan
.
From next graph generated by NewRelic, we found AVERAGE response time
is 2930ms while redis scan
could take 2680ms, which is 2680/2930 = 91.4% portion of the total time.
Why redis scan cost this much? From our local environment, we never noticed such huge performance issue.
1. Locate the culprit
1.1. Phenomenon
After that I dug into the source code and found next code snippet in our Django Application and this is the only code of this API to access redis
, so this must be the culprit of this issue.
1 | from django.core.cache import cache |
From the above code, we could know that redis is treated as a cache server here.
So what does delete_pattern
do on the earth? So I traced the code and in the end found next delete_pattern
definition from django_redis.client.default.DefaultClient#delete_pattern
.
1 | def delete_pattern(self, pattern, version=None, prefix=None, client=None): |
The code calls client.scan_iter(pattern)
to get next cursor and try to delete that key of the cursor.
1.2. Redis scan
From the official site of redis scan, we can combine MATCH
and COUNT
in the SCAN
command.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33redis 127.0.0.1:6379> scan 0 MATCH *11*
1) "288"
2) 1) "key:911"
redis 127.0.0.1:6379> scan 288 MATCH *11*
1) "224"
2) (empty list or set)
redis 127.0.0.1:6379> scan 224 MATCH *11*
1) "80"
2) (empty list or set)
redis 127.0.0.1:6379> scan 80 MATCH *11*
1) "176"
2) (empty list or set)
redis 127.0.0.1:6379> scan 176 MATCH *11* COUNT 1000
1) "0"
2) 1) "key:611"
2) "key:711"
3) "key:118"
4) "key:117"
5) "key:311"
6) "key:112"
7) "key:111"
8) "key:110"
9) "key:113"
10) "key:211"
11) "key:411"
12) "key:115"
13) "key:116"
14) "key:114"
15) "key:119"
16) "key:811"
17) "key:511"
18) "key:11"
redis 127.0.0.1:6379>
The above command returned me with a 0
means, in the first line of response, there is no more data to scan. If not 0
, the number is next cursor to be used to scan.
It is important to note that the MATCH filter is applied
after
elements are retrieved from the collection, just before returning data to the client.
This means, firstly it will retrieve the data, and then will use a filter to match the retrieved data.
Does this means there could be multiple back-and-forth between the client and server, and which will cost the huge latency of this redis scan? If that’s true, we just need an extra easy COUNT
parameter.
2. Validation in local environment
2.1. KEYS number
How many keys are there in our online production redis server?1
2
3127.0.0.1:6379[10]> INFO
# more code is ommitted here
db10:keys=6986,expires=6986,avg_ttl=40118300
From above code, we can know there are nearly 7000 keys in our server.
2.2. Client-Server latency
I simply use the ping
command to get the latency and it’s 3.5 ms. Due to classified cause, sensitive info is hidden.1
2
3
4
5
6
7[xxx@xxxx.xxx.xxx.xxx ~]$ ping xxxx.com
PING xxxx.com (xx.xx.xx.xx) 56(84) bytes of data.
64 bytes from xxxx.com (xx.xx.xx.xx): icmp_seq=1 ttl=58 time=3.58 ms
64 bytes from xxxx.com (xx.xx.xx.xx): icmp_seq=2 ttl=58 time=3.46 ms
--- xxxx.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 3.461/3.498/3.585/0.050 ms
2.3. Simulate the Network Latency Locally
From this post, I successfully simulated the network latency.1
2
3
4
5
6
7(venv) vagrant@localhost:~ $ sudo tc qdisc add dev lo root netem delay 2ms
(venv) vagrant@localhost:~ $ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=4.34 ms
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 4.348/4.348/4.348/0.000 ms
So I made the lo
NIC as slow as around 4.3 ms.
Other commands to be used for tc qdisc
:
baisc syntax: tc qdisc add/del dev lo/eth0/eth1 root netem [delay 2ms]
1
2
3
4
5
6
7
8# add rule to lo NIC
tc qdisc add dev lo root netem delay 2ms
# add rule to eth0 NIC
tc qdisc add dev eth0 root netem delay 2ms
# test
tc -s qdisc
# delete rule
tc qdisc del dev eth0 root netem
2.4. Measure the time of cleaning the cache before and after setting the delay
Since this is a Django application, so I can use Django Shell to try to execute the clean cache statement, which as a result operate the local redis server.1
2
3
4
5(venv) vagrant@localhost:~ $ time python manage.py shell -c 'from django.core.cache import cache; cache.delete_pattern("view_cache:dispatch:123456789:category_slug:all")'
real 0m8.404s
user 0m0.944s
sys 0m0.696s
What if we don’t set the delay option, what’s would the timing metric be?1
2
3
4
5(venv) vagrant@lcoalhost:~ $ time python manage.py shell -c 'from django.core.cache import cache; cache.delete("view_cache:dispatch:123456789:category_slug:all")'
real 0m4.366s
user 0m0.800s
sys 0m0.648s
From the above test, we can see that before and after will make a huge difference as large as 4 seconds !!! And this is under the network latency of 4.3 ms. If the network latency is around 3.5 ms, the total time of cleaning cache could be as large as 3 seconds, which is very close the above phenomenon!
Now, we can get a conclusion: network latency will influence too much on the redis scan
performance. SCAN
will call the redis multiple times back-and-forth, which in the end, cause a large redis scan
time.
How to solve the problem? Use the COUNT
parameter!
3. Solve the problem
3.1. Estimation of deleting time again
From the definition of delete_pattern
method, we can see it just scan_iter
the redis using the default COUNT
which is 10. Let’s give an estimation here:
- The
Redis Scan
use a sequential scan withCOUNT
as 10, there will be 6986/10 = 700 times to scan. - Each scan costs at least 1 ping time, which is 3.5 ms.
- In total, 3.5*700 = 2450 ms, which is close to the 2680 ms of the phenomenon above. Interesting.
From above estimation, we are more confident of our guess now, and using the COUNT
must be the solution!
But the delete_pattern
method provided by the default client doesn’t use COUNT
parameter, so what I can do is to create a new Redis Client which is inherited from the default Client and override this delete_pattern
method to use the COUNT
parameter.
Here is my code:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33import socket
# Compatibility with redis-py 2.10.x+
from redis.exceptions import ConnectionError
from redis.exceptions import ResponseError
from django_redis.exceptions import ConnectionInterrupted, CompressorError
from django_redis.client import DefaultClient
try:
from redis.exceptions import TimeoutError, ResponseError
_main_exceptions = (TimeoutError, ResponseError, ConnectionError, socket.timeout)
except ImportError:
_main_exceptions = (ConnectionError, socket.timeout)
class CacheClient(DefaultClient):
def delete_pattern(self, pattern, itersize=None, version=None, prefix=None, client=None):
"""
Remove all keys matching pattern.
"""
if client is None:
client = self.get_client(write=True)
pattern = self.make_key(pattern, version=version, prefix=prefix)
try:
count = 0
for key in client.scan_iter(pattern, count=itersize): // have a itersize here
client.delete(key)
count += 1
return count
except _main_exceptions as e:
raise ConnectionInterrupted(connection=client, parent=e)
When I call the code, I make sure to pass the itersize
very large, say, 10,000 to eliminate too many back-and-forth RPC calls.
3.2. Enable the customized redis client class
Of course we need to enable this client in our settings/local.py
1
2
3
4
5
6
7
8
9
10
11
12CACHES = {
'default': {
'BACKEND': 'django_redis.cache.RedisCache',
# redis server
'LOCATION':
os.environ.get('NOJ_REDIS_CACHE_URL', 'redis://localhost:6379/10'),
'OPTIONS': {
'PARSER_CLASS': 'redis.connection.HiredisParser',
'CLIENT_CLASS': 'common.cache_client.CacheClient' // enable the customized redis client class
}
}
}
3.3. Test after fixing
I added a 10000
after my delete_pattern
call.1
2
3
4
5(venv) vagrant@localhost:~ $ time python manage.py shell -c 'from django.core.cache import cache; cache.delete_pattern("view_cache:dispatch:123456789:category_slug:all",10000)'
real 0m4.021s
user 0m0.784s
sys 0m0.612s
The time now is very close to the time before setting the network latency, which means, delete_pattern
won’t be a performance issue anymore, since it only has one RPC call now.
Cheers, bro!
4. Conclusion
- Don’t be afraid of the library source code, just dig into it.
- Use
tc
command to simulate the network latency. MATCH
inRedis Scan
is to filter on top of the retrieved data.- Default
COUNT
ofRedis Scan
is 10. - Use
python manage.py shell
to execute ad-hoc test. - Use
ping
to get network latency. - Use
INFO
to get redis statistical information.
References