Thursday, January 2, 2020

How to debug redis at high scale

Background of our architecture

We are using very simple architecture of 100s of Java web boxes with auto-scaling and Redis Cluster as primary data source and Jedis as Redis client. We need to hit many different Redises in our single serving call with 20ms timeouts from client of our service. 

Problem we were facing

At a very high scale whenever CPU of Redis was going beyond 60% whole cluster was spiking to 100% with decrease in IOPS. Because of this we had to over provision the Redis boxes. 

Things we tried to debug the issue

  • Enabled debug logs in Redis

    • Debug logs were giving hint of things going bad but didn't give full picture 
      • Observation from debug logs 100s of new connections were getting created very frequently. 

  • Ran strace on Redis server with full text and timestamps

sudo strace -s 1000 -tt -T -p <PID>

    • This gave us very good information about what Redis does internally. 
      • Redis makes all system calls in batches. 3 major system calls are.
        • Accept
        • Read 
        • Write
      • Because of this batch reading/writing/accepting requests were timing out results in disconnection of TCP connection from Jedis and then it was creating new connection 
      • So whenever there was momentary spike in calls, Jedis was creating new connections, results in timeout of existing connections and logging of connection breaking, etc. and it was going in cycle of creating and destroying new connections. 

  • Ran tcpdump on client with full text and timestamp

sudo tcpdump -tttt -A -i any port <Redis Port>

    • This further clears everything that was happening about what all is happening at the TCP level. 

Our solution was to increase the number of idle connections to handle sudden spike in calls without creating new connections.