Software Engineering and Machine Learning enthusiast: January 2020

Background of our architecture

We are using very simple architecture of 100s of Java web boxes with auto-scaling and Redis Cluster as primary data source and Jedis as Redis client. We need to hit many different Redises in our single serving call with 20ms timeouts from client of our service.

Problem we were facing

At a very high scale whenever CPU of Redis was going beyond 60% whole cluster was spiking to 100% with decrease in IOPS. Because of this we had to over provision the Redis boxes.

Things we tried to debug the issue

Enabled debug logs in Redis

Debug logs were giving hint of things going bad but didn't give full picture

Observation from debug logs 100s of new connections were getting created very frequently.

Ran strace on Redis server with full text and timestamps

`sudo strace -s 1000 -tt -T -p <PID>`

This gave us very good information about what Redis does internally.

Redis makes all system calls in batches. 3 major system calls are.

Accept
Read
Write

Because of this batch reading/writing/accepting requests were timing out results in disconnection of TCP connection from Jedis and then it was creating new connection
So whenever there was momentary spike in calls, Jedis was creating new connections, results in timeout of existing connections and logging of connection breaking, etc. and it was going in cycle of creating and destroying new connections.

Ran tcpdump on client with full text and timestamp

`sudo tcpdump -tttt -A -i any port <Redis Port>`

This further clears everything that was happening about what all is happening at the TCP level.

Software Engineering and Machine Learning enthusiast

Thursday, January 2, 2020

How to debug redis at high scale

Background of our architecture

Problem we were facing

Things we tried to debug the issue

Enabled debug logs in Redis

Ran strace on Redis server with full text and timestamps

`sudo strace -s 1000 -tt -T -p <PID>`

`sudo strace -s 1000 -tt -T -p <PID>`

Ran tcpdump on client with full text and timestamp

`sudo tcpdump -tttt -A -i any port <Redis Port>`

Our solution was to increase the number of idle connections to handle sudden spike in calls without creating new connections.

Do you consider your reporting manager as your mentor?

Report Abuse

Thursday, January 2, 2020

How to debug redis at high scale

Background of our architecture

Problem we were facing

Things we tried to debug the issue

Enabled debug logs in Redis

Ran strace on Redis server with full text and timestamps sudo strace -s 1000 -tt -T -p <PID>

sudo strace -s 1000 -tt -T -p <PID>

Ran tcpdump on client with full text and timestamp

sudo tcpdump -tttt -A -i any port <Redis Port>

Our solution was to increase the number of idle connections to handle sudden spike in calls without creating new connections.

Do you consider your reporting manager as your mentor?

Ran strace on Redis server with full text and timestamps

`sudo strace -s 1000 -tt -T -p <PID>`

`sudo strace -s 1000 -tt -T -p <PID>`

`sudo tcpdump -tttt -A -i any port <Redis Port>`