SentinelManagedConnection delays new master detection and Risks data loss (original) (raw)

Summary

When using a SentinelManagedConnection with a retry mechanism, the connection does not detect a new master even after retrying and SENTINEL_DOWN_AFTER_MILLISECONDS has passed. Instead, it remains stuck inside the super().connect() retry loop, continuing to attempt connections to the old master. Only after exhausting all retry attempts does it finally invoke _connect_retry, which allows get_master_address to retrieve the updated master information.


Expected Behavior

When the master goes down and SENTINEL_DOWN_AFTER_MILLISECONDS elapses, the connection should, at the next opportunity, try to identify a new master and redirect traffic accordingly, minimizing downtime and avoiding writes to outdated instances.

Actual Behavior

The connection persists in the super().connect() (from AbstractConnection) retry loop, attempting to reconnect to the old master. Even after a new master has been elected by Redis, it does not detect this change until all retry attempts to the old master have failed.

Dangerous Side Effects

Beyond the unexpected delay, this behavior can cause data integrity issues:


Reproduce behavior

Reproducing this issue is somewhat dependent on failover timing and how long it takes for the old master to switch to slave vs the backing off time between retries but here is a draft:

  1. Create a Redis Sentinel cluster using something like the following docker-compose.yml:

services:

Redis Node #1 (initial master) + Sentinel

redis-node-1:SentinelManagedConnection container_name: redis-node-1 image: bitnami/redis:7.2.4 environment: - ALLOW_EMPTY_PASSWORD=yes - REDIS_AOF_ENABLED=no ports: - 6380:6379 networks: - nw

redis-node-1-sentinel: container_name: redis-node-1-sentinel image: bitnami/redis-sentinel:7.2.4 depends_on: - redis-node-1 environment: - REDIS_MASTER_HOST=redis-node-1 - REDIS_SENTINEL_MASTER_NAME=mymaster - REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS=5000 - REDIS_SENTINEL_FAILOVER_TIMEOUT=10000 - REDIS_SENTINEL_QUORUM=2 ports: - 36380:26379 networks: - nw

Redis Node #2 + Sentinel

redis-node-2: container_name: redis-node-2 image: bitnami/redis:7.2.4 environment: - ALLOW_EMPTY_PASSWORD=yes - REDIS_REPLICATION_MODE=slave - REDIS_MASTER_HOST=redis-node-1 - REDIS_AOF_ENABLED=no ports: - 6381:6379 networks: - nw

redis-node-2-sentinel: container_name: redis-node-2-sentinel image: bitnami/redis-sentinel:7.2.4 depends_on: - redis-node-2 environment: - REDIS_MASTER_HOST=redis-node-1 - REDIS_SENTINEL_MASTER_NAME=mymaster - REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS=5000 - REDIS_SENTINEL_FAILOVER_TIMEOUT=10000 - REDIS_SENTINEL_QUORUM=2 ports: - 36381:26379 networks: - nw

Redis Node #3 + Sentinel

redis-node-3: container_name: redis-node-3 image: bitnami/redis:7.2.4 environment: - ALLOW_EMPTY_PASSWORD=yes - REDIS_REPLICATION_MODE=slave - REDIS_MASTER_HOST=redis-node-1 - REDIS_AOF_ENABLED=no ports: - 6382:6379 networks: - nw

redis-node-3-sentinel: container_name: redis-node_3-sentinel image: bitnami/redis-sentinel:7.2.4 depends_on: - redis-node-3 environment: - REDIS_MASTER_HOST=redis-node-1 - REDIS_SENTINEL_MASTER_NAME=mymaster - REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS=5000 - REDIS_SENTINEL_FAILOVER_TIMEOUT=10000 - REDIS_SENTINEL_QUORUM=2 ports: - 36382:26379 networks: - nw

networks: nw: driver: bridge

  1. Run this Python script:

def restart_container(delay, container): time.sleep(delay) container.restart()

def problems(): for el in docker.from_env().containers.list(all=True): if "redis-node-1" == el.name: container_to_be_stopped = el

sentinel = Sentinel(
    [("localhost", 36380), ("localhost", 36381)],
    socket_timeout=1,
    socket_connect_timeout=1,
    health_check_interval=5,
    retry=Retry(ExponentialBackoff(10, 0.5), 5),
)
redis = sentinel.master_for("mymaster")


pipeline = redis.pipeline()
pipeline.hset("myhash","name","john doe")

print("Shutdown master")
container_to_be_stopped.stop()

stop_thread = Thread(target=restart_container, args=(5, container_to_be_stopped))
stop_thread.start()
resp = pipeline.execute()

stop_thread.join()

print(resp)
# You can also try to hgetall("myhash") to verify that there is nothing written despite the resp being [1] - as in it wrote 1 field-value pair correctly

Comments:

| Returns a pair (address, port) or raises MasterNotFoundError if no |
| master is found. |
| """ |
| collected_errors = list() |
| for sentinel_no, sentinel in enumerate(self.sentinels): |
| try: |
| masters = sentinel.sentinel_masters() |
| except (ConnectionError, TimeoutError) as e: |
| collected_errors.append(f"{sentinel} - {e!r}") |
| continue |
| state = masters.get(service_name) |
| if state and self.check_master_state(state, service_name): |
| # Put this sentinel at the top of the list |
| self.sentinels[0], self.sentinels[sentinel_no] = ( |
| sentinel, |
| self.sentinels[0], |
| ) |


Possible solutions

The issue stems from Sentinel's retry-connect logic

def connect_to(self, address):
self.host, self.port = address
super().connect()
if self.connection_pool.check_connection:
self.send_command("PING")
if str_if_bytes(self.read_response()) != "PONG":
raise ConnectionError("PING failed")
def _connect_retry(self):
if self._sock:
return # already connected
if self.connection_pool.is_master:
self.connect_to(self.connection_pool.get_master_address())
else:
for slave in self.connection_pool.rotate_slaves():
try:
return self.connect_to(slave)
except ConnectionError:
continue
raise SlaveNotFoundError # Never be here
def connect(self):
return self.retry.call_with_retry(self._connect_retry, lambda error: None)
def connect(self):
"Connects to the Redis server if not already connected"
if self._sock:
return
try:
sock = self.retry.call_with_retry(
lambda: self._connect(), lambda error: self.disconnect(error)
)
except socket.timeout:
raise TimeoutError("Timeout connecting to server")
except OSError as e:
raise ConnectionError(self._error_message(e))

In my opinion, SentinelManagedConnection should handle the retries directly, so it can switch to the newly discovered master address without waiting for all old-master retries to fail. For example, modifying::

def connect_to(self, address):
    self.host, self.port = address
    super().connect(retry=False)  # <- changed here
    if self.connection_pool.check_connection:
        self.send_command("PING")
        if str_if_bytes(self.read_response()) != "PONG":
            raise ConnectionError("PING failed")

and updating AbstractConnection.connect() to accept a retry=True parameter:

def connect(self, retry=True):
    "Connects to the Redis server if not already connected"
    if self._sock:
        return

    try:
        if retry:
            sock = self.retry.call_with_retry(
                lambda: self._connect(), lambda error: self.disconnect(error)
            )
        else:
            sock = self._connect()
   ...

However, having flags flying around (😉) might not be very pretty so maybe a better approach would be instead of calling its superclass implementation, to let Sentinel maintain its own version of the AbstractConnection.connect() method.

Additional Comments

This might also happen in the async version since it follows a similar pattern.