Восстановление после сбоя

«Сбой» – это ситуация, когда мастер становится недоступен вследствие проблем с оборудованием, сетевых неполадок или программной ошибки.

In a master-replica set with manual failover, if a master disappears, error messages appear on the replicas stating that the connection is lost:

2023-12-04 13:19:04.724 [16755] main/110/applier/replicator@127.0.0.1:3301 I> can't read row
2023-12-04 13:19:04.724 [16755] main/110/applier/replicator@127.0.0.1:3301 coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 19, aka 127.0.0.1:55932, peer of 127.0.0.1:3301: Broken pipe
2023-12-04 13:19:04.724 [16755] main/110/applier/replicator@127.0.0.1:3301 I> will retry every 1.00 second
2023-12-04 13:19:04.724 [16755] relay/127.0.0.1:55940/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 23, aka 127.0.0.1:3302, peer of 127.0.0.1:55940: Broken pipe
2023-12-04 13:19:04.724 [16755] relay/127.0.0.1:55940/101/main I> exiting the relay loop

In a master-replica set with automated failover, a log also includes Raft messages showing the process of a new master’s election:

2023-12-04 13:16:56.340 [16615] main/111/applier/replicator@127.0.0.1:3302 I> can't read row
2023-12-04 13:16:56.340 [16615] main/111/applier/replicator@127.0.0.1:3302 coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 24, aka 127.0.0.1:55687, peer of 127.0.0.1:3302: Broken pipe
2023-12-04 13:16:56.340 [16615] main/111/applier/replicator@127.0.0.1:3302 I> will retry every 1.00 second
2023-12-04 13:16:56.340 [16615] relay/127.0.0.1:55695/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 25, aka 127.0.0.1:3301, peer of 127.0.0.1:55695: Broken pipe
2023-12-04 13:16:56.340 [16615] relay/127.0.0.1:55695/101/main I> exiting the relay loop
2023-12-04 13:16:59.690 [16615] main/112/applier/replicator@127.0.0.1:3303 I> RAFT: message {term: 3, vote: 2, state: candidate, vclock: {1: 9}} from 2
2023-12-04 13:16:59.690 [16615] main/112/applier/replicator@127.0.0.1:3303 I> RAFT: received a newer term from 2
2023-12-04 13:16:59.690 [16615] main/112/applier/replicator@127.0.0.1:3303 I> RAFT: bump term to 3, follow
2023-12-04 13:16:59.690 [16615] main/112/applier/replicator@127.0.0.1:3303 I> RAFT: vote for 2, follow
2023-12-04 13:16:59.691 [16615] main/119/raft_worker I> RAFT: persisted state {term: 3}
2023-12-04 13:16:59.691 [16615] main/119/raft_worker I> RAFT: persisted state {term: 3, vote: 2}
2023-12-04 13:16:59.691 [16615] main/112/applier/replicator@127.0.0.1:3303 I> RAFT: message {term: 3, vote: 2, leader: 2, state: leader} from 2
2023-12-04 13:16:59.691 [16615] main/112/applier/replicator@127.0.0.1:3303 I> RAFT: vote request is skipped - this is a notification about a vote for a third node, not a request
2023-12-04 13:16:59.691 [16615] main/112/applier/replicator@127.0.0.1:3303 I> RAFT: leader is 2, follow

The master’s upstream status is reported as disconnected when executing box.info.replication on a replica:

auto_leader:instance001> box.info.replication
---
- 1:
    id: 1
    uuid: 4cfa6e3c-625e-b027-00a7-29b2f2182f23
    lsn: 32
    upstream:
      peer: replicator@127.0.0.1:3302
      lag: 0.00032305717468262
      status: disconnected
      idle: 48.352504000002
      message: 'connect, called on fd 20, aka 127.0.0.1:62575: Connection refused'
      system_message: Connection refused
    name: instance002
    downstream:
      status: stopped
      message: 'unexpected EOF when reading from socket, called on fd 32, aka 127.0.0.1:3301,
        peer of 127.0.0.1:62204: Broken pipe'
      system_message: Broken pipe
  2:
    id: 2
    uuid: 9bb111c2-3ff5-36a7-00f4-2b9a573ea660
    lsn: 1
    name: instance001
  3:
    id: 3
    uuid: 9a3a1b9b-8a18-baf6-00b3-a6e5e11fd8b6
    lsn: 0
    upstream:
      status: follow
      idle: 0.18620999999985
      peer: replicator@127.0.0.1:3303
      lag: 0.00012516975402832
    name: instance003
    downstream:
      status: follow
      idle: 0.19718099999955
      vclock: {2: 1, 1: 32}
      lag: 0.00051403045654297
...

To learn how to perform manual failover in a master-replica set, see the Performing manual failover section.

In a master-replica configuration with automated failover, a new master should be elected automatically.

Версия:

Восстановление после сбоя