Рейтинг@Mail.ru

2. Internals

2. Internals

2.1. Tarantool’s binary protocol

Tarantool’s binary protocol is a binary request/response protocol.

2.1.1. Notation in diagrams

0    X
+----+
|    | - X bytes
+----+
 TYPE - type of MsgPack value (if it is a MsgPack object)

+====+
|    | - Variable size MsgPack object
+====+
 TYPE - type of MsgPack value

+~~~~+
|    | - Variable size MsgPack Array/Map
+~~~~+
 TYPE - type of MsgPack value

MsgPack data types:

  • MP_INT - Integer
  • MP_MAP - Map
  • MP_ARR - Array
  • MP_STRING - String
  • MP_FIXSTR - Fixed size string
  • MP_OBJECT - Any MsgPack object
  • MP_BIN - MsgPack binary format

2.1.2. Greeting packet

TARANTOOL'S GREETING:

0                                     63
+--------------------------------------+
|                                      |
| Tarantool Greeting (server version)  |
|               64 bytes               |
+---------------------+----------------+
|                     |                |
| BASE64 encoded SALT |      NULL      |
|      44 bytes       |                |
+---------------------+----------------+
64                  107              127

The server instance begins the dialogue by sending a fixed-size (128-byte) text greeting to the client. The greeting always contains two 64-byte lines of ASCII text, each line ending with a newline character (‘\n’). The first line contains the instance version and protocol type. The second line contains up to 44 bytes of base64-encoded random string, to use in the authentication packet, and ends with up to 23 spaces.

2.1.3. Unified packet structure

Once a greeting is read, the protocol becomes pure request/response and features a complete access to Tarantool functionality, including:

  • request multiplexing, e.g. ability to asynchronously issue multiple requests via the same connection
  • response format that supports zero-copy writes

For data structuring and encoding, the protocol uses msgpack data format, see http://msgpack.org

The Tarantool protocol mandates use of a few integer constants serving as keys in maps used in the protocol. These constants are defined in src/box/iproto_constants.h

We list them here too:

-- user keys
<code>          ::= 0x00
<sync>          ::= 0x01
<schema_id>     ::= 0x05
<space_id>      ::= 0x10
<index_id>      ::= 0x11
<limit>         ::= 0x12
<offset>        ::= 0x13
<iterator>      ::= 0x14
<key>           ::= 0x20
<tuple>         ::= 0x21
<function_name> ::= 0x22
<username>      ::= 0x23
<expression>    ::= 0x27
<ops>           ::= 0x28
<data>          ::= 0x30
<error>         ::= 0x31
-- -- Value for <code> key in request can be:
-- User command codes
<select>  ::= 0x01
<insert>  ::= 0x02
<replace> ::= 0x03
<update>  ::= 0x04
<delete>  ::= 0x05
<call_16> ::= 0x06
<auth>    ::= 0x07
<eval>    ::= 0x08
<upsert>  ::= 0x09
<call>    ::= 0x0a
-- Admin command codes
<ping>    ::= 0x40

-- -- Value for <code> key in response can be:
<OK>      ::= 0x00
<ERROR>   ::= 0x8XXX

Both <header> and <body> are msgpack maps:

Request/Response:

0        5
+--------+ +============+ +===================================+
| BODY + | |            | |                                   |
| HEADER | |   HEADER   | |               BODY                |
|  SIZE  | |            | |                                   |
+--------+ +============+ +===================================+
  MP_INT       MP_MAP                     MP_MAP
UNIFIED HEADER:

+================+================+=====================+
|                |                |                     |
|   0x00: CODE   |   0x01: SYNC   |    0x05: SCHEMA_ID  |
| MP_INT: MP_INT | MP_INT: MP_INT |  MP_INT: MP_INT     |
|                |                |                     |
+================+================+=====================+
                          MP_MAP

They only differ in the allowed set of keys and values. The key defines the type of value that follows. If a body has no keys, the entire msgpack map for the body may be missing. Such is the case, for example, for a <ping> request. schema_id may be absent in the request’s header, meaning that there will be no version checking, but it must be present in the response. If schema_id is sent in the header, then it will be checked.

2.1.4. Authentication

When a client connects to the server instance, the instance responds with a 128-byte text greeting message. Part of the greeting is base-64 encoded session salt - a random string which can be used for authentication. The length of decoded salt (44 bytes) exceeds the amount necessary to sign the authentication message (first 20 bytes). An excess is reserved for future authentication schemas.

PREPARE SCRAMBLE:

    LEN(ENCODED_SALT) = 44;
    LEN(SCRAMBLE)     = 20;

prepare 'chap-sha1' scramble:

    salt = base64_decode(encoded_salt);
    step_1 = sha1(password);
    step_2 = sha1(step_1);
    step_3 = sha1(salt, step_2);
    scramble = xor(step_1, step_3);
    return scramble;

AUTHORIZATION BODY: CODE = 0x07

+==================+====================================+
|                  |        +-------------+-----------+ |
|  (KEY)           | (TUPLE)|  len == 9   | len == 20 | |
|   0x23:USERNAME  |   0x21:| "chap-sha1" |  SCRAMBLE | |
| MP_INT:MP_STRING | MP_INT:|  MP_STRING  |  MP_BIN   | |
|                  |        +-------------+-----------+ |
|                  |                   MP_ARRAY         |
+==================+====================================+
                        MP_MAP

<key> holds the user name. <tuple> must be an array of 2 fields: authentication mechanism (“chap-sha1” is the only supported mechanism right now) and password, encrypted according to the specified mechanism. Authentication in Tarantool is optional, if no authentication is performed, session user is ‘guest’. The instance responds to authentication packet with a standard response with 0 tuples.

2.1.5. Requests

  • SELECT: CODE - 0x01 Find tuples matching the search pattern
SELECT BODY:

+==================+==================+==================+
|                  |                  |                  |
|   0x10: SPACE_ID |   0x11: INDEX_ID |   0x12: LIMIT    |
| MP_INT: MP_INT   | MP_INT: MP_INT   | MP_INT: MP_INT   |
|                  |                  |                  |
+==================+==================+==================+
|                  |                  |                  |
|   0x13: OFFSET   |   0x14: ITERATOR |   0x20: KEY      |
| MP_INT: MP_INT   | MP_INT: MP_INT   | MP_INT: MP_ARRAY |
|                  |                  |                  |
+==================+==================+==================+
                          MP_MAP
  • INSERT: CODE - 0x02 Inserts tuple into the space, if no tuple with same unique keys exists. Otherwise throw duplicate key error.
  • REPLACE: CODE - 0x03 Insert a tuple into the space or replace an existing one.
INSERT/REPLACE BODY:

+==================+==================+
|                  |                  |
|   0x10: SPACE_ID |   0x21: TUPLE    |
| MP_INT: MP_INT   | MP_INT: MP_ARRAY |
|                  |                  |
+==================+==================+
                 MP_MAP
  • UPDATE: CODE - 0x04 Update a tuple
UPDATE BODY:

+==================+=======================+
|                  |                       |
|   0x10: SPACE_ID |   0x11: INDEX_ID      |
| MP_INT: MP_INT   | MP_INT: MP_INT        |
|                  |                       |
+==================+=======================+
|                  |          +~~~~~~~~~~+ |
|                  |          |          | |
|                  | (TUPLE)  |    OP    | |
|   0x20: KEY      |    0x21: |          | |
| MP_INT: MP_ARRAY |  MP_INT: +~~~~~~~~~~+ |
|                  |            MP_ARRAY   |
+==================+=======================+
                 MP_MAP
OP:
    Works only for integer fields:
    * Addition    OP = '+' . space[key][field_no] += argument
    * Subtraction OP = '-' . space[key][field_no] -= argument
    * Bitwise AND OP = '&' . space[key][field_no] &= argument
    * Bitwise XOR OP = '^' . space[key][field_no] ^= argument
    * Bitwise OR  OP = '|' . space[key][field_no] |= argument
    Works on any fields:
    * Delete      OP = '#'
      delete <argument> fields starting
      from <field_no> in the space[<key>]

0           2
+-----------+==========+==========+
|           |          |          |
|    OP     | FIELD_NO | ARGUMENT |
| MP_FIXSTR |  MP_INT  |  MP_INT  |
|           |          |          |
+-----------+==========+==========+
              MP_ARRAY
    * Insert      OP = '!'
      insert <argument> before <field_no>
    * Assign      OP = '='
      assign <argument> to field <field_no>.
      will extend the tuple if <field_no> == <max_field_no> + 1

0           2
+-----------+==========+===========+
|           |          |           |
|    OP     | FIELD_NO | ARGUMENT  |
| MP_FIXSTR |  MP_INT  | MP_OBJECT |
|           |          |           |
+-----------+==========+===========+
              MP_ARRAY

    Works on string fields:
    * Splice      OP = ':'
      take the string from space[key][field_no] and
      substitute <offset> bytes from <position> with <argument>
0           2
+-----------+==========+==========+========+==========+
|           |          |          |        |          |
|    ':'    | FIELD_NO | POSITION | OFFSET | ARGUMENT |
| MP_FIXSTR |  MP_INT  |  MP_INT  | MP_INT |  MP_STR  |
|           |          |          |        |          |
+-----------+==========+==========+========+==========+
                         MP_ARRAY

It is an error to specify an argument of a type that differs from the expected type.

  • DELETE: CODE - 0x05 Delete a tuple
DELETE BODY:

+==================+==================+==================+
|                  |                  |                  |
|   0x10: SPACE_ID |   0x11: INDEX_ID |   0x20: KEY      |
| MP_INT: MP_INT   | MP_INT: MP_INT   | MP_INT: MP_ARRAY |
|                  |                  |                  |
+==================+==================+==================+
                          MP_MAP
  • CALL_16: CODE - 0x06 Call a stored function, returning an array of tuples. This is deprecated; CALL (0x0a) is recommended instead.
CALL_16 BODY:

+=======================+==================+
|                       |                  |
|   0x22: FUNCTION_NAME |   0x21: TUPLE    |
| MP_INT: MP_STRING     | MP_INT: MP_ARRAY |
|                       |                  |
+=======================+==================+
                    MP_MAP
  • EVAL: CODE - 0x08 Evaulate Lua expression
EVAL BODY:

+=======================+==================+
|                       |                  |
|   0x27: EXPRESSION    |   0x21: TUPLE    |
| MP_INT: MP_STRING     | MP_INT: MP_ARRAY |
|                       |                  |
+=======================+==================+
                    MP_MAP
  • UPSERT: CODE - 0x09 Update tuple if it would be found elsewhere try to insert tuple. Always use primary index for key.
UPSERT BODY:

+==================+==================+==========================+
|                  |                  |             +~~~~~~~~~~+ |
|                  |                  |             |          | |
|   0x10: SPACE_ID |   0x21: TUPLE    |       (OPS) |    OP    | |
| MP_INT: MP_INT   | MP_INT: MP_ARRAY |       0x28: |          | |
|                  |                  |     MP_INT: +~~~~~~~~~~+ |
|                  |                  |               MP_ARRAY   |
+==================+==================+==========================+
                                MP_MAP

Operations structure same as for UPDATE operation.
   0           2
+-----------+==========+==========+
|           |          |          |
|    OP     | FIELD_NO | ARGUMENT |
| MP_FIXSTR |  MP_INT  |  MP_INT  |
|           |          |          |
+-----------+==========+==========+
              MP_ARRAY

Supported operations:

'+' - add a value to a numeric field. If the filed is not numeric, it's
      changed to 0 first. If the field does not exist, the operation is
      skipped. There is no error in case of overflow either, the value
      simply wraps around in C style. The range of the integer is MsgPack:
      from -2^63 to 2^64-1
'-' - same as the previous, but subtract a value
'=' - assign a field to a value. The field must exist, if it does not exist,
      the operation is skipped.
'!' - insert a field. It's only possible to insert a field if this create no
      nil "gaps" between fields. E.g. it's possible to add a field between
      existing fields or as the last field of the tuple.
'#' - delete a field. If the field does not exist, the operation is skipped.
      It's not possible to change with update operations a part of the primary
      key (this is validated before performing upsert).
  • CALL: CODE - 0x0a Similar to CALL_16, but – like EVAL, CALL returns a list of values, unconverted
CALL BODY:

+=======================+==================+
|                       |                  |
|   0x22: FUNCTION_NAME |   0x21: TUPLE    |
| MP_INT: MP_STRING     | MP_INT: MP_ARRAY |
|                       |                  |
+=======================+==================+
                    MP_MAP

2.1.6. Response packet structure

We will show whole packets here:

OK:    LEN + HEADER + BODY

0      5                                          OPTIONAL
+------++================+================++===================+
|      ||                |                ||                   |
| BODY ||   0x00: 0x00   |   0x01: SYNC   ||   0x30: DATA      |
|HEADER|| MP_INT: MP_INT | MP_INT: MP_INT || MP_INT: MP_OBJECT |
| SIZE ||                |                ||                   |
+------++================+================++===================+
 MP_INT                MP_MAP                      MP_MAP

Set of tuples in the response <data> expects a msgpack array of tuples as value EVAL command returns arbitrary MP_ARRAY with arbitrary MsgPack values.

ERROR: LEN + HEADER + BODY

0      5
+------++================+================++===================+
|      ||                |                ||                   |
| BODY ||   0x00: 0x8XXX |   0x01: SYNC   ||   0x31: ERROR     |
|HEADER|| MP_INT: MP_INT | MP_INT: MP_INT || MP_INT: MP_STRING |
| SIZE ||                |                ||                   |
+------++================+================++===================+
 MP_INT                MP_MAP                      MP_MAP

Where 0xXXX is ERRCODE.

An error message is present in the response only if there is an error; <error> expects as value a msgpack string.

Convenience macros which define hexadecimal constants for return codes can be found in src/box/errcode.h

2.1.7. Replication packet structure

-- replication keys
<server_id>     ::= 0x02
<lsn>           ::= 0x03
<timestamp>     ::= 0x04
<server_uuid>   ::= 0x24
<cluster_uuid>  ::= 0x25
<vclock>        ::= 0x26
-- replication codes
<join>      ::= 0x41
<subscribe> ::= 0x42
JOIN:

In the beginning you must send JOIN
                         HEADER                          BODY
+================+================+===================++-------+
|                |                |    SERVER_UUID    ||       |
|   0x00: 0x41   |   0x01: SYNC   |   0x24: UUID      || EMPTY |
| MP_INT: MP_INT | MP_INT: MP_INT | MP_INT: MP_STRING ||       |
|                |                |                   ||       |
+================+================+===================++-------+
               MP_MAP                                   MP_MAP

Then instance, which we connect to, will send last SNAP file by, simply,
creating a number of INSERTs (with additional LSN and ServerID)
(don't reply). Then it'll send a vclock's MP_MAP and close a socket.

+================+================++============================+
|                |                ||        +~~~~~~~~~~~~~~~~~+ |
|                |                ||        |                 | |
|   0x00: 0x00   |   0x01: SYNC   ||   0x26:| SRV_ID: SRV_LSN | |
| MP_INT: MP_INT | MP_INT: MP_INT || MP_INT:| MP_INT: MP_INT  | |
|                |                ||        +~~~~~~~~~~~~~~~~~+ |
|                |                ||               MP_MAP       |
+================+================++============================+
               MP_MAP                      MP_MAP

SUBSCRIBE:

Then you must send SUBSCRIBE:

                              HEADER
+===================+===================+
|                   |                   |
|     0x00: 0x41    |    0x01: SYNC     |
|   MP_INT: MP_INT  |  MP_INT: MP_INT   |
|                   |                   |
+===================+===================+
|    SERVER_UUID    |    CLUSTER_UUID   |
|   0x24: UUID      |   0x25: UUID      |
| MP_INT: MP_STRING | MP_INT: MP_STRING |
|                   |                   |
+===================+===================+
                 MP_MAP

      BODY
+================+
|                |
|   0x26: VCLOCK |
| MP_INT: MP_INT |
|                |
+================+
      MP_MAP

Then you must process every query that'll came through other masters.
Every request between masters will have Additional LSN and SERVER_ID.

2.1.8. XLOG / SNAP

XLOG and SNAP files have nearly the same format. The header looks like:

<type>\n                  SNAP\n or XLOG\n
<version>\n               currently 0.13\n
Server: <server_uuid>\n   where UUID is a 36-byte string
VClock: <vclock_map>\n    e.g. {1: 0}\n
\n

After the file header come the data tuples. Tuples begin with a row marker 0xd5ba0bab and the last tuple may be followed by an EOF marker 0xd510aded. Thus, between the file header and the EOF marker, there may be data tuples that have this form:

0            3 4                                         17
+-------------+========+============+===========+=========+
|             |        |            |           |         |
| 0xd5ba0bab  | LENGTH | CRC32 PREV | CRC32 CUR | PADDING |
|             |        |            |           |         |
+-------------+========+============+===========+=========+
  MP_FIXEXT2    MP_INT     MP_INT       MP_INT      ---

+============+ +===================================+
|            | |                                   |
|   HEADER   | |                BODY               |
|            | |                                   |
+============+ +===================================+
    MP_MAP                     MP_MAP

See the example in the following section.

2.2. Data persistence and the WAL file format

To maintain data persistence, Tarantool writes each data change request (insert, update, delete, replace, upsert) into a write-ahead log (WAL) file in the wal_dir directory. A new WAL file is created for every rows_per_wal records. Each data change request gets assigned a continuously growing 64-bit log sequence number. The name of the WAL file is based on the log sequence number of the first record in the file, plus an extension .xlog.

Apart from a log sequence number and the data change request (formatted as in Tarantool’s binary protocol), each WAL record contains a header, some metadata, and then the data formatted according to msgpack rules. For example, this is what the WAL file looks like after the first INSERT request (“s:insert({1})”) for the sandbox database created in our “Getting started” exercises. On the left are the hexadecimal bytes that you would see with:

$ hexdump 00000000000000000000.xlog

and on the right are comments.

Hex dump of WAL file       Comment
--------------------       -------
58 4c 4f 47 0a             "XLOG\n"
30 2e 31 33 0a             "0.13\n" = version
53 65 72 76 65 72 3a 20    "Server: "
38 62 66 32 32 33 65 30 2d [Server UUID]\n
36 39 31 34 2d 34 62 35 35
2d 39 34 64 32 2d 64 32 62
36 64 30 39 62 30 31 39 36
0a
56 43 6c 6f 63 6b 3a 20    "Vclock: "
7b 7d                      "{}" = vclock value, initially blank
...                        (not shown = tuples for system spaces)
d5 ba 0b ab                Magic row marker always = 0xab0bbad5
19                         Length, not including length of header, = 25 bytes
00                           Record header: previous crc32
ce 8c 3e d6 70               Record header: current crc32
a7 cc 73 7f 00 00 66 39      Record header: padding
84                         msgpack code meaning "Map of 4 elements" follows
00 02                         element#1: tag=request type, value=0x02=IPROTO_INSERT
02 01                         element#2: tag=server id, value=0x01
03 04                         element#3: tag=lsn, value=0x04
04 cb 41 d4 e2 2f 62 fd d5 d4 element#4: tag=timestamp, value=an 8-byte "Float64"
82                         msgpack code meaning "map of 2 elements" follows
10 cd 02 00                   element#1: tag=space id, value=512, big byte first
21 91 01                      element#2: tag=tuple, value=1-element fixed array={1}

Tarantool processes requests atomically: a change is either accepted and recorded in the WAL, or discarded completely. Let’s clarify how this happens, using the REPLACE request as an example:

  1. The server instance attempts to locate the original tuple by primary key. If found, a reference to the tuple is retained for later use.
  2. The new tuple is validated. If for example it does not contain an indexed field, or it has an indexed field whose type does not match the type according to the index definition, the change is aborted.
  3. The new tuple replaces the old tuple in all existing indexes.
  4. A message is sent to WAL writer running in a separate thread, requesting that the change be recorded in the WAL. The instance switches to work on the next request until the write is acknowledged.
  5. On success, a confirmation is sent to the client. On failure, a rollback procedure is begun. During the rollback procedure, the transaction processor rolls back all changes to the database which occurred after the first failed change, from latest to oldest, up to the first failed change. All rolled back requests are aborted with ER_WAL_IO error. No new change is applied while rollback is in progress. When the rollback procedure is finished, the server restarts the processing pipeline.

One advantage of the described algorithm is that complete request pipelining is achieved, even for requests on the same value of the primary key. As a result, database performance doesn’t degrade even if all requests refer to the same key in the same space.

The transaction processor thread communicates with the WAL writer thread using asynchronous (yet reliable) messaging; the transaction processor thread, not being blocked on WAL tasks, continues to handle requests quickly even at high volumes of disk I/O. A response to a request is sent as soon as it is ready, even if there were earlier incomplete requests on the same connection. In particular, SELECT performance, even for SELECTs running on a connection packed with UPDATEs and DELETEs, remains unaffected by disk load.

The WAL writer employs a number of durability modes, as defined in configuration variable wal_mode. It is possible to turn the write-ahead log completely off, by setting wal_mode to none. Even without the write-ahead log it’s still possible to take a persistent copy of the entire data set with the box.snapshot() request.

An .xlog file always contains changes based on the primary key. Even if the client requested an update or delete using a secondary key, the record in the .xlog file will contain the primary key.

2.3. The snapshot file format

The format of a snapshot .snap file is nearly the same as the format of a WAL .xlog file. However, the snapshot header differs: it contains the instance’s global unique identifier and the snapshot file’s position in history, relative to earlier snapshot files. Also, the content differs: an .xlog file may contain records for any data-change requests (inserts, updates, upserts, and deletes), a .snap file may only contain records of inserts to memtx spaces.

Primarily, the .snap file’s records are ordered by space id. Therefore the records of system spaces – such as _schema, _space, _index, _func, _priv and _cluster – will be at the start of the .snap file, before the records of any spaces that were created by users.

Secondarily, the .snap file’s records are ordered by primary key within space id.

2.4. The recovery process

The recovery process begins when box.cfg{} happens for the first time after the Tarantool server instance starts.

The recovery process must recover the databases as of the moment when the instance was last shut down. For this it may use the latest snapshot file and any WAL files that were written after the snapshot. One complicating factor is that Tarantool has two engines – the memtx data must be reconstructed entirely from the snapshot and the WAL files, while the vinyl data will be on disk but might require updating around the time of a checkpoint. (When a snapshot happens, Tarantool tells the vinyl engine to make a checkpoint, and the snapshot operation is rolled back if anything goes wrong, so vinyl’s checkpoint is at least as fresh as the snapshot file.)

Step 1
Read the configuration parameters in the box.cfg{} request. Parameters which affect recovery may include work_dir, wal_dir, memtx_dir, vinyl_dir and force_recovery.
Step 2

Find the latest snapshot file. Use its data to reconstruct the in-memory databases. Instruct the vinyl engine to recover to the latest checkpoint.

There are actually two variations of the reconstruction procedure for memtx databases, depending on whether the recovery process is “default”.

If the recovery process is default (force_recovery is false), memtx can read data in the snapshot with all indexes disabled. First, all tuples are read into memory. Then, primary keys are built in bulk, taking advantage of the fact that the data is already sorted by primary key within each space.

If the recovery process is non-default (force_recovery is true), Tarantool performs additional checking. Indexes are enabled at the start, and tuples are added one by one. This means that any unique-key constraint violations will be caught, and any duplicates will be skipped. Normally there will be no constraint violations or duplicates, so these checks are only made if an error has occurred.

Step 3
Find the WAL file that was made at the time of, or after, the snapshot file. Read its log entries until the log-entry LSN is greater than the LSN of the snapshot, or greater than the LSN of the vinyl checkpoint. This is the recovery process’s “start position”; it matches the current state of the engines.
Step 4
Redo the log entries, from the start position to the end of the WAL. The engine skips a redo instruction if it is older than the engine’s checkpoint.
Step 5
For the memtx engine, re-create all secondary indexes.

2.5. Server startup with replication

In addition to the recovery process described above, the server must take additional steps and precautions if replication is enabled.

Once again the startup procedure is initiated by the box.cfg{} request. One of the box.cfg parameters may be replication that specifies replication source(-s). We will refer to this replica, which is starting up due to box.cfg, as the “local” replica to distinguish it from the other replicas in a replica set, which we will refer to as “distant” replicas.

If there is no snapshot .snap file and the ``replication`` parameter is empty:
then the local replica assumes it is an unreplicated “standalone” instance, or is the first replica of a new replica set. It will generate new UUIDs for itself and for the replica set. The replica UUID is stored in the _cluster space; the replica set UUID is stored in the _schema space. Since a snapshot contains all the data in all the spaces, that means the local replica’s snapshot will contain the replica UUID and the replica set UUID. Therefore, when the local replica restarts on later occasions, it will be able to recover these UUIDs when it reads the .snap file.

If there is no snapshot .snap file and the ``replication`` parameter is not empty and the ``_cluster`` space contains no other replica UUIDs:
then the local replica assumes it is not a standalone instance, but is not yet part of a replica set. It must now join the replica set. It will send its replica UUID to the first distant replica which is listed in replication and which will act as a master. This is called the “join request”. When a distant replica receives a join request, it will send back:

  1. the distant replica’s replica set UUID,
  2. the contents of the distant replica’s .snap file.
    When the local replica receives this information, it puts the replica set UUID in its _schema space, puts the distant replica’s UUID and connection information in its _cluster space, and makes a snapshot containing all the data sent by the distant replica. Then, if the local replica has data in its WAL .xlog files, it sends that data to the distant replica. The distant replica will receive this and update its own copy of the data, and add the local replica’s UUID to its _cluster space.

If there is no snapshot .snap file and the ``replication`` parameter is not empty and the ``_cluster`` space contains other replica UUIDs:
then the local replica assumes it is not a standalone instance, and is already part of a replica set. It will send its replica UUID and replica set UUID to all the distant replicas which are listed in replication. This is called the “on-connect handshake”. When a distant replica receives an on-connect handshake:

  1. the distant replica compares its own copy of the replica set UUID to the one in the on-connect handshake. If there is no match, then the handshake fails and the local replica will display an error.
  2. the distant replica looks for a record of the connecting instance in its _cluster space. If there is none, then the handshake fails.
    Otherwise the handshake is successful. The distant replica will read any new information from its own .snap and .xlog files, and send the new requests to the local replica.

In the end ... the local replica knows what replica set it belongs to, the distant replica knows that the local replica is a member of the replica set, and both replicas have the same database contents.

If there is a snapshot file and replication source is not empty:
first the local replica goes through the recovery process described in the previous section, using its own .snap and .xlog files. Then it sends a “subscribe” request to all the other replicas of the replica set. The subscribe request contains the server vector clock. The vector clock has a collection of pairs ‘server id, lsn’ for every replica in the _cluster system space. Each distant replica, upon receiving a subscribe request, will read its .xlog files’ requests and send them to the local replica if (lsn of .xlog file request) is greater than (lsn of the vector clock in the subscribe request). After all the other replicas of the replica set have responded to the local replica’s subscribe request, the replica startup is complete.

The following temporary limitations apply for version 1.7:

  • The URIs in the replication parameter should all be in the same order on all replicas. This is not mandatory but is an aid to consistency.
  • The replicas of a replica set should be started up at slightly different times. This is not mandatory but prevents a situation where each replica is waiting for the other replica to be ready.
  • The maximum number of entries in the _cluster space is 32. Tuples for out-of-date replicas are not automatically re-used, so if this 32-replica limit is reached, users may have to reorganize the _cluster space manually.