Рейтинг@Mail.ru

Troubleshooting guide

Troubleshooting guide

For this guide, you need to install Tarantool stat module:

$ sudo yum install tarantool-stat
$ # -- OR --
$ sudo apt-get install tarantool-stat

Problem: INSERT/UPDATE-requests result in ER_MEMORY_ISSUE error

Possible reasons

  • Lack of RAM (parameters arena_used_ratio and quota_used_ratio in box.slab.info() report are getting close to 100%).

    To check these parameters, say:

    $ # attaching to a Tarantool instance
    $ tarantoolctl enter <instance_name>
    $ # -- OR --
    $ tarantoolctl connect <URI>
    
    -- requesting arena_used_ratio value
    tarantool> require('stat').stat()['slab.arena_used_ratio']
    
    -- requesting quota_used_ratio value
    tarantool> require('stat').stat()['slab.quota_used_ratio']
    

Solution

Try either of the following measures:

  • In Tarantool’s instance file, increase the value of box.cfg{memtx_memory} (if memory resources are available).

    Tarantool needs to be restarted to change this parameter. The Tarantool server will be unavailable while restarting from .xlog files, unless you restart it using hot standby mode. In the latter case, nearly 100% server availability is guaranteed.

  • Clean up the database.

  • Check the indicators of memory fragmentation:

    -- requesting quota_used_ratio value
    tarantool> require('stat').stat()['slab.quota_used_ratio']
    
    -- requesting items_used_ratio value
    tarantool> require('stat').stat()['slab.items_used_ratio']
    

    In case of heavy memory fragmentation (quota_used_ratio is getting close to 100%, items_used_ratio is about 50%), we recommend restarting Tarantool in the hot standby mode.

Problem: Tarantool generates too heavy CPU load

Possible reasons

The transaction processor thread consumes over 60% CPU.

Solution

Attach to the Tarantool instance with tarantoolctl utility, analyze the query statistics with box.stat() and spot the CPU consumption leader. The following commands can help:

$ # attaching to a Tarantool instance
$ tarantoolctl enter <instance_name>
$ # -- OR --
$ tarantoolctl connect <URI>
-- checking the RPS of calling stored procedures
tarantool> require('stat').stat()['stat.op.call.rps']

The critical RPS value is 75 000, boiling down to 10 000 - 20 000 for a rich Lua application (a Lua module of 200+ lines).

-- checking RPS per query type
tarantool> require('stat').stat()['stat.op.<query_type>.rps']

The critical RPS value for SELECT/INSERT/UPDATE/DELETE requests is 100 000.

If the load is mostly generated by SELECT requests, we recommend adding a slave server and let it process part of the queries.

If the load is mostly generated by INSERT/UPDATE/DELETE requests, we recommend sharding the database.

Problem: Query processing times out

Possible reasons

Note

All reasons that we discuss here can be identified by messages in Tarantool’s log file, all starting with the words 'Too long...'.

  1. Both fast and slow queries are processed within a single connection, so the readahead buffer is cluttered with slow queries.

    Solution

    Try either of the following measures:

    • Increase the readahead buffer size (box.cfg{readahead} parameter).

      This parameter can be changed on the fly, so you don’t need to restart Tarantool. Attach to the Tarantool instance with tarantoolctl utility and call box.cfg{} with a new readahead value:

      $ # attaching to a Tarantool instance
      $ tarantoolctl enter <instance_name>
      $ # -- OR --
      $ tarantoolctl connect <URI>
      
      -- changing the readahead value
      tarantool> box.cfg{readahead = 10 * 1024 * 1024}
      

      Example: Given 1000 RPS, 1 Кbyte of query size, and 10 seconds of maximal query processing time, the minimal readahead buffer size must be 10 Mbytes.

    • On the business logic level, split fast and slow queries processing by different connections.

  2. Slow disks.

    Solution

    Check disk performance (use iostat, iotop or strace utility to check iowait parameter) and try to put .xlog files and snapshot files on different physical disks (i.e. use different locations for wal_dir and memtx_dir).

Problem: Replication “lag” and “idle” contain negative values

This is about box.info.replication.(upstream.)lag and box.info.replication.(upstream.)idle values in box.info.replication section.

Possible reasons

Operating system clock on the hosts is not synchronized, or the NTP server is faulty.

Solution

Check NTP server settings.

If you found no problems with the NTP server, just do nothing then. Lag calculation uses operating system clock from two different machines. If they get out of sync, the remote master clock can get consistently behind the local instance’s clock.

Problem: Replication statistics differ on replicas within a replica set

This is about a replica set that consists of one master and several replicas. In a replica set of this type, values in box.info.replication section, like box.info.replication.lsn, come from the master and must be the same on all replicas within the replica set. The problem is that they get different.

Possible reasons

Replication is broken.

Solution

Restart replication.

Problem: Master-master replication is stopped

This is about box.info.replication(.upstream).status = stopped.

Possible reasons

In a master-master replica set of two Tarantool instances, one of the masters has tried to perform an action already performed by the other server, for example re-insert a tuple with the same unique key. This would cause an error message like 'Duplicate key exists in unique index 'primary' in space <space_name>'.

Solution

Restart replication with the following commands (at each master instance):

$ # attaching to a Tarantool instance
$ tarantoolctl enter <instance_name>
$ # -- OR --
$ tarantoolctl connect <URI>
-- restarting replication
tarantool> original_value = box.cfg.replication
tarantool> box.cfg{replication={}}
tarantool> box.cfg{replication=original_value}

We also recommend using text primary keys or setting up master-slave replication.

Problem: Tarantool works much slower than before

Possible reasons

Inefficient memory usage (RAM is cluttered with a huge amount of unused objects).

Solution

Call the Lua function collectgarbage(‘count’) and measure its execution time with Tarantool functions clock.bench() or clock.proc().

Example of calculating memory usage statistics:

$ # attaching to a Tarantool instance
$ tarantoolctl enter <instance_name>
$ # -- OR --
$ tarantoolctl connect <URI>
-- loading Tarantool's "clock" module with time-related routines
tarantool> local clock = require 'clock'
-- starting the timer
tarantool> local b = clock.proc()
-- launching garbage collection
tarantool> local c = collectgarbage('count')
-- stopping the timer after garbage collection is completed
tarantool> return c, clock.proc() - b

If the returned clock.proc() value is greater than 0.001, this may be an indicator of inefficient memory usage (no active measures are required, but we recommend to optimize your Tarantool application code).

If the value is greater than 0.01, your application definitely needs thorough code analysis aimed at optimizing memory usage.