Saturday, December 31, 2016

Zabbix with Galera/Percona Hourly Spikes. SOLVED!

The relief I have in solving this problem is indescribable. Utterly indescribable!
This had plagued us for months, and in such issues everyone in the team goes on the edge.

First: Which SPIKES do I speak of?
Open Zabbix server's "Zabbix internal process busy %" graph.
For us, this graph had crazy spikes every hour when the slowdown happened.

Why are the SPIKES caused?
This is Zabbix's process which converts history to trend data every hour.

Our Setup
Zabbix 2.4
Percona (with MySQL) 5.6 DB - Geo-Distributed/WAN Cluster across 2 sites
Current DB size: around 200 GB

Enter the SPIKES
Every hour, on the hour, we get a serious slowdown. Zabbix UI freezes. Graphs start going gray.
We looked at the resources: IO was high which CAUSED CPU to be high. Everything else looked fine. We moved one DB node to SSD storage. Things calmed down with resources BUT spike was still present.

Zabbix server logs show "duplicate entry" errors
We looked at Zabbix logs and saw some "Duplicate Entry" errors.
When I researched this, it indicated database corruption, which, in our case, was due to two Zabbix server processes running connected to same DB. Someone had accidentally started the 2nd (failover) server. We stopped the 2nd server and cleaned up the DB; dropped all history and trend data.

Spikes went away, but we were not sure whether it was due to DB size or really it was the corruption.

We waited and let the DB grow, and lo and behold, Spikes came back!

Now we had at least eliminated the DB corruption issue.

We started by tweaking Zabbix knobs first, nothing worked.

Replication Test
Finally, we started looking at the Galera replication. The way we determined this was "replication" related was by a simple test: shut down all DB nodes except for the one Zabbix connects to.
Once we did this, NO SPIKES!
Test 2: Bring up a DB node in same Site - NO SPIKES!
Test 3: Bring up a DB node in remote Site - SPIKES!

Galera replication is not handling our intra-site network too well.
Let's start looking at replication and flow control options.

Let's revisit Philip's presentation

All of this was already set.
Further research yields

gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0

We set the nodes to use these FC settings. No help. SUPER FRUSTRATION!
Ask in Galera forums, no help (or they are too busy or on holidays!)

So back to reading on Galera parameters.
I do a bunch of reading and find these. This indicates how many transactions are waiting to be processed in the queue. BINGO!

mysql> SHOW GLOBAL STATUS LIKE 'wsrep_local_recv%';
| Variable_name              | Value      |
| wsrep_local_recv_queue     | 1721       |
| wsrep_local_recv_queue_max | 1721       |
| wsrep_local_recv_queue_min | 0          |
| wsrep_local_recv_queue_avg | 169.347046 |
4 rows in set (0.00 sec)

Our fc_limit value was too low! We were sometimes passing 3000 transactions!
So obviously the next thing I did was to set fc_limit to a number higher than this, and DONE!

To check how node has caught up, the wsrep_local_recv_queue should fall back to 0.
In our case, that happens within few seconds.

This was a great NEW YEAR'S PRESENT for me.
And if any of you are using Percona/Galera with any application this information might help!

Wednesday, October 12, 2016

Pacemaker - pcs cluster auth does not work on CentOS 6.x

Recently we were trying to fix an issue with our pacemaker/cman cluster on CentOS 6.7.
Regardless of everything we tried, pcs cluster auth was not working.

Started digging and found PAM blocking auth in /var/log/secure
Upon investigation started looking at this file


And commented out this line
auth            required onerr=fail item=group sense=allow file=/etc/

This fixed the problem but since our PAM configs are pushed via puppet, they were being overwritten during puppet run so I kept looking and found this

# in this file add haclient
vim /etc/

Check auth like this
pcs cluster auth nodeA nodeB -u hacluster

Assumption is you already have a password set for hacluster user.
If not then set it from root as:
passwd hacluster
Then restart pcsd service

Sunday, May 22, 2016

Zabbix proxy force configuration update

From Zabbix proxy shell run:

zabbix_proxy -R config_cache_reload

WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor

The reason as we found for this error was one of the following

Improper permissions on /var/lib/mysql
Sstuser account doesn't have proper permissions

Zabbix history and trends cleanup

Shut down the Zabbix server and frontend connections to the DB

If using MySQL, make sure auto_recalc is 1 otherwise you will have to analyze all these tables. The default value is 1.

CREATE TABLE history_new LIKE history;
CREATE TABLE history_log_new LIKE history_log;
CREATE TABLE history_str_new LIKE history_str;
CREATE TABLE history_text_new LIKE history_text;
CREATE TABLE history_uint_new LIKE history_uint;
CREATE TABLE trends_new LIKE trends;
CREATE TABLE trends_uint_new LIKE trends_uint;

ALTER TABLE trends RENAME trends_old;
ALTER TABLE trends_new RENAME trends;
ALTER TABLE trends_uint RENAME trends_uint_old;
ALTER TABLE trends_uint_new RENAME trends_uint;
ALTER TABLE history RENAME history_old;
ALTER TABLE history_new RENAME history;
ALTER TABLE history_log RENAME history_log_old;
ALTER TABLE history_log_new RENAME history_log;
ALTER TABLE history_str RENAME history_str_old;
ALTER TABLE history_str_new RENAME history_str;
ALTER TABLE history_text RENAME history_text_old;
ALTER TABLE history_text_new RENAME history_text;
ALTER TABLE history_uint RENAME history_uint_old;
ALTER TABLE history_uint_new RENAME history_uint;

DROP TABLE trends_old;
DROP TABLE trends_uint_old;
DROP TABLE history_old;
DROP TABLE history_log_old;
DROP TABLE history_str_old;
DROP TABLE history_text_old;
DROP TABLE history_uint_old;

delete from events;

That's all.

Duplicate Entry error in Zabbix server logs

We were seeing the following error

22199:20150613:133805.639 [Z3005] query failed: [1062] Duplicate entry '1743313' for key 'PRIMARY' [insert into events (eventid,source,object,objectid,clock,ns,value) values (1743313,3,0,55456,1460569085,540384532,0);

Main reason for this error is that two Zabbix servers were connected to the same DB server. We thought this might work in a HA scenario but it doesn't even though the second Zabbix server is doing nothing.

You might also have to run this query

delete from events;

Be careful as it would delete all events.
We did not care about this in a load test so we were OK with it.

Thursday, August 13, 2015

ERROR listener failed: zbx_tcp_listen() fatal error: unable to serve on any address [[-]:10051]

Load testing scenario and similar error to the post below, but different reason.

Zabbix runs as part of Pacemaker cluster.

There were 2 reasons:

1. mySQL did not have enough connections allowed
Set this parameter in my.cnf: max_connections = 512

2. Zabbix service controlled by Pacemaker. This requires maintenance mode. I saw articles on how to do it with crm command but this is obsolete. So here is how with pcs

pcs property set maintenance-mode=true
pcs property set maintenance-mode=false

Once it is in maintenance mode, it can controlled independently of Pacemaker.

Thursday, July 30, 2015

mysql-proxy not running via puppet service resource type

Had an issue with service resource for mysql-proxy.
When executing the manifest, the service would not start. If I try manually, the service runs. If I try with exec resource, the service runs.

Seems the init script is not lsb compliant

Modify as such to get it working
service { 'mysql-proxy':
  ensure => "running",
  status => 'ps afx | grep -i mysql-proxy | grep -v grep',
  hasstatus => "no",
  hasrestart => "yes",

Wednesday, July 15, 2015

zabbix-server does not start - zbx_tcp_listen() fatal error

Zabbix 2.4.x
CentOS 6.6

Although I have rarely seen something like this but my deployment of Zabbix via Puppet caused a very strange issue where Zabbix service would not start

Error in logs
listener failed: zbx_tcp_listen() fatal error: unable to serve on any address [[-]:10051]

Service status and when trying to start
[root@abc-zabserver-b zabbix]# service zabbix-server status
zabbix_server is stopped
[root@abc-zabserver-b zabbix]# service zabbix-server start
Starting Zabbix server:                                    [  OK  ]
[root@abc-zabserver-b zabbix]# service zabbix-server status
zabbix_server is stopped

Process is running (sometimes shows multiple processes running)
[root@abc-zabserver-b zabbix]# ps afx | grep -i zabbix
 3852 pts/1    S+     0:00  |       \_ grep -i zabbix
 2150 ?        S      0:00 zabbix_server -c /etc/zabbix/zabbix_server.conf

But service is still stopped
[root@abc-zabserver-b zabbix]# service zabbix-server status
zabbix_server is stopped

If I kill the process(es) then Zabbix service comes up fine.
When I was deploying Zabbix with Puppet I was using
ensure => installed
instead of
ensure => '2.4.1.-5.el6' (or some other specific version)

Basically my config file for Zabbix was still and older version and it didn't play too well with the new Zabbix that became available through the repos.

I ended up refreshing the config file.

Monday, July 13, 2015

Decrease timeout for Zabbix OK blinker

version: Zabbix Server 2.4

The default time for OK and status change trigger is 30 minutes which means the OK keeps blinking on the screen for that long. 

There are two ways to change this:

Go to Administration > General > Trigger displaying options (drop down on right)
Change values of following as desired

  • Display OK triggers for
  • On status change triggers blink for

The other way is to directly change it in DB (MySQL in this case):

mysql -u zabbix -p <PASSWORD> -e 'UPDATE config SET `ok_period`=60, `blink_period`=60' zabbixserverdb

This is more helpful for automation.