Saturday, December 31, 2016

Zabbix with Galera/Percona Hourly Spikes. SOLVED!

The relief I have in solving this problem is indescribable. Utterly indescribable!
This had plagued us for months, and in such issues everyone in the team goes on the edge.

First: Which SPIKES do I speak of?
Open Zabbix server's "Zabbix internal process busy %" graph.
For us, this graph had crazy spikes every hour when the slowdown happened.

Why are the SPIKES caused?
This is Zabbix's process which converts history to trend data every hour.

Our Setup
Zabbix 2.4
Percona (with MySQL) 5.6 DB - Geo-Distributed/WAN Cluster across 2 sites
Current DB size: around 200 GB

Enter the SPIKES
Every hour, on the hour, we get a serious slowdown. Zabbix UI freezes. Graphs start going gray.
We looked at the resources: IO was high which CAUSED CPU to be high. Everything else looked fine. We moved one DB node to SSD storage. Things calmed down with resources BUT spike was still present.

Zabbix server logs show "duplicate entry" errors
We looked at Zabbix logs and saw some "Duplicate Entry" errors.
When I researched this, it indicated database corruption, which, in our case, was due to two Zabbix server processes running connected to same DB. Someone had accidentally started the 2nd (failover) server. We stopped the 2nd server and cleaned up the DB; dropped all history and trend data.

Spikes went away, but we were not sure whether it was due to DB size or really it was the corruption.

We waited and let the DB grow, and lo and behold, Spikes came back!

Now we had at least eliminated the DB corruption issue.

We started by tweaking Zabbix knobs first, nothing worked.

Replication Test
Finally, we started looking at the Galera replication. The way we determined this was "replication" related was by a simple test: shut down all DB nodes except for the one Zabbix connects to.
Once we did this, NO SPIKES!
Test 2: Bring up a DB node in same Site - NO SPIKES!
Test 3: Bring up a DB node in remote Site - SPIKES!

Galera replication is not handling our intra-site network too well.
Let's start looking at replication and flow control options.

Let's revisit Philip's presentation

All of this was already set.
Further research yields

gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0

We set the nodes to use these FC settings. No help. SUPER FRUSTRATION!
Ask in Galera forums, no help (or they are too busy or on holidays!)

So back to reading on Galera parameters.
I do a bunch of reading and find these. This indicates how many transactions are waiting to be processed in the queue. BINGO!

mysql> SHOW GLOBAL STATUS LIKE 'wsrep_local_recv%';
| Variable_name              | Value      |
| wsrep_local_recv_queue     | 1721       |
| wsrep_local_recv_queue_max | 1721       |
| wsrep_local_recv_queue_min | 0          |
| wsrep_local_recv_queue_avg | 169.347046 |
4 rows in set (0.00 sec)

Our fc_limit value was too low! We were sometimes passing 3000 transactions!
So obviously the next thing I did was to set fc_limit to a number higher than this, and DONE!

To check how node has caught up, the wsrep_local_recv_queue should fall back to 0.
In our case, that happens within few seconds.

This was a great NEW YEAR'S PRESENT for me.
And if any of you are using Percona/Galera with any application this information might help!

No comments:

Post a Comment