Wednesday, October 12, 2016

Pacemaker - pcs cluster auth does not work on CentOS 6.x

Recently we were trying to fix an issue with our pacemaker/cman cluster on CentOS 6.7.
Regardless of everything we tried, pcs cluster auth was not working.

Started digging and found PAM blocking auth in /var/log/secure
Upon investigation started looking at this file


And commented out this line
auth            required onerr=fail item=group sense=allow file=/etc/

This fixed the problem but since our PAM configs are pushed via puppet, they were being overwritten during puppet run so I kept looking and found this

# in this file add haclient
vim /etc/

Check auth like this
pcs cluster auth nodeA nodeB -u hacluster

Assumption is you already have a password set for hacluster user.
If not then set it from root as:
passwd hacluster
Then restart pcsd service

Sunday, May 22, 2016

Zabbix proxy force configuration update

From Zabbix proxy shell run:

zabbix_proxy -R config_cache_reload

WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor

The reason as we found for this error was one of the following

Improper permissions on /var/lib/mysql
Sstuser account doesn't have proper permissions

Zabbix history and trends cleanup

Shut down the Zabbix server and frontend connections to the DB

If using MySQL, make sure auto_recalc is 1 otherwise you will have to analyze all these tables. The default value is 1.

CREATE TABLE history_new LIKE history;
CREATE TABLE history_log_new LIKE history_log;
CREATE TABLE history_str_new LIKE history_str;
CREATE TABLE history_text_new LIKE history_text;
CREATE TABLE history_uint_new LIKE history_uint;
CREATE TABLE trends_new LIKE trends;
CREATE TABLE trends_uint_new LIKE trends_uint;

ALTER TABLE trends RENAME trends_old;
ALTER TABLE trends_new RENAME trends;
ALTER TABLE trends_uint RENAME trends_uint_old;
ALTER TABLE trends_uint_new RENAME trends_uint;
ALTER TABLE history RENAME history_old;
ALTER TABLE history_new RENAME history;
ALTER TABLE history_log RENAME history_log_old;
ALTER TABLE history_log_new RENAME history_log;
ALTER TABLE history_str RENAME history_str_old;
ALTER TABLE history_str_new RENAME history_str;
ALTER TABLE history_text RENAME history_text_old;
ALTER TABLE history_text_new RENAME history_text;
ALTER TABLE history_uint RENAME history_uint_old;
ALTER TABLE history_uint_new RENAME history_uint;

DROP TABLE trends_old;
DROP TABLE trends_uint_old;
DROP TABLE history_old;
DROP TABLE history_log_old;
DROP TABLE history_str_old;
DROP TABLE history_text_old;
DROP TABLE history_uint_old;

delete from events;

That's all.

Duplicate Entry error in Zabbix server logs

We were seeing the following error

22199:20150613:133805.639 [Z3005] query failed: [1062] Duplicate entry '1743313' for key 'PRIMARY' [insert into events (eventid,source,object,objectid,clock,ns,value) values (1743313,3,0,55456,1460569085,540384532,0);

Main reason for this error is that two Zabbix servers were connected to the same DB server. We thought this might work in a HA scenario but it doesn't even though the second Zabbix server is doing nothing.

You might also have to run this query

delete from events;

Be careful as it would delete all events.
We did not care about this in a load test so we were OK with it.

Thursday, August 13, 2015

ERROR listener failed: zbx_tcp_listen() fatal error: unable to serve on any address [[-]:10051]

Load testing scenario and similar error to the post below, but different reason.

Zabbix runs as part of Pacemaker cluster.

There were 2 reasons:

1. mySQL did not have enough connections allowed
Set this parameter in my.cnf: max_connections = 512

2. Zabbix service controlled by Pacemaker. This requires maintenance mode. I saw articles on how to do it with crm command but this is obsolete. So here is how with pcs

pcs property set maintenance-mode=true
pcs property set maintenance-mode=false

Once it is in maintenance mode, it can controlled independently of Pacemaker.

Thursday, July 30, 2015

mysql-proxy not running via puppet service resource type

Had an issue with service resource for mysql-proxy.
When executing the manifest, the service would not start. If I try manually, the service runs. If I try with exec resource, the service runs.

Seems the init script is not lsb compliant

Modify as such to get it working
service { 'mysql-proxy':
  ensure => "running",
  status => 'ps afx | grep -i mysql-proxy | grep -v grep',
  hasstatus => "no",
  hasrestart => "yes",

Wednesday, July 15, 2015

zabbix-server does not start - zbx_tcp_listen() fatal error

Zabbix 2.4.x
CentOS 6.6

Although I have rarely seen something like this but my deployment of Zabbix via Puppet caused a very strange issue where Zabbix service would not start

Error in logs
listener failed: zbx_tcp_listen() fatal error: unable to serve on any address [[-]:10051]

Service status and when trying to start
[root@abc-zabserver-b zabbix]# service zabbix-server status
zabbix_server is stopped
[root@abc-zabserver-b zabbix]# service zabbix-server start
Starting Zabbix server:                                    [  OK  ]
[root@abc-zabserver-b zabbix]# service zabbix-server status
zabbix_server is stopped

Process is running (sometimes shows multiple processes running)
[root@abc-zabserver-b zabbix]# ps afx | grep -i zabbix
 3852 pts/1    S+     0:00  |       \_ grep -i zabbix
 2150 ?        S      0:00 zabbix_server -c /etc/zabbix/zabbix_server.conf

But service is still stopped
[root@abc-zabserver-b zabbix]# service zabbix-server status
zabbix_server is stopped

If I kill the process(es) then Zabbix service comes up fine.
When I was deploying Zabbix with Puppet I was using
ensure => installed
instead of
ensure => '2.4.1.-5.el6' (or some other specific version)

Basically my config file for Zabbix was still and older version and it didn't play too well with the new Zabbix that became available through the repos.

I ended up refreshing the config file.

Monday, July 13, 2015

Decrease timeout for Zabbix OK blinker

version: Zabbix Server 2.4

The default time for OK and status change trigger is 30 minutes which means the OK keeps blinking on the screen for that long. 

There are two ways to change this:

Go to Administration > General > Trigger displaying options (drop down on right)
Change values of following as desired

  • Display OK triggers for
  • On status change triggers blink for

The other way is to directly change it in DB (MySQL in this case):

mysql -u zabbix -p <PASSWORD> -e 'UPDATE config SET `ok_period`=60, `blink_period`=60' zabbixserverdb

This is more helpful for automation.

Thursday, July 9, 2015

MySQL server has gone away - Galera with HAProxy

Whew! Been a long time. I've been working on some interesting projects and will update with more info soon. Recently, I created a MySQL Galera cluster with HAProxy load balancing.

I saw these errors
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
[Z3005] query failed: [2006] MySQL server has gone away [select m.maintenanceid,m.maintenance_type,m.active_since,tp.timeperiod_type,tp.every,tp.month,tp.dayofweek,,tp.start_time,tp.period,tp.start_date from maintenances m,maintenances_windows mw,timeperiods tp where m.maintenanceid=mw.maintenanceid and mw.timeperiodid=tp.timeperiodid and m.active_since<=1436204460 and m.active_till>1436204460]

Tweak the following

On Galera nodes, add to /etc/my.cnf
wait_timeout = 28000
max_allowed_packet = 64M

In HAProxy config /etc/haproxy/haproxy.cfg increase timeout values
timeout connect  10s
timeout client   2m
timeout server   2m