Re: Broken graphs - sometimes
Christian Rößner <c <at> roessner-network-solutions.com>
2015-02-24 17:47:28 GMT
I just don’t know, if I should start my own thread. I do not want to be a thread theft. Hope it is okay, if I
answer my settings details here?
> Am 23.02.2015 um 21:44 schrieb James Wells <jwells <at> dragonheim.net>:
> Everything you are describing indicates that you are overloading the DB. PHP plays no part in this, it is
all between the Zabbix server and the DB.
> So some basic questions I have not seen asked nor answered;
> • What DB are you running?
Postgres-SQL 9.3 with partitioning. I have daily tables that are created with a cron job:
<at> daily echo "select add_partition('history', 'day');" | psql -U zabbix -d zabbix >/dev/null
<at> daily echo "select add_partition('history_uint', 'day');" | psql -U zabbix -d zabbix >/dev/null
<at> daily echo "select add_partition('history_log', 'day');" | psql -U zabbix -d zabbix >/dev/null
<at> monthly echo "select add_partition('trends', 'month');" | psql -U zabbix -d zabbix >/dev/null
<at> monthly echo "select add_partition('trends_uint', 'month');" | psql -U zabbix -d zabbix >/dev/null
It’s basically the version found in the Zabbix wiki, but it can not run into deadlocks, as tables are
created outside Postgres.
> • When you look at the Zabbix Status widget on the dashboard what does it show for 'Required server
performance, new values per second‘?
Required server performance, new values per second 14.94 -
> • What are the values you have configured, in your Zabbix server config, for the following;
> • CacheSize
> • CacheUpdateFrequency
> • HistoryCacheSize
> • HistoryTextCacheSize
> • SenderFrequency
> • TrendCacheSize
> • UnavailableDelay
> • UnreachableDelay
I have not configured any of these. If you have some suggestions, I am very happy for it
pcregrep -v '^\s*#' zabbix_server.conf | grep -ve ^$
> Once those questions are answered then you can really start to look into what is causing the gaps. As
someone stated about, this is most often caused by a but in Zabbix's implementation of SNMP bulk get...
From there, the next most common is a pooly tuned DB, which is surprisingly easy to do. The next most likely
badly configured items.
I think DB should be ok. At least it is running on a HP server with 24GB RAM three channels active, two NUMA
nodes, SAS RAID 10 with perfect stripe alignment and ext4 optimized for this. So at least the hardware is
fine for this. Not sure, what to tune in Postgres itself There are also 8 VMs running on this same host
(low usage) For i.e. current load of the server: load average: 0.20, 0.26, 0.33
2 x Intel(R) Xeon(R) CPU L5520 <at> 2.27GHz
The nodes that a showed a picture of are directly connected with gigabit. The SNMP agents snmpd.local.conf
running on the nodes from Debian 7 looks like this:
interface eth0 6 1000000000
interface eth1 6 1000000000
interface eth2 6 1000000000
interface eth3 6 1000000000
interface eth4 6 1000000000
interface eth5 6 1000000000
interface bond0 161 3000000000
interface bond1 161 3000000000
interface bond1.100 135 1000000000
interface bond1.102 135 1000000000
interface bond1.104 135 1000000000
interface bond1.105 135 1000000000
interface bond1.106 135 1000000000
interface bond1.107 135 1000000000
interface bond1.108 135 1000000000
interface bond1.109 135 1000000000
interface bond1.200 135 1000000000
interface ifb0 6 1000000000
extend phone_in /usr/local/bin/phone-class-in.sh
extend phone_out /usr/local/bin/phone-class-out.sh
The Zabbix host configuration uses SNMPv2, so there should not be any counter overflow issue.
Nothing in the logs
I double checked this in the zabbix_server.log. Normally I would expect erros that have a timeout as a consequence:
12456:20150203:231526.093 SNMP agent item "tc.classid.1.110" on host "node0.localdomain" failed:
first network error, wait for 15 seconds
Some, very, very rare of these occur every some weeks (probably, when rebooting a node for new kernel).
> If you are using SNMP, you can test the first one by disabling all nodes but one SNMP node that is showing the
issue. If it is the SNMP bulk get issue, this node will continue to have the issue, even if it is the only node
you are monitoring.
> As for the DB, there are too many possible things to look at there without knowing what DB you are running.
> For the items, simply look at the items that are showing gaps... Are they the same ones all the time for all
nodes or are they random ones on random nodes? Generally, if it is the same items, it means that your
monitoring code is randomly returning an invalid value type or a non-zero exit status.
> Another option, based on what you have said so far, is that you might be using Zabbix Agent instead of Zabbix
Agent (Actrive). Zabbix Agent mode is pretty expensive as the server has to go out and query each item
individually... If a single items hangs or times out the Zabbix server may reap the thread. When it does
this, the Server will ceased to collect any data that was going to be collected by that thread until the next
> On Mon, Feb 23, 2015 at 11:14 AM, Christian Rößner <c <at> roessner-network-solutions.com> wrote:
> > I have a Zabbix Server and sometimes one of my graphs got broken, with lines and/or dots.
> > FYI, see this graphic at http://postimg.org/image/6ycrwwve3/
> > I´ve changed (increase/decreased) some parameters such as Start_Pollers and start_pingers, for example.
> > This server monitors about 300 hosts and less than 1k items.
> I guess I have similar problems:
> Neither 2.2.7 nor 2.4.3 have resolved this. It always seems to occur, when lots of data is to be graphed. I
mean, if there is a peak.
> I use SQL partitions, so there is no housekeeping active.
> But I also want to check the cache idea from above.
> I got some warnings from Zabbix about several processes being more than 75% and I have adopted them all over
the time. Memory should not be a bigger problem. Or would this be a setting for PHP somewhere?
> I monitor 3424 items for 121 devices.
> If there are some news on this issue, please let me know, as this is somewhat annoying
> Kund regrads
> Bachelor of Science Informatik
> Erlenwiese 14, 36304 Alsfeld
> T: +49 6631 78823400, F: +49 6631 78823409, M: +49 171 9905345
> USt-IdNr.: DE225643613, http://www.roessner-network-solutions.com
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> Zabbix-users mailing list
> Zabbix-users <at> lists.sourceforge.net
Bachelor of Science Informatik
Erlenwiese 14, 36304 Alsfeld
T: +49 6631 78823400, F: +49 6631 78823409, M: +49 171 9905345
USt-IdNr.: DE225643613, http://www.roessner-network-solutions.com
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
Zabbix-users mailing list
Zabbix-users <at> lists.sourceforge.net