[00:00:42] New patchset: Sara; "Apply ganglia gmetad changes from test branch to production branch. gmetad will now use a template. manutius will monitor only "Upload caches eqiad"." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12132 [00:01:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12132 [00:05:27] New review: Sara; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12132 [00:05:30] Change merged: Sara; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12132 [00:06:05] New review: Asher; "(no comment)" [operations/debs/squid] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12129 [00:06:07] Change merged: Asher; [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/12129 [00:08:06] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:10:23] New patchset: Lcarr; "fixing ipv6 monitors for lvs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12133 [00:10:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12133 [00:11:04] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12133 [00:11:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12133 [00:13:14] ssmollett: is your ganglia change ready to be merged ? [00:13:30] LeslieCarr: yup. [00:13:37] merging [00:38:33] binasher: have you read much about scalebase? [00:38:51] * AaronSchulz tries to find details [00:39:47] AaronSchulz: are they in the cloud?? [01:03:43] New review: Asher; "(no comment)" [operations/debs/squid] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12126 [01:03:45] Change merged: Asher; [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/12126 [01:41:35] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 213 seconds [01:41:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 255 seconds [01:47:35] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 22 seconds [01:49:32] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 690s [01:52:05] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 17 seconds [01:52:32] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 16s [02:40:16] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [04:43:57] New patchset: Tim Starling; "Removed redundant "Apaches 8 CPU" data source" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12143 [04:44:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12143 [04:44:30] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12143 [04:44:32] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12143 [06:10:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:41:58] New patchset: ArielGlenn; "sigpipe sometimes shows as 141; fix up noop/lastlinks to work again" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12144 [06:47:12] New patchset: ArielGlenn; "sigpipe sometimes shows as 141; fix up noop/lastlinks to work again" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12144 [06:48:10] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12144 [06:48:12] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12144 [07:09:47] New patchset: ArielGlenn; "fix name of dsh node group for snapshot hosts" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12145 [07:10:30] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12145 [07:10:32] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12145 [07:12:27] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [08:06:58] New patchset: QChris; "Repairung status for items with toBeRun false." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12147 [08:09:57] New review: QChris; "While moving the check for Maintenance mode around in commit" [operations/dumps] (ariel) C: 0; - https://gerrit.wikimedia.org/r/12147 [08:27:20] New review: ArielGlenn; "The problem is that we want to do md5sum and status markup of all jobs, not just the ones that actua..." [operations/dumps] (ariel); V: 0 C: 0; - https://gerrit.wikimedia.org/r/12147 [08:58:00] New patchset: Hashar; "class to cleanup /var/cache/apt/archives" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12151 [08:58:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12151 [09:05:06] New review: Hashar; "So should we just abandon this change and fill a request for the above DNS entry?" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9874 [09:11:46] New review: Dereckson; "Logo protected." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/11741 [09:23:03] New patchset: Hashar; "(bug 37327) Configure chr.wikipedia site logo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11741 [09:23:10] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11741 [09:23:33] New review: Hashar; "Patchset 2 is a rebase." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11741 [09:24:29] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11741 [09:26:43] New review: Hashar; "deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11741 [09:38:59] New patchset: Hashar; "$wgMobileResourceVersion does not exist anymore" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12043 [09:39:05] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/12043 [09:39:19] New review: Hashar; "Patchset 2 is just a rebase." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12043 [09:39:22] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12043 [09:40:13] New review: Hashar; "Deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12043 [09:42:49] New patchset: Hashar; "(bug 37457) viwikibooks can import from fr/it wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11746 [09:42:55] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11746 [09:43:02] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11746 [09:43:04] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11746 [09:43:47] New review: Hashar; "Deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11746 [09:53:19] New patchset: ArielGlenn; "--date option; date and wikiname in outputdirname optionally" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12152 [09:55:07] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12152 [09:55:09] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12152 [10:08:51] New patchset: QChris; "Correcting md5 copying and item updating order" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12153 [10:08:52] New patchset: QChris; "Repairing status for items with toBeRun false." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12147 [10:09:05] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:09:34] New review: QChris; "Although commit c483a067b9eb3214ee1309ef2602bfb3c4c831f2 repairs the" [operations/dumps] (ariel) C: 0; - https://gerrit.wikimedia.org/r/12147 [10:35:47] New review: Lydia Pintscher; "Yes I think we can close this." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9874 [12:13:44] New patchset: Mark Bergsma; "Don't blindly accept mail for non-local domains, to fix SMTP address verification" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12162 [12:14:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12162 [12:22:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12162 [12:22:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12162 [12:27:33] New patchset: Petrb; "inserted back the ram check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12163 [12:27:42] Ryan_Lane: ^ [12:28:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12163 [12:32:15] petan: thanks [12:35:13] New patchset: Petrb; "inserted back the ram check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12163 [12:35:36] Ryan_Lane: I don't know if we use the test branch or not, so I inserted it to both [12:35:44] see topic :) [12:35:46] New patchset: Hashar; "(bug 37740) raise account throttle for an edit marathon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12165 [12:35:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12163 [12:35:50] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/12165 [12:35:54] I'll abandon it in test [12:36:15] Ryan_Lane: ok, then how do we make nagios work again [12:36:23] Ryan_Lane: by merging this in production? [12:36:28] branch [12:36:34] I'm going to do the switch today [12:36:38] ok [12:36:39] and yes, I'll merge it into production [12:36:53] hm, we have a few open things in the test branch [12:37:01] New review: Amire80; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/12165 [12:37:06] I'll just ignore them [12:37:43] ok [12:38:04] so when you merge it to production it will be in labs as well? [12:38:32] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12165 [12:38:34] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12165 [12:40:44] New patchset: Ryan Lane; "Merge remote-tracking branch 'origin/production' into test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12021 [12:40:59] petan: when I finish the switchover, yeah [12:41:14] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [12:41:15] wait. that doesn't look right [12:41:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12021 [12:41:17] heh [12:41:29] New review: Hashar; "deployed live." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12165 [12:42:03] well, the comment isn't correct anyway [12:42:20] Ryan_Lane: which [12:42:28] for the change I just pushed in [12:43:47] New patchset: Ryan Lane; "Merge test into production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12021 [12:43:58] that's better [12:44:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12021 [12:44:48] !log merging puppet test branch into production branch, lot's of changes going through…. [12:46:16] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12021 [12:46:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12021 [13:07:57] New patchset: Ryan Lane; "Remove duplicate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12167 [13:19:04] New patchset: Ryan Lane; "Not setting config options based on a version of something is a bad idea" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12169 [13:19:38] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12169 [13:19:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12169 [13:20:12] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12169 [13:20:28] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12167 [13:20:39] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12167 [13:20:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12169 [13:20:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12167 [13:27:30] New patchset: Ryan Lane; "Remove labs specific fact from production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12172 [13:28:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12172 [13:29:13] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12172 [13:29:16] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12172 [13:38:54] PROBLEM - Swift HTTP on copper is CRITICAL: Connection refused [13:39:03] PROBLEM - Swift HTTP on magnesium is CRITICAL: Connection refused [13:39:12] PROBLEM - Swift HTTP on owa3 is CRITICAL: Connection refused [13:39:21] PROBLEM - Swift HTTP on owa2 is CRITICAL: Connection refused [13:39:39] PROBLEM - Swift HTTP on zinc is CRITICAL: Connection refused [13:41:42] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12163 [13:41:47] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12163 [13:45:39] PROBLEM - Varnish HTCP daemon on cp1029 is CRITICAL: Connection refused by host [13:46:33] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: Connection refused by host [13:46:52] PROBLEM - Lucene disk space on search1019 is CRITICAL: Connection refused by host [13:47:09] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: Connection refused by host [13:47:14] are those swift errors actually a problem? [13:47:18] PROBLEM - MySQL Slave Running on db1022 is CRITICAL: Connection refused by host [13:47:18] PROBLEM - MySQL Idle Transactions on db1022 is CRITICAL: Connection refused by host [13:47:18] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: Connection refused by host [13:47:19] are those systems in use? [13:47:36] PROBLEM - Varnish HTCP daemon on cp1030 is CRITICAL: Connection refused by host [13:47:36] PROBLEM - MySQL Slave Delay on db57 is CRITICAL: Connection refused by host [13:47:36] PROBLEM - MySQL Slave Running on db57 is CRITICAL: Connection refused by host [13:47:36] PROBLEM - MySQL Idle Transactions on db57 is CRITICAL: Connection refused by host [13:47:41] o.O [13:47:45] PROBLEM - mysqld processes on db57 is CRITICAL: Connection refused by host [13:48:03] PROBLEM - Lucene disk space on search19 is CRITICAL: Connection refused by host [13:48:03] PROBLEM - Full LVS Snapshot on db57 is CRITICAL: Connection refused by host [13:48:03] PROBLEM - MySQL Recent Restart on db57 is CRITICAL: Connection refused by host [13:48:03] PROBLEM - MySQL disk space on db57 is CRITICAL: Connection refused by host [13:48:11] this looks like nagios is broken [13:48:12] PROBLEM - MySQL Recent Restart on db1022 is CRITICAL: Connection refused by host [13:48:12] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: Connection refused by host [13:48:12] PROBLEM - Lucene disk space on search1015 is CRITICAL: Connection refused by host [13:48:14] yay [13:48:21] PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: Connection refused by host [13:48:21] PROBLEM - Full LVS Snapshot on db1022 is CRITICAL: Connection refused by host [13:48:21] PROBLEM - mysqld processes on db1022 is CRITICAL: Connection refused by host [13:48:24] Ryan_Lane: did you copy the REALM? [13:48:28] eh? [13:48:34] what do you mean? [13:48:36] because production has different IP than labs [13:48:41] if we are using the same config for both [13:48:47] we need to use realm [13:48:50] !$realm [13:48:50] $realm is a variable used in puppet to determine which cluster a system is in. See also $site. [13:49:13] Ryan_Lane: in nrpe config I mean [13:49:51] PROBLEM - Lucene disk space on search1021 is CRITICAL: Connection refused by host [13:50:00] PROBLEM - Lucene disk space on search1002 is CRITICAL: Connection refused by host [13:50:12] allowed_hosts is fine [13:51:27] is neon not the nagios server? [13:51:57] seems its spense [13:51:58] spence [13:52:15] PROBLEM - MySQL disk space on db60 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - MySQL disk space on db59 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - MySQL Recent Restart on db59 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - Full LVS Snapshot on db60 is CRITICAL: Connection refused by host [13:52:33] PROBLEM - Lucene disk space on search21 is CRITICAL: Connection refused by host [13:52:33] PROBLEM - MySQL Idle Transactions on db60 is CRITICAL: Connection refused by host [13:52:42] PROBLEM - Lucene disk space on search1016 is CRITICAL: Connection refused by host [13:52:51] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: Connection refused by host [13:52:54] hm [13:52:56] Ryan_Lane: that's for documentation on wikitech would be [13:53:00] why is nrpe not running on them? [13:53:00] PROBLEM - Lucene disk space on search1020 is CRITICAL: Connection refused by host [13:53:00] PROBLEM - Lucene disk space on search1010 is CRITICAL: Connection refused by host [13:53:02] if it was actually correct [13:53:03] :D [13:53:09] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: Connection refused by host [13:53:27] PROBLEM - Lucene disk space on search1001 is CRITICAL: Connection refused by host [13:53:27] PROBLEM - MySQL Slave Running on db59 is CRITICAL: Connection refused by host [13:53:27] PROBLEM - MySQL Idle Transactions on db59 is CRITICAL: Connection refused by host [13:53:36] PROBLEM - Full LVS Snapshot on db59 is CRITICAL: Connection refused by host [13:53:36] PROBLEM - mysqld processes on db59 is CRITICAL: Connection refused by host [13:53:45] RECOVERY - MySQL disk space on db60 is OK: DISK OK [13:53:52] I wonder if the nrpe restart is broken or something [13:53:54] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 0 seconds [13:53:54] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [13:53:54] RECOVERY - Full LVS Snapshot on db60 is OK: OK no full LVM snapshot volumes [13:54:03] RECOVERY - MySQL Idle Transactions on db60 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:54:47] I think mutante was talking with me on hackaton about something what is running so long with no restart that he isn't even sure if it's able to restart [13:54:51] it was nagios related :P [13:54:57] PROBLEM - Lucene disk space on searchidx2 is CRITICAL: Connection refused by host [13:54:57] PROBLEM - Varnish HTCP daemon on cp1034 is CRITICAL: Connection refused by host [13:55:15] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: Connection refused by host [13:55:15] RECOVERY - MySQL disk space on db59 is OK: DISK OK [13:55:33] RECOVERY - MySQL Recent Restart on db59 is OK: OK 4621659 seconds since restart [13:55:42] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 1 seconds [13:56:09] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 0 seconds [13:56:18] RECOVERY - MySQL Slave Running on db59 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [13:56:18] RECOVERY - MySQL Idle Transactions on db59 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:56:27] RECOVERY - Full LVS Snapshot on db59 is OK: OK no full LVM snapshot volumes [13:56:27] RECOVERY - mysqld processes on db59 is OK: PROCS OK: 1 process with command name mysqld [13:56:45] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: Connection refused by host [13:57:10] oh, when puppet re-runs, it'll fix this [13:57:12] PROBLEM - Varnish HTCP daemon on cp1035 is CRITICAL: Connection refused by host [13:57:48] RECOVERY - Lucene disk space on search1001 is OK: DISK OK [13:58:24] PROBLEM - Lucene disk space on search13 is CRITICAL: Connection refused by host [13:58:42] PROBLEM - Lucene disk space on search1014 is CRITICAL: Connection refused by host [14:00:12] PROBLEM - Lucene disk space on search1008 is CRITICAL: Connection refused by host [14:00:12] PROBLEM - Lucene disk space on search1011 is CRITICAL: Connection refused by host [14:00:39] PROBLEM - MySQL Slave Running on db1001 is CRITICAL: Connection refused by host [14:00:39] PROBLEM - MySQL Recent Restart on db1001 is CRITICAL: Connection refused by host [14:00:39] PROBLEM - MySQL disk space on db1001 is CRITICAL: Connection refused by host [14:00:39] PROBLEM - MySQL Idle Transactions on db1001 is CRITICAL: Connection refused by host [14:00:48] PROBLEM - MySQL Slave Delay on db1001 is CRITICAL: Connection refused by host [14:00:48] PROBLEM - mysqld processes on db1001 is CRITICAL: Connection refused by host [14:00:48] PROBLEM - MySQL Recent Restart on db58 is CRITICAL: Connection refused by host [14:00:57] PROBLEM - Full LVS Snapshot on db1001 is CRITICAL: Connection refused by host [14:00:57] PROBLEM - MySQL Slave Running on db58 is CRITICAL: Connection refused by host [14:01:06] PROBLEM - Full LVS Snapshot on db58 is CRITICAL: Connection refused by host [14:01:06] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: Connection refused by host [14:01:06] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: Connection refused by host [14:01:15] PROBLEM - Lucene disk space on search36 is CRITICAL: Connection refused by host [14:01:19] oh yeah, this is going to be annoying as hell [14:01:24] PROBLEM - mysqld processes on db58 is CRITICAL: Connection refused by host [14:01:24] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: Connection refused by host [14:01:33] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: Connection refused by host [14:01:51] PROBLEM - MySQL Idle Transactions on db58 is CRITICAL: Connection refused by host [14:02:00] PROBLEM - Varnish HTCP daemon on cp1033 is CRITICAL: Connection refused by host [14:02:00] PROBLEM - MySQL disk space on db58 is CRITICAL: Connection refused by host [14:02:04] het at least it's not paging [14:02:27] PROBLEM - MySQL Recent Restart on db1020 is CRITICAL: Connection refused by host [14:02:36] PROBLEM - MySQL disk space on db1020 is CRITICAL: Connection refused by host [14:03:03] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: Connection refused by host [14:03:03] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: Connection refused by host [14:03:03] PROBLEM - Full LVS Snapshot on bellin is CRITICAL: Connection refused by host [14:03:03] PROBLEM - MySQL Slave Delay on bellin is CRITICAL: Connection refused by host [14:03:12] PROBLEM - Lucene disk space on search31 is CRITICAL: Connection refused by host [14:03:12] PROBLEM - Full LVS Snapshot on db1020 is CRITICAL: Connection refused by host [14:03:12] PROBLEM - MySQL Idle Transactions on db1020 is CRITICAL: Connection refused by host [14:03:12] PROBLEM - MySQL Slave Running on db1020 is CRITICAL: Connection refused by host [14:03:12] PROBLEM - mysqld processes on db1020 is CRITICAL: Connection refused by host [14:03:21] PROBLEM - MySQL Replication Heartbeat on bellin is CRITICAL: Connection refused by host [14:03:30] PROBLEM - MySQL Idle Transactions on bellin is CRITICAL: Connection refused by host [14:03:30] PROBLEM - mysqld processes on bellin is CRITICAL: Connection refused by host [14:03:39] PROBLEM - MySQL Slave Running on bellin is CRITICAL: Connection refused by host [14:03:48] PROBLEM - MySQL disk space on bellin is CRITICAL: Connection refused by host [14:03:48] PROBLEM - MySQL Recent Restart on bellin is CRITICAL: Connection refused by host [14:04:24] PROBLEM - Lucene disk space on search33 is CRITICAL: Connection refused by host [14:04:42] PROBLEM - Lucene disk space on search1006 is CRITICAL: Connection refused by host [14:04:42] PROBLEM - MySQL Replication Heartbeat on db1004 is CRITICAL: Connection refused by host [14:04:51] PROBLEM - MySQL Recent Restart on db1004 is CRITICAL: Connection refused by host [14:05:00] PROBLEM - MySQL Slave Delay on db1004 is CRITICAL: Connection refused by host [14:05:00] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: Connection refused by host [14:05:18] PROBLEM - Varnish HTCP daemon on cp1032 is CRITICAL: Connection refused by host [14:05:36] PROBLEM - Full LVS Snapshot on db1004 is CRITICAL: Connection refused by host [14:05:45] PROBLEM - MySQL Idle Transactions on db1004 is CRITICAL: Connection refused by host [14:05:45] PROBLEM - MySQL Slave Running on db1004 is CRITICAL: Connection refused by host [14:05:45] PROBLEM - MySQL disk space on db1004 is CRITICAL: Connection refused by host [14:05:45] PROBLEM - Lucene disk space on search1022 is CRITICAL: Connection refused by host [14:05:54] PROBLEM - Lucene disk space on search1013 is CRITICAL: Connection refused by host [14:05:54] PROBLEM - mysqld processes on db1004 is CRITICAL: Connection refused by host [14:06:03] PROBLEM - Lucene disk space on search1018 is CRITICAL: Connection refused by host [14:06:21] PROBLEM - MySQL disk space on db1026 is CRITICAL: Connection refused by host [14:08:09] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: Connection refused by host [14:08:27] PROBLEM - Lucene disk space on search28 is CRITICAL: Connection refused by host [14:08:36] PROBLEM - Varnish HTCP daemon on cp1036 is CRITICAL: Connection refused by host [14:10:15] PROBLEM - Lucene disk space on search35 is CRITICAL: Connection refused by host [14:10:24] PROBLEM - Lucene disk space on search24 is CRITICAL: Connection refused by host [14:12:30] PROBLEM - Varnish HTCP daemon on cp1031 is CRITICAL: Connection refused by host [14:12:57] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: Connection refused by host [14:13:15] RECOVERY - Varnish HTCP daemon on cp1029 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:13:33] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: Connection refused by host [14:13:33] PROBLEM - Lucene disk space on search1007 is CRITICAL: Connection refused by host [14:13:51] PROBLEM - Full LVS Snapshot on db53 is CRITICAL: Connection refused by host [14:13:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: Connection refused by host [14:13:51] PROBLEM - mysqld processes on db53 is CRITICAL: Connection refused by host [14:14:00] RECOVERY - Varnish HTCP daemon on cp1030 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:14:09] PROBLEM - MySQL Slave Running on db53 is CRITICAL: Connection refused by host [14:14:09] PROBLEM - MySQL Recent Restart on db53 is CRITICAL: Connection refused by host [14:14:09] PROBLEM - MySQL disk space on db53 is CRITICAL: Connection refused by host [14:14:18] RECOVERY - MySQL Recent Restart on db57 is OK: OK 4325036 seconds since restart [14:14:27] RECOVERY - Lucene disk space on search1019 is OK: DISK OK [14:14:27] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [14:14:27] PROBLEM - MySQL Idle Transactions on db53 is CRITICAL: Connection refused by host [14:14:27] RECOVERY - MySQL Idle Transactions on db57 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:14:27] RECOVERY - Full LVS Snapshot on db57 is OK: OK no full LVM snapshot volumes [14:14:49] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [14:14:49] RECOVERY - MySQL Idle Transactions on db1022 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:14:49] RECOVERY - MySQL Slave Running on db1022 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:14:58] RECOVERY - MySQL Slave Delay on db57 is OK: OK replication delay 0 seconds [14:15:16] RECOVERY - MySQL Recent Restart on db1022 is OK: OK 3688373 seconds since restart [14:15:16] RECOVERY - mysqld processes on db57 is OK: PROCS OK: 1 process with command name mysqld [14:15:16] RECOVERY - MySQL Slave Running on db57 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:15:34] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay 0 seconds [14:15:43] RECOVERY - MySQL disk space on db57 is OK: DISK OK [14:15:52] PROBLEM - Lucene disk space on search16 is CRITICAL: Connection refused by host [14:15:52] RECOVERY - Lucene disk space on search19 is OK: DISK OK [14:15:52] RECOVERY - mysqld processes on db1022 is OK: PROCS OK: 1 process with command name mysqld [14:15:52] RECOVERY - Full LVS Snapshot on db1022 is OK: OK no full LVM snapshot volumes [14:16:01] RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay 0 seconds [14:16:01] PROBLEM - Lucene disk space on search1017 is CRITICAL: Connection refused by host [14:16:01] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [14:16:01] RECOVERY - Lucene disk space on search1015 is OK: DISK OK [14:16:07] New patchset: Hashar; "appsudoers is not a puppet class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12178 [14:16:10] PROBLEM - Lucene disk space on search1012 is CRITICAL: Connection refused by host [14:16:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12178 [14:17:13] RECOVERY - Lucene disk space on search1021 is OK: DISK OK [14:17:31] PROBLEM - Lucene disk space on search1003 is CRITICAL: Connection refused by host [14:17:31] RECOVERY - Lucene disk space on search1002 is OK: DISK OK [14:17:40] PROBLEM - Lucene disk space on search1004 is CRITICAL: Connection refused by host [14:20:31] RECOVERY - Lucene disk space on search1020 is OK: DISK OK [14:20:31] RECOVERY - Lucene disk space on search1010 is OK: DISK OK [14:20:40] RECOVERY - Lucene disk space on search1016 is OK: DISK OK [14:21:34] RECOVERY - Lucene disk space on search21 is OK: DISK OK [14:22:01] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [14:22:19] RECOVERY - Lucene disk space on searchidx2 is OK: DISK OK [14:22:28] RECOVERY - Varnish HTCP daemon on cp1034 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:24:25] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [14:24:34] RECOVERY - Varnish HTCP daemon on cp1035 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:25:46] RECOVERY - Lucene disk space on search13 is OK: DISK OK [14:26:22] RECOVERY - Lucene disk space on search1014 is OK: DISK OK [14:27:34] RECOVERY - Lucene disk space on search1011 is OK: DISK OK [14:27:43] RECOVERY - Lucene disk space on search1008 is OK: DISK OK [14:27:52] RECOVERY - mysqld processes on db1001 is OK: PROCS OK: 1 process with command name mysqld [14:28:01] RECOVERY - MySQL Slave Running on db1001 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:28:10] RECOVERY - MySQL disk space on db1001 is OK: DISK OK [14:28:10] RECOVERY - MySQL Recent Restart on db1001 is OK: OK 2424579 seconds since restart [14:28:19] RECOVERY - MySQL Idle Transactions on db1001 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:28:19] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 0 seconds [14:28:37] RECOVERY - Full LVS Snapshot on db58 is OK: OK no full LVM snapshot volumes [14:28:37] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 0 seconds [14:28:37] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [14:28:46] RECOVERY - Lucene disk space on search36 is OK: DISK OK [14:28:46] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [14:28:55] RECOVERY - Varnish HTCP daemon on cp1033 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:29:04] RECOVERY - MySQL Idle Transactions on db58 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:29:04] RECOVERY - MySQL Slave Running on db58 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:29:04] RECOVERY - mysqld processes on db58 is OK: PROCS OK: 1 process with command name mysqld [14:29:04] RECOVERY - Full LVS Snapshot on db1001 is OK: OK no full LVM snapshot volumes [14:29:13] RECOVERY - MySQL Slave Delay on db1001 is OK: OK replication delay 0 seconds [14:29:31] RECOVERY - MySQL disk space on db58 is OK: DISK OK [14:29:31] RECOVERY - MySQL Recent Restart on db58 is OK: OK 4878710 seconds since restart [14:29:58] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 1 seconds [14:30:25] RECOVERY - Full LVS Snapshot on db1020 is OK: OK no full LVM snapshot volumes [14:30:35] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [14:30:35] RECOVERY - MySQL Slave Delay on bellin is OK: OK replication delay seconds [14:30:43] RECOVERY - MySQL Slave Running on db1020 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:30:52] RECOVERY - Full LVS Snapshot on bellin is OK: OK no full LVM snapshot volumes [14:30:52] RECOVERY - mysqld processes on db1020 is OK: PROCS OK: 1 process with command name mysqld [14:31:01] RECOVERY - mysqld processes on bellin is OK: PROCS OK: 1 process with command name mysqld [14:31:01] RECOVERY - MySQL Replication Heartbeat on bellin is OK: OK replication delay seconds [14:31:10] RECOVERY - MySQL Idle Transactions on db1020 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:31:10] RECOVERY - MySQL Idle Transactions on bellin is OK: OK longest blocking idle transaction sleeps for seconds [14:31:19] RECOVERY - MySQL disk space on db1020 is OK: DISK OK [14:31:19] RECOVERY - MySQL disk space on bellin is OK: DISK OK [14:31:28] RECOVERY - MySQL Recent Restart on db1020 is OK: OK 2426681 seconds since restart [14:31:28] RECOVERY - MySQL Slave Running on bellin is OK: OK replication [14:31:37] RECOVERY - MySQL Recent Restart on bellin is OK: OK seconds since restart [14:31:37] RECOVERY - Lucene disk space on search31 is OK: DISK OK [14:31:46] RECOVERY - Lucene disk space on search33 is OK: DISK OK [14:32:22] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [14:32:31] RECOVERY - Varnish HTCP daemon on cp1032 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:32:40] RECOVERY - Lucene disk space on search1018 is OK: DISK OK [14:32:40] RECOVERY - MySQL Slave Delay on db1004 is OK: OK replication delay 1 seconds [14:32:40] RECOVERY - Full LVS Snapshot on db1004 is OK: OK no full LVM snapshot volumes [14:32:49] RECOVERY - MySQL Idle Transactions on db1004 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:32:58] RECOVERY - MySQL Slave Running on db1004 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:33:07] RECOVERY - mysqld processes on db1004 is OK: PROCS OK: 1 process with command name mysqld [14:33:07] RECOVERY - MySQL disk space on db1004 is OK: DISK OK [14:33:25] RECOVERY - Lucene disk space on search1006 is OK: DISK OK [14:33:25] RECOVERY - MySQL Recent Restart on db1004 is OK: OK 3689535 seconds since restart [14:33:34] RECOVERY - MySQL Replication Heartbeat on db1004 is OK: OK replication delay 0 seconds [14:33:43] RECOVERY - Lucene disk space on search1013 is OK: DISK OK [14:34:34] hey mark, if you are around [14:34:37] RECOVERY - Lucene disk space on search1022 is OK: DISK OK [14:34:41] asher has approved my /var/run/ mysql commit [14:34:43] https://gerrit.wikimedia.org/r/#/c/11296/ [14:34:46] but I need a merge still [14:34:55] RECOVERY - MySQL disk space on db1026 is OK: DISK OK [14:35:22] RECOVERY - Varnish HTCP daemon on cp1036 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:35:58] RECOVERY - Lucene disk space on search28 is OK: DISK OK [14:36:23] New patchset: Hashar; "Change autopatrol-related mediawikwiki userrights" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11748 [14:36:29] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11748 [14:36:34] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [14:37:07] New review: Hashar; "- did a rebase" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11748 [14:37:14]