[00:00:42] New patchset: Sara; "Apply ganglia gmetad changes from test branch to production branch. gmetad will now use a template. manutius will monitor only "Upload caches eqiad"." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12132 [00:01:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12132 [00:05:27] New review: Sara; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12132 [00:05:30] Change merged: Sara; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12132 [00:06:05] New review: Asher; "(no comment)" [operations/debs/squid] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12129 [00:06:07] Change merged: Asher; [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/12129 [00:08:06] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:10:23] New patchset: Lcarr; "fixing ipv6 monitors for lvs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12133 [00:10:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12133 [00:11:04] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12133 [00:11:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12133 [00:13:14] ssmollett: is your ganglia change ready to be merged ? [00:13:30] LeslieCarr: yup. [00:13:37] merging [00:38:33] binasher: have you read much about scalebase? [00:38:51] * AaronSchulz tries to find details [00:39:47] AaronSchulz: are they in the cloud?? [01:03:43] New review: Asher; "(no comment)" [operations/debs/squid] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12126 [01:03:45] Change merged: Asher; [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/12126 [01:41:35] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 213 seconds [01:41:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 255 seconds [01:47:35] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 22 seconds [01:49:32] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 690s [01:52:05] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 17 seconds [01:52:32] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 16s [02:40:16] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [04:43:57] New patchset: Tim Starling; "Removed redundant "Apaches 8 CPU" data source" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12143 [04:44:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12143 [04:44:30] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12143 [04:44:32] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12143 [06:10:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:41:58] New patchset: ArielGlenn; "sigpipe sometimes shows as 141; fix up noop/lastlinks to work again" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12144 [06:47:12] New patchset: ArielGlenn; "sigpipe sometimes shows as 141; fix up noop/lastlinks to work again" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12144 [06:48:10] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12144 [06:48:12] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12144 [07:09:47] New patchset: ArielGlenn; "fix name of dsh node group for snapshot hosts" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12145 [07:10:30] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12145 [07:10:32] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12145 [07:12:27] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [08:06:58] New patchset: QChris; "Repairung status for items with toBeRun false." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12147 [08:09:57] New review: QChris; "While moving the check for Maintenance mode around in commit" [operations/dumps] (ariel) C: 0; - https://gerrit.wikimedia.org/r/12147 [08:27:20] New review: ArielGlenn; "The problem is that we want to do md5sum and status markup of all jobs, not just the ones that actua..." [operations/dumps] (ariel); V: 0 C: 0; - https://gerrit.wikimedia.org/r/12147 [08:58:00] New patchset: Hashar; "class to cleanup /var/cache/apt/archives" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12151 [08:58:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12151 [09:05:06] New review: Hashar; "So should we just abandon this change and fill a request for the above DNS entry?" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9874 [09:11:46] New review: Dereckson; "Logo protected." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/11741 [09:23:03] New patchset: Hashar; "(bug 37327) Configure chr.wikipedia site logo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11741 [09:23:10] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11741 [09:23:33] New review: Hashar; "Patchset 2 is a rebase." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11741 [09:24:29] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11741 [09:26:43] New review: Hashar; "deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11741 [09:38:59] New patchset: Hashar; "$wgMobileResourceVersion does not exist anymore" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12043 [09:39:05] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/12043 [09:39:19] New review: Hashar; "Patchset 2 is just a rebase." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12043 [09:39:22] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12043 [09:40:13] New review: Hashar; "Deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12043 [09:42:49] New patchset: Hashar; "(bug 37457) viwikibooks can import from fr/it wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11746 [09:42:55] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11746 [09:43:02] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11746 [09:43:04] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11746 [09:43:47] New review: Hashar; "Deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11746 [09:53:19] New patchset: ArielGlenn; "--date option; date and wikiname in outputdirname optionally" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12152 [09:55:07] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12152 [09:55:09] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12152 [10:08:51] New patchset: QChris; "Correcting md5 copying and item updating order" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12153 [10:08:52] New patchset: QChris; "Repairing status for items with toBeRun false." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/12147 [10:09:05] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:09:34] New review: QChris; "Although commit c483a067b9eb3214ee1309ef2602bfb3c4c831f2 repairs the" [operations/dumps] (ariel) C: 0; - https://gerrit.wikimedia.org/r/12147 [10:35:47] New review: Lydia Pintscher; "Yes I think we can close this." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9874 [12:13:44] New patchset: Mark Bergsma; "Don't blindly accept mail for non-local domains, to fix SMTP address verification" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12162 [12:14:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12162 [12:22:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12162 [12:22:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12162 [12:27:33] New patchset: Petrb; "inserted back the ram check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12163 [12:27:42] Ryan_Lane: ^ [12:28:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12163 [12:32:15] petan: thanks [12:35:13] New patchset: Petrb; "inserted back the ram check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12163 [12:35:36] Ryan_Lane: I don't know if we use the test branch or not, so I inserted it to both [12:35:44] see topic :) [12:35:46] New patchset: Hashar; "(bug 37740) raise account throttle for an edit marathon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12165 [12:35:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12163 [12:35:50] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/12165 [12:35:54] I'll abandon it in test [12:36:15] Ryan_Lane: ok, then how do we make nagios work again [12:36:23] Ryan_Lane: by merging this in production? [12:36:28] branch [12:36:34] I'm going to do the switch today [12:36:38] ok [12:36:39] and yes, I'll merge it into production [12:36:53] hm, we have a few open things in the test branch [12:37:01] New review: Amire80; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/12165 [12:37:06] I'll just ignore them [12:37:43] ok [12:38:04] so when you merge it to production it will be in labs as well? [12:38:32] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12165 [12:38:34] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12165 [12:40:44] New patchset: Ryan Lane; "Merge remote-tracking branch 'origin/production' into test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12021 [12:40:59] petan: when I finish the switchover, yeah [12:41:14] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [12:41:15] wait. that doesn't look right [12:41:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12021 [12:41:17] heh [12:41:29] New review: Hashar; "deployed live." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12165 [12:42:03] well, the comment isn't correct anyway [12:42:20] Ryan_Lane: which [12:42:28] for the change I just pushed in [12:43:47] New patchset: Ryan Lane; "Merge test into production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12021 [12:43:58] that's better [12:44:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12021 [12:44:48] !log merging puppet test branch into production branch, lot's of changes going through…. [12:46:16] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12021 [12:46:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12021 [13:07:57] New patchset: Ryan Lane; "Remove duplicate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12167 [13:19:04] New patchset: Ryan Lane; "Not setting config options based on a version of something is a bad idea" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12169 [13:19:38] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12169 [13:19:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12169 [13:20:12] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12169 [13:20:28] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12167 [13:20:39] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12167 [13:20:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12169 [13:20:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12167 [13:27:30] New patchset: Ryan Lane; "Remove labs specific fact from production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12172 [13:28:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12172 [13:29:13] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12172 [13:29:16] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12172 [13:38:54] PROBLEM - Swift HTTP on copper is CRITICAL: Connection refused [13:39:03] PROBLEM - Swift HTTP on magnesium is CRITICAL: Connection refused [13:39:12] PROBLEM - Swift HTTP on owa3 is CRITICAL: Connection refused [13:39:21] PROBLEM - Swift HTTP on owa2 is CRITICAL: Connection refused [13:39:39] PROBLEM - Swift HTTP on zinc is CRITICAL: Connection refused [13:41:42] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12163 [13:41:47] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12163 [13:45:39] PROBLEM - Varnish HTCP daemon on cp1029 is CRITICAL: Connection refused by host [13:46:33] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: Connection refused by host [13:46:52] PROBLEM - Lucene disk space on search1019 is CRITICAL: Connection refused by host [13:47:09] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: Connection refused by host [13:47:14] are those swift errors actually a problem? [13:47:18] PROBLEM - MySQL Slave Running on db1022 is CRITICAL: Connection refused by host [13:47:18] PROBLEM - MySQL Idle Transactions on db1022 is CRITICAL: Connection refused by host [13:47:18] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: Connection refused by host [13:47:19] are those systems in use? [13:47:36] PROBLEM - Varnish HTCP daemon on cp1030 is CRITICAL: Connection refused by host [13:47:36] PROBLEM - MySQL Slave Delay on db57 is CRITICAL: Connection refused by host [13:47:36] PROBLEM - MySQL Slave Running on db57 is CRITICAL: Connection refused by host [13:47:36] PROBLEM - MySQL Idle Transactions on db57 is CRITICAL: Connection refused by host [13:47:41] o.O [13:47:45] PROBLEM - mysqld processes on db57 is CRITICAL: Connection refused by host [13:48:03] PROBLEM - Lucene disk space on search19 is CRITICAL: Connection refused by host [13:48:03] PROBLEM - Full LVS Snapshot on db57 is CRITICAL: Connection refused by host [13:48:03] PROBLEM - MySQL Recent Restart on db57 is CRITICAL: Connection refused by host [13:48:03] PROBLEM - MySQL disk space on db57 is CRITICAL: Connection refused by host [13:48:11] this looks like nagios is broken [13:48:12] PROBLEM - MySQL Recent Restart on db1022 is CRITICAL: Connection refused by host [13:48:12] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: Connection refused by host [13:48:12] PROBLEM - Lucene disk space on search1015 is CRITICAL: Connection refused by host [13:48:14] yay [13:48:21] PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: Connection refused by host [13:48:21] PROBLEM - Full LVS Snapshot on db1022 is CRITICAL: Connection refused by host [13:48:21] PROBLEM - mysqld processes on db1022 is CRITICAL: Connection refused by host [13:48:24] Ryan_Lane: did you copy the REALM? [13:48:28] eh? [13:48:34] what do you mean? [13:48:36] because production has different IP than labs [13:48:41] if we are using the same config for both [13:48:47] we need to use realm [13:48:50] !$realm [13:48:50] $realm is a variable used in puppet to determine which cluster a system is in. See also $site. [13:49:13] Ryan_Lane: in nrpe config I mean [13:49:51] PROBLEM - Lucene disk space on search1021 is CRITICAL: Connection refused by host [13:50:00] PROBLEM - Lucene disk space on search1002 is CRITICAL: Connection refused by host [13:50:12] allowed_hosts is fine [13:51:27] is neon not the nagios server? [13:51:57] seems its spense [13:51:58] spence [13:52:15] PROBLEM - MySQL disk space on db60 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - MySQL disk space on db59 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - MySQL Recent Restart on db59 is CRITICAL: Connection refused by host [13:52:24] PROBLEM - Full LVS Snapshot on db60 is CRITICAL: Connection refused by host [13:52:33] PROBLEM - Lucene disk space on search21 is CRITICAL: Connection refused by host [13:52:33] PROBLEM - MySQL Idle Transactions on db60 is CRITICAL: Connection refused by host [13:52:42] PROBLEM - Lucene disk space on search1016 is CRITICAL: Connection refused by host [13:52:51] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: Connection refused by host [13:52:54] hm [13:52:56] Ryan_Lane: that's for documentation on wikitech would be [13:53:00] why is nrpe not running on them? [13:53:00] PROBLEM - Lucene disk space on search1020 is CRITICAL: Connection refused by host [13:53:00] PROBLEM - Lucene disk space on search1010 is CRITICAL: Connection refused by host [13:53:02] if it was actually correct [13:53:03] :D [13:53:09] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: Connection refused by host [13:53:27] PROBLEM - Lucene disk space on search1001 is CRITICAL: Connection refused by host [13:53:27] PROBLEM - MySQL Slave Running on db59 is CRITICAL: Connection refused by host [13:53:27] PROBLEM - MySQL Idle Transactions on db59 is CRITICAL: Connection refused by host [13:53:36] PROBLEM - Full LVS Snapshot on db59 is CRITICAL: Connection refused by host [13:53:36] PROBLEM - mysqld processes on db59 is CRITICAL: Connection refused by host [13:53:45] RECOVERY - MySQL disk space on db60 is OK: DISK OK [13:53:52] I wonder if the nrpe restart is broken or something [13:53:54] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 0 seconds [13:53:54] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [13:53:54] RECOVERY - Full LVS Snapshot on db60 is OK: OK no full LVM snapshot volumes [13:54:03] RECOVERY - MySQL Idle Transactions on db60 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:54:47] I think mutante was talking with me on hackaton about something what is running so long with no restart that he isn't even sure if it's able to restart [13:54:51] it was nagios related :P [13:54:57] PROBLEM - Lucene disk space on searchidx2 is CRITICAL: Connection refused by host [13:54:57] PROBLEM - Varnish HTCP daemon on cp1034 is CRITICAL: Connection refused by host [13:55:15] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: Connection refused by host [13:55:15] RECOVERY - MySQL disk space on db59 is OK: DISK OK [13:55:33] RECOVERY - MySQL Recent Restart on db59 is OK: OK 4621659 seconds since restart [13:55:42] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 1 seconds [13:56:09] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 0 seconds [13:56:18] RECOVERY - MySQL Slave Running on db59 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [13:56:18] RECOVERY - MySQL Idle Transactions on db59 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:56:27] RECOVERY - Full LVS Snapshot on db59 is OK: OK no full LVM snapshot volumes [13:56:27] RECOVERY - mysqld processes on db59 is OK: PROCS OK: 1 process with command name mysqld [13:56:45] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: Connection refused by host [13:57:10] oh, when puppet re-runs, it'll fix this [13:57:12] PROBLEM - Varnish HTCP daemon on cp1035 is CRITICAL: Connection refused by host [13:57:48] RECOVERY - Lucene disk space on search1001 is OK: DISK OK [13:58:24] PROBLEM - Lucene disk space on search13 is CRITICAL: Connection refused by host [13:58:42] PROBLEM - Lucene disk space on search1014 is CRITICAL: Connection refused by host [14:00:12] PROBLEM - Lucene disk space on search1008 is CRITICAL: Connection refused by host [14:00:12] PROBLEM - Lucene disk space on search1011 is CRITICAL: Connection refused by host [14:00:39] PROBLEM - MySQL Slave Running on db1001 is CRITICAL: Connection refused by host [14:00:39] PROBLEM - MySQL Recent Restart on db1001 is CRITICAL: Connection refused by host [14:00:39] PROBLEM - MySQL disk space on db1001 is CRITICAL: Connection refused by host [14:00:39] PROBLEM - MySQL Idle Transactions on db1001 is CRITICAL: Connection refused by host [14:00:48] PROBLEM - MySQL Slave Delay on db1001 is CRITICAL: Connection refused by host [14:00:48] PROBLEM - mysqld processes on db1001 is CRITICAL: Connection refused by host [14:00:48] PROBLEM - MySQL Recent Restart on db58 is CRITICAL: Connection refused by host [14:00:57] PROBLEM - Full LVS Snapshot on db1001 is CRITICAL: Connection refused by host [14:00:57] PROBLEM - MySQL Slave Running on db58 is CRITICAL: Connection refused by host [14:01:06] PROBLEM - Full LVS Snapshot on db58 is CRITICAL: Connection refused by host [14:01:06] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: Connection refused by host [14:01:06] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: Connection refused by host [14:01:15] PROBLEM - Lucene disk space on search36 is CRITICAL: Connection refused by host [14:01:19] oh yeah, this is going to be annoying as hell [14:01:24] PROBLEM - mysqld processes on db58 is CRITICAL: Connection refused by host [14:01:24] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: Connection refused by host [14:01:33] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: Connection refused by host [14:01:51] PROBLEM - MySQL Idle Transactions on db58 is CRITICAL: Connection refused by host [14:02:00] PROBLEM - Varnish HTCP daemon on cp1033 is CRITICAL: Connection refused by host [14:02:00] PROBLEM - MySQL disk space on db58 is CRITICAL: Connection refused by host [14:02:04] het at least it's not paging [14:02:27] PROBLEM - MySQL Recent Restart on db1020 is CRITICAL: Connection refused by host [14:02:36] PROBLEM - MySQL disk space on db1020 is CRITICAL: Connection refused by host [14:03:03] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: Connection refused by host [14:03:03] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: Connection refused by host [14:03:03] PROBLEM - Full LVS Snapshot on bellin is CRITICAL: Connection refused by host [14:03:03] PROBLEM - MySQL Slave Delay on bellin is CRITICAL: Connection refused by host [14:03:12] PROBLEM - Lucene disk space on search31 is CRITICAL: Connection refused by host [14:03:12] PROBLEM - Full LVS Snapshot on db1020 is CRITICAL: Connection refused by host [14:03:12] PROBLEM - MySQL Idle Transactions on db1020 is CRITICAL: Connection refused by host [14:03:12] PROBLEM - MySQL Slave Running on db1020 is CRITICAL: Connection refused by host [14:03:12] PROBLEM - mysqld processes on db1020 is CRITICAL: Connection refused by host [14:03:21] PROBLEM - MySQL Replication Heartbeat on bellin is CRITICAL: Connection refused by host [14:03:30] PROBLEM - MySQL Idle Transactions on bellin is CRITICAL: Connection refused by host [14:03:30] PROBLEM - mysqld processes on bellin is CRITICAL: Connection refused by host [14:03:39] PROBLEM - MySQL Slave Running on bellin is CRITICAL: Connection refused by host [14:03:48] PROBLEM - MySQL disk space on bellin is CRITICAL: Connection refused by host [14:03:48] PROBLEM - MySQL Recent Restart on bellin is CRITICAL: Connection refused by host [14:04:24] PROBLEM - Lucene disk space on search33 is CRITICAL: Connection refused by host [14:04:42] PROBLEM - Lucene disk space on search1006 is CRITICAL: Connection refused by host [14:04:42] PROBLEM - MySQL Replication Heartbeat on db1004 is CRITICAL: Connection refused by host [14:04:51] PROBLEM - MySQL Recent Restart on db1004 is CRITICAL: Connection refused by host [14:05:00] PROBLEM - MySQL Slave Delay on db1004 is CRITICAL: Connection refused by host [14:05:00] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: Connection refused by host [14:05:18] PROBLEM - Varnish HTCP daemon on cp1032 is CRITICAL: Connection refused by host [14:05:36] PROBLEM - Full LVS Snapshot on db1004 is CRITICAL: Connection refused by host [14:05:45] PROBLEM - MySQL Idle Transactions on db1004 is CRITICAL: Connection refused by host [14:05:45] PROBLEM - MySQL Slave Running on db1004 is CRITICAL: Connection refused by host [14:05:45] PROBLEM - MySQL disk space on db1004 is CRITICAL: Connection refused by host [14:05:45] PROBLEM - Lucene disk space on search1022 is CRITICAL: Connection refused by host [14:05:54] PROBLEM - Lucene disk space on search1013 is CRITICAL: Connection refused by host [14:05:54] PROBLEM - mysqld processes on db1004 is CRITICAL: Connection refused by host [14:06:03] PROBLEM - Lucene disk space on search1018 is CRITICAL: Connection refused by host [14:06:21] PROBLEM - MySQL disk space on db1026 is CRITICAL: Connection refused by host [14:08:09] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: Connection refused by host [14:08:27] PROBLEM - Lucene disk space on search28 is CRITICAL: Connection refused by host [14:08:36] PROBLEM - Varnish HTCP daemon on cp1036 is CRITICAL: Connection refused by host [14:10:15] PROBLEM - Lucene disk space on search35 is CRITICAL: Connection refused by host [14:10:24] PROBLEM - Lucene disk space on search24 is CRITICAL: Connection refused by host [14:12:30] PROBLEM - Varnish HTCP daemon on cp1031 is CRITICAL: Connection refused by host [14:12:57] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: Connection refused by host [14:13:15] RECOVERY - Varnish HTCP daemon on cp1029 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:13:33] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: Connection refused by host [14:13:33] PROBLEM - Lucene disk space on search1007 is CRITICAL: Connection refused by host [14:13:51] PROBLEM - Full LVS Snapshot on db53 is CRITICAL: Connection refused by host [14:13:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: Connection refused by host [14:13:51] PROBLEM - mysqld processes on db53 is CRITICAL: Connection refused by host [14:14:00] RECOVERY - Varnish HTCP daemon on cp1030 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:14:09] PROBLEM - MySQL Slave Running on db53 is CRITICAL: Connection refused by host [14:14:09] PROBLEM - MySQL Recent Restart on db53 is CRITICAL: Connection refused by host [14:14:09] PROBLEM - MySQL disk space on db53 is CRITICAL: Connection refused by host [14:14:18] RECOVERY - MySQL Recent Restart on db57 is OK: OK 4325036 seconds since restart [14:14:27] RECOVERY - Lucene disk space on search1019 is OK: DISK OK [14:14:27] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [14:14:27] PROBLEM - MySQL Idle Transactions on db53 is CRITICAL: Connection refused by host [14:14:27] RECOVERY - MySQL Idle Transactions on db57 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:14:27] RECOVERY - Full LVS Snapshot on db57 is OK: OK no full LVM snapshot volumes [14:14:49] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [14:14:49] RECOVERY - MySQL Idle Transactions on db1022 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:14:49] RECOVERY - MySQL Slave Running on db1022 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:14:58] RECOVERY - MySQL Slave Delay on db57 is OK: OK replication delay 0 seconds [14:15:16] RECOVERY - MySQL Recent Restart on db1022 is OK: OK 3688373 seconds since restart [14:15:16] RECOVERY - mysqld processes on db57 is OK: PROCS OK: 1 process with command name mysqld [14:15:16] RECOVERY - MySQL Slave Running on db57 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:15:34] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay 0 seconds [14:15:43] RECOVERY - MySQL disk space on db57 is OK: DISK OK [14:15:52] PROBLEM - Lucene disk space on search16 is CRITICAL: Connection refused by host [14:15:52] RECOVERY - Lucene disk space on search19 is OK: DISK OK [14:15:52] RECOVERY - mysqld processes on db1022 is OK: PROCS OK: 1 process with command name mysqld [14:15:52] RECOVERY - Full LVS Snapshot on db1022 is OK: OK no full LVM snapshot volumes [14:16:01] RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay 0 seconds [14:16:01] PROBLEM - Lucene disk space on search1017 is CRITICAL: Connection refused by host [14:16:01] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [14:16:01] RECOVERY - Lucene disk space on search1015 is OK: DISK OK [14:16:07] New patchset: Hashar; "appsudoers is not a puppet class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12178 [14:16:10] PROBLEM - Lucene disk space on search1012 is CRITICAL: Connection refused by host [14:16:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12178 [14:17:13] RECOVERY - Lucene disk space on search1021 is OK: DISK OK [14:17:31] PROBLEM - Lucene disk space on search1003 is CRITICAL: Connection refused by host [14:17:31] RECOVERY - Lucene disk space on search1002 is OK: DISK OK [14:17:40] PROBLEM - Lucene disk space on search1004 is CRITICAL: Connection refused by host [14:20:31] RECOVERY - Lucene disk space on search1020 is OK: DISK OK [14:20:31] RECOVERY - Lucene disk space on search1010 is OK: DISK OK [14:20:40] RECOVERY - Lucene disk space on search1016 is OK: DISK OK [14:21:34] RECOVERY - Lucene disk space on search21 is OK: DISK OK [14:22:01] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [14:22:19] RECOVERY - Lucene disk space on searchidx2 is OK: DISK OK [14:22:28] RECOVERY - Varnish HTCP daemon on cp1034 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:24:25] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [14:24:34] RECOVERY - Varnish HTCP daemon on cp1035 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:25:46] RECOVERY - Lucene disk space on search13 is OK: DISK OK [14:26:22] RECOVERY - Lucene disk space on search1014 is OK: DISK OK [14:27:34] RECOVERY - Lucene disk space on search1011 is OK: DISK OK [14:27:43] RECOVERY - Lucene disk space on search1008 is OK: DISK OK [14:27:52] RECOVERY - mysqld processes on db1001 is OK: PROCS OK: 1 process with command name mysqld [14:28:01] RECOVERY - MySQL Slave Running on db1001 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:28:10] RECOVERY - MySQL disk space on db1001 is OK: DISK OK [14:28:10] RECOVERY - MySQL Recent Restart on db1001 is OK: OK 2424579 seconds since restart [14:28:19] RECOVERY - MySQL Idle Transactions on db1001 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:28:19] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 0 seconds [14:28:37] RECOVERY - Full LVS Snapshot on db58 is OK: OK no full LVM snapshot volumes [14:28:37] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 0 seconds [14:28:37] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [14:28:46] RECOVERY - Lucene disk space on search36 is OK: DISK OK [14:28:46] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [14:28:55] RECOVERY - Varnish HTCP daemon on cp1033 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:29:04] RECOVERY - MySQL Idle Transactions on db58 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:29:04] RECOVERY - MySQL Slave Running on db58 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:29:04] RECOVERY - mysqld processes on db58 is OK: PROCS OK: 1 process with command name mysqld [14:29:04] RECOVERY - Full LVS Snapshot on db1001 is OK: OK no full LVM snapshot volumes [14:29:13] RECOVERY - MySQL Slave Delay on db1001 is OK: OK replication delay 0 seconds [14:29:31] RECOVERY - MySQL disk space on db58 is OK: DISK OK [14:29:31] RECOVERY - MySQL Recent Restart on db58 is OK: OK 4878710 seconds since restart [14:29:58] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 1 seconds [14:30:25] RECOVERY - Full LVS Snapshot on db1020 is OK: OK no full LVM snapshot volumes [14:30:35] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [14:30:35] RECOVERY - MySQL Slave Delay on bellin is OK: OK replication delay seconds [14:30:43] RECOVERY - MySQL Slave Running on db1020 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:30:52] RECOVERY - Full LVS Snapshot on bellin is OK: OK no full LVM snapshot volumes [14:30:52] RECOVERY - mysqld processes on db1020 is OK: PROCS OK: 1 process with command name mysqld [14:31:01] RECOVERY - mysqld processes on bellin is OK: PROCS OK: 1 process with command name mysqld [14:31:01] RECOVERY - MySQL Replication Heartbeat on bellin is OK: OK replication delay seconds [14:31:10] RECOVERY - MySQL Idle Transactions on db1020 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:31:10] RECOVERY - MySQL Idle Transactions on bellin is OK: OK longest blocking idle transaction sleeps for seconds [14:31:19] RECOVERY - MySQL disk space on db1020 is OK: DISK OK [14:31:19] RECOVERY - MySQL disk space on bellin is OK: DISK OK [14:31:28] RECOVERY - MySQL Recent Restart on db1020 is OK: OK 2426681 seconds since restart [14:31:28] RECOVERY - MySQL Slave Running on bellin is OK: OK replication [14:31:37] RECOVERY - MySQL Recent Restart on bellin is OK: OK seconds since restart [14:31:37] RECOVERY - Lucene disk space on search31 is OK: DISK OK [14:31:46] RECOVERY - Lucene disk space on search33 is OK: DISK OK [14:32:22] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [14:32:31] RECOVERY - Varnish HTCP daemon on cp1032 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:32:40] RECOVERY - Lucene disk space on search1018 is OK: DISK OK [14:32:40] RECOVERY - MySQL Slave Delay on db1004 is OK: OK replication delay 1 seconds [14:32:40] RECOVERY - Full LVS Snapshot on db1004 is OK: OK no full LVM snapshot volumes [14:32:49] RECOVERY - MySQL Idle Transactions on db1004 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:32:58] RECOVERY - MySQL Slave Running on db1004 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:33:07] RECOVERY - mysqld processes on db1004 is OK: PROCS OK: 1 process with command name mysqld [14:33:07] RECOVERY - MySQL disk space on db1004 is OK: DISK OK [14:33:25] RECOVERY - Lucene disk space on search1006 is OK: DISK OK [14:33:25] RECOVERY - MySQL Recent Restart on db1004 is OK: OK 3689535 seconds since restart [14:33:34] RECOVERY - MySQL Replication Heartbeat on db1004 is OK: OK replication delay 0 seconds [14:33:43] RECOVERY - Lucene disk space on search1013 is OK: DISK OK [14:34:34] hey mark, if you are around [14:34:37] RECOVERY - Lucene disk space on search1022 is OK: DISK OK [14:34:41] asher has approved my /var/run/ mysql commit [14:34:43] https://gerrit.wikimedia.org/r/#/c/11296/ [14:34:46] but I need a merge still [14:34:55] RECOVERY - MySQL disk space on db1026 is OK: DISK OK [14:35:22] RECOVERY - Varnish HTCP daemon on cp1036 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:35:58] RECOVERY - Lucene disk space on search28 is OK: DISK OK [14:36:23] New patchset: Hashar; "Change autopatrol-related mediawikwiki userrights" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11748 [14:36:29] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11748 [14:36:34] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [14:37:07] New review: Hashar; "- did a rebase" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11748 [14:37:14] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11748 [14:38:49] RECOVERY - Lucene disk space on search24 is OK: DISK OK [14:38:54] New review: Ottomata; "Ok, I will use rsync daemon modules on the udp2log machines for this. I need this commit merged first:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/11898 [14:39:16] RECOVERY - Lucene disk space on search35 is OK: DISK OK [14:40:01] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [14:40:37] RECOVERY - Varnish HTCP daemon on cp1031 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:40:46] RECOVERY - Lucene disk space on search1007 is OK: DISK OK [14:40:55] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [14:41:22] RECOVERY - Full LVS Snapshot on db53 is OK: OK no full LVM snapshot volumes [14:41:22] RECOVERY - MySQL Idle Transactions on db53 is OK: OK longest blocking idle transaction sleeps for 0 seconds [14:41:22] RECOVERY - mysqld processes on db53 is OK: PROCS OK: 1 process with command name mysqld [14:41:31] RECOVERY - MySQL Slave Running on db53 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:41:31] RECOVERY - MySQL disk space on db53 is OK: DISK OK [14:41:58] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [14:41:58] RECOVERY - MySQL Recent Restart on db53 is OK: OK 4241103 seconds since restart [14:42:07] RECOVERY - Lucene disk space on search1012 is OK: DISK OK [14:42:58] New review: Ottomata; "I am waiting for this to be merged. Currently this is blocking:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/11574 [14:42:59] hmm- just got an error when adding a reviewer in Gerrit: 'Application Error\nServer Error\nServer smtp.pmtpa.wmnet rejected body' [14:43:11] could this be related to the anti-DOS config changes? [14:43:28] RECOVERY - Lucene disk space on search1004 is OK: DISK OK [14:43:37] RECOVERY - Lucene disk space on search1003 is OK: DISK OK [14:43:37] RECOVERY - Lucene disk space on search1017 is OK: DISK OK [14:44:31] RECOVERY - Lucene disk space on search16 is OK: DISK OK [14:46:25] New patchset: Ottomata; "Removing sampled-1000 logs from oxygen." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12181 [14:47:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12181 [14:59:52] New patchset: Alex Monk; "(bug 37741) Raise account creation limit for outreach event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12183 [14:59:59] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/12183 [15:04:11] New patchset: Hashar; "enhance account throttling" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12185 [15:04:17] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/12185 [15:08:28] New review: Alex Monk; "Just a spelling error. Apart from that, looks good." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/12185 [15:10:24] gwicke: yes [15:10:35] paravoid: ^^ another one of those errors [15:16:19] ? [15:16:38] Ryan_Lane: ? [15:16:49] New review: Platonides; "You're not taking into account the wiki db." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/12185 [15:16:54] emails from gerrit are getting rejected, occasionally [15:17:03] I think its when gerrit tries to send to aliases [15:17:11] or maybe when adding a CC [15:18:11] or when they hit the secondary MX [15:18:18] while mchenry is overloaded [15:18:23] ah [15:18:34] yes, sodium does address verification [15:18:56] sender address verification that is [15:19:02] * Ryan_Lane nods [15:19:04] mails originate from gerrit@wikimedia.org [15:19:09] but that address does not exist [15:19:13] that's wrong [15:19:13] so, we need to add the address? [15:19:16] I'll add an alias to root [15:19:20] * Ryan_Lane nods [15:20:39] and we should also fix the discrepancy between the MX [15:20:47] either we should do sender verification or not [15:21:14] but I'm not sure if I should dare touch that huge exim4 conf :) [15:21:45] heh [15:21:49] and that's why I never make exim changes [15:22:23] New review: Alex Monk; "Platonides, line 14: $wgAccountCreationThrottle = $throttle['value'];" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/12185 [15:25:42] New review: Platonides; "No, not the overrider, the one which is replaced." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/12185 [15:27:22] -> MAIL FROM: [15:27:22] <- 250 OK [15:27:22] -> RCPT TO: [15:27:22] <- 250 Accepted [15:27:22] -> QUIT [15:27:24] done [15:27:51] took me a while to find that aliases are local to mcherny [15:28:43] hey Ryan_Lane [15:28:45] that's almost the only thing I know about the mail system :-D [15:28:53] drdee: hello [15:28:59] paravoid: sweet [15:29:04] hm, they both have sender verification, I wonder why this was a problem now [15:29:08] paravoid: aliases are either local, or in gmail [15:29:10] could you please create a gerrit mysql account with just the SELECT privilege? [15:29:15] drdee: ah, yeah [15:29:20] I was going to do that this week [15:30:57] sweet [15:31:19] and this week can also be maybe today? :D [15:33:50] paravoid, Ryan_Lane: does bugzilla and RT have a email address with an inbox as well? [15:34:00] with an inbox? [15:34:10] no, they don't have inboxes [15:34:17] only users have inboxes [15:34:29] maybe using wrong terminology [15:34:32] well, I guess I lie [15:34:36] RT has an smtp server [15:34:40] can bugzilla receive email? [15:34:48] no [15:34:50] rt can [15:34:58] cool, on which alias? [15:35:11] bugzilla can, it's just not setup for us ;) [15:35:13] ops-requests@rt.wikimedia.org [15:35:34] Reedy: would it be useful to setup bugzilla to receive email? [15:35:43] thanks Ryan_Lane [15:35:44] Possibly yeah [15:35:53] shall I create an RT ticket? [15:35:57] https://bugzilla.wikimedia.org/show_bug.cgi?id=22629 [15:36:11] nice [15:36:26] i'll create an RT ticket as well (assuming that there is none yet) [15:36:57] I don't think there is [15:37:27] https://rt.wikimedia.org/Ticket/Display.html?id=3159 [15:37:29] now there is [15:37:31] :D [15:56:24] hey apergos: are you around? [15:57:23] who is responsible for taking care of the backup tapes? [16:05:09] New patchset: Ottomata; "Adding Accept-Language and X-Carrier headers to web access log sources (squid,nginx,varnish)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12188 [16:05:41] Change abandoned: Ottomata; "Abandoning this in favor of:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6526 [16:05:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12188 [16:11:23] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [16:26:01] drdee_: not (dinner then off),and I am not really sure who the amanda person is [16:38:50] apergos: ok thx [16:39:23] notpeter: https://gerrit.wikimedia.org/r/12193 [16:40:38] preilly: now? or in 20? [16:40:45] not in 20 [16:40:52] I mean in 20 [16:40:57] hah, ok [16:46:00] New patchset: Pyoungmeister; "making all eqiad apaches to instlal precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12197 [16:46:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12197 [16:50:23] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12197 [16:50:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12197 [16:58:49] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12193 [16:58:58] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12193 [16:59:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12193 [16:59:25] preilly: merged and I'm going to start forcing puppet runs [17:01:05] preilly: there's an error in there [17:01:21] notpeter: what is it? [17:01:47] preilly: missing semi [17:01:52] on one line [17:02:15] orange camaroon [17:04:09] New patchset: Pyoungmeister; "adding missing semi" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12198 [17:04:41] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12198 [17:04:41] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12198 [17:06:41] notpeter: I've got no idea how that happened [17:06:52] preilly: eh, typos happen [17:07:28] fixed version is live [17:08:51] notpeter: okay cool [17:09:55] semicolons are my nemeses [17:13:14] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [17:13:25] is this you, mark? http://www.mediawiki.org/w/index.php?oldid=551981&rcid=625570 [17:14:57] Jasper_Deng: no [17:15:10] that's marktraceur [17:15:18] he's not in this channel [17:15:27] was just wondering [17:23:05] New review: Hashar; "Maybe I should make the helper function efRaiseAccountCreationThrottle() to just accept an array of ..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/12185 [17:27:35] New patchset: preilly; "update dolphin browser code to latest Wikipedia code on GitHub" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12199 [17:27:42] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/12199 [17:27:59] New review: preilly; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12199 [17:28:01] Change merged: preilly; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12199 [17:28:51] Ryan_Lane: how do I sync this https://gerrit.wikimedia.org/r/#/c/12199/ [17:29:18] what do you mean sync? [17:29:34] you mean deploy? [17:29:39] ah [17:29:45] Ryan_Lane: deploy to cluster [17:30:06] Ryan_Lane: it's in fenari:/apache/common/docroot/bits I think [17:30:17] and I get error: cannot open .git/FETCH_HEAD: Permission denied [17:30:21] sync-docroot [17:30:30] may need to be root to do it [17:30:42] hey guys, this one is easy [17:30:49] can someone approve/merge this real quick? [17:30:49] https://gerrit.wikimedia.org/r/#/c/12181/ [17:30:54] Ryan_Lane: do you have any spare cycles to sync it for me [17:31:01] Ryan_Lane: well I mean update it and sync it [17:31:06] preilly: it works for me [17:31:08] as laner [17:31:28] no. I can't do this [17:31:34] I'm leaving in 30 mins [17:31:36] actually [17:31:41] I'm leaving basically right now [17:32:00] Ryan_Lane: okay [17:32:03] preilly: /home/w/common/docroot/bits [17:32:05] git pull [17:32:07] sync-docroot [17:33:02] Ryan_Lane: okay I guess I was in the wrong directory is all [17:33:06] * Ryan_Lane nods [17:33:40] Ryan_Lane: so does sync-docroot actually scap it? [17:33:44] yes [17:33:55] <^demon> Remind me to never give preilly a job next door to a nuclear launch facility, in case he goes to the wrong desk at work ;-) [17:34:10] ^demon: maybe we shouldn't have the same crap in multiple places [17:34:17] where one place is the wrong one [17:34:19] ^demon: seriously [17:34:31] ^demon: that wasn't a necessary comment at all [17:34:31] <^demon> Ryan_Lane: That too :) [17:35:00] <^demon> preilly: I didn't mean to offend, I'm sorry. [17:41:43] New review: Jeremyb; "The RT is filed and I think had some new activity today even." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9874 [17:42:40] * jeremyb spies a ^demon... [17:43:03] could use some help with gerrit today. worked on it some yesterday. [17:43:12] also, could just use a regex wizard in general [17:43:37] (/me sometimes fills that role but i'm kinda stumped here) [17:44:04] anyway, i should be back around 3ish i guess [18:04:44] @replag [18:04:46] Krinkle: [s4] db33: 11s; [s7] db26: 1s [18:05:28] If I read my logs correctly, pywikibots have been disallowed from editing for almost a day on commons due to db replag [18:06:01] If I read my logs correctly, my pywikibots (and from reports also other people's) have been disallowed from editing for almost a day now, on commons (due to db replag). [18:06:11] any ideas where/what is going wrong? [18:06:33] maybe that slave should be depooled? Or is that 11s replay on db33 only since recently (not over 24 hours) [18:06:45] @replag [18:06:47] Krinkle: [s4] db33: 21s [18:06:52] oh, increasing [18:07:43] It's seconds behind the master [18:07:59] so even a few minutes isn't too bad [18:09:23] New patchset: Sara; "Comment ganglia manifest to improve maintainability. Remove obsolate ganglia gmetad.conf file." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12205 [18:09:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12205 [18:12:11] Reedy: So bots should just continue editing with that replag? [18:12:19] indeed [18:12:27] it's not worth worrying about [18:12:37] Maybe a treshold in pywikibot is tripping up [18:12:38] it is forcing the bot to pause up to 60 seconds between edit attempts. [18:13:22] maxlag is somewhat of a useless metric at times [18:13:49] 1 Fatal error: Invalid host name (wikipedia.geo.blitzed.org), can't determine language.#012 in /usr/local/apache/common-local/multiversion/MWMultiVersion.php on line 354 [18:13:59] :/ [18:14:00] Reedy: That would be me [18:14:03] (I think) [18:14:09] Doing what? [18:14:12] I was checking out hostnames in apache files [18:14:19] Saw the blitzed one [18:14:27] which is still actively set to wmf ip [18:14:40] and apparently the software doesn't know how to deal with that [18:14:45] ... that hostname [18:14:53] New patchset: preilly; "remove session cookie in the absence of X-V-O support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12206 [18:14:54] the error used to be worse [18:14:55] i [18:14:56] t [18:14:57] h [18:14:57] as [18:15:00] (stupid irc client) [18:15:04] it has improved :) [18:15:23] notpeter: can you merge and push this https://gerrit.wikimedia.org/r/12206 [18:15:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12206 [18:15:41] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/12206 [18:16:59] New patchset: Sara; "Comment ganglia manifest to improve maintainability. Remove obsolate ganglia gmetad.conf file." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12205 [18:17:11] notpeter: ping [18:17:32] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12206 [18:17:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12206 [18:17:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12205 [18:17:53] preilly: forcing puppet runs now [18:18:03] notpeter: okay great thanks man [18:18:07] notpeter: you're the best [18:18:09] no prob! [18:18:10] New review: Sara; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12205 [18:18:13] Change merged: Sara; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12205 [18:18:23] preilly: I'm hella good at +2ing your changes :) [18:18:31] notpeter: can you let me know when it completes [18:18:37] yep [18:18:43] notpeter: ha ha ha I +2 your +2'ing [18:20:02] "I like the fact that you like the fact that I did X" [18:20:21] New review: Trevor Parscal; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5289 [18:20:27] ok. +2 to your changes being live [18:21:23] preilly: ^ [18:21:30] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 182 seconds [18:21:43] notpeter: thanks [18:22:15] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 190 seconds [18:23:01] !log reloading mr1-pmtpa for sw upgrade (fixing a cpu bug) [18:23:06] Logged the message, Mistress of the network gear. [18:25:06] PROBLEM - Host ps1-d1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.14) [18:25:06] PROBLEM - Host ps1-b1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.6) [18:25:06] PROBLEM - Host ps1-d3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.16) [18:25:06] PROBLEM - Host ps1-a1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.1) [18:25:06] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.18) [18:25:07] PROBLEM - Host ps1-c2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.12) [18:25:07] PROBLEM - Host ps1-b2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.7) [18:25:08] PROBLEM - Host ps1-a2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.2) [18:25:08] PROBLEM - Host ps1-c3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.13) [18:25:09] PROBLEM - Host ps1-b3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.8) [18:25:09] PROBLEM - Host ps1-d3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:25:10] PROBLEM - Host ps1-b5-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:25:10] PROBLEM - Host ps1-d1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:25:11] PROBLEM - Host ps1-d2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:25:11] PROBLEM - Host ps1-b4-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [18:25:31] oops [18:25:37] from the mr1 reboot, ignore [18:25:39] heh heh [18:25:47] promise all the power supplies didn't go down [18:25:51] PROBLEM - Host ps1-c1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.11) [18:25:55] New review: Krinkle; "(no comment)" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/12185 [18:27:48] PROBLEM - Host mr1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.2.3) [18:28:33] RECOVERY - Host ps1-c3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.83 ms [18:28:33] RECOVERY - Host ps1-a1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.11 ms [18:28:33] RECOVERY - Host ps1-d3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [18:28:33] RECOVERY - Host ps1-b3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.10 ms [18:28:33] RECOVERY - Host ps1-a2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.42 ms [18:28:34] RECOVERY - Host ps1-d3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.94 ms [18:28:34] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.37 ms [18:28:42] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.70 ms [18:28:42] RECOVERY - Host ps1-c2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.78 ms [18:28:42] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 3.82 ms [18:28:42] RECOVERY - Host ps1-b1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.86 ms [18:28:42] RECOVERY - Host ps1-b5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.20 ms [18:28:43] RECOVERY - Host ps1-c1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.41 ms [18:28:51] RECOVERY - Host ps1-b4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.14 ms [18:29:00] RECOVERY - Host ps1-d1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.45 ms [18:29:59] Reedy: db33 still okay? 190 seconds now. nagios says it is too much :P [18:30:02] @replag [18:30:04] Krinkle: [s4] db33: 226s; [s7] db26: 1s [18:30:39] RECOVERY - Host ps1-d2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.24 ms [18:31:11] Ganglia derived db data recent as of: Thu Jan 1 0:00:00 GMT 1970 [18:31:12] wheee [18:31:22] New patchset: Aaron Schulz; "Fix up w/s on sync script." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12209 [18:31:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12209 [18:32:09] LeslieCarr: https://gerrit.wikimedia.org/r/12209 [18:32:11] It's not increasing 1s/s yet [18:32:24] AaronSchulz: looking [18:32:49] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12209 [18:32:54] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12209 [18:32:54] yay whitespace cleanups [18:32:57] easiest rewviews [18:33:01] LeslieCarr: I should make a script that automatically pings you when I do patches there [18:33:15] only for whitespace type cleanups :p [18:33:21] Reedy: I dug up pywikibot source to check why it is refusing to edit, turns out it isn't requesting dbrreplag at all. the APi is erroing out with { error: { code: "maxlag", .. } }. So its not pywiki to blame (the edit in question was not finished done either, this is in a response to an edit) - [18:33:21] RECOVERY - Host mr1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [18:33:31] Reedy: the bots haven't edited in over 24 hours, not even once in 60 seconds. [18:33:44] all attempts result in a maxlag error and then waiting another minute [18:34:02] replication from the master side looks ok [18:34:14] ssmollett: merging more of your ganglia changes [18:37:05] Krinkle: might be worth poking Asher when he's around.. There's loads of "Waiting for the slave SQL thread to advance position" processes on db33 [18:37:40] Requesting API query from commons:commons [18:37:41] error occured, { code: "maxlag", [18:37:42] info: "Waiting for 10.0.6.43: 204 seconds lagged" } [18:37:43] HTTP Status: 200 [18:37:43] HTTP Response: OK [18:37:47] @info 10.0.6.43 [18:37:47] Krinkle: [10.0.6.43: s4] db33 [18:37:51] just checking :D [18:38:12] 1 | system user | | NULL | Connect | 700776 | Waiting for master to send event [18:38:27] Reedy: So basically all API write queries for commons are down now, and have been for almost 24 hours [18:38:41] "down"? [18:38:46] Tim only stopped purge [18:38:52] refused, error.code = maxlag [18:39:51] it can't do anything [18:40:06] weird though.. I can do it from the js console [18:40:33] don't pass a maxlag parameter [18:40:36] don't obey maxlag [18:40:37] fixed [18:42:01] Hm.. Indeed, pywiki is POSTing maxlag=5 [18:42:04] wtf [18:42:38] but when should a script pass maxlag ? [18:42:40] Is it any good? [18:42:41] I imagine they added it on request [18:42:59] "Recommended usage for Wikimedia wikis is as follows: Use maxlag=5 (5 seconds)." [18:43:03] https://www.mediawiki.org/wiki/Manual:Maxlag_parameter [18:43:04] :D [18:43:15] Maxlag is strange, just because one slave is lagged, doesn't mean you should stop [18:43:16] ie this [18:43:18] it's one upset [18:43:22] it's one upset slave [18:43:24] not an upset cluster [18:43:46] I think we removed it on AWB [18:45:05] In soviet russia, maxlag doesn't wait for you [18:45:09] Reedy: but what about that slave, isn't that also causing "lag warnings" to be shown on RecentChanges and Watchlist [18:45:22] I don't see those on commons [18:46:10] I don't think those work on maxlag [18:46:27] *they don't work on maxlag [18:47:10] k, I'll see if I can get it removed from pywiki [18:47:56] back here though, this slave does seem to be problematic. Is it still in the pool? as in, if I do a read query (either through api or through GUI, like page history), is that one in the mix? [18:51:34] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 189 seconds [18:51:44] yeah, it's still pooled [18:51:58] you might hit it, you might not [18:52:10] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 196 seconds [18:52:18] it's not terribly lagged (ie so it's urgent to deal with it) [18:53:18] * robla starts hunting around for a maxlag graph [18:53:41] hey guys [18:54:02] drdee_ has a script that sends out an email with some project management weekly status stuff [18:54:04] to us in analytics [18:54:19] do we have an smtp server that a labs instance can use to send email? [18:54:28] sodium? mchenry? [18:54:37] @replag [18:54:39] Krinkle: [s4] db33: 321s [18:55:57] Reedy: okay,I found out that as of the 2011 version I can set maxlag=None in user-config.py. The default is still maxlag=5 as of now [18:56:07] I'll do that then, so that at least MY bots are back up :P [18:56:24] we should get it resolved what the best practice is, though [18:56:51] indeed [18:58:39] New patchset: Bhartshorne; "partman recipe for swift storage nodes with ssds is broken; removing ms-be5+ until I get a fixed recipe." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12211 [18:59:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12211 [18:59:28] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12211 [18:59:31] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12211 [18:59:54] can someone review this for me? [18:59:55] https://gerrit.wikimedia.org/r/#/c/12181/ [18:59:57] it is a quick and easy one [19:00:09] LeslieCarr maybe? [19:00:48] so i actually don't know what the "file" command does in this instance, can you explain ? [19:01:01] i'm just removing a log file [19:01:05] it is in place on all udp2log instances now [19:01:06] I thought there was some reason that all udp2log servers kept the 1000-sampled log. [19:01:11] robla says no [19:01:17] huh. [19:01:27] we only need it on one, but I think whenever people set it up they just kept it on all [19:01:37] drdee_ said to keep it on emery and locke, so we'd ahve one backup, i guess [19:01:40] but we don't need it on all 3 [19:02:12] agree with ottomata [19:02:34] LeslieCarr, the file command just tells udp2log to write logs to a file [19:02:43] the 1000 means 1/1000 of the lines go to the file [19:03:03] I am just removing the sampled-1000 log from oxygen, as we don't need it there [19:03:17] ok [19:03:52] New review: Lcarr; "this removes the log file, which is unneeded on oxygen" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12181 [19:03:54] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12181 [19:04:17] maplebed: i merge your config ? [19:04:23] yaeh. [19:04:31] merging now [19:05:47] dankeeee [19:05:48] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/12213 [19:06:06] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/12213 [19:06:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12213 [20:09:41] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 5 seconds [20:09:58] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 4 seconds [20:10:13] Krinkle: ^ there you go :p [20:10:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:10:22] wee [20:10:26] @replag [20:10:28] Krinkle: [s4] db33: 3s [20:12:58] testing livehack on rt [20:21:13] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:40] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:40] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:40] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:40] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.18:11000 (timeout) [20:21:49] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:49] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:49] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:49] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:58] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:58] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:58] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:58] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:58] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:59] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:59] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:00] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:00] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:01] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:01] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:02] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:02] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:03] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:07] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:07] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:07] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:07] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:07] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:08] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:08] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:16] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:17] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:17] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:25] PROBLEM - SSH on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:34] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:52] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:10] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:10] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:10] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:10] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:19] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39112 bytes in 1.074 seconds [20:23:19] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:19] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:19] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:19] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:20] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:20] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:21] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:21] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:22] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:22] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:23] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:23] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:28] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:28] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:28] PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:37] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:37] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:37] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39122 bytes in 4.564 seconds [20:23:46] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:46] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:22] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:31] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:31] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:49] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:34] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:37] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 60758 bytes in 1.602 seconds [20:26:46] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:27:10] wtf? [20:27:56] someone disconnected a rack? [20:28:07] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 48282 bytes in 6.469 seconds [20:28:16] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:52] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 38929 bytes in 7.259 seconds [20:29:19] PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:37] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:55] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 38920 bytes in 0.171 seconds [20:30:49] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:07] eep [20:32:10] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.009 second response time [20:32:15] ok, let me see if maybe switch flipped out ? [20:32:28] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:37] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39119 bytes in 3.317 seconds [20:32:42] LeslieCarr: yeah, it looks like traffic dropped across the baord [20:33:21] esp for lvs 3 and 4 [20:33:47] althoguh might be upstream fo that [20:33:49] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39115 bytes in 2.600 seconds [20:33:58] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 48482 bytes in 2.670 seconds [20:33:58] 3 and 4 ? [20:34:25] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:34:34] that's what graphs look like [20:35:10] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 48472 bytes in 0.678 seconds [20:35:26] looking at all the net gear now [20:35:28] and rack tables [20:35:34] this has all the signs of network issue [20:35:41] or someone unplugging a whole rack ;) [20:35:50] chris has said that that is not the case [20:35:56] and I'm inclined to believe him [20:37:43] PROBLEM - Apache HTTP on srv241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:56] i would believe chris [20:38:02] totally [20:38:28] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:44] especially since i am not seeing tons of ethernet interfaces having flapped, as i would with physical powerdown/powerup [20:39:13] RECOVERY - Apache HTTP on srv241 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.152 second response time [20:39:49] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:49] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39112 bytes in 5.053 seconds [20:40:16] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:43] PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:41:37] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39122 bytes in 1.459 seconds [20:41:46] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 48289 bytes in 0.159 seconds [20:42:04] RECOVERY - Apache HTTP on srv242 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.380 second response time [20:42:29] why are page load times so long for diffs [20:42:32] and page history [20:42:49] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.877 second response time [20:43:33] srv268 is dying [20:43:45] from the load [20:43:51] spiking at 60 or so [20:44:11] i'm not seeing anything that looks like actual physical issues [20:44:19] PROBLEM - NTP on srv268 is CRITICAL: NTP CRITICAL: No response from NTP server [20:44:28] PROBLEM - MySQL Idle Transactions on db38 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:44:36] i wonder, could anything else cause some sort of cascading failure -- overloading of one or more srv's or mw's ? [20:44:37] srv268 dead I guess [20:44:41] (since some appeared overloaded) [20:45:52] so, the wierd part is that it's mostly the mw's that aren't responding on 80 [20:45:59] only a couple of the srvs are dead [20:46:07] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:46:13] which makes me thing something physically localized, be it switch, power, etc [20:46:34] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:47:28] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 48482 bytes in 0.907 seconds [20:47:28] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:48:40] mobile is down - at least for pages needing to hit the apaches [20:48:59] binasher: cached pages seem fine [20:49:13] binasher: anything that involves an apache is having issues [20:49:25] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 38966 bytes in 2.069 seconds [20:49:41] they're on different switches which seem fine … how many active connections is too many for one of the apaches ? [20:49:43] lvs4 is the active appserver lvs server [20:49:47] i guess this has been going on for around 30min? [20:49:51] pybal is complaining of too many downs [20:49:56] binasher: yes [20:49:59] first alert :21 [20:50:11] ah, i never got paged [20:50:19] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39159 bytes in 4.755 seconds [20:50:20] me neither [20:50:24] i didn't get paged either [20:50:24] not being on irc was so blissful [20:50:37] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 11 seconds [20:50:46] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:50:56] hrm, this is happening on esams as well ? [20:51:31] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:31] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 3 seconds [20:52:07] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39163 bytes in 1.415 seconds [20:52:13] I'm in mw32 right now [20:52:23] it's awfully slow, but its load is 2 [20:52:38] its http is slow that is [20:52:41] the machine is responsive [20:52:48] even for GET / HTTP/1.0 [20:52:52] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 38973 bytes in 0.965 seconds [20:52:56] multiple seconds slow [20:53:12] i'm on mw21 , a telnet localhost 80, get / is also incredibly slow [20:53:19] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:22] (mw21 is different floor, different rack) [20:53:37] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:44] was anything deployed recently? [20:53:51] like, application-wise [20:54:29] https://gerrit.wikimedia.org/r/#/c/12209/ [20:54:41] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 48473 bytes in 0.746 seconds [20:54:46] https://gerrit.wikimedia.org/r/#/c/12044/ [20:55:16] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:25] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:25] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:26] enwiki job queue is lockalicious.. killing some stuff [20:56:19] RECOVERY - MySQL Idle Transactions on db38 is OK: OK longest blocking idle transaction sleeps for 16 seconds [20:56:21] thats all cleared [20:56:28] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60226 bytes in 4.727 seconds [20:56:34] wee [20:56:46] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 48289 bytes in 0.150 seconds [20:56:49] would a memcache server being borked cause any of these errors ? [20:56:54] no [20:56:58] one day! [20:56:59] no,asher's right [20:57:06] jobqueue. is a common thread [20:57:13] ah :) [20:57:20] binasher: what was blocked? [20:57:26] srv268 seems to have its own issues [20:57:30] yeah [20:57:58] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:07] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 48282 bytes in 0.123 seconds [20:58:42] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 38966 bytes in 0.135 seconds [20:58:55] srv268 should be pulled from memc - leslie could be right too, considering the current mw mc client.. slot 0 could contain the keys with enwiki replag info [20:59:25] binasher: could mw32 be extremely slow because it can't contact srv268's memcached? [20:59:26] i'll switch srv268 out [20:59:27] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:59:29] that being unavailable and a couple other special keys could make mw get abusive [20:59:38] ah [20:59:38] using my magic powers of "i learned how to memcache server swap" [20:59:54] go go gadget spof! [21:00:48] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 60758 bytes in 1.310 seconds [21:01:12] LeslieCarr: so? [21:01:15] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39159 bytes in 1.976 seconds [21:02:02] I'm logging into the management of srv268 [21:02:09] LeslieCarr: deploy deploy deploy [21:02:19] sorry, committing first now [21:02:26] and kill its power [21:02:42] it's fast again [21:02:54] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [21:02:57] no it's not [21:03:39] !log powercycling srv268; unreachable due to load spike [21:03:45] Logged the message, Master [21:04:06] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:04:29] deploying now [21:04:40] phew [21:04:42] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.192 second response time [21:04:51] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.611 second response time [21:04:54] !log replaced srv268 with srv245 in memcached list [21:04:59] Logged the message, Mistress of the network gear. [21:05:00] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [21:05:00] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.147 second response time [21:05:00] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.200 second response time [21:05:00] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.326 second response time [21:05:00] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.421 second response time [21:05:09] yep, things recovered [21:05:17] mw32 is fast again [21:05:18] so, [21:05:20] wtf??? [21:05:43] 1) a single server being slow killing everything? [21:05:49] so binasher i know that fixed it, can you explain to us why it fixed it ? [21:06:03] because of memcached hashing I presume [21:06:06] so we're not flailing around in the dark as much :) [21:06:30] 2) getting alerts just for foundation-lb & wikiversity-lb while enwiki was obviously slow? [21:06:36] 3) not getting paged? [21:06:37] i think the site started to recover when i killed some locking stuff in the enwilki master, not 30 seconds ago [21:06:44] this wasn't confined to enwiki [21:06:54] Jasper_Deng: that doesn't matter at all [21:06:56] binasher: mw32 was very very very slow on GET / [21:07:00] binasher: until 30s ago [21:07:10] as in 5-6 seconds slow [21:07:46] ok, i believe you [21:07:59] we might havd had /two/ issues though [21:08:06] i was going off the vip alerts clearing and my being able to use the website again [21:08:50] R=smart_route defer (-1): lookup of host "lily.esams.wikipedia.org" failed in smart_route router [21:08:59] yay for broken mail in nagios [21:09:13] ahha [21:09:21] i don't think it paged because most of these have critical => false [21:09:25] paravoid: were you providing a host header for you GET / to mw32? [21:09:31] LeslieCarr: don't the lb vips page? [21:09:37] binasher: no [21:09:44] binasher: initially I did, but was failing for localhost too [21:09:49] a lot of them don't .. looking at line 845 and above on lvs.pp [21:09:50] so I was just trying localhost [21:09:52] why was it trying to use "lily.esams.wikipedia.org" ? [21:10:02] # exipick -i|wc -l [21:10:02] 92 [21:10:08] mails in spence's queue [21:11:16] route_list = * mchenry.wikimedia.org:lily.esams.wikipedia.org [21:11:17] yay [21:12:06] it should be lily.esams.wikimedia.org [21:12:10] dunno why it didn't use mchenry though, looking [21:12:32] doesn't reply to 25 from spence [21:12:43] but otherwise correct Platonides, thanks [21:16:28] do we have a backup smtp relay? [21:17:00] (lily is decomissioned) [21:17:27] binasher: also, do you know why srv268's slowness caused this? [21:19:52] yeah, mc slot 0 is what the key "enwiki:lag_times:db38" maps to, which is one of the top requested keys on srv245 now.. where db38 = the enwiki master.. that key is requested on every enwiki article view and since srv268 wasn't all the way down and since our cray cray memc client which is about to be replaced doesn't handle timeouts well, tons of apache children were probably waiting up to a tcp timeout + then waiting on the enwiki master bei [21:19:52] slow [21:19:59] the funny thing is [21:20:17] it appears to be a mediawiki bug that there's a lag_time memcache key for the master [21:20:53] not a new bug either [21:21:00] what's the lag_time memcache key? [21:21:43] paravoid: there's one for every slave, i.e. [21:21:45] VALUE frwiki:lag_times:db31 1 67 [21:21:45] a:5:{i:0;i:0;i:1;i:0;i:2;i:1;i:3;i:0;s:9:"timestamp";i:1340227282;} [21:21:57] mediawiki has a lot of logic around not sending queries to lagged slaved [21:22:02] slaves [21:22:11] oh [21:22:20] but there's no need to do so for the master [21:22:22] slave status is shared via memc for up to a few seconds [21:22:27] whuich is why it's a bug [21:22:29] okay [21:22:30] yeah [21:22:31] but [21:23:42] well, i guess i can't bitch too much more about our memc client. this may have been its final outage before retirement! [21:24:07] heh [21:24:09] btw, since you're here [21:24:15] I have another question about igbinary [21:24:34] opjectcache::MemcacheClient - yes, that was a challenge [21:24:47] as far as I understand, it's a module to do faster serialization of data [21:25:03] to put php data structures into memcache [21:25:18] but does it keep the same format? [21:25:23] it replaces php serialize/unserialize [21:25:35] no, totally incompatible [21:25:40] so, when we migrate to that [21:25:47] we're going to have to invalidate everything [21:25:51] its binary, as the name implies :) [21:25:58] in our memcaches [21:26:30] and essentially run with a cold cache [21:26:33] or am I missing something? [21:26:41] yep - we also have to invalidate everything when we switch to the new client since the hashing algorithm is going to be ketama (consistent!) and when we switch to the new memcached servers and shut down every currently running instance of memcached, returning that ram to php [21:26:51] so [21:27:06] this is all going to happen at once, where at once is over time but as part of a single elongated deploy [21:27:15] aha [21:27:25] are you sure robla is aware of that? [21:27:46] he made it sound like "installe php5-memcached on the servers and we'll just enable turn a flag to 1 on MW" [21:28:10] j/k, i'm going to press a button to do it all at once, get on plane, and i'll check in a week later when the dust settles [21:28:58] :-) [21:29:07] so, back to topic [21:29:22] paravoid: i have no idea what robla, i'm not sure if any of the managers know at a deep technical level. but tim knows, who wrote the client, aaron knows, who might do the deploy, i know, you know, terry knows who has been supporting the project in management meetings, etc. [21:29:56] do we have a second MX to have for spence's pages instead of lily? [21:30:09] paravoid: you'll have to get used to not taking managers literally any time they talk about how anything is going to work at a technical level, especially in ops. [21:30:13] and who's going to prepare the incident report? I can, but I'll just copy what binasher said :) [21:31:02] paravoid: wait, todays incident report, or the one in a few weeks when we switch to a bunch of new shit? ;) [21:33:38] when the hashing algorithm doesn't change, it could be possible to make the format dependant of the target server [21:33:56] would lead to a bit confusing code, but easy to do [21:35:20] paravoid: for todays, i was only online or aware of what was going on for the last 1/4th of the action.. and you're really good at writing non-snarky incident responses.. so if you could, that'd be awesome [21:35:36] sure, if you're cool with me quoting you :) [21:36:39] ok, ok [21:38:34] Platonides: yup. that we're replacing the servers memcached runs on is kind of fortunate in terms of not making the serialization change any costlier or more complex [21:43:43] the big problem is the hashing change [21:43:55] otherwise, it could be seamlessly done one at a time [21:45:02] (I don't see why ibinary couldn't additionally accept serialize format as a fallback, though) [21:45:43] the hashing change doesn't have to be a problem either [21:45:58] if we can figure out how to do a very staggered deploy [21:47:40] I can't figure out why spence's exim failed [21:47:55] maybe a merge from test->production that Ryan did? [21:48:04] well, how to do a staggered deploy? :) [21:48:16] I'm thinking, it would be possible to make a double-objectcache [21:48:25] our git history is a fucking mess [21:48:26] which stores in two child memcached servers [21:48:31] !log freezing many bounce messages on mchenry (all older than 2400 minutes) [21:48:35] merge every second commit [21:48:36] Logged the message, Mistress of the network gear. [21:48:48] double win given that the new memcached would be on different servers [21:49:05] then after X time you just flip the kind used for reads [21:49:06] !log see RT3170 for more details on above change and mchenry pain [21:49:11] Logged the message, Mistress of the network gear. [21:50:04] LeslieCarr: you know exipick? [21:50:10] paravoid: do not [21:50:17] should i research this for awesome ? [21:50:17] have a look [21:50:20] yes. [21:50:26] man exipick [21:50:32] it's installed with exim4 so it should be everywhere [21:52:35] Platonides: if we enable the new client + servers over, say a 24 hour period, where at first only a few servers are using the new stuff, those servers will be slow for a bit, but won't overload the mysql parsercache or result in much of a user impact, while starting to populate the new cache [21:52:41] binasher: btw, this affected all services? or just text? [21:53:14] binasher, they could be serving old stuff [21:53:26] because invalidations from the other servers won't touch their memcached [21:53:28] and viceversa [21:53:57] Platonides: we don't really explicitly invalidate anything in memcached except for sessions [21:54:05] (for now) [21:54:11] o rly? [21:54:13] but we perform replaces [21:54:25] binasher: and any idea on why we got alerts for just foundation & wikiversity? [21:54:29] (there are probably deletes, though) [21:56:01] there are deletes, but only user related [21:56:03] "only" [21:56:42] centralauth:login-token and wiki:user:id are the only classes of keys we delete [21:57:01] binasher suppose you add the different memcached on 2% servers [21:57:29] then someone edits [[Obama]] on an "old" server to say "Obama is gay" [21:57:41] someone visits on a "new" server, it gets loaded into the new memcached [21:57:48] it is reverted on "old" server [21:58:05] 2% of users, hitting the new memcached get the "Obama is Gay" [21:58:19] moreover, you need to make the purge to happen on a new server [21:58:30] or it wouldn't work [21:59:34] (perhaps there's some extra check for staleness, but even then, the same problem would be at different -smaller- places) [21:59:58] the double-write objectcache seems better [22:00:09] as far as there's memory available for that [22:00:24] which if they'll live on new servers, there will be [22:01:21] Platonides: we don't purge revision / parser cache [22:02:08] afaik, we do a database query every single time (grr!!!!) to see if what's cached is valid [22:02:16] paravoid/binasher i have a thought as to why we only got those pages -- perhaps they are visited so rarely their main page wasn't in cache … and all the other sites we use the main page (which would be in varnish) [22:02:25] we do purge image rows in cache [22:02:35] and user rows [22:03:06] i haven't seen an image key delete yet but do see the user deletes [22:03:08] using multwrite for a brief while seems like it would avoid this stuff [22:03:18] yeah [22:03:43] if the transition will take hours, it would pretty useful then [22:03:59] we probably don't want a huge window of iffy data [22:04:47] LeslieCarr: we should also have gotten paged by watchmouse which monitors enwiki random and also has a check that should bypass squid via login + cookies… or at least used to. could someone check out watchmouse? [22:06:17] another (not necessarily better) option would be to set all ttls to short values during the switch [22:07:01] binasher: what it the switch goes bad? ;P [22:07:09] excitement? [22:07:16] heh [22:07:22] secure.wikimedia.org went down [22:07:46] now? [22:07:55] 20:45 [22:08:16] 1h 20m ago [22:08:18] which lines up [22:08:24] but it only texted a few people [22:08:54] i got the secure page, and an ok a few minutes later [22:10:24] i'm not on it … added myself [22:10:29] oh dear, what a shame ;) [22:10:38] faidon, want to get paged by watchmouse ? [22:11:07] you tell me [22:11:59] well, while nobody wants a page, i think you're knowledgeable enough to get paged and fix shit [22:12:07] yeah, I'm kidding [22:12:10] sure, add me [22:12:28] hehe [22:12:46] it's not like I pay for received SMS [22:12:47] emotions come through so badly on irc [22:12:53] *cough* *cough* [22:14:02] LeslieCarr: send paravoid ALL of the spam messages [22:14:55] ok, set up watchmouse …. and also redirected all cronspam to your sms ;) [22:16:41] LeslieCarr: did watchmouse pick up an outage? [22:17:40] binasher: only for secure.wm [22:17:45] hrm [22:17:58] i wonder if the "logged in" check is broken [22:20:19] New patchset: Faidon; "Replace lily with lists as a backup relay" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12279 [22:20:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12279 [22:27:20] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/12279 [22:27:24] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12279 [23:10:50] New review: Andrew Bogott; "It occurs to me to wonder... can I approve this patch myself? And if I can, isn't that somewhat dan..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11892 [23:19:38] New review: Platonides; "It depends if you have a +2 option in code review." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/11892 [23:22:39] binasher: sorry for not mentioning that, completely forgot about it, didn't mean to misquote you [23:27:37] New patchset: Ottomata; "Adding Evan Rosen. RT 3119" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12299 [23:28:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/12299 [23:29:34] paravoid: looking more, i think the srv268 pull was more important to ending the outage. the db bit reduced 503's from approximately 1000/sec to 666/sec, but it took the mc.php deploy to drop back to around 0 [23:32:15] so you're saying srv268 was possessed ? [23:32:25] yes! [23:32:31] binasher: ah, just replied [23:32:40] i am surprised/saddened that we didn't get the alert that a memcache machine is down [23:32:56] binasher: that the alerts came back after your clearing [23:32:59] btw [23:33:06] what did you to do change mc.php? [23:33:18] where is mc.php and how did you deploy it? [23:33:31] LeslieCarr: i thought i saw the memcache alert go ok after your mc.php deploy [23:33:47] /h/w/c/wmf-config/mc.php [23:33:48] so it should have alerted in irc before i logged in? [23:33:57] http://wikitech.wikimedia.org/view/Memcached [23:34:09] maybe i should mention that it needs to be git pushed too [23:34:28] oops, slapped with the fine manual [23:34:37] my bad [23:34:41] well most of the time manual doesn't always work :) [23:34:47] just this time I had fixed it a bit and it does work [23:34:48] hehe [23:34:53] binasher: and how did you count the 503s? [23:35:24] sorry for all the questions, I'm trying to get better prepared for the next outge :-) [23:36:23] questions are good! [23:36:35] binasher: what's mysql do? [23:36:35] http://gdash.wikimedia.org/dashboards/reqerror/ [23:37:00] binasher: coool! [23:37:09] Ryan_Lane: it's a sophisticated engine for introducing random sleep values into applications [23:37:14] paravoid: does it make sense now ? [23:37:14] :D [23:37:28] LeslieCarr: what does? [23:37:39] binasher: and memcache? [23:38:06] oh the page [23:38:10] i switched a few things [23:38:20] ah, sure, yes :) [23:38:21] i want ot make sure someone can look at http://wikitech.wikimedia.org/view/Memcached and fix memcache [23:39:00] yay :) [23:39:30] Ryan_Lane: memcache is magic smoke that makes social nosql webscale in the cloud [23:39:38] \o/ [23:39:59] isn't that memsql? [23:40:00] I like the answers for my questions [23:41:32] hehe i like this graph http://gdash.wikimedia.org/dashboards/reqerror/ [23:41:48] I like the graph, don't like much its output [23:41:49] i mean, in a way, we did even better since we reduced our 4xx responses ;) [23:42:04] i.e. the thousands 503s [23:47:17] binasher: https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blobdiff;f=templates/varnish/mobile-frontend.inc.vcl.erb;h=6fc0b6be72d754e6846a44cda97c80c2d61cc731;hp=051c33cd8d319e6dfdc53a82bc0c7ca8f0b6ad9d;hb=23482bb8bb9662d4bf68ad62071054682dbb4989;hpb=f67b3277269199083bf76b156d654a6357f7998f