[00:00:03] no [00:01:42] New patchset: Pyoungmeister; "yet more apache upgrades" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34450 [00:03:28] !log depooling mw53-mw49 for upgrade to apache [00:03:34] Logged the message, notpeter [00:03:45] !log depooling mw66-mw99 for upgrade to apache [00:03:50] !log maxsem synchronized php-1.21wmf3/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/#/c/34440/' [00:03:51] Logged the message, notpeter [00:03:57] Logged the message, Master [00:06:19] !log maxsem synchronized php-1.21wmf4/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/#/c/34440/' [00:06:26] Logged the message, Master [00:09:36] New patchset: Asher; "path fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34451 [00:11:09] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34451 [00:13:52] * notpeter is bringin' the nagios alerts [00:14:23] woo, https statistics [00:14:34] binasher: this will include ipv6 hits too though, wouldn't it? [00:15:18] ipv6 isn't a lot of traffic, but https isn't either [00:16:07] PROBLEM - Host mw66 is DOWN: PING CRITICAL - Packet loss = 100% [00:16:16] PROBLEM - Host mw35 is DOWN: PING CRITICAL - Packet loss = 100% [00:16:34] PROBLEM - Host mw67 is DOWN: PING CRITICAL - Packet loss = 100% [00:16:43] PROBLEM - Host mw36 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:10] PROBLEM - Host mw68 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:19] PROBLEM - Host mw37 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:46] PROBLEM - Host mw69 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:55] PROBLEM - Host mw38 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:20] paravoid: i think it will be both and won't distinguish between nginx sites [00:18:22] PROBLEM - Host mw39 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:40] PROBLEM - Host mw40 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:40] and just realized we are down 2/4 hosts in esams and have no data on how nginx is actually doing [00:18:40] PROBLEM - Host mw43 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:40] PROBLEM - Host mw41 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:59] yes, Ryan_Lane was messing with it yesterday [00:19:27] sounds like one of the hosts didn't come back up after a reboot or something like that [00:19:30] I told ma rk our morning, since the mgmt lan for that rack is out and he said somethinga bout going to esams this week [00:19:39] yes [00:20:25] yeah the status page doesn't distinguish between vhosts [00:20:28] PROBLEM - Host mw44 is DOWN: PING CRITICAL - Packet loss = 100% [00:21:04] PROBLEM - Host mw45 is DOWN: PING CRITICAL - Packet loss = 100% [00:21:58] RECOVERY - Host mw66 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [00:22:07] RECOVERY - Host mw35 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [00:22:16] PROBLEM - Host mw47 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:25] RECOVERY - Host mw67 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [00:22:34] RECOVERY - Host mw36 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [00:22:43] PROBLEM - Host mw48 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:52] RECOVERY - Host mw68 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [00:23:01] RECOVERY - Host mw37 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [00:23:15] Ryan_Lane: are you still debugging this or should I file a RT for esams? [00:23:28] RECOVERY - Host mw69 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [00:23:28] PROBLEM - Host mw49 is DOWN: PING CRITICAL - Packet loss = 100% [00:23:38] RECOVERY - Host mw38 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [00:23:46] PROBLEM - Memcached on mw42 is CRITICAL: Connection refused [00:24:04] RECOVERY - Host mw39 is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [00:24:22] RECOVERY - Host mw43 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [00:24:22] RECOVERY - Host mw41 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [00:24:22] RECOVERY - Host mw40 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [00:24:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34450 [00:24:27] debugging what? [00:24:38] this :) [00:24:55] paravoid: the mgmt stuff? [00:24:56] ssl300x [00:24:58] PROBLEM - Apache HTTP on mw42 is CRITICAL: Connection refused [00:25:02] yes [00:25:05] apparently the mgmt switches aren't manages [00:25:07] managed* [00:25:12] someone will need to go to the datacenter for it [00:25:16] PROBLEM - SSH on mw42 is CRITICAL: Connection refused [00:25:17] nothing to be done about it [00:25:26] PROBLEM - Memcached on mw46 is CRITICAL: Connection refused [00:25:26] PROBLEM - SSH on mw35 is CRITICAL: Connection refused [00:25:26] PROBLEM - Apache HTTP on mw46 is CRITICAL: Connection refused [00:25:27] well, RT ticket in the esams queue? [00:25:30] doing it [00:25:34] I did that when I noticed it [00:25:46] oh [00:25:46] sorry [00:25:49] and ssl3004 has been down for months [00:25:52] PROBLEM - Apache HTTP on mw35 is CRITICAL: Connection refused [00:25:55] yeah [00:26:01] PROBLEM - Apache HTTP on mw66 is CRITICAL: Connection refused [00:26:01] PROBLEM - SSH on mw67 is CRITICAL: Connection refused [00:26:10] PROBLEM - SSH on mw68 is CRITICAL: Connection refused [00:26:10] PROBLEM - Apache HTTP on mw36 is CRITICAL: Connection refused [00:26:10] PROBLEM - Apache HTTP on mw67 is CRITICAL: Connection refused [00:26:10] RECOVERY - Host mw44 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [00:26:20] PROBLEM - Apache HTTP on mw68 is CRITICAL: Connection refused [00:26:20] PROBLEM - SSH on mw36 is CRITICAL: Connection refused [00:26:28] PROBLEM - Memcached on mw36 is CRITICAL: Connection refused [00:26:28] PROBLEM - Memcached on mw35 is CRITICAL: Connection refused [00:26:37] PROBLEM - SSH on mw46 is CRITICAL: Connection refused [00:26:37] PROBLEM - Memcached on mw37 is CRITICAL: Connection refused [00:26:46] PROBLEM - SSH on mw66 is CRITICAL: Connection refused [00:26:46] RECOVERY - Host mw45 is UP: PING OK - Packet loss = 0%, RTA = 3.06 ms [00:26:52] http://ganglia.wikimedia.org/latest/?c=SSL%20cluster%20esams&h=ssl3002.esams.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [00:26:55] PROBLEM - SSH on mw69 is CRITICAL: Connection refused [00:26:55] PROBLEM - SSH on mw38 is CRITICAL: Connection refused [00:27:04] PROBLEM - SSH on mw37 is CRITICAL: Connection refused [00:27:23] PROBLEM - Apache HTTP on mw38 is CRITICAL: Connection refused [00:27:23] PROBLEM - Apache HTTP on mw39 is CRITICAL: Connection refused [00:27:31] PROBLEM - Apache HTTP on mw37 is CRITICAL: Connection refused [00:27:40] PROBLEM - SSH on mw41 is CRITICAL: Connection refused [00:27:40] PROBLEM - Apache HTTP on mw69 is CRITICAL: Connection refused [00:27:43] binasher: btw, we can use http 1.1 for the backend in nginx now [00:27:48] which also means connection pooling [00:27:49] PROBLEM - Memcached on mw39 is CRITICAL: Connection refused [00:27:49] PROBLEM - Memcached on mw38 is CRITICAL: Connection refused [00:27:58] RECOVERY - Host mw47 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [00:28:07] PROBLEM - SSH on mw43 is CRITICAL: Connection refused [00:28:08] I haven't started to test that yet [00:28:16] PROBLEM - SSH on mw40 is CRITICAL: Connection refused [00:28:25] PROBLEM - Apache HTTP on mw40 is CRITICAL: Connection refused [00:28:25] RECOVERY - Host mw48 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [00:28:34] PROBLEM - Memcached on mw40 is CRITICAL: Connection refused [00:28:34] PROBLEM - Memcached on mw41 is CRITICAL: Connection refused [00:28:43] PROBLEM - Memcached on mw43 is CRITICAL: Connection refused [00:28:43] PROBLEM - SSH on mw39 is CRITICAL: Connection refused [00:28:52] PROBLEM - Apache HTTP on mw41 is CRITICAL: Connection refused [00:28:52] PROBLEM - Apache HTTP on mw43 is CRITICAL: Connection refused [00:29:10] RECOVERY - Host mw49 is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [00:29:37] PROBLEM - SSH on mw44 is CRITICAL: Connection refused [00:30:04] PROBLEM - Apache HTTP on mw44 is CRITICAL: Connection refused [00:30:05] RECOVERY - SSH on mw66 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:30:05] PROBLEM - Memcached on mw45 is CRITICAL: Connection refused [00:30:08] it's kind of amazing how the box uses half of the memory it used to use before the precise reinstall [00:30:13] RECOVERY - SSH on mw35 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:30:13] PROBLEM - Apache HTTP on mw45 is CRITICAL: Connection refused [00:30:16] yep [00:30:22] and half the cpu [00:30:25] not that the cpu was much used [00:30:50] RECOVERY - SSH on mw67 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:30:58] PROBLEM - Memcached on mw44 is CRITICAL: Connection refused [00:30:58] RECOVERY - SSH on mw68 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:31:07] RECOVERY - SSH on mw36 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:31:18] and it's even using half the memory with all of ssl3001's load added up [00:31:25] PROBLEM - SSH on mw45 is CRITICAL: Connection refused [00:31:25] PROBLEM - SSH on mw47 is CRITICAL: Connection refused [00:31:27] well, half of it anyway [00:31:43] RECOVERY - SSH on mw69 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:31:43] PROBLEM - Memcached on mw47 is CRITICAL: Connection refused [00:31:43] PROBLEM - Apache HTTP on mw48 is CRITICAL: Connection refused [00:31:43] PROBLEM - Memcached on mw48 is CRITICAL: Connection refused [00:31:43] RECOVERY - SSH on mw38 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:31:52] RECOVERY - SSH on mw37 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:32:01] RECOVERY - SSH on mw39 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:32:01] PROBLEM - Apache HTTP on mw47 is CRITICAL: Connection refused [00:32:02] hmm, less processes [00:32:05] ah [00:32:08] that's odd [00:32:12] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=SSL+cluster+esams&h=ssl3002.esams.wikimedia.org&v=414&m=proc_total&jr=&js=&vl=+&ti=Total+Processes [00:32:12] did that change in puppet? [00:32:28] RECOVERY - SSH on mw41 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:32:32] hm. does it use http 1.1 by default now? [00:32:53] nope [00:33:00] so that shouldn't be the difference [00:33:13] PROBLEM - SSH on mw49 is CRITICAL: Connection refused [00:33:13] RECOVERY - SSH on mw40 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:33:13] PROBLEM - SSH on mw48 is CRITICAL: Connection refused [00:33:22] PROBLEM - Memcached on mw49 is CRITICAL: Connection refused [00:33:28] we actually run way more processes than is necessary [00:33:31] PROBLEM - Apache HTTP on mw49 is CRITICAL: Connection refused [00:33:48] the default nginx config changed for one of the ms boxes and the ssl servers got the change along with it [00:34:34] RECOVERY - SSH on mw44 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:34:34] RECOVERY - SSH on mw43 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:34:34] RECOVERY - SSH on mw45 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:34:43] RECOVERY - SSH on mw46 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:35:01] RECOVERY - SSH on mw42 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:36:13] RECOVERY - SSH on mw47 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:36:31] RECOVERY - SSH on mw48 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:37:38] !log preilly synchronized php-1.21wmf3/extensions/ZeroRatedMobileAccess 'update post deploy' [00:37:45] Logged the message, Master [00:38:01] RECOVERY - SSH on mw49 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:39:58] !log preilly synchronized php-1.21wmf4/extensions/ZeroRatedMobileAccess 'update post deploy' [00:40:06] Logged the message, Master [00:41:46] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.013 seconds [00:41:46] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.010 seconds [00:41:55] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.009 seconds [00:41:55] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [00:42:04] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [00:44:55] PROBLEM - NTP on mw35 is CRITICAL: NTP CRITICAL: Offset unknown [00:46:34] PROBLEM - NTP on mw36 is CRITICAL: NTP CRITICAL: No response from NTP server [00:46:52] PROBLEM - NTP on mw37 is CRITICAL: NTP CRITICAL: No response from NTP server [00:46:52] PROBLEM - NTP on mw68 is CRITICAL: NTP CRITICAL: No response from NTP server [00:47:37] PROBLEM - NTP on mw41 is CRITICAL: NTP CRITICAL: No response from NTP server [00:47:46] PROBLEM - NTP on mw40 is CRITICAL: NTP CRITICAL: No response from NTP server [00:48:04] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [00:48:31] PROBLEM - NTP on mw43 is CRITICAL: NTP CRITICAL: Offset unknown [00:50:01] PROBLEM - NTP on mw45 is CRITICAL: NTP CRITICAL: No response from NTP server [00:50:28] PROBLEM - NTP on mw44 is CRITICAL: NTP CRITICAL: No response from NTP server [00:53:08] RECOVERY - NTP on mw43 is OK: NTP OK: Offset -0.006959319115 secs [00:53:08] PROBLEM - NTP on mw66 is CRITICAL: NTP CRITICAL: Offset unknown [00:53:26] PROBLEM - NTP on mw48 is CRITICAL: NTP CRITICAL: No response from NTP server [00:54:02] RECOVERY - NTP on mw35 is OK: NTP OK: Offset -0.008198618889 secs [00:54:29] PROBLEM - NTP on mw67 is CRITICAL: NTP CRITICAL: Offset unknown [00:54:47] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [00:55:14] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [00:55:23] PROBLEM - NTP on mw69 is CRITICAL: NTP CRITICAL: No response from NTP server [00:55:32] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [00:55:32] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [00:55:32] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [00:55:41] PROBLEM - NTP on mw38 is CRITICAL: NTP CRITICAL: No response from NTP server [00:56:35] PROBLEM - NTP on mw42 is CRITICAL: NTP CRITICAL: No response from NTP server [00:59:53] PROBLEM - NTP on mw49 is CRITICAL: NTP CRITICAL: No response from NTP server [01:00:02] PROBLEM - NTP on mw46 is CRITICAL: NTP CRITICAL: No response from NTP server [01:07:41] RECOVERY - NTP on mw40 is OK: NTP OK: Offset -0.05681073666 secs [01:07:50] RECOVERY - NTP on mw48 is OK: NTP OK: Offset 0.08070385456 secs [01:08:26] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.004 seconds [01:08:53] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [01:09:11] RECOVERY - NTP on mw66 is OK: NTP OK: Offset -0.004572749138 secs [01:09:11] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [01:09:56] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.014 seconds [01:09:56] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [01:11:44] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [01:21:11] RECOVERY - NTP on mw41 is OK: NTP OK: Offset 0.01474070549 secs [01:22:05] RECOVERY - NTP on mw68 is OK: NTP OK: Offset -0.0178142786 secs [01:22:23] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [01:22:41] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [01:22:41] RECOVERY - NTP on mw36 is OK: NTP OK: Offset -0.01192617416 secs [01:23:08] RECOVERY - NTP on mw67 is OK: NTP OK: Offset -0.009715795517 secs [01:23:26] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [01:23:26] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [01:23:53] RECOVERY - NTP on mw44 is OK: NTP OK: Offset -0.01133227348 secs [01:34:05] RECOVERY - NTP on mw42 is OK: NTP OK: Offset -0.07879531384 secs [01:34:32] RECOVERY - NTP on mw38 is OK: NTP OK: Offset -0.03637301922 secs [01:35:44] RECOVERY - NTP on mw46 is OK: NTP OK: Offset -0.06795620918 secs [01:37:05] RECOVERY - NTP on mw49 is OK: NTP OK: Offset -0.007266640663 secs [01:37:14] RECOVERY - NTP on mw37 is OK: NTP OK: Offset -0.0003443956375 secs [01:37:14] RECOVERY - NTP on mw45 is OK: NTP OK: Offset -0.01063299179 secs [01:40:05] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 267 seconds [01:40:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 311 seconds [01:43:23] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:45:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:47:06] New patchset: Asher; "ganglia plugin for nginx stats" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34467 [01:50:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34467 [01:50:53] RECOVERY - NTP on mw69 is OK: NTP OK: Offset -0.006066560745 secs [01:54:47] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [02:00:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 267 seconds [02:01:14] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 320 seconds [02:08:44] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [02:21:11] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [02:24:47] !log LocalisationUpdate completed (1.21wmf4) at Wed Nov 21 02:24:47 UTC 2012 [02:24:54] Logged the message, Master [02:30:56] RECOVERY - SSH on ms-be7 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:31:05] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [02:45:42] !log LocalisationUpdate completed (1.21wmf3) at Wed Nov 21 02:45:42 UTC 2012 [02:45:52] Logged the message, Master [02:54:47] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:57:38] RECOVERY - Puppet freshness on analytics1002 is OK: puppet ran at Wed Nov 21 02:57:12 UTC 2012 [03:01:05] !log depooling mw20-mw34 for upgrade to precise [03:01:13] Logged the message, notpeter [03:01:36] !log depooling mw70-mw74 for upgrade to precise [03:01:42] Logged the message, notpeter [03:05:08] RECOVERY - Puppet freshness on ms-fe1 is OK: puppet ran at Wed Nov 21 03:04:53 UTC 2012 [03:05:59] gonna make a lot of nagios noise (in this channel I mean) [03:10:24] PROBLEM - Host mw70 is DOWN: PING CRITICAL - Packet loss = 100% [03:10:32] PROBLEM - Host mw20 is DOWN: PING CRITICAL - Packet loss = 100% [03:10:50] PROBLEM - Host mw22 is DOWN: PING CRITICAL - Packet loss = 100% [03:10:50] PROBLEM - Host mw21 is DOWN: PING CRITICAL - Packet loss = 100% [03:11:08] PROBLEM - Host mw71 is DOWN: PING CRITICAL - Packet loss = 100% [03:11:08] PROBLEM - Host mw72 is DOWN: PING CRITICAL - Packet loss = 100% [03:11:24] New patchset: Faidon; "partman: fix grub install on non-/dev/sda" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34476 [03:11:24] New patchset: Faidon; "partman: fixes to ms-be with SSDs recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34477 [03:11:24] New patchset: Faidon; "partman: prepare for two flavors of ms-be" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34478 [03:11:40] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34476 [03:11:51] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34477 [03:11:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34478 [03:12:38] PROBLEM - Host mw25 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:25] New patchset: Pyoungmeister; "last apache upgrades" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34479 [03:15:03] a few hours more down the partman/d-i drain [03:15:10] working around bugs [03:15:21] PROBLEM - Host mw30 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34479 [03:15:56] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:57] paravoid: :( [03:16:05] PROBLEM - Apache HTTP on mw73 is CRITICAL: Connection refused [03:16:05] PROBLEM - Memcached on mw24 is CRITICAL: Connection refused [03:16:05] RECOVERY - Host mw70 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:16:14] RECOVERY - Host mw20 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [03:16:32] RECOVERY - Host mw22 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [03:16:32] RECOVERY - Host mw21 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [03:16:32] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:32] PROBLEM - Host mw32 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:41] PROBLEM - Host mw34 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:41] PROBLEM - Host mw33 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:50] PROBLEM - Apache HTTP on mw26 is CRITICAL: Connection refused [03:16:50] RECOVERY - Host mw71 is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [03:16:50] RECOVERY - Host mw72 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [03:16:59] PROBLEM - Apache HTTP on mw74 is CRITICAL: Connection refused [03:16:59] PROBLEM - SSH on mw73 is CRITICAL: Connection refused [03:16:59] PROBLEM - SSH on mw74 is CRITICAL: Connection refused [03:17:08] PROBLEM - SSH on mw26 is CRITICAL: Connection refused [03:17:17] PROBLEM - SSH on mw24 is CRITICAL: Connection refused [03:17:26] PROBLEM - Apache HTTP on mw24 is CRITICAL: Connection refused [03:17:29] notpeter: I wonder if scheduling nagios downtimes would help in such cases [03:17:40] well, I know they would, I wonder if we should start using them [03:17:44] PROBLEM - Memcached on mw26 is CRITICAL: Connection refused [03:17:53] PROBLEM - Memcached on mw27 is CRITICAL: Connection refused [03:17:53] PROBLEM - SSH on mw27 is CRITICAL: Connection refused [03:18:11] PROBLEM - SSH on mw28 is CRITICAL: Connection refused [03:18:20] PROBLEM - Memcached on mw28 is CRITICAL: Connection refused [03:18:20] RECOVERY - Host mw25 is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [03:18:29] PROBLEM - Apache HTTP on mw29 is CRITICAL: Connection refused [03:18:42] paravoid: I'm fine either way [03:18:50] ma_rk has been anti in the past [03:18:56] PROBLEM - Memcached on mw29 is CRITICAL: Connection refused [03:18:56] PROBLEM - Apache HTTP on mw28 is CRITICAL: Connection refused [03:18:58] I generally don't care as long as it's not paging [03:19:01] why? [03:19:06] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [03:19:12] he, presonally, likes to see things go down and come up [03:19:20] aha [03:19:29] I like to check the nagios.w.o page when I'm doing osmething like this [03:19:32] PROBLEM - Memcached on mw20 is CRITICAL: Connection refused [03:19:39] as it's a better way to check 20 apaches at once [03:19:41] iunno [03:19:41] PROBLEM - SSH on mw29 is CRITICAL: Connection refused [03:19:47] I have no opinoins on this [03:20:17] PROBLEM - SSH on mw72 is CRITICAL: Connection refused [03:20:17] PROBLEM - Apache HTTP on mw72 is CRITICAL: Connection refused [03:20:17] PROBLEM - Apache HTTP on mw21 is CRITICAL: Connection refused [03:20:22] I can silence things if people, genearlly, would prefer it [03:20:26] PROBLEM - Memcached on mw22 is CRITICAL: Connection refused [03:20:26] PROBLEM - SSH on mw20 is CRITICAL: Connection refused [03:20:26] PROBLEM - SSH on mw71 is CRITICAL: Connection refused [03:20:30] (he says, on his last round of upgrades....) [03:20:35] PROBLEM - SSH on mw70 is CRITICAL: Connection refused [03:20:35] PROBLEM - Apache HTTP on mw70 is CRITICAL: Connection refused [03:20:35] PROBLEM - Apache HTTP on mw71 is CRITICAL: Connection refused [03:20:35] PROBLEM - Apache HTTP on mw20 is CRITICAL: Connection refused [03:20:38] ((until 2 year from now.......)) [03:20:44] PROBLEM - Memcached on mw21 is CRITICAL: Connection refused [03:20:44] PROBLEM - SSH on mw21 is CRITICAL: Connection refused [03:21:02] RECOVERY - Host mw30 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [03:21:11] PROBLEM - SSH on mw22 is CRITICAL: Connection refused [03:21:11] PROBLEM - Apache HTTP on mw22 is CRITICAL: Connection refused [03:21:38] PROBLEM - SSH on mw25 is CRITICAL: Connection refused [03:21:38] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [03:21:56] PROBLEM - Memcached on mw25 is CRITICAL: Connection refused [03:22:14] RECOVERY - Host mw32 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:22:14] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [03:22:23] RECOVERY - Host mw34 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [03:22:23] RECOVERY - Host mw33 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [03:22:39] yeah I know you're finishing off, this was a bigger "I wonder..." [03:22:58] ah, gotcha [03:23:08] PROBLEM - Apache HTTP on mw25 is CRITICAL: Connection refused [03:23:21] the thing that I think is more useful is making notes in the nagios.w.o page [03:23:26] for things that are being worked on [03:23:28] etc [03:23:44] RECOVERY - SSH on mw70 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:24:20] PROBLEM - SSH on mw30 is CRITICAL: Connection refused [03:24:20] RECOVERY - SSH on mw22 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:24:38] PROBLEM - Memcached on mw30 is CRITICAL: Connection refused [03:25:23] RECOVERY - SSH on mw20 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:23] RECOVERY - SSH on mw71 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:23] RECOVERY - SSH on mw72 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:23] PROBLEM - Apache HTTP on mw30 is CRITICAL: Connection refused [03:25:23] RECOVERY - SSH on mw73 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:32] RECOVERY - SSH on mw21 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:59] PROBLEM - Memcached on mw31 is CRITICAL: Connection refused [03:25:59] PROBLEM - SSH on mw31 is CRITICAL: Connection refused [03:26:08] PROBLEM - SSH on mw34 is CRITICAL: Connection refused [03:26:08] PROBLEM - Apache HTTP on mw32 is CRITICAL: Connection refused [03:26:17] PROBLEM - Apache HTTP on mw34 is CRITICAL: Connection refused [03:26:26] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [03:26:26] PROBLEM - Memcached on mw34 is CRITICAL: Connection refused [03:26:26] RECOVERY - SSH on mw25 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:26:26] PROBLEM - Memcached on mw33 is CRITICAL: Connection refused [03:26:35] PROBLEM - Apache HTTP on mw33 is CRITICAL: Connection refused [03:26:44] PROBLEM - Memcached on mw32 is CRITICAL: Connection refused [03:27:02] RECOVERY - SSH on mw24 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:27:02] RECOVERY - SSH on mw26 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:27:02] PROBLEM - SSH on mw33 is CRITICAL: Connection refused [03:27:11] RECOVERY - SSH on mw74 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:27:11] PROBLEM - SSH on mw32 is CRITICAL: Connection refused [03:27:11] PROBLEM - SSH on ms-be7 is CRITICAL: Connection refused [03:27:47] RECOVERY - SSH on mw27 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:27:47] RECOVERY - SSH on mw28 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:29:08] RECOVERY - SSH on mw30 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:29:17] RECOVERY - SSH on mw31 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:29:26] RECOVERY - SSH on mw29 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:30:11] RECOVERY - SSH on mw33 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:30:29] RECOVERY - SSH on mw32 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:30:56] RECOVERY - SSH on mw34 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:35:26] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [03:35:33] New patchset: Tim Starling; "Add timeouts to RMI communications" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/34481 [03:35:35] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [03:36:47] RECOVERY - SSH on ms-be7 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:37:14] PROBLEM - NTP on mw73 is CRITICAL: NTP CRITICAL: Offset unknown [03:39:29] PROBLEM - NTP on mw21 is CRITICAL: NTP CRITICAL: Offset unknown [03:39:38] PROBLEM - NTP on mw70 is CRITICAL: NTP CRITICAL: Offset unknown [03:40:32] PROBLEM - NTP on mw72 is CRITICAL: NTP CRITICAL: No response from NTP server [03:41:17] PROBLEM - NTP on mw22 is CRITICAL: NTP CRITICAL: No response from NTP server [03:42:36] New review: Tim Starling; "I only tested compilation. It would be nice if someone could test this properly." [operations/debs/lucene-search-2] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/34481 [03:42:57] RECOVERY - NTP on mw21 is OK: NTP OK: Offset 0.02046966553 secs [03:45:56] PROBLEM - NTP on mw33 is CRITICAL: NTP CRITICAL: Offset unknown [03:46:05] RECOVERY - NTP on mw22 is OK: NTP OK: Offset -0.03259980679 secs [03:47:26] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:47:53] PROBLEM - NTP on mw71 is CRITICAL: NTP CRITICAL: No response from NTP server [03:48:20] RECOVERY - NTP on mw73 is OK: NTP OK: Offset 0.05045318604 secs [03:48:56] PROBLEM - NTP on mw24 is CRITICAL: NTP CRITICAL: Offset unknown [03:49:14] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [03:49:41] PROBLEM - NTP on mw74 is CRITICAL: NTP CRITICAL: Offset unknown [03:49:41] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [03:50:44] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [03:50:54]