[00:00:03] no [00:01:42] New patchset: Pyoungmeister; "yet more apache upgrades" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34450 [00:03:28] !log depooling mw53-mw49 for upgrade to apache [00:03:34] Logged the message, notpeter [00:03:45] !log depooling mw66-mw99 for upgrade to apache [00:03:50] !log maxsem synchronized php-1.21wmf3/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/#/c/34440/' [00:03:51] Logged the message, notpeter [00:03:57] Logged the message, Master [00:06:19] !log maxsem synchronized php-1.21wmf4/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/#/c/34440/' [00:06:26] Logged the message, Master [00:09:36] New patchset: Asher; "path fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34451 [00:11:09] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34451 [00:13:52] * notpeter is bringin' the nagios alerts [00:14:23] woo, https statistics [00:14:34] binasher: this will include ipv6 hits too though, wouldn't it? [00:15:18] ipv6 isn't a lot of traffic, but https isn't either [00:16:07] PROBLEM - Host mw66 is DOWN: PING CRITICAL - Packet loss = 100% [00:16:16] PROBLEM - Host mw35 is DOWN: PING CRITICAL - Packet loss = 100% [00:16:34] PROBLEM - Host mw67 is DOWN: PING CRITICAL - Packet loss = 100% [00:16:43] PROBLEM - Host mw36 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:10] PROBLEM - Host mw68 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:19] PROBLEM - Host mw37 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:46] PROBLEM - Host mw69 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:55] PROBLEM - Host mw38 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:20] paravoid: i think it will be both and won't distinguish between nginx sites [00:18:22] PROBLEM - Host mw39 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:40] PROBLEM - Host mw40 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:40] and just realized we are down 2/4 hosts in esams and have no data on how nginx is actually doing [00:18:40] PROBLEM - Host mw43 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:40] PROBLEM - Host mw41 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:59] yes, Ryan_Lane was messing with it yesterday [00:19:27] sounds like one of the hosts didn't come back up after a reboot or something like that [00:19:30] I told ma rk our morning, since the mgmt lan for that rack is out and he said somethinga bout going to esams this week [00:19:39] yes [00:20:25] yeah the status page doesn't distinguish between vhosts [00:20:28] PROBLEM - Host mw44 is DOWN: PING CRITICAL - Packet loss = 100% [00:21:04] PROBLEM - Host mw45 is DOWN: PING CRITICAL - Packet loss = 100% [00:21:58] RECOVERY - Host mw66 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [00:22:07] RECOVERY - Host mw35 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [00:22:16] PROBLEM - Host mw47 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:25] RECOVERY - Host mw67 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [00:22:34] RECOVERY - Host mw36 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [00:22:43] PROBLEM - Host mw48 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:52] RECOVERY - Host mw68 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [00:23:01] RECOVERY - Host mw37 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [00:23:15] Ryan_Lane: are you still debugging this or should I file a RT for esams? [00:23:28] RECOVERY - Host mw69 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [00:23:28] PROBLEM - Host mw49 is DOWN: PING CRITICAL - Packet loss = 100% [00:23:38] RECOVERY - Host mw38 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [00:23:46] PROBLEM - Memcached on mw42 is CRITICAL: Connection refused [00:24:04] RECOVERY - Host mw39 is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [00:24:22] RECOVERY - Host mw43 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [00:24:22] RECOVERY - Host mw41 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [00:24:22] RECOVERY - Host mw40 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [00:24:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34450 [00:24:27] debugging what? [00:24:38] this :) [00:24:55] paravoid: the mgmt stuff? [00:24:56] ssl300x [00:24:58] PROBLEM - Apache HTTP on mw42 is CRITICAL: Connection refused [00:25:02] yes [00:25:05] apparently the mgmt switches aren't manages [00:25:07] managed* [00:25:12] someone will need to go to the datacenter for it [00:25:16] PROBLEM - SSH on mw42 is CRITICAL: Connection refused [00:25:17] nothing to be done about it [00:25:26] PROBLEM - Memcached on mw46 is CRITICAL: Connection refused [00:25:26] PROBLEM - SSH on mw35 is CRITICAL: Connection refused [00:25:26] PROBLEM - Apache HTTP on mw46 is CRITICAL: Connection refused [00:25:27] well, RT ticket in the esams queue? [00:25:30] doing it [00:25:34] I did that when I noticed it [00:25:46] oh [00:25:46] sorry [00:25:49] and ssl3004 has been down for months [00:25:52] PROBLEM - Apache HTTP on mw35 is CRITICAL: Connection refused [00:25:55] yeah [00:26:01] PROBLEM - Apache HTTP on mw66 is CRITICAL: Connection refused [00:26:01] PROBLEM - SSH on mw67 is CRITICAL: Connection refused [00:26:10] PROBLEM - SSH on mw68 is CRITICAL: Connection refused [00:26:10] PROBLEM - Apache HTTP on mw36 is CRITICAL: Connection refused [00:26:10] PROBLEM - Apache HTTP on mw67 is CRITICAL: Connection refused [00:26:10] RECOVERY - Host mw44 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [00:26:20] PROBLEM - Apache HTTP on mw68 is CRITICAL: Connection refused [00:26:20] PROBLEM - SSH on mw36 is CRITICAL: Connection refused [00:26:28] PROBLEM - Memcached on mw36 is CRITICAL: Connection refused [00:26:28] PROBLEM - Memcached on mw35 is CRITICAL: Connection refused [00:26:37] PROBLEM - SSH on mw46 is CRITICAL: Connection refused [00:26:37] PROBLEM - Memcached on mw37 is CRITICAL: Connection refused [00:26:46] PROBLEM - SSH on mw66 is CRITICAL: Connection refused [00:26:46] RECOVERY - Host mw45 is UP: PING OK - Packet loss = 0%, RTA = 3.06 ms [00:26:52] http://ganglia.wikimedia.org/latest/?c=SSL%20cluster%20esams&h=ssl3002.esams.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [00:26:55] PROBLEM - SSH on mw69 is CRITICAL: Connection refused [00:26:55] PROBLEM - SSH on mw38 is CRITICAL: Connection refused [00:27:04] PROBLEM - SSH on mw37 is CRITICAL: Connection refused [00:27:23] PROBLEM - Apache HTTP on mw38 is CRITICAL: Connection refused [00:27:23] PROBLEM - Apache HTTP on mw39 is CRITICAL: Connection refused [00:27:31] PROBLEM - Apache HTTP on mw37 is CRITICAL: Connection refused [00:27:40] PROBLEM - SSH on mw41 is CRITICAL: Connection refused [00:27:40] PROBLEM - Apache HTTP on mw69 is CRITICAL: Connection refused [00:27:43] binasher: btw, we can use http 1.1 for the backend in nginx now [00:27:48] which also means connection pooling [00:27:49] PROBLEM - Memcached on mw39 is CRITICAL: Connection refused [00:27:49] PROBLEM - Memcached on mw38 is CRITICAL: Connection refused [00:27:58] RECOVERY - Host mw47 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [00:28:07] PROBLEM - SSH on mw43 is CRITICAL: Connection refused [00:28:08] I haven't started to test that yet [00:28:16] PROBLEM - SSH on mw40 is CRITICAL: Connection refused [00:28:25] PROBLEM - Apache HTTP on mw40 is CRITICAL: Connection refused [00:28:25] RECOVERY - Host mw48 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [00:28:34] PROBLEM - Memcached on mw40 is CRITICAL: Connection refused [00:28:34] PROBLEM - Memcached on mw41 is CRITICAL: Connection refused [00:28:43] PROBLEM - Memcached on mw43 is CRITICAL: Connection refused [00:28:43] PROBLEM - SSH on mw39 is CRITICAL: Connection refused [00:28:52] PROBLEM - Apache HTTP on mw41 is CRITICAL: Connection refused [00:28:52] PROBLEM - Apache HTTP on mw43 is CRITICAL: Connection refused [00:29:10] RECOVERY - Host mw49 is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [00:29:37] PROBLEM - SSH on mw44 is CRITICAL: Connection refused [00:30:04] PROBLEM - Apache HTTP on mw44 is CRITICAL: Connection refused [00:30:05] RECOVERY - SSH on mw66 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:30:05] PROBLEM - Memcached on mw45 is CRITICAL: Connection refused [00:30:08] it's kind of amazing how the box uses half of the memory it used to use before the precise reinstall [00:30:13] RECOVERY - SSH on mw35 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:30:13] PROBLEM - Apache HTTP on mw45 is CRITICAL: Connection refused [00:30:16] yep [00:30:22] and half the cpu [00:30:25] not that the cpu was much used [00:30:50] RECOVERY - SSH on mw67 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:30:58] PROBLEM - Memcached on mw44 is CRITICAL: Connection refused [00:30:58] RECOVERY - SSH on mw68 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:31:07] RECOVERY - SSH on mw36 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:31:18] and it's even using half the memory with all of ssl3001's load added up [00:31:25] PROBLEM - SSH on mw45 is CRITICAL: Connection refused [00:31:25] PROBLEM - SSH on mw47 is CRITICAL: Connection refused [00:31:27] well, half of it anyway [00:31:43] RECOVERY - SSH on mw69 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:31:43] PROBLEM - Memcached on mw47 is CRITICAL: Connection refused [00:31:43] PROBLEM - Apache HTTP on mw48 is CRITICAL: Connection refused [00:31:43] PROBLEM - Memcached on mw48 is CRITICAL: Connection refused [00:31:43] RECOVERY - SSH on mw38 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:31:52] RECOVERY - SSH on mw37 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:32:01] RECOVERY - SSH on mw39 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:32:01] PROBLEM - Apache HTTP on mw47 is CRITICAL: Connection refused [00:32:02] hmm, less processes [00:32:05] ah [00:32:08] that's odd [00:32:12] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=SSL+cluster+esams&h=ssl3002.esams.wikimedia.org&v=414&m=proc_total&jr=&js=&vl=+&ti=Total+Processes [00:32:12] did that change in puppet? [00:32:28] RECOVERY - SSH on mw41 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:32:32] hm. does it use http 1.1 by default now? [00:32:53] nope [00:33:00] so that shouldn't be the difference [00:33:13] PROBLEM - SSH on mw49 is CRITICAL: Connection refused [00:33:13] RECOVERY - SSH on mw40 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:33:13] PROBLEM - SSH on mw48 is CRITICAL: Connection refused [00:33:22] PROBLEM - Memcached on mw49 is CRITICAL: Connection refused [00:33:28] we actually run way more processes than is necessary [00:33:31] PROBLEM - Apache HTTP on mw49 is CRITICAL: Connection refused [00:33:48] the default nginx config changed for one of the ms boxes and the ssl servers got the change along with it [00:34:34] RECOVERY - SSH on mw44 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:34:34] RECOVERY - SSH on mw43 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:34:34] RECOVERY - SSH on mw45 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:34:43] RECOVERY - SSH on mw46 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:35:01] RECOVERY - SSH on mw42 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:36:13] RECOVERY - SSH on mw47 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:36:31] RECOVERY - SSH on mw48 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:37:38] !log preilly synchronized php-1.21wmf3/extensions/ZeroRatedMobileAccess 'update post deploy' [00:37:45] Logged the message, Master [00:38:01] RECOVERY - SSH on mw49 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:39:58] !log preilly synchronized php-1.21wmf4/extensions/ZeroRatedMobileAccess 'update post deploy' [00:40:06] Logged the message, Master [00:41:46] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.013 seconds [00:41:46] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.010 seconds [00:41:55] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.009 seconds [00:41:55] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [00:42:04] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [00:44:55] PROBLEM - NTP on mw35 is CRITICAL: NTP CRITICAL: Offset unknown [00:46:34] PROBLEM - NTP on mw36 is CRITICAL: NTP CRITICAL: No response from NTP server [00:46:52] PROBLEM - NTP on mw37 is CRITICAL: NTP CRITICAL: No response from NTP server [00:46:52] PROBLEM - NTP on mw68 is CRITICAL: NTP CRITICAL: No response from NTP server [00:47:37] PROBLEM - NTP on mw41 is CRITICAL: NTP CRITICAL: No response from NTP server [00:47:46] PROBLEM - NTP on mw40 is CRITICAL: NTP CRITICAL: No response from NTP server [00:48:04] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [00:48:31] PROBLEM - NTP on mw43 is CRITICAL: NTP CRITICAL: Offset unknown [00:50:01] PROBLEM - NTP on mw45 is CRITICAL: NTP CRITICAL: No response from NTP server [00:50:28] PROBLEM - NTP on mw44 is CRITICAL: NTP CRITICAL: No response from NTP server [00:53:08] RECOVERY - NTP on mw43 is OK: NTP OK: Offset -0.006959319115 secs [00:53:08] PROBLEM - NTP on mw66 is CRITICAL: NTP CRITICAL: Offset unknown [00:53:26] PROBLEM - NTP on mw48 is CRITICAL: NTP CRITICAL: No response from NTP server [00:54:02] RECOVERY - NTP on mw35 is OK: NTP OK: Offset -0.008198618889 secs [00:54:29] PROBLEM - NTP on mw67 is CRITICAL: NTP CRITICAL: Offset unknown [00:54:47] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [00:55:14] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [00:55:23] PROBLEM - NTP on mw69 is CRITICAL: NTP CRITICAL: No response from NTP server [00:55:32] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [00:55:32] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [00:55:32] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [00:55:41] PROBLEM - NTP on mw38 is CRITICAL: NTP CRITICAL: No response from NTP server [00:56:35] PROBLEM - NTP on mw42 is CRITICAL: NTP CRITICAL: No response from NTP server [00:59:53] PROBLEM - NTP on mw49 is CRITICAL: NTP CRITICAL: No response from NTP server [01:00:02] PROBLEM - NTP on mw46 is CRITICAL: NTP CRITICAL: No response from NTP server [01:07:41] RECOVERY - NTP on mw40 is OK: NTP OK: Offset -0.05681073666 secs [01:07:50] RECOVERY - NTP on mw48 is OK: NTP OK: Offset 0.08070385456 secs [01:08:26] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.004 seconds [01:08:53] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [01:09:11] RECOVERY - NTP on mw66 is OK: NTP OK: Offset -0.004572749138 secs [01:09:11] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [01:09:56] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.014 seconds [01:09:56] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [01:11:44] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [01:21:11] RECOVERY - NTP on mw41 is OK: NTP OK: Offset 0.01474070549 secs [01:22:05] RECOVERY - NTP on mw68 is OK: NTP OK: Offset -0.0178142786 secs [01:22:23] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [01:22:41] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [01:22:41] RECOVERY - NTP on mw36 is OK: NTP OK: Offset -0.01192617416 secs [01:23:08] RECOVERY - NTP on mw67 is OK: NTP OK: Offset -0.009715795517 secs [01:23:26] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [01:23:26] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [01:23:53] RECOVERY - NTP on mw44 is OK: NTP OK: Offset -0.01133227348 secs [01:34:05] RECOVERY - NTP on mw42 is OK: NTP OK: Offset -0.07879531384 secs [01:34:32] RECOVERY - NTP on mw38 is OK: NTP OK: Offset -0.03637301922 secs [01:35:44] RECOVERY - NTP on mw46 is OK: NTP OK: Offset -0.06795620918 secs [01:37:05] RECOVERY - NTP on mw49 is OK: NTP OK: Offset -0.007266640663 secs [01:37:14] RECOVERY - NTP on mw37 is OK: NTP OK: Offset -0.0003443956375 secs [01:37:14] RECOVERY - NTP on mw45 is OK: NTP OK: Offset -0.01063299179 secs [01:40:05] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 267 seconds [01:40:50] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 311 seconds [01:43:23] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:45:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:47:06] New patchset: Asher; "ganglia plugin for nginx stats" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34467 [01:50:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34467 [01:50:53] RECOVERY - NTP on mw69 is OK: NTP OK: Offset -0.006066560745 secs [01:54:47] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [02:00:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 267 seconds [02:01:14] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 320 seconds [02:08:44] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [02:21:11] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [02:24:47] !log LocalisationUpdate completed (1.21wmf4) at Wed Nov 21 02:24:47 UTC 2012 [02:24:54] Logged the message, Master [02:30:56] RECOVERY - SSH on ms-be7 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:31:05] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [02:45:42] !log LocalisationUpdate completed (1.21wmf3) at Wed Nov 21 02:45:42 UTC 2012 [02:45:52] Logged the message, Master [02:54:47] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:57:38] RECOVERY - Puppet freshness on analytics1002 is OK: puppet ran at Wed Nov 21 02:57:12 UTC 2012 [03:01:05] !log depooling mw20-mw34 for upgrade to precise [03:01:13] Logged the message, notpeter [03:01:36] !log depooling mw70-mw74 for upgrade to precise [03:01:42] Logged the message, notpeter [03:05:08] RECOVERY - Puppet freshness on ms-fe1 is OK: puppet ran at Wed Nov 21 03:04:53 UTC 2012 [03:05:59] gonna make a lot of nagios noise (in this channel I mean) [03:10:24] PROBLEM - Host mw70 is DOWN: PING CRITICAL - Packet loss = 100% [03:10:32] PROBLEM - Host mw20 is DOWN: PING CRITICAL - Packet loss = 100% [03:10:50] PROBLEM - Host mw22 is DOWN: PING CRITICAL - Packet loss = 100% [03:10:50] PROBLEM - Host mw21 is DOWN: PING CRITICAL - Packet loss = 100% [03:11:08] PROBLEM - Host mw71 is DOWN: PING CRITICAL - Packet loss = 100% [03:11:08] PROBLEM - Host mw72 is DOWN: PING CRITICAL - Packet loss = 100% [03:11:24] New patchset: Faidon; "partman: fix grub install on non-/dev/sda" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34476 [03:11:24] New patchset: Faidon; "partman: fixes to ms-be with SSDs recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34477 [03:11:24] New patchset: Faidon; "partman: prepare for two flavors of ms-be" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34478 [03:11:40] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34476 [03:11:51] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34477 [03:11:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34478 [03:12:38] PROBLEM - Host mw25 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:25] New patchset: Pyoungmeister; "last apache upgrades" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34479 [03:15:03] a few hours more down the partman/d-i drain [03:15:10] working around bugs [03:15:21] PROBLEM - Host mw30 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34479 [03:15:56] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:57] paravoid: :( [03:16:05] PROBLEM - Apache HTTP on mw73 is CRITICAL: Connection refused [03:16:05] PROBLEM - Memcached on mw24 is CRITICAL: Connection refused [03:16:05] RECOVERY - Host mw70 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:16:14] RECOVERY - Host mw20 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [03:16:32] RECOVERY - Host mw22 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [03:16:32] RECOVERY - Host mw21 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [03:16:32] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:32] PROBLEM - Host mw32 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:41] PROBLEM - Host mw34 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:41] PROBLEM - Host mw33 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:50] PROBLEM - Apache HTTP on mw26 is CRITICAL: Connection refused [03:16:50] RECOVERY - Host mw71 is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [03:16:50] RECOVERY - Host mw72 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [03:16:59] PROBLEM - Apache HTTP on mw74 is CRITICAL: Connection refused [03:16:59] PROBLEM - SSH on mw73 is CRITICAL: Connection refused [03:16:59] PROBLEM - SSH on mw74 is CRITICAL: Connection refused [03:17:08] PROBLEM - SSH on mw26 is CRITICAL: Connection refused [03:17:17] PROBLEM - SSH on mw24 is CRITICAL: Connection refused [03:17:26] PROBLEM - Apache HTTP on mw24 is CRITICAL: Connection refused [03:17:29] notpeter: I wonder if scheduling nagios downtimes would help in such cases [03:17:40] well, I know they would, I wonder if we should start using them [03:17:44] PROBLEM - Memcached on mw26 is CRITICAL: Connection refused [03:17:53] PROBLEM - Memcached on mw27 is CRITICAL: Connection refused [03:17:53] PROBLEM - SSH on mw27 is CRITICAL: Connection refused [03:18:11] PROBLEM - SSH on mw28 is CRITICAL: Connection refused [03:18:20] PROBLEM - Memcached on mw28 is CRITICAL: Connection refused [03:18:20] RECOVERY - Host mw25 is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [03:18:29] PROBLEM - Apache HTTP on mw29 is CRITICAL: Connection refused [03:18:42] paravoid: I'm fine either way [03:18:50] ma_rk has been anti in the past [03:18:56] PROBLEM - Memcached on mw29 is CRITICAL: Connection refused [03:18:56] PROBLEM - Apache HTTP on mw28 is CRITICAL: Connection refused [03:18:58] I generally don't care as long as it's not paging [03:19:01] why? [03:19:06] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [03:19:12] he, presonally, likes to see things go down and come up [03:19:20] aha [03:19:29] I like to check the nagios.w.o page when I'm doing osmething like this [03:19:32] PROBLEM - Memcached on mw20 is CRITICAL: Connection refused [03:19:39] as it's a better way to check 20 apaches at once [03:19:41] iunno [03:19:41] PROBLEM - SSH on mw29 is CRITICAL: Connection refused [03:19:47] I have no opinoins on this [03:20:17] PROBLEM - SSH on mw72 is CRITICAL: Connection refused [03:20:17] PROBLEM - Apache HTTP on mw72 is CRITICAL: Connection refused [03:20:17] PROBLEM - Apache HTTP on mw21 is CRITICAL: Connection refused [03:20:22] I can silence things if people, genearlly, would prefer it [03:20:26] PROBLEM - Memcached on mw22 is CRITICAL: Connection refused [03:20:26] PROBLEM - SSH on mw20 is CRITICAL: Connection refused [03:20:26] PROBLEM - SSH on mw71 is CRITICAL: Connection refused [03:20:30] (he says, on his last round of upgrades....) [03:20:35] PROBLEM - SSH on mw70 is CRITICAL: Connection refused [03:20:35] PROBLEM - Apache HTTP on mw70 is CRITICAL: Connection refused [03:20:35] PROBLEM - Apache HTTP on mw71 is CRITICAL: Connection refused [03:20:35] PROBLEM - Apache HTTP on mw20 is CRITICAL: Connection refused [03:20:38] ((until 2 year from now.......)) [03:20:44] PROBLEM - Memcached on mw21 is CRITICAL: Connection refused [03:20:44] PROBLEM - SSH on mw21 is CRITICAL: Connection refused [03:21:02] RECOVERY - Host mw30 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [03:21:11] PROBLEM - SSH on mw22 is CRITICAL: Connection refused [03:21:11] PROBLEM - Apache HTTP on mw22 is CRITICAL: Connection refused [03:21:38] PROBLEM - SSH on mw25 is CRITICAL: Connection refused [03:21:38] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [03:21:56] PROBLEM - Memcached on mw25 is CRITICAL: Connection refused [03:22:14] RECOVERY - Host mw32 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:22:14] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [03:22:23] RECOVERY - Host mw34 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [03:22:23] RECOVERY - Host mw33 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [03:22:39] yeah I know you're finishing off, this was a bigger "I wonder..." [03:22:58] ah, gotcha [03:23:08] PROBLEM - Apache HTTP on mw25 is CRITICAL: Connection refused [03:23:21] the thing that I think is more useful is making notes in the nagios.w.o page [03:23:26] for things that are being worked on [03:23:28] etc [03:23:44] RECOVERY - SSH on mw70 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:24:20] PROBLEM - SSH on mw30 is CRITICAL: Connection refused [03:24:20] RECOVERY - SSH on mw22 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:24:38] PROBLEM - Memcached on mw30 is CRITICAL: Connection refused [03:25:23] RECOVERY - SSH on mw20 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:23] RECOVERY - SSH on mw71 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:23] RECOVERY - SSH on mw72 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:23] PROBLEM - Apache HTTP on mw30 is CRITICAL: Connection refused [03:25:23] RECOVERY - SSH on mw73 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:32] RECOVERY - SSH on mw21 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:25:59] PROBLEM - Memcached on mw31 is CRITICAL: Connection refused [03:25:59] PROBLEM - SSH on mw31 is CRITICAL: Connection refused [03:26:08] PROBLEM - SSH on mw34 is CRITICAL: Connection refused [03:26:08] PROBLEM - Apache HTTP on mw32 is CRITICAL: Connection refused [03:26:17] PROBLEM - Apache HTTP on mw34 is CRITICAL: Connection refused [03:26:26] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [03:26:26] PROBLEM - Memcached on mw34 is CRITICAL: Connection refused [03:26:26] RECOVERY - SSH on mw25 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:26:26] PROBLEM - Memcached on mw33 is CRITICAL: Connection refused [03:26:35] PROBLEM - Apache HTTP on mw33 is CRITICAL: Connection refused [03:26:44] PROBLEM - Memcached on mw32 is CRITICAL: Connection refused [03:27:02] RECOVERY - SSH on mw24 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:27:02] RECOVERY - SSH on mw26 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:27:02] PROBLEM - SSH on mw33 is CRITICAL: Connection refused [03:27:11] RECOVERY - SSH on mw74 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:27:11] PROBLEM - SSH on mw32 is CRITICAL: Connection refused [03:27:11] PROBLEM - SSH on ms-be7 is CRITICAL: Connection refused [03:27:47] RECOVERY - SSH on mw27 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:27:47] RECOVERY - SSH on mw28 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:29:08] RECOVERY - SSH on mw30 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:29:17] RECOVERY - SSH on mw31 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:29:26] RECOVERY - SSH on mw29 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:30:11] RECOVERY - SSH on mw33 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:30:29] RECOVERY - SSH on mw32 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:30:56] RECOVERY - SSH on mw34 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:35:26] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [03:35:33] New patchset: Tim Starling; "Add timeouts to RMI communications" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/34481 [03:35:35] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [03:36:47] RECOVERY - SSH on ms-be7 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:37:14] PROBLEM - NTP on mw73 is CRITICAL: NTP CRITICAL: Offset unknown [03:39:29] PROBLEM - NTP on mw21 is CRITICAL: NTP CRITICAL: Offset unknown [03:39:38] PROBLEM - NTP on mw70 is CRITICAL: NTP CRITICAL: Offset unknown [03:40:32] PROBLEM - NTP on mw72 is CRITICAL: NTP CRITICAL: No response from NTP server [03:41:17] PROBLEM - NTP on mw22 is CRITICAL: NTP CRITICAL: No response from NTP server [03:42:36] New review: Tim Starling; "I only tested compilation. It would be nice if someone could test this properly." [operations/debs/lucene-search-2] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/34481 [03:42:57] RECOVERY - NTP on mw21 is OK: NTP OK: Offset 0.02046966553 secs [03:45:56] PROBLEM - NTP on mw33 is CRITICAL: NTP CRITICAL: Offset unknown [03:46:05] RECOVERY - NTP on mw22 is OK: NTP OK: Offset -0.03259980679 secs [03:47:26] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:47:53] PROBLEM - NTP on mw71 is CRITICAL: NTP CRITICAL: No response from NTP server [03:48:20] RECOVERY - NTP on mw73 is OK: NTP OK: Offset 0.05045318604 secs [03:48:56] PROBLEM - NTP on mw24 is CRITICAL: NTP CRITICAL: Offset unknown [03:49:14] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [03:49:41] PROBLEM - NTP on mw74 is CRITICAL: NTP CRITICAL: Offset unknown [03:49:41] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [03:50:44] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [03:50:54] PROBLEM - NTP on mw28 is CRITICAL: NTP CRITICAL: Offset unknown [03:51:20] PROBLEM - NTP on mw26 is CRITICAL: NTP CRITICAL: Offset unknown [03:52:14] PROBLEM - NTP on mw29 is CRITICAL: NTP CRITICAL: Offset unknown [03:52:14] PROBLEM - NTP on mw32 is CRITICAL: NTP CRITICAL: Offset unknown [03:52:23] PROBLEM - NTP on mw30 is CRITICAL: NTP CRITICAL: Offset unknown [03:53:44] RECOVERY - NTP on mw24 is OK: NTP OK: Offset -0.00827395916 secs [03:53:53] RECOVERY - NTP on mw32 is OK: NTP OK: Offset -0.006015896797 secs [03:54:02] RECOVERY - NTP on mw28 is OK: NTP OK: Offset -0.008961677551 secs [03:57:02] RECOVERY - NTP on mw29 is OK: NTP OK: Offset -0.01075148582 secs [03:58:59] RECOVERY - NTP on mw33 is OK: NTP OK: Offset -0.007286667824 secs [04:00:56] RECOVERY - NTP on mw71 is OK: NTP OK: Offset -0.0475628376 secs [04:00:56] RECOVERY - NTP on mw26 is OK: NTP OK: Offset -0.005421996117 secs [04:01:59] RECOVERY - NTP on mw30 is OK: NTP OK: Offset -0.007488250732 secs [04:02:26] RECOVERY - NTP on mw74 is OK: NTP OK: Offset -0.001455903053 secs [04:03:29] RECOVERY - NTP on mw70 is OK: NTP OK: Offset -0.01495325565 secs [04:03:29] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [04:12:30] New patchset: Tim Starling; "Add timeouts to RMI communications" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/34481 [04:13:41] RECOVERY - NTP on mw72 is OK: NTP OK: Offset -0.008569478989 secs [04:29:31] apergos: ms-be7 is done, preseeding should work for new systems, see git [04:29:44] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [04:30:26] apergos: I didn't do puppet though; could you set it up and put it in the rings (along with setting 100 for ms-be6) while I'm sleeping? [05:06:47] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [06:11:35] PROBLEM - Lucene on search14 is CRITICAL: Connection timed out [06:14:35] RECOVERY - Lucene on search14 is OK: TCP OK - 2.995 second response time on port 8123 [06:24:47] PROBLEM - Lucene on search14 is CRITICAL: Connection timed out [06:27:47] RECOVERY - Lucene on search14 is OK: TCP OK - 0.003 second response time on port 8123 [06:35:44] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:35:44] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:35:44] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:37:50] PROBLEM - Lucene on search14 is CRITICAL: Connection timed out [06:55:37] paravoid: ms-be6 is not done moving the data around, you can see this from the df and from this: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=network_report&s=by+name&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [06:56:55] RECOVERY - Lucene on search14 is OK: TCP OK - 0.002 second response time on port 8123 [06:58:31] also it looks like you wasted the same hours I did on walking through the grub-installer code, I guess you did not see the backread in here, sorry about that [07:08:08] didn't we agree that I was going to look at it? [07:08:31] yes but then you were gone (or doing other things) for a long time [07:08:35] so I went aheaa [07:08:46] long time? that was just yesterday [07:08:50] yes [07:08:53] and I fixed it yesterday [07:09:07] I'm not complaining that you fixed it [07:09:25] so, I think ms-be6 is close to the target [07:09:30] and if you're going to add ms-be7 [07:09:41] why not make it 100 for ms-be6 too [07:10:09] alternatively you can add ms-be7 at weight 33 and then on friday or so make ms-be7 33->66, ms-be6 66->100 [07:10:28] the df on ms-be1 shows 1.3t in use, on ms-b6 only 540gb in use, that seems pretty far off of 66% [07:11:01] per disk [07:11:33] ok [07:11:51] I was looking at ms7 cruft yesterday too [07:11:56] ok great [07:12:25] it basically needs a half hour's work [07:12:38] cleanup the wiki page + file up a couple of bugzillas or so [07:12:50] ah not to do them ourselves you mean [07:12:51] for extdist and captcha I think [07:13:04] no, for the MW needed code changes [07:13:58] there were a few things which I wasn't sure about [07:14:33] pybaltestfile.txt you had a plan for this [07:15:05] yeah let's fix the bigger problems first [07:15:11] 404 see the comments, I have no idea aobut that [07:15:25] (they are in the list of cruft) [07:15:53] and the docroot [07:17:26] anyways after the grub-installer stuff this was next on my list [07:17:37] what's the difference between hits and hits__ ? in ganglia [07:18:37] therefore after I put ms-be7 in the rings and maybe move ms-be6 to 70% (to catch 'we only move one replica per reblanace' issues), I'll be doublechecking all these [07:18:54] I don't think 66->70 makes much sense (but it won't hurt either) [07:19:05] I think we can just do 33->66 on 7 and 66->100 on 6 at the same time [07:19:11] end of this week possibly [07:19:36] ah I guess the rebalance will happen anyways since we are adding ms-be7 [07:19:39] so yes [07:20:03] no point to moev to 70 [07:20:19] PROBLEM - Lucene on search14 is CRITICAL: Connection timed out [07:20:28] well you should be sleeping I think [07:21:40] RECOVERY - Lucene on search14 is OK: TCP OK - 0.006 second response time on port 8123 [07:21:41] !log restarted lucene search on search14 [07:21:48] Logged the message, Master [07:36:27] * Aaron|home can't wait till swift 1.7.5 [07:38:58] paravoid: is the next upgrade scheduled? [07:39:11] no [07:39:31] we're not going to move to 1.7.5 immediately, since it's the grizzly release [07:39:41] also, no way in hell I'm going to do an upgrade 4 days before the fundraiser :) [07:41:34] * Aaron|home would love to see that double GET fix make it in, but ah well [07:43:10] indeed, it's nasty [07:50:28] !log repooling mw70-mw74 [07:50:34] Logged the message, notpeter [07:52:49] New patchset: Pyoungmeister; "typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34491 [07:53:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34491 [07:53:46] hashar: see my comment on bug 41778 [07:54:04] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [07:54:26] Jasper_Deng: please give me the context / link :-) [07:55:06] http://bugzilla.wikimedia.org/show_bug.cgi?id=41778 [07:55:10] I replied to your comment [07:55:34] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [07:55:43] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [07:56:01] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [07:57:13] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [07:57:47] Jasper_Deng: oooo :-] [07:58:07] Jasper_Deng: I am not Asher Feldman, his nickname is binasher and he is working from the San Francisco office. [07:58:07] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [07:58:15] Jasper_Deng: I am Antoine Musso, working on CI :-] [07:58:20] continuous integration [07:58:21] oh [07:59:02] but still, shouldn't ipblocks have gotten fixed on all wikis rather than broken on the majority of them? [07:59:04] Jasper_Deng: in case he is not reading the bug notification, you might want to drop him an email too [07:59:25] * Jasper_Deng doesn't know email addresses of WMF sysadmins [07:59:36] Asher email is in bugzilla [08:00:00] as for the ipblocks / ipv6 stuff, I have no knowledge of this issue and would prefer avoiding spending one hour figuring it out :-] [08:00:13] afeldman@wikimedia.org ? [08:00:17] yeah [08:00:37] if the issue is still happening, you might want to reopen the bug :-] [08:00:41] I think he's already CC'd [08:00:48] but all that matters is that the bug is fisxed [08:04:56] https://wikitech.wikimedia.org/view/Schema_changes [08:06:49] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.010 seconds [08:06:58] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds [08:07:16] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [08:08:10] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [08:08:21] what could've explained the undersize ipblocks column size? [08:08:55] does it matter? :) [08:09:02] I think it should just be added to that list [08:09:09] with a link to that bugzilla [08:09:18] hashar might know more. [08:09:59] is the bug fixed or not ? [08:10:06] I don't understand what is your request Jasper_Deng_away [08:10:09] oh away [08:10:10] ) [08:10:20] I think he needs a db migration. [08:10:27] * Jasper_Deng_away doesn't need one [08:12:05] okay, never mind me then [08:16:52] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [08:17:01] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [08:17:37] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.001 seconds [08:18:13] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [08:19:52] RECOVERY - Puppet freshness on ms-be7 is OK: puppet ran at Wed Nov 21 08:19:22 UTC 2012 [08:20:27] paravoid: do you happen to know if there are any training courses for Debian packaging ? [08:20:57] you mean professional courses? [08:20:58] paravoid: I thought that instead of either asking you to package or ranting about how I am unable to do any package, I might as well train myself :)) [08:21:02] no I don't think so [08:21:06] there are a few good presentations [08:21:10] yeah with a root@debian.org or something :) [08:21:18] a what? [08:21:34] I meant, receiving a training by someone actually knowing what he is talking about [08:21:47] ah, because there are only 6 root@d.o [08:22:01] ahh [08:22:10] http://git.debian.org/?p=collab-maint/packaging-tutorial.git;a=blob_plain;f=packaging-tutorial.pdf;hb=refs/heads/pdf [08:22:13] I probably meant package manager [08:22:41] I have been using Debian for a loooong time, still don't know the community / organization :( [08:22:54] well [08:22:57] I'll be in Paris this weekend [08:23:04] there's a France mini-DebConf [08:23:12] lots of french Debian people :) [08:23:21] you're welcome to attend [08:23:26] http://fr2012.mini.debconf.org/ [08:23:37] oh man [08:24:33] :-) [08:25:02] Debian's role in establishing an alternative to Skype [08:25:04] \O/ [08:25:25] over 100 people attending [08:25:34] will talk to my wife about it [08:25:50] maybe she could enjoy a week-end in Paris [08:26:10] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [08:26:17] are you staying there after the conf or just for the 2 days? [08:26:19] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [08:27:49] I'm staying until Monday evening [08:28:28] I have a quite a full schedule though [08:28:32] I can imagine [08:29:42] the guy who wrote the packaging tutorial I gave you above (apt-get install packaging-tutorial too) [08:29:55] RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:30:01] is French and attending the conf [08:30:04] RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:30:10] ;-) [08:30:22] RECOVERY - swift-container-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:30:22] RECOVERY - swift-account-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:30:23] RECOVERY - swift-container-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:30:40] RECOVERY - swift-container-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:30:40] RECOVERY - swift-account-server on ms-be7 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [08:30:40] RECOVERY - swift-account-reaper on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:30:49] RECOVERY - swift-object-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:31:07] RECOVERY - swift-object-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:31:16] RECOVERY - swift-object-auditor on ms-be7 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:32:43] yay [08:33:22] RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:33:40] not completely, sdm3 and sdn3 aren't in fstab or have mount points etc. but almost yay :-D [08:33:55] yeah, known issue [08:34:02] needs fixing [08:34:09] yes indeed [08:37:56] paravoid: thanks for the tutorial :-] [08:43:16] RECOVERY - NTP on ms-be7 is OK: NTP OK: Offset -0.0154916048 secs [08:44:11] [08:45:58] got a tiny typo fix for you guys https://gerrit.wikimedia.org/r/#/c/34305 [08:46:18] some string had an extra $ which is unneeded and clutter a generated URL [08:46:23] not urgent htough [08:50:09] New patchset: ArielGlenn; "mount points and fstab for ssds on ms-be hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34494 [08:56:16] apergos: jenkins said it failed [08:56:19] do not merge [08:56:28] yeah I'm trying to find the typo [08:56:33] and it can'tmerge it anyways [08:56:34] why did you verified it yourself? [08:56:52] the typo is that you're missing a : [08:56:58] cause I didn't notice until after I hit the button that I was given only the publish option [08:57:33] you should only select one of the the Code review radios, not the verify ones [08:57:48] ok, I don't undrstand that syntax (with the :), guess I better look it up [08:57:54] always leave 0 no score for verified unless there's a very good reason not to [08:58:12] ok well we were told early on to verify, I suppose that changed at some point [08:58:14] so typically you do the code review part, and jenkins does the verified part [09:08:36] New patchset: MaxSem; "Switch Translate to ext:Solaruim" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34495 [09:18:56] !log repooling mw20-mw34 [09:19:02] Logged the message, notpeter [09:50:28] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:58] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [09:58:36] notpeter: ping? [09:59:55] i guess apergos is looking into kaulen? [09:59:57] if I can't get onto the box I'll power cycle it I guess [10:00:04] (it's taking a long time) [10:00:11] ah I got a front page [10:00:19] * jeremyb too [10:00:26] fairly snappy now [10:00:26] I'm on [10:00:56] load is dropping [10:02:05] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous+pmtpa&h=kaulen.wikimedia.org&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 looks like it had a lot of incoming there for a bit, hm [10:02:48] hrmmmm, watchmouse ain't recovering so fast [10:02:54] i wonder why [10:19:40] things look pretty normal over there and on the db so I'm going to call it good [10:26:50] New patchset: Jeremyb; "bug 42319 - throttle.php: bclwiki bday event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34499 [10:37:09] New patchset: ArielGlenn; "mount points and fstab for ssds on ms-be hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34494 [10:38:24] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34494 [10:49:07] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [10:55:03] New patchset: Silke Meyer; "Added puppet files for Wikidata on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30593 [10:56:30] New review: Silke Meyer; "Replaced the spaces with tabs." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/30593 [11:13:07] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [12:22:07] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:56:01] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [13:13:11] New patchset: Demon; "Configure changeMerge features for Gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34516 [13:13:35] New review: Demon; "Please don't merge until we deploy 2.5/2.6." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/34516 [14:05:08] New patchset: ArielGlenn; "provide for xfs filesystem labels without making the filesystem (ms-bexx)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34525 [14:33:28] New review: Jens Ohlig; "Looking good :)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/30593 [14:46:20] !log Halted nas1001-b for NIC upgrade [14:46:27] Logged the message, Master [14:48:56] whenever someone context switch, would you mind merging a typo I made to a zuul config file? https://gerrit.wikimedia.org/r/#/c/34305/ ;) [14:49:03] it is not in production, not going to kill anything [14:49:08] thanks in advance ! ;-] [14:49:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34305 [14:52:45] New review: Andrew Bogott; "Thanks -- looks good." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/30593 [14:52:47] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30593 [14:56:33] Bedankt mark ! [14:58:32] alsjeblieft [15:08:10] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [15:11:57] hello guys [15:12:02] I need access on stat1001 please [15:12:12] I need that to publish some reports for wikistats [15:12:17] I'm part of the wikimedia-analytics team [15:12:41] my username is spetrea, I already have access to stat1, build1 and build2 [15:13:16] paravoid: hello, can you please help me with this ? [15:13:21] mark apergos paravoid : could you grant average_drifter an access on stat1001 ? account in puppet is spetrea [15:13:22] ;) [15:14:03] paravoid , apergos please let me know if you could help me out with this [15:20:23] is there an RT ticket? [15:22:30] mark: I can make one [15:22:41] mark: can you please give me a link to the RT ? [15:25:53] send an email to access-requests@rt.wikimedia.org with your request, the reason why you need access [15:25:58] your manager will need to approve as well [15:35:19] mark: [15:35:20] RT could not load a valid user, and RT's configuration does not allow [15:35:23] for the creation of a new user for your email. [15:35:26] mark: why does it say that ? [15:35:44] mark: I have 2 e-mails stefan.petrea@gmail.com and stefan@garage-coding.com [15:35:54] mark: could you please check, maybe the other one is in the database of RT ? [15:36:20] it should be your wikimedia.org address, if that account has been made [15:36:36] if that doesn't work, ask your manager to file the request I guess? [15:37:21] mark: I will try again with my other e-mail [15:38:28] mark: tried with my other e-mail and it says [15:38:30] No permission to create tickets in the queue 'access-requests' [15:39:39] which email is that? [15:40:27] mark: stefan@garage-coding.com [15:40:34] like I said, that's not gonna work [15:40:41] use your @wikimedia.org address if you have one [15:40:52] mark: I don't have one [15:40:52] if that's not setup, ask diederik to sort things out for you [15:41:15] drdee: can you please help with this ? [15:41:27] drdee: or we can wait for ottomata [15:41:40] as i said, let's talk with ottomata first, i think he can set this up for us [15:41:45] no he can't [15:41:49] you need to file an RT ticket [15:41:59] ottomata needs to follow processes just like everyone else [15:42:58] mark, i know, i am just saying ottomata has already stuff running that we can use so average_drifter does not need access to stat1001 in the first place [15:43:07] ok [15:49:54] !log nas1001-b back up and running [15:50:00] Logged the message, Master [15:50:07] !log Initiated takeover of nas1001-a to nas1001-b [15:50:14] Logged the message, Master [15:58:16] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 30.91 ms [16:05:25] !log remove known troublemakers srv266 and srv284 from dsh group "apaches" [16:05:32] Logged the message, Master [16:08:42] !log powercycling srv284 [16:08:48] Logged the message, Master [16:12:04] RECOVERY - Host srv284 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [16:15:40] !log Aborted synchronous snapmirror relationships between nas1-a and nas1001-a to initiate cf giveback [16:15:47] Logged the message, Master [16:16:43] PROBLEM - Apache HTTP on srv284 is CRITICAL: Connection refused [16:23:02] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:51] !log Takeback by nas1001-a completed, snapmirror relationships back in sync [16:24:57] Logged the message, Master [16:28:34] PROBLEM - Host srv284 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:08] !log reinstalling srv284 "last lucid standing" [16:29:14] Logged the message, Master [16:31:22] !seen ottomata [16:34:16] RECOVERY - Host srv284 is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [16:37:07] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [16:37:07] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:37:07] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:37:43] PROBLEM - Memcached on srv284 is CRITICAL: Connection refused [16:38:32] !log powering on analytics1007 [16:38:39] Logged the message, Master [16:39:04] PROBLEM - SSH on srv284 is CRITICAL: Connection refused [16:53:01] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [16:53:01] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [16:53:20] PROBLEM - Host srv284 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:22] RECOVERY - SSH on srv284 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:00:31] RECOVERY - Host srv284 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:03:35] apergos: ping [17:03:48] Vito: ponngg [17:04:08] apergos: an user is reporting latex problems [17:04:12] missing texvc [17:04:33] is the user in a channell somewhere? [17:06:34] apergos: ^musaz is your man [17:06:47] ah [17:06:56] what project is this on ^musaz ? [17:07:05] <^musaz> it [17:07:30] and can you give me a step by step of what you did and the errors you saw? [17:07:55] in fact if you have an error page open right now that would be great to have the text and the page source [17:08:00] (pastebin) [17:08:05] <^musaz> apergos: in the last few days i'm in trouble with latex expressions [17:08:33] <^musaz> this problem is usually solved repeating the overview [17:08:40] <^musaz> since today [17:09:52] <^musaz> http://it.wikipedia.org/w/index.php?title=Amplificatore_operazionale&oldid=54075876 example [17:10:00] apergos: afaik he can solve the issue by makins several previews [17:10:09] hmm [17:10:35] could be one unappy server, the question is which one [17:10:39] <^musaz> it happens with all the expressions [17:10:47] ok [17:10:49] <^musaz> also the trivial ones [17:10:50] he wandered whatever it was someway related to his own system but I'm quite sure it depends on the server side [17:11:06] do you recall the error message at all? [17:11:14] apergos: maybe getting the srv # would be usefull, wouldn't it? [17:11:37] yes, that's why I was hoping [17:11:45] oh apergos I got it too [17:11:50] ah [17:11:58] so: page text? then, page source? [17:12:14] [17:12:28] Errore del parser (Eseguibile texvc mancante; per favore consultare math/README per la configurazione.): G = \frac{V_{\text{out}}} {V_{\text{in}}} = - \frac{R_{\text{f}}} {R_{\text{in}}} [17:13:04] "parser error (texvc executable missing; please read math/README for configuration.) [17:13:18] <^musaz> apergos: something similar to parse error (missing texvc , please visit math/README for the config [17:13:30] ok [17:14:01] brb [17:14:18] can one of you give me the source of the page where it is successful (just the 'Served by' part)? [17:14:48] so the first preview that works, look at the page source and find the line [17:22:48] thanks [17:24:40] PROBLEM - NTP on srv284 is CRITICAL: NTP CRITICAL: No response from NTP server [17:26:09] ok so [17:26:27] $wgTexvc = "/usr/local/apache/uncommon/$wmfVersionNumber/bin/texvc"; [17:26:41] this directory is now missing on some (all? precise?) hosts [17:35:56] mw1-16, srv193-198 (maybe 199), srv219-224 have [17:36:44] tmh1 and 2 have [17:36:53] and soe irrelevant boxes. [17:40:12] apergos: Easily fixed [17:40:44] it's in some package I guess? except maybe not any more [17:41:09] !log Running scap-recompile on all hosts in mediawiki-installation group [17:41:15] Logged the message, Master [17:41:21] ohh I dunno that script [17:41:34] Aaron moved it out of the normal script at some point [17:41:54] looking at it [17:42:15] How about now? [17:42:15] srv284: Host key verification failed. [17:42:15] meh, there's always 1 ;) [17:42:17] so probably this needs to be called separately after install of precise hosts? [17:42:21] (yes there is. always.) [17:42:31] Likely [17:42:37] I now run it when I deploy new mw versions [17:42:45] purely because it has to be built at least once [17:43:07] PROBLEM - Puppet freshness on mw70 is CRITICAL: Puppet has not run in the last 10 hours [17:43:07] PROBLEM - Puppet freshness on mw73 is CRITICAL: Puppet has not run in the last 10 hours [17:43:39] Vito: ^musaz: Try again [17:43:54] thanks for that, whew [17:44:01] PROBLEM - Puppet freshness on mw71 is CRITICAL: Puppet has not run in the last 10 hours [17:44:33] We didn't realise that was going to break after tim rebuilt the appserver package and deployed it [17:44:33] so we were WTFing at it for a while [17:45:04] PROBLEM - Puppet freshness on mw72 is CRITICAL: Puppet has not run in the last 10 hours [17:45:04] PROBLEM - Puppet freshness on mw74 is CRITICAL: Puppet has not run in the last 10 hours [17:46:19] hahaha [17:46:25] so you've already been through this, nice [17:47:17] About a month ago :D [17:47:25] sweet [17:49:24] New patchset: Hashar; "deploy zuul on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34555 [17:49:59] New review: Hashar; "Do not submit yet, need to be scheduled with ops." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/34555 [17:55:07] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:58:07] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [18:11:03] <^musaz> Reedy , apergos it works, thank you [18:11:21] yay! [18:29:30] apergos: apache can't add top-level dirs to /mnt/thumbs2 [18:29:45] * AaronSchulz needs a wikivoyage dir [18:30:01] uh [18:30:02] netapp? [18:30:10] yes [18:31:04] hm [18:32:18] ok done [18:32:26] no subdirs, just that one [18:34:15] !log prepared all wikivoyage file directories and containers [18:34:22] Logged the message, Master [18:34:25] that was fast [18:34:51] well it's not sharded or anything...not that that would take too long either [18:35:03] hey sometime later AaronSchulz I woudl love the 5 minute summary of how stuff that used to be in /mnt/upload7/private works in swift now as far as img_auth etc [18:35:23] I can see that images are retrieved out of swift but I couldn't figure out where the code path was [18:36:25] Change abandoned: Demon; "Not necessary after all--I did this via labsconsole like I should've." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32605 [18:36:46] Change abandoned: Krinkle; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22858 [18:37:06] New review: Krinkle; "nvm if it takes 3 months" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22858 [18:38:30] apergos: isn't that stuff still there? [18:38:41] well I mean [18:38:43] I mean we don't read from it though [18:38:54] I was trying to figure out how that stuff mapped to swift containers [18:38:58] and really I failed :-D [18:40:09] (yes the dir is on the netapp and is used afaik for writes) [18:40:52] have you seen http://wikitech.wikimedia.org/view/Swift/Swift_Container_Name_Conventions ? [18:41:07] it's slightly old in that it uses upload6, but is otherwise recentish [18:43:19] oohh private wikis [18:43:22] nope didn't see it [18:45:01] so that default stanza really does cover (almost) all of em [18:45:19] !log Copying over all math/timeline files to nas1 that are not already there for DR [18:45:26] Logged the message, Master [18:45:43] only thin I gotta figure out now is how, when I send an http://upload... url for an image on a private wiki, swift (I guess) does the right thing and denies me [18:45:48] apergos: suffice it to say the paths for nfs are a mess :) [18:46:11] yeoowww [18:46:59] hey [18:47:01] what's up? [18:47:08] apergos: rewrite doesn't do anything to check for private wikis (it does prevent viewing deleted stuff though), but the ACLs on private wikis require authentication [18:47:30] the ACLs for the deletion containers require auth on all wikis as well [18:47:33] yeah, do swift ... stat [18:47:50] apergos: there is a setZoneAccess script in WikimediaMaintenance that does this, as well as MW when it creates containers [18:47:52] and we don't serve upload.wm.org URLs in private wikis [18:47:52] ah speeking of swift commands [18:48:20] New patchset: Demon; "Removing some old cruft from gerrit manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34561 [18:48:39] where do I find the key/passwd these days, it seems not to be in the file as documented here: [18:48:40] http://wikitech.wikimedia.org/view/Swift/How_To#List_containers_and_contents [18:48:51] any more... [18:49:40] ( paravoid ) [18:49:59] New review: Demon; "I verified that neither of these crons exists on either of the gerrit boxes." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/34561 [18:50:08] and yes, I know we don't serve private media via upload.wm urls, that's why I checked one [18:51:18] apergos: PrivateSettings [18:51:36] paravoid: and we need to clean up the user/password sometime ;) [18:52:05] * AaronSchulz likes the long keys that radosgw makes [18:52:08] I didn't even know about this file [18:53:16] where does that get updated/commited? [18:55:19] AaronSchulz: [18:55:34] RECOVERY - Apache HTTP on srv284 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [18:55:38] !log re-signing srv284 on puppetmaster, package installs... [18:55:44] Logged the message, Master [18:56:18] apergos: svn in wmf-config [18:56:21] private svn [18:56:36] ok (I doubt I'd need to change it but better to know and not need it) [18:57:18] all right, off for the day [18:57:28] apergos: puppet private repo [19:00:45] ah not those ones, the ones in privatesettings I wanted. thanks though [19:02:28] New patchset: Demon; "Stop using gerrit2 for replication purposes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25508 [19:13:07] PROBLEM - Puppet freshness on mw20 is CRITICAL: Puppet has not run in the last 10 hours [19:13:07] PROBLEM - Puppet freshness on mw22 is CRITICAL: Puppet has not run in the last 10 hours [19:14:01] PROBLEM - Puppet freshness on mw21 is CRITICAL: Puppet has not run in the last 10 hours [19:14:01] PROBLEM - Puppet freshness on mw25 is CRITICAL: Puppet has not run in the last 10 hours [19:14:01] PROBLEM - Puppet freshness on mw24 is CRITICAL: Puppet has not run in the last 10 hours [19:14:01] PROBLEM - Puppet freshness on mw26 is CRITICAL: Puppet has not run in the last 10 hours [19:14:01] PROBLEM - Puppet freshness on mw28 is CRITICAL: Puppet has not run in the last 10 hours [19:14:02] PROBLEM - Puppet freshness on mw27 is CRITICAL: Puppet has not run in the last 10 hours [19:14:02] PROBLEM - Puppet freshness on mw31 is CRITICAL: Puppet has not run in the last 10 hours [19:14:03] PROBLEM - Puppet freshness on mw33 is CRITICAL: Puppet has not run in the last 10 hours [19:15:04] PROBLEM - Puppet freshness on mw29 is CRITICAL: Puppet has not run in the last 10 hours [19:15:04] PROBLEM - Puppet freshness on mw32 is CRITICAL: Puppet has not run in the last 10 hours [19:15:04] PROBLEM - Puppet freshness on mw34 is CRITICAL: Puppet has not run in the last 10 hours [19:15:30] New patchset: Cmjohnson; "Changin macs for ms-be8 and ms-be10 to reflect h/w change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34565 [19:16:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34565 [19:17:10] PROBLEM - Puppet freshness on mw30 is CRITICAL: Puppet has not run in the last 10 hours [19:17:46] merging on sockpuppet for chris, incl. unmerged changes to wikidata.pp [19:17:51] cmjohnson1: ^ [19:18:08] thx mutante [19:18:27] it also merged changes on labsmediawiki and zuul [19:19:36] apergos: if you are still around...ms-be8 and 10 are ready [19:20:38] ok, great, thanks! [19:20:48] (around but done) [19:21:08] :-) [19:23:28] RECOVERY - NTP on srv284 is OK: NTP OK: Offset -0.03766560555 secs [19:24:16] !log installing package upgrades in the srv193-srv199 range [19:24:22] Logged the message, Master [19:31:26] PROBLEM - Apache HTTP on srv193 is CRITICAL: Connection refused [19:33:49] PROBLEM - Apache HTTP on srv194 is CRITICAL: Connection refused [19:34:51] mutante: did you depool the boxes you're patching? [19:37:01] notpeter: srv284 for reinstall i did, but srv193 194 just for package upgrades (non-kernel) i did not [19:37:15] (well, it was already depooled) [19:37:45] (it is not by script, just 2 have been touched at this point) [19:38:37] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.065 second response time [19:39:20] ganglia is a little yellowish but not red [19:39:22] RECOVERY - Apache HTTP on srv193 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.093 second response time [19:43:14] ah, ok, cool [19:49:31] apergos, Reedy: sorry but I had to go, anyway it works now, thank you! [19:54:12] New patchset: Pyoungmeister; "removing old applicationserver role classes from site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34574 [19:55:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34574 [20:00:37] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: rest of wikipedias to 1.21wmf4 [20:00:44] Logged the message, Master [20:05:50] New review: Jeremyb; "We can always do followups. This is ready for merge as is." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/34499 [20:07:35] New review: Demon; "Test comment." [operations/debs/gerrit] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/27531 [20:20:42] New patchset: Reedy; "Everything wikipedia to 1.21wmf4" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34577 [20:23:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34577 [20:24:50] binasher: Wikibase\TermSqlCache::saveTermsOfEntity seems deadlock prone [20:25:19] * AaronSchulz eliminates job queue deadlocks and gets something in their place ;) [20:25:25] RECOVERY - Puppet freshness on spence is OK: puppet ran at Wed Nov 21 20:25:14 UTC 2012 [20:25:44] sql cache? hmm [20:26:17] wtf [20:34:10] !log preilly synchronized php-1.21wmf3/extensions/ZeroRatedMobileAccess 'update post deploy' [20:34:16] Logged the message, Master [20:34:39] !log preilly synchronized php-1.21wmf4/extensions/ZeroRatedMobileAccess 'update post deploy' [20:34:46] Logged the message, Master [20:36:23] binasher, we've prepared changes for mobile redirection on Wikivoyage: https://gerrit.wikimedia.org/r/#/c/34281/ and https://gerrit.wikimedia.org/r/#/c/29895/ [20:36:56] does it sound uncasry enough to deploy it today? [20:37:04] *unscary [20:37:32] MaxSem: i'll defer that to others on the ops team. [20:47:46] !log preilly Started syncing Wikimedia installation... : update zero rated mobile access [20:47:54] !log applying changes to fix no creds issue when setting preferences on labsconsole [20:47:54] Logged the message, Master [20:48:01] Logged the message, Master [20:50:36] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [20:56:19] !log deploying Change If5f6bc33: (bug 42334) on labsconsole [20:56:25] Logged the message, Master [20:58:24] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25508 [21:03:57] PROBLEM - Apache HTTP on srv196 is CRITICAL: Connection refused [21:06:00] New patchset: Demon; "Rename groups to extra_groups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34589 [21:06:26] 83haiku: debug: /dev/sda3 is not a BeFS partition: exiting [21:06:35] BeFS!? [21:07:04] [[w.Be File System]] [21:07:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34589 [21:07:46] "(BFS, occasionally misnamed as BeFS — the name BeFS is used in the Linux kernel to avoid any confusion with Boot File System)" ...o .kk... [21:08:34] mutante: where are you seeing this? [21:08:54] srv196 [21:09:05] fresh apache app server on precise [21:09:48] but it looks like it also tries all other things just to find out they are not it [21:09:51] /dev/sda3 is not a FAT partition: exiting [21:10:37] and "os-proper" stuff debug: running /usr/lib/os-probes/mounted/80minix on mounted /dev/sda3 [21:10:52] syslog is just pretty chatty about debug [21:11:35] s/proper/prober/ [21:14:36] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [21:16:52] RECOVERY - Apache HTTP on srv196 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.073 second response time [21:17:05] !log preilly Finished syncing Wikimedia installation... : update zero rated mobile access [21:17:12] Logged the message, Master [21:23:19] oh, there is another issue with the apache init script and nice [21:23:22] /etc/init.d/apache2: 55: [: nice: unexpected operator [21:24:05] it's not nice [21:24:13] it's probably a bashism in a /bin/sh script [21:24:14] heh, yea [21:25:36] APACHE_HTTPD=$(. $APACHE_ENVVARS && echo $APACHE_HTTPD) [21:25:45] if [ ! -x $APACHE_HTTPD ] ; then [21:25:48] etc... [21:26:13] it can still start Apache anyways [21:27:30] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [21:32:36] New patchset: preilly; "add http.X-Carrier-Short header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34600 [21:32:47] binasher: ^^ [21:32:55] sweet [21:37:33] New patchset: Asher; "add http.X-Carrier-Short header (as X-CS)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34600 [21:40:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34600 [21:51:40] !log installing package upgrades in the mw55-mw99 range [21:51:46] Logged the message, Master [21:51:55] make that mw59 [22:01:44] Change abandoned: Dzahn; "outdated, that issue has been fixed" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32753 [22:03:35] New review: Dzahn; "redirect to toolserver works" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/31302 [22:06:57] New patchset: Asher; "filter the superfluous junk from the apache syslog logfile that makes up a plurality of log lines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34605 [22:10:08] New patchset: Demon; "Fix up more naming issues with gerrit manifest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34606 [22:10:09] New review: Dzahn; "yeah, Apache config can go now for sure. just maybe you wanna reuse wlm.pp next year though? really ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/31302 [22:10:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34606 [22:11:21] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34605 [22:12:42] mutante: if you're up for a reviewing mood, https://gerrit.wikimedia.org/r/#/c/34113/ [22:13:51] PROBLEM - Apache HTTP on mw57 is CRITICAL: Connection refused [22:15:43] New patchset: Asher; "more apache syslog cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34607 [22:15:50] paravoid: want to give a review while you wait for one? ;) https://gerrit.wikimedia.org/r/34499 [22:16:25] to mw-config? I don't feel authoritative for that [22:16:37] heh [22:17:09] iirc apergos and Reedy are most relevant [22:17:20] but hashar has worked on it too [22:18:46] jeremyb: can you add a comment explaining what the ip address is, or pointing back to the request? [22:19:04] binasher: it's in the commit msg. you want it in the code too? [22:19:26] oh, so it is [22:20:02] jeremyb, are you sure those are the right times? if so looks good to me [22:20:10] New patchset: Demon; "Use proper SSH key for replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34610 [22:20:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34607 [22:20:55] Krenair: i'm happy to have some double checking... he contradicted himself between UTC and PHT on the bug and also contradicted himself the same way in IRC. (irc and bugzilla were copy/paste of eachother) [22:21:00] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34499 [22:21:06] New review: Dzahn; "using NE flag sounds reasonable. indeed it conflicts with the other one, but i would rather merge th..." [operations/apache-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/34113 [22:21:09] ugh, i hate timezones [22:21:28] Krenair: and then he changed his mind and said a different start time. i went with what he finally said [22:21:47] hrm [22:21:50] Krenair: also, i gave him an extra minute at the end. (midnight instead of the requested 23:59) [22:22:07] i merged it but the request for commons too is probably legit if its a photo scavenger hunt [22:22:21] jeremyb: sorry bed time for me :( [22:22:29] hashar: binasher's doing it [22:23:24] or maybe not if its just for account reg [22:23:25] binasher: i didn't click through to the event links. just seemed kinda like he might never stop changing his mind ;( [22:24:09] it's a bit confusing but this will be better than nothing regardless of what they're actually trying to request [22:24:30] hashar: i would like to delete or move misc::jenkins from misc-servers [22:24:35] right. and they *do* have local sysops on site. who can in turn override for their own wiki [22:24:38] # FIXME: merge with misc::contint::test, or remove [22:24:54] mutante: I don't think I ever used misc::jenkins [22:24:55] hashar: but no need to look now, will do it via gerrit and add you for later [22:25:01] oh.. ok. [22:25:02] mutante: been using the misc::contint ones [22:25:14] well it says this is supposed to be merged with contint [22:25:19] at least in that comment [22:25:20] mutante: which I will probably move to a module whenever I found a good name for it [22:25:25] fair [22:25:34] <^demon> That jenkins is/was fundraising's jenkins. [22:25:37] yeah that comment has always been there I think [22:25:40] !log asher synchronized wmf-config/throttle.php 'increased api limit for bug 42319' [22:25:46] <^demon> We've never shared manifest code, although in an ideal world we sould. [22:25:46] oh [22:25:47] Logged the message, Master [22:25:56] if possible i want to clean the entire misc-servers file [22:25:58] <^demon> *should, even [22:26:04] kk [22:26:13] beed time for real [22:26:18] bon nuit [22:26:21] merci! [22:27:57] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.067 second response time [22:30:30] PROBLEM - Apache HTTP on srv284 is CRITICAL: Connection refused [22:31:26] New patchset: MaxSem; "Kill wlm.wikimedia.org with fire" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31302 [22:33:49] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34610 [22:34:32] New review: MaxSem; "In the new changeset, this node remains listed in site.pp with a comment based on https://rt.wikimed..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/31302 [22:36:20] binasher: merged your change on sockpuppet [22:38:59] New review: Dzahn; "alright, killing old wlm.wikimedia, now redirect to toolserver" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/31302 [22:39:00] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31302 [22:41:36] PROBLEM - Apache HTTP on mw59 is CRITICAL: Connection refused [22:41:56] Kaulen doesn't seem happy again [22:42:05] bugzilla is dead [22:42:13] ok, let me see [22:42:41] it's not even giving me a login prompt via ssh [22:42:41] heh [22:42:47] ack [22:42:51] switching to mgmt :p [22:43:07] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.062 second response time [22:43:24] now it has... [22:43:25] sloooow [22:43:31] need to reset mgmt [22:43:40] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=ascending&c=Miscellaneous+pmtpa&h=kaulen.wikimedia.org&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [22:43:46] swapping and wait cpu [22:43:48] as usual.. [22:44:27] PROBLEM - HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:44:36] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:45:34] sloooooowness [22:45:55] come on, gimme shell [22:46:51] I've logged in [22:46:55] not got a useable prompt yet : [22:46:59] same here [22:47:07] poor bz [22:49:18] if you can't get a shell on it, just power cycle it [22:49:31] it's not a DB server [22:49:40] literally waiting for a kill command to do something. but ack, powercycling [22:50:08] !log powercycling kaulen [22:50:14] Logged the message, Master [22:52:33] PROBLEM - Host kaulen is DOWN: CRITICAL - Host Unreachable (208.80.152.149) [22:53:04] ugh, went to BIOS? .. now! [22:53:18] of course fsck :p [22:54:00] works for me again [22:54:05] Meh. I can't work without Bugzilla. [22:54:12] RECOVERY - HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.026 seconds [22:54:15] Ah. Back it is. [22:54:21] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:54:21] RECOVERY - Host kaulen is UP: PING OK - Packet loss = 0%, RTA = 2.12 ms [22:54:22] yep, just now [22:55:07] !log bugzilla back up [22:55:14] Logged the message, Master [22:55:22] you should disable swap on it [22:55:40] funny how everyone seems to think locking up and requiring a power cycle is the best possible failure mode for an OOM condition [22:57:10] I really don't get why swap is so prevalent in our infrastructure [22:57:27] swapoff /dev/mapper/kaulen-swap_1 [22:57:27] swift proxies even had 48G of swap each [22:57:30] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:57:55] and comment out the /etc/fstab line [22:58:43] I tried to disable swap on my laptop, but then I discovered that hibernate depends on it [22:58:53] !log disabled swap on kaulen per Tim's advice [22:59:00] Logged the message, Master [22:59:00] done in fstab [22:59:04] I need a script that does a swapon right before hibernate, then does swapoff after resume [23:01:09] TimStarling: /etc/pm or pm-utils or whatever ubuntu ships these days? [23:01:23] there should be a place to put hibernate/resume hooks in /etc [23:01:47] binasher: what's your take on statsd? [23:04:06] PROBLEM - HTTP on formey is CRITICAL: Connection refused [23:05:00] PROBLEM - HTTPS on formey is CRITICAL: Connection refused [23:05:45] ^demon: no apache on formey, you working? [23:05:52] <^demon> Yes. [23:05:54] k [23:08:24] apergos: ms-be7 runs an ancient swift version. [23:09:21] RECOVERY - Puppet freshness on ms-be7 is OK: puppet ran at Wed Nov 21 23:09:07 UTC 2012 [23:09:58] it just needed an apt-get dist-upgrade [23:10:09] I haven't made puppet recipes to ensure => latest, as I find this a bit scary [23:10:10] <^demon> mutante: I can fix it for now, but https://gerrit.wikimedia.org/r/#/c/32476/ will fix it from breaking on future puppet runs. [23:11:10] ^demon: oh, just one ".pem" too much? you want that merged? [23:11:21] <^demon> Yep :) [23:11:27] RECOVERY - HTTPS on formey is OK: OK - Certificate will expire on 08/22/2015 22:23. [23:11:57] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32476 [23:12:03] it makes sense, like the others [23:12:03] RECOVERY - HTTP on formey is OK: HTTP OK HTTP/1.1 200 OK - 3596 bytes in 0.008 seconds [23:14:38] paravoid: statsd + graphite = good [23:14:55] ^demon: puppet run looks good to me [23:15:06] <^demon> Yep, everything looks happy now. [23:15:17] binasher: so, swift has added since a few releases statsd support with a lot of metrics [23:15:20] SSLCACertificateFile /etc/ssl/certs/wmf-ca.pem.pem#012+#011SSLCACertificateFile /etc/ssl/certs/wmf-ca.pem#012 [23:15:27] ack , it was ".pem.pem" [23:15:33] hah [23:15:56] <^demon> Yeah, so formey should survive puppet runs now. [23:16:17] it does. finished catalog run [23:34:33] !log preilly synchronized php-1.21wmf3/extensions/ZeroRatedMobileAccess 'update post deploy' [23:34:40] Logged the message, Master [23:37:51] !log preilly synchronized php-1.21wmf4/extensions/ZeroRatedMobileAccess 'update post deploy' [23:37:57] Logged the message, Master [23:47:15] !log tmp. depooling and rebooting a few mw5x servers for kernel upgrades, one by one [23:47:23] Logged the message, Master [23:50:36] PROBLEM - Host mw55 is DOWN: PING CRITICAL - Packet loss = 100% [23:51:48] RECOVERY - Host mw55 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [23:55:24] PROBLEM - Apache HTTP on mw55 is CRITICAL: Connection refused [23:58:42] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.074 second response time