[00:08:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Mon Jan 28 00:07:59 UTC 2013 [00:08:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:09:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Mon Jan 28 00:08:51 UTC 2013 [00:09:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:14:30] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [00:14:59] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:17:20] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [00:17:49] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 201 seconds [00:17:59] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 216 seconds [00:18:24] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [00:19:49] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:21:59] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [00:22:01] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 16 seconds [00:22:19] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [00:25:09] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [00:27:30] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [00:31:00] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [00:50:58] i was pinged a while ago [00:51:09] oh that wa ori-l [00:51:19] yeah rt duty only really counts on weekdays [00:51:29] right now i'm online due to "juniper duty" [00:52:27] argh [00:52:28] still? [01:04:40] still [01:04:50] i did go out on a bike ride [01:05:25] they are certain it is a bug in the mx80 - not properly sending pim joins [01:05:34] which makes sense that new groups work , until they get pruned [01:19:58] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [01:44:04] I don't know what 'pim joins' are, but they sound cool. [02:29:56] !log LocalisationUpdate completed (1.21wmf8) at Mon Jan 28 02:29:55 UTC 2013 [02:30:17] Logged the message, Master [02:47:09] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [02:47:49] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 192 seconds [02:47:59] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 191 seconds [02:52:54] !log LocalisationUpdate completed (1.21wmf7) at Mon Jan 28 02:52:53 UTC 2013 [02:53:06] Logged the message, Master [02:57:59] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 26 seconds [02:58:09] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 29 seconds [02:58:10] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 27 seconds [02:58:59] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [03:02:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [03:03:05] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:31:50] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:31:50] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [03:31:51] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:07:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Mon Jan 28 04:07:44 UTC 2013 [04:08:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:08:36] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Mon Jan 28 04:08:34 UTC 2013 [04:09:36] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:11:41] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:25:51] PROBLEM - swift-container-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:52] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:52] PROBLEM - swift-container-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:01] PROBLEM - swift-object-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:02] PROBLEM - swift-object-auditor on ms-be11 is CRITICAL: Timeout while attempting connection [04:26:02] PROBLEM - swift-container-replicator on ms-be11 is CRITICAL: Timeout while attempting connection [04:26:02] PROBLEM - swift-account-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:21] PROBLEM - swift-object-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:22] PROBLEM - swift-account-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:31] PROBLEM - swift-container-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:32] PROBLEM - swift-account-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:41] PROBLEM - swift-object-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:28] PROBLEM - swift-account-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:28] PROBLEM - swift-container-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:29] PROBLEM - swift-object-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:56] PROBLEM - swift-container-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:56] PROBLEM - swift-object-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:04] PROBLEM - swift-object-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:13] PROBLEM - swift-container-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:14] PROBLEM - swift-account-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:22] PROBLEM - swift-object-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:31] PROBLEM - swift-account-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:40] PROBLEM - swift-container-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:58] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:33:17] fixed!!!! [04:38:31] RECOVERY - swift-container-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:38:32] RECOVERY - swift-account-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:38:32] RECOVERY - swift-object-updater on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:38:32] RECOVERY - swift-account-reaper on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:38:41] RECOVERY - swift-container-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [04:38:41] RECOVERY - swift-object-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [04:38:42] RECOVERY - swift-container-server on ms-be11 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:38:42] RECOVERY - swift-account-reaper on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:38:42] RECOVERY - swift-container-updater on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [04:38:52] RECOVERY - swift-object-server on ms-be11 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [04:38:52] RECOVERY - swift-account-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [04:38:52] RECOVERY - swift-object-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:38:52] RECOVERY - swift-container-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [04:38:59] RECOVERY - swift-container-server on ms-be11 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:38:59] RECOVERY - swift-account-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [04:39:00] RECOVERY - swift-object-server on ms-be11 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [04:39:11] RECOVERY - swift-object-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [04:39:11] RECOVERY - swift-account-server on ms-be11 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:39:26] RECOVERY - swift-account-server on ms-be11 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:39:26] RECOVERY - swift-container-updater on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [04:39:26] RECOVERY - swift-object-updater on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:39:53] RECOVERY - swift-container-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:39:54] RECOVERY - swift-account-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:39:54] RECOVERY - swift-object-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:49:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [04:50:01] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [04:50:14] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 190 seconds [04:51:17] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 205 seconds [05:04:38] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [05:04:58] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 189 seconds [05:05:14] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 189 seconds [05:06:08] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 184 seconds [05:42:36] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 201 seconds [05:52:37] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [05:52:46] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [05:53:14] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [05:54:17] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:19:29] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [06:21:35] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [06:24:35] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [06:25:29] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [06:25:29] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [06:26:32] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [06:27:35] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [06:38:43] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [06:41:39] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [06:41:40] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [06:45:53] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [07:23:32] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [07:28:47] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:52:06] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [08:54:07] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [08:55:06] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.027 second response time on port 11000 [08:55:55] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [10:15:50] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [10:17:25] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [10:19:46] New patchset: Silke Meyer; "Create and set $wgCacheDir for Wikidata repo or client" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46220 [10:28:57] New patchset: Silke Meyer; "Removing Moodbar extension from Wikidata setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46221 [11:20:35] New patchset: Mark Bergsma; "Add cp3010 as upload varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46228 [11:22:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46228 [11:39:37] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [11:49:37] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK: HTTP/1.1 200 OK - 641 bytes in 0.185 second response time [12:13:07] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.19784284672 (gt 8.0) [12:37:34] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [12:51:04] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 12.9331811511 (gt 8.0) [12:55:46] New patchset: Silke Meyer; "Fixed a copy and paste error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46234 [12:59:21] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [12:59:50] New patchset: Mark Bergsma; "Add cp3007-cp3009" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46235 [13:00:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46235 [13:00:52] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [13:01:31] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 92.11 ms [13:02:50] New patchset: Silke Meyer; "Added variable to Wikidata client to enable other SiteIDS than "enwiki"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46236 [13:23:11] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:32] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 92.14 ms [13:26:41] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [13:26:51] PROBLEM - Varnish traffic logger on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [13:27:47] PROBLEM - Varnish traffic logger on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [13:28:59] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [13:33:20] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [13:33:20] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [13:33:20] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:33:20] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [13:33:21] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [13:33:21] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:33:21] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [13:33:22] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [13:34:32] New patchset: Mark Bergsma; "Proof of concept for hashing thumbs to original's hash key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29805 [13:35:50] RECOVERY - Varnish traffic logger on cp3010 is OK: PROCS OK: 3 processes with command name varnishncsa [13:36:11] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.234 seconds [13:36:40] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK: HTTP/1.1 200 OK - 641 bytes in 0.185 second response time [13:36:47] RECOVERY - Varnish traffic logger on cp3010 is OK: PROCS OK: 3 processes with command name varnishncsa [13:37:34] New review: Amire80; "Can probably be combined with I3a43ad74." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/46234 [13:38:43] New review: Amire80; "Looks sane, but I am not familiar with Puppet. Can probably be combined with I5c62a7f7." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/46236 [13:43:52] interesting. did https handling somehow change with the equiad migration? [13:44:09] YuviPanda: what do you mean ? :-] [13:44:13] any weird behavior? [13:44:19] hashar: well, 'different' behavior [13:44:24] (or I think) [13:44:28] https/1.1 for https [13:44:30] err [13:44:32] 1.1 for https [13:44:34] 1.0 for http [13:45:12] hashar: end difference being, transfer encoding chunked works on https [13:45:15] and not on http [13:45:23] maybe for the keep alive ? [13:45:50] AFAIK the https traffic is sent to a nginx reverse proxy [13:45:55] maybe something changed there [13:46:00] or it always add 1/1 [13:46:08] or it always had HTTP 1/1 [13:46:27] hashar: yeah, i'm investigating [13:46:39] it might be a false alarm, since you were able to uplaod without problems yesterday [13:47:19] mark, ^^ [13:47:32] New patchset: Mark Bergsma; "Proof of concept for hashing thumbs to original's hash key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29805 [13:47:55] hashar: plus, if it is still nginx -> squid -> apache for https, transfer-encoding: chunked should *not* work, since squid (or at least the version we used to run) did not support that [13:48:10] that part of the infrastructure has not changed at all during the eqiad migration [13:48:36] mark: so, we support full http/1.1 on https, and http/1.0 on http? [13:49:04] i don't think we support full http/1.1 anywhere [13:49:20] curl -I https://en.wikipedia.org says http/1.1 [13:50:37] what I'm saying is, even though some of our services may announce http 1.1, I don't think the full http 1.1 spec/featureset is supported anywhere [13:50:56] hmm, alright. so nothing's changed. [13:50:58] * YuviPanda looks elsewhere [13:51:01] thanks mark [13:51:12] no, you're still talking to the exact same servers as last month [13:51:45] oh? https was always in equiad? [13:51:55] for about a year already [13:52:15] http too [13:52:23] * YuviPanda trouts self [13:52:30] alright, the bug is elsewhere then! [13:52:47] however, *weirdly*, Transfer-Encoding: Chunked actually *does* work now [13:53:00] I can swear that it didn't when I first tried it, 2-3 months ago [13:53:11] i believe ryan upgraded the nginx ssl servers to a newer version about a month ago [13:53:18] or a bit more [13:53:22] aaah [13:53:42] so it is possible that they now buffer it before sending it off to the squids [13:54:44] yes [13:54:50] chunked is part of http/1.1 [13:55:04] nginx before the upgrade supported proxying only with http/1.0 [13:55:19] this was one the reasons ryan upgraded those [13:56:26] paravoid: wheee, awesome. [13:56:28] that explains it [13:56:35] so it's been on from before the move. [13:56:39] either way, good :) [13:57:01] He had told mea bout this, but for some reason I thought it was not enabled everywhere [13:57:19] hmm, but when I last talked to him it might indeed not have been enabled everywhere (more than a month ago) [14:05:00] ApiSandbox, o_0 [14:07:09] I think pipelining is not enabled yet [14:07:15] but chunked might just be, who knows [14:17:39] paravoid: it does seem to work (for me) [14:28:35] New patchset: Hashar; "(bug 44424) wikiversions.cdb for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46240 [14:29:24] New patchset: Mark Bergsma; "Put cp3009/cp3010 in a standard frontend/backend configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46241 [14:32:11] why isn't gerrit merging [14:32:39] that changes seems to depend on https://gerrit.wikimedia.org/r/#/c/29805/4 [14:32:46] which is not merged [14:32:51] doh [14:32:57] i'm an idiot [14:33:08] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46241 [14:34:54] New patchset: Mark Bergsma; "Put cp3009/cp3010 in a standard frontend/backend configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46242 [14:35:07] <^demon> mark: You can just rebase it and amend, rather than abandoning :) [14:35:24] i realize that [14:35:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46242 [14:37:31] paravoid: would we be able to write the README.md file for the wikimedia module ? :-] [14:38:14] New patchset: Mark Bergsma; "I really am an idiot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46244 [14:38:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46244 [14:43:23] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:00] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:56] New review: Anomie; "The full path needs to be passed to getRealmSpecificFilename(). As written, this will test if wikive..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/46240 [14:50:18] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 118.57 ms [14:50:30] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 92.28 ms [14:52:51] PROBLEM - Varnish traffic logger on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [14:52:51] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [14:53:01] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [14:53:01] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:53:21] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:53:44] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:38] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:54:51] RECOVERY - Varnish traffic logger on cp3010 is OK: PROCS OK: 3 processes with command name varnishncsa [14:55:00] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK: HTTP/1.1 200 OK - 643 bytes in 0.182 second response time [14:55:14] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:57:12] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [14:57:20] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 92.30 ms [14:57:29] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 118.88 ms [14:59:31] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:20] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 92.12 ms [15:01:23] PROBLEM - Varnish HTTP upload-frontend on cp3009 is CRITICAL: Connection refused [15:03:02] RECOVERY - Varnish HTTP upload-frontend on cp3009 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.237 seconds [15:03:20] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [15:03:21] PROBLEM - Varnish traffic logger on cp3009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:04:05] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [15:05:08] PROBLEM - Varnish traffic logger on cp3009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:05:59] !log Pooled cp3009 and cp3010 in the esams upload pool [15:06:13] Logged the message, Master [15:15:56] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [15:16:05] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 20.2831671111 (gt 8.0) [15:16:07] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [15:18:07] emery again [15:18:07] sigh [15:19:27] [18446744025.112439] BUG: soft lockup - CPU#4 stuck for 17163091968s! [perl:21796] [15:19:32] that's interesting... [15:20:06] yes, i am looking at it with ottomata [15:20:13] it's not clear to me what is causing [15:20:23] this, because nothing has changed in the recent weks [15:20:28] more traffic [15:20:50] but then also oxygen and locke would have the same issues, and it started on sunday [15:20:52] are we sure it's firing on all cylinders? i'm not entirely sure what the soft lockup means [15:21:14] no not necessarily [15:21:44] it could be some kernel bug (we've seen it before) on a certain type of hardware under quite peculiar load conditions [15:21:52] ok [15:22:02] i will ask ottomata to stop the sqstat script first [15:22:06] [18446744025.112439] BUG: soft lockup - CPU#4 stuck for 17163091968s! [perl:21796] [15:22:09] 15:22:03 up 209 days, 17:32, 5 users, load average: 2.81, 2.86, 2.87 [15:22:13] typical 208 days bug [15:22:14] needs a reboot [15:22:34] what is 208 days bug? [15:22:41] and an upgrade to precise [15:22:50] it's a kernel bug on older versions of 2.6.32 [15:22:57] ok [15:23:05] that fires off after 208 days of uptime [15:23:09] and messes with the scheduler [15:23:09] i will talk with ottomata about upgrading to precise [15:23:12] and then weird things happen [15:23:17] RECOVERY - Varnish traffic logger on cp3009 is OK: PROCS OK: 3 processes with command name varnishncsa [15:23:19] apparently :) [15:23:37] there's a thread from last year in ops@ [15:24:05] paravoid: do the stock kernel updates fix the bug? [15:24:06] ok [15:24:10] yes [15:24:15] RECOVERY - Varnish traffic logger on cp3009 is OK: PROCS OK: 3 processes with command name varnishncsa [15:24:19] running an upgrade to latest 2.6.32 now [15:24:21] k [15:24:40] drdee: can I reboot? [15:25:03] yes [15:25:23] cool [15:25:30] will do after upgrade finishes [15:25:43] please arrange a precise upgrade nevertheless [15:25:47] ok [15:26:10] 1 question i have is how to determine whether a box is fully puppetized, would you have to make an image of the box, then run the puppet on a different box and make image as well and compare them or something like that? [15:27:20] that would work, although we never do that ;) [15:27:44] because i think that both emery and locke still contain unpuppetized things [15:29:11] !log upgraded & rebooting emery, 208 days bug [15:29:22] Logged the message, Master [15:29:28] ty paravoid [15:29:50] New review: Hashar; "Mailed Ryan about it." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45108 [15:31:46] can I get a few changes merged for gallium please? I will take care of running puppet on the host. https://gerrit.wikimedia.org/r/#/c/44974/ install pyflakes a python linter and https://gerrit.wikimedia.org/r/#/c/46188/ which updates the frontage at http://integration.mediawiki.org/ ;-D [15:32:07] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:32:18] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:33:47] PROBLEM - SSH on emery is CRITICAL: Connection refused [15:33:51] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:34:07] PROBLEM - udp2log log age for aft on emery is CRITICAL: Connection refused by host [15:34:08] PROBLEM - udp2log log age for emery on emery is CRITICAL: Connection refused by host [15:35:12] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:36:15] PROBLEM - SSH on emery is CRITICAL: Connection refused [15:36:51] PROBLEM - udp2log log age for aft on emery is CRITICAL: Connection refused by host [15:37:27] PROBLEM - udp2log log age for emery on emery is CRITICAL: Connection refused by host [15:38:00] powercycling emery [15:38:33] with the console open [15:38:48] PROBLEM - Host kaulen is DOWN: CRITICAL - Host Unreachable (208.80.152.149) [15:39:44] thanks paravoid for babysitting emery ;) [15:39:48] np [15:39:51] RECOVERY - Host kaulen is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [15:39:56] fsck check it seems, although I did press C before [15:40:43] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44974 [15:40:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46188 [15:42:09] hashar: both merged [15:42:40] holy crap, my gerrit queue is huge [15:43:06] paravoid: thhhannkkkkkss :-] [15:43:08] New review: Faidon; "Thanks." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/45657 [15:43:09] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45657 [15:45:14] New review: Hashar; "Works on https://integration.mediawiki.org/ :-] Thanks!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46188 [15:45:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46234 [15:45:33] Silke_WMDE: the second one needs a rebase [15:46:16] j^: around? [15:50:53] paravoid: ssh to emery is still borked :( [15:51:13] it was fscking [15:51:30] should be ok now [15:51:30] oh sorry [15:51:33] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [15:51:37] yup [15:51:41] 307 days without being checked kinda does that :) [15:51:42] ty! [15:52:00] RECOVERY - SSH on emery is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:52:07] RECOVERY - udp2log log age for aft on emery is OK: OK: all log files active [15:52:07] RECOVERY - SSH on emery is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:52:08] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [15:52:41] New review: Faidon; "Needs a manual rebase." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46236 [15:52:45] RECOVERY - udp2log log age for aft on emery is OK: OK: all log files active [15:54:02] paravoid: and we are back at 25% packet loss at emery, could you kill the sqstat script on emery and see if that helps [15:55:01] New patchset: Silke Meyer; "Added variable to Wikidata client to enable other SiteIDS than "enwiki"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46236 [15:56:23] is ottomata not available? [15:56:33] ottomata is here! [15:56:36] hey [15:56:38] it looks better now [15:56:46] but still, could you monitor it a bit and help drdee? [15:56:53] I can help if it's something you can't handle [15:56:57] yeah, i'm helping him, i was just in a levelup chat with chad on it [15:57:03] for the last hour [15:57:09] sorry [15:57:10] New patchset: Silke Meyer; "Added variable to Wikidata client to enable other SiteIDS than "enwiki"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46236 [15:57:11] not 'on it' [15:57:12] but ja [15:57:18] no worries [15:57:26] did you reboot emery? [15:57:33] I did [15:57:33] aye cool [15:57:38] it looks much better now [15:57:42] paravoid: rebase is done [15:57:51] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [15:58:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46236 [15:58:41] paravoid: THANKS! :) [15:59:21] Silke_WMDE: done :) for future reference, #wikimedia-labs is the right place for labs-related support, questions and puppet commits and here (#wikimedia-operations) for general puppet questions etc. [15:59:32] ok [16:04:18] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [16:05:12] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [16:08:21] grr salt messages in dmesg are so annoying [16:08:38] every 30' [16:16:36] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [16:17:17] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [16:18:39] New patchset: MaxSem; "Postgres module for OSM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36155 [16:21:06] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [16:23:03] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [16:24:13] paravoid: whats up? [16:24:41] j^: hey, do you have any news regarding your cgroups patches? [16:24:53] I replied to the bug report a few minutes ago asking for Tim's input [16:25:03] I remember he was also doing something related [16:25:54]