[00:08:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Mon Jan 28 00:07:59 UTC 2013 [00:08:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:09:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Mon Jan 28 00:08:51 UTC 2013 [00:09:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:14:30] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [00:14:59] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:17:20] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [00:17:49] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 201 seconds [00:17:59] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 216 seconds [00:18:24] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [00:19:49] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:21:59] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [00:22:01] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 16 seconds [00:22:19] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [00:25:09] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [00:27:30] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [00:31:00] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [00:50:58] i was pinged a while ago [00:51:09] oh that wa ori-l [00:51:19] yeah rt duty only really counts on weekdays [00:51:29] right now i'm online due to "juniper duty" [00:52:27] argh [00:52:28] still? [01:04:40] still [01:04:50] i did go out on a bike ride [01:05:25] they are certain it is a bug in the mx80 - not properly sending pim joins [01:05:34] which makes sense that new groups work , until they get pruned [01:19:58] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [01:44:04] I don't know what 'pim joins' are, but they sound cool. [02:29:56] !log LocalisationUpdate completed (1.21wmf8) at Mon Jan 28 02:29:55 UTC 2013 [02:30:17] Logged the message, Master [02:47:09] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [02:47:49] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 192 seconds [02:47:59] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 191 seconds [02:52:54] !log LocalisationUpdate completed (1.21wmf7) at Mon Jan 28 02:52:53 UTC 2013 [02:53:06] Logged the message, Master [02:57:59] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 26 seconds [02:58:09] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 29 seconds [02:58:10] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 27 seconds [02:58:59] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [03:02:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [03:03:05] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [03:31:49] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:31:50] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:31:50] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [03:31:51] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:07:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Mon Jan 28 04:07:44 UTC 2013 [04:08:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:08:36] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Mon Jan 28 04:08:34 UTC 2013 [04:09:36] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:11:41] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:25:51] PROBLEM - swift-container-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:52] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:52] PROBLEM - swift-container-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:01] PROBLEM - swift-object-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:02] PROBLEM - swift-object-auditor on ms-be11 is CRITICAL: Timeout while attempting connection [04:26:02] PROBLEM - swift-container-replicator on ms-be11 is CRITICAL: Timeout while attempting connection [04:26:02] PROBLEM - swift-account-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:21] PROBLEM - swift-object-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:22] PROBLEM - swift-account-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:31] PROBLEM - swift-container-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:32] PROBLEM - swift-account-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:41] PROBLEM - swift-object-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:28] PROBLEM - swift-account-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:28] PROBLEM - swift-container-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:29] PROBLEM - swift-object-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:56] PROBLEM - swift-container-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:56] PROBLEM - swift-object-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:04] PROBLEM - swift-object-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:13] PROBLEM - swift-container-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:14] PROBLEM - swift-account-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:22] PROBLEM - swift-object-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:31] PROBLEM - swift-account-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:40] PROBLEM - swift-container-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:58] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:33:17] fixed!!!! [04:38:31] RECOVERY - swift-container-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:38:32] RECOVERY - swift-account-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:38:32] RECOVERY - swift-object-updater on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:38:32] RECOVERY - swift-account-reaper on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:38:41] RECOVERY - swift-container-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [04:38:41] RECOVERY - swift-object-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [04:38:42] RECOVERY - swift-container-server on ms-be11 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:38:42] RECOVERY - swift-account-reaper on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:38:42] RECOVERY - swift-container-updater on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [04:38:52] RECOVERY - swift-object-server on ms-be11 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [04:38:52] RECOVERY - swift-account-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [04:38:52] RECOVERY - swift-object-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:38:52] RECOVERY - swift-container-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [04:38:59] RECOVERY - swift-container-server on ms-be11 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:38:59] RECOVERY - swift-account-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [04:39:00] RECOVERY - swift-object-server on ms-be11 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [04:39:11] RECOVERY - swift-object-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [04:39:11] RECOVERY - swift-account-server on ms-be11 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:39:26] RECOVERY - swift-account-server on ms-be11 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:39:26] RECOVERY - swift-container-updater on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [04:39:26] RECOVERY - swift-object-updater on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:39:53] RECOVERY - swift-container-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:39:54] RECOVERY - swift-account-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:39:54] RECOVERY - swift-object-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:49:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [04:50:01] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [04:50:14] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 190 seconds [04:51:17] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 205 seconds [05:04:38] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [05:04:58] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 189 seconds [05:05:14] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 189 seconds [05:06:08] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 184 seconds [05:42:36] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 201 seconds [05:52:37] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [05:52:46] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [05:53:14] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [05:54:17] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:19:29] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [06:21:35] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [06:24:35] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [06:25:29] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [06:25:29] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [06:26:32] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [06:27:35] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [06:38:43] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [06:41:39] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [06:41:40] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [06:45:53] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [07:23:32] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [07:28:47] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:52:06] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [08:54:07] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [08:55:06] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.027 second response time on port 11000 [08:55:55] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [10:15:50] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [10:17:25] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [10:19:46] New patchset: Silke Meyer; "Create and set $wgCacheDir for Wikidata repo or client" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46220 [10:28:57] New patchset: Silke Meyer; "Removing Moodbar extension from Wikidata setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46221 [11:20:35] New patchset: Mark Bergsma; "Add cp3010 as upload varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46228 [11:22:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46228 [11:39:37] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [11:49:37] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK: HTTP/1.1 200 OK - 641 bytes in 0.185 second response time [12:13:07] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.19784284672 (gt 8.0) [12:37:34] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [12:51:04] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 12.9331811511 (gt 8.0) [12:55:46] New patchset: Silke Meyer; "Fixed a copy and paste error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46234 [12:59:21] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [12:59:50] New patchset: Mark Bergsma; "Add cp3007-cp3009" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46235 [13:00:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46235 [13:00:52] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [13:01:31] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 92.11 ms [13:02:50] New patchset: Silke Meyer; "Added variable to Wikidata client to enable other SiteIDS than "enwiki"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46236 [13:23:11] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:32] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 92.14 ms [13:26:41] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [13:26:51] PROBLEM - Varnish traffic logger on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [13:27:47] PROBLEM - Varnish traffic logger on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [13:28:59] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [13:33:20] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [13:33:20] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [13:33:20] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:33:20] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [13:33:21] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [13:33:21] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:33:21] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [13:33:22] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [13:34:32] New patchset: Mark Bergsma; "Proof of concept for hashing thumbs to original's hash key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29805 [13:35:50] RECOVERY - Varnish traffic logger on cp3010 is OK: PROCS OK: 3 processes with command name varnishncsa [13:36:11] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.234 seconds [13:36:40] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK: HTTP/1.1 200 OK - 641 bytes in 0.185 second response time [13:36:47] RECOVERY - Varnish traffic logger on cp3010 is OK: PROCS OK: 3 processes with command name varnishncsa [13:37:34] New review: Amire80; "Can probably be combined with I3a43ad74." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/46234 [13:38:43] New review: Amire80; "Looks sane, but I am not familiar with Puppet. Can probably be combined with I5c62a7f7." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/46236 [13:43:52] interesting. did https handling somehow change with the equiad migration? [13:44:09] YuviPanda: what do you mean ? :-] [13:44:13] any weird behavior? [13:44:19] hashar: well, 'different' behavior [13:44:24] (or I think) [13:44:28] https/1.1 for https [13:44:30] err [13:44:32] 1.1 for https [13:44:34] 1.0 for http [13:45:12] hashar: end difference being, transfer encoding chunked works on https [13:45:15] and not on http [13:45:23] maybe for the keep alive ? [13:45:50] AFAIK the https traffic is sent to a nginx reverse proxy [13:45:55] maybe something changed there [13:46:00] or it always add 1/1 [13:46:08] or it always had HTTP 1/1 [13:46:27] hashar: yeah, i'm investigating [13:46:39] it might be a false alarm, since you were able to uplaod without problems yesterday [13:47:19] mark, ^^ [13:47:32] New patchset: Mark Bergsma; "Proof of concept for hashing thumbs to original's hash key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29805 [13:47:55] hashar: plus, if it is still nginx -> squid -> apache for https, transfer-encoding: chunked should *not* work, since squid (or at least the version we used to run) did not support that [13:48:10] that part of the infrastructure has not changed at all during the eqiad migration [13:48:36] mark: so, we support full http/1.1 on https, and http/1.0 on http? [13:49:04] i don't think we support full http/1.1 anywhere [13:49:20] curl -I https://en.wikipedia.org says http/1.1 [13:50:37] what I'm saying is, even though some of our services may announce http 1.1, I don't think the full http 1.1 spec/featureset is supported anywhere [13:50:56] hmm, alright. so nothing's changed. [13:50:58] * YuviPanda looks elsewhere [13:51:01] thanks mark [13:51:12] no, you're still talking to the exact same servers as last month [13:51:45] oh? https was always in equiad? [13:51:55] for about a year already [13:52:15] http too [13:52:23] * YuviPanda trouts self [13:52:30] alright, the bug is elsewhere then! [13:52:47] however, *weirdly*, Transfer-Encoding: Chunked actually *does* work now [13:53:00] I can swear that it didn't when I first tried it, 2-3 months ago [13:53:11] i believe ryan upgraded the nginx ssl servers to a newer version about a month ago [13:53:18] or a bit more [13:53:22] aaah [13:53:42] so it is possible that they now buffer it before sending it off to the squids [13:54:44] yes [13:54:50] chunked is part of http/1.1 [13:55:04] nginx before the upgrade supported proxying only with http/1.0 [13:55:19] this was one the reasons ryan upgraded those [13:56:26] paravoid: wheee, awesome. [13:56:28] that explains it [13:56:35] so it's been on from before the move. [13:56:39] either way, good :) [13:57:01] He had told mea bout this, but for some reason I thought it was not enabled everywhere [13:57:19] hmm, but when I last talked to him it might indeed not have been enabled everywhere (more than a month ago) [14:05:00] ApiSandbox, o_0 [14:07:09] I think pipelining is not enabled yet [14:07:15] but chunked might just be, who knows [14:17:39] paravoid: it does seem to work (for me) [14:28:35] New patchset: Hashar; "(bug 44424) wikiversions.cdb for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46240 [14:29:24] New patchset: Mark Bergsma; "Put cp3009/cp3010 in a standard frontend/backend configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46241 [14:32:11] why isn't gerrit merging [14:32:39] that changes seems to depend on https://gerrit.wikimedia.org/r/#/c/29805/4 [14:32:46] which is not merged [14:32:51] doh [14:32:57] i'm an idiot [14:33:08] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46241 [14:34:54] New patchset: Mark Bergsma; "Put cp3009/cp3010 in a standard frontend/backend configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46242 [14:35:07] <^demon> mark: You can just rebase it and amend, rather than abandoning :) [14:35:24] i realize that [14:35:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46242 [14:37:31] paravoid: would we be able to write the README.md file for the wikimedia module ? :-] [14:38:14] New patchset: Mark Bergsma; "I really am an idiot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46244 [14:38:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46244 [14:43:23] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:00] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:56] New review: Anomie; "The full path needs to be passed to getRealmSpecificFilename(). As written, this will test if wikive..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/46240 [14:50:18] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 118.57 ms [14:50:30] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 92.28 ms [14:52:51] PROBLEM - Varnish traffic logger on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [14:52:51] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [14:53:01] PROBLEM - Varnish HTTP upload-frontend on cp3010 is CRITICAL: Connection refused [14:53:01] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:53:21] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:53:44] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:38] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:54:51] RECOVERY - Varnish traffic logger on cp3010 is OK: PROCS OK: 3 processes with command name varnishncsa [14:55:00] RECOVERY - Varnish HTTP upload-frontend on cp3010 is OK: HTTP OK: HTTP/1.1 200 OK - 643 bytes in 0.182 second response time [14:55:14] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:57:12] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [14:57:20] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 92.30 ms [14:57:29] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 118.88 ms [14:59:31] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:20] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 92.12 ms [15:01:23] PROBLEM - Varnish HTTP upload-frontend on cp3009 is CRITICAL: Connection refused [15:03:02] RECOVERY - Varnish HTTP upload-frontend on cp3009 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.237 seconds [15:03:20] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [15:03:21] PROBLEM - Varnish traffic logger on cp3009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:04:05] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [15:05:08] PROBLEM - Varnish traffic logger on cp3009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:05:59] !log Pooled cp3009 and cp3010 in the esams upload pool [15:06:13] Logged the message, Master [15:15:56] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [15:16:05] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 20.2831671111 (gt 8.0) [15:16:07] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [15:18:07] emery again [15:18:07] sigh [15:19:27] [18446744025.112439] BUG: soft lockup - CPU#4 stuck for 17163091968s! [perl:21796] [15:19:32] that's interesting... [15:20:06] yes, i am looking at it with ottomata [15:20:13] it's not clear to me what is causing [15:20:23] this, because nothing has changed in the recent weks [15:20:28] more traffic [15:20:50] but then also oxygen and locke would have the same issues, and it started on sunday [15:20:52] are we sure it's firing on all cylinders? i'm not entirely sure what the soft lockup means [15:21:14] no not necessarily [15:21:44] it could be some kernel bug (we've seen it before) on a certain type of hardware under quite peculiar load conditions [15:21:52] ok [15:22:02] i will ask ottomata to stop the sqstat script first [15:22:06] [18446744025.112439] BUG: soft lockup - CPU#4 stuck for 17163091968s! [perl:21796] [15:22:09] 15:22:03 up 209 days, 17:32, 5 users, load average: 2.81, 2.86, 2.87 [15:22:13] typical 208 days bug [15:22:14] needs a reboot [15:22:34] what is 208 days bug? [15:22:41] and an upgrade to precise [15:22:50] it's a kernel bug on older versions of 2.6.32 [15:22:57] ok [15:23:05] that fires off after 208 days of uptime [15:23:09] and messes with the scheduler [15:23:09] i will talk with ottomata about upgrading to precise [15:23:12] and then weird things happen [15:23:17] RECOVERY - Varnish traffic logger on cp3009 is OK: PROCS OK: 3 processes with command name varnishncsa [15:23:19] apparently :) [15:23:37] there's a thread from last year in ops@ [15:24:05] paravoid: do the stock kernel updates fix the bug? [15:24:06] ok [15:24:10] yes [15:24:15] RECOVERY - Varnish traffic logger on cp3009 is OK: PROCS OK: 3 processes with command name varnishncsa [15:24:19] running an upgrade to latest 2.6.32 now [15:24:21] k [15:24:40] drdee: can I reboot? [15:25:03] yes [15:25:23] cool [15:25:30] will do after upgrade finishes [15:25:43] please arrange a precise upgrade nevertheless [15:25:47] ok [15:26:10] 1 question i have is how to determine whether a box is fully puppetized, would you have to make an image of the box, then run the puppet on a different box and make image as well and compare them or something like that? [15:27:20] that would work, although we never do that ;) [15:27:44] because i think that both emery and locke still contain unpuppetized things [15:29:11] !log upgraded & rebooting emery, 208 days bug [15:29:22] Logged the message, Master [15:29:28] ty paravoid [15:29:50] New review: Hashar; "Mailed Ryan about it." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45108 [15:31:46] can I get a few changes merged for gallium please? I will take care of running puppet on the host. https://gerrit.wikimedia.org/r/#/c/44974/ install pyflakes a python linter and https://gerrit.wikimedia.org/r/#/c/46188/ which updates the frontage at http://integration.mediawiki.org/ ;-D [15:32:07] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:32:18] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:33:47] PROBLEM - SSH on emery is CRITICAL: Connection refused [15:33:51] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:34:07] PROBLEM - udp2log log age for aft on emery is CRITICAL: Connection refused by host [15:34:08] PROBLEM - udp2log log age for emery on emery is CRITICAL: Connection refused by host [15:35:12] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:36:15] PROBLEM - SSH on emery is CRITICAL: Connection refused [15:36:51] PROBLEM - udp2log log age for aft on emery is CRITICAL: Connection refused by host [15:37:27] PROBLEM - udp2log log age for emery on emery is CRITICAL: Connection refused by host [15:38:00] powercycling emery [15:38:33] with the console open [15:38:48] PROBLEM - Host kaulen is DOWN: CRITICAL - Host Unreachable (208.80.152.149) [15:39:44] thanks paravoid for babysitting emery ;) [15:39:48] np [15:39:51] RECOVERY - Host kaulen is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [15:39:56] fsck check it seems, although I did press C before [15:40:43] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44974 [15:40:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46188 [15:42:09] hashar: both merged [15:42:40] holy crap, my gerrit queue is huge [15:43:06] paravoid: thhhannkkkkkss :-] [15:43:08] New review: Faidon; "Thanks." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/45657 [15:43:09] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45657 [15:45:14] New review: Hashar; "Works on https://integration.mediawiki.org/ :-] Thanks!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46188 [15:45:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46234 [15:45:33] Silke_WMDE: the second one needs a rebase [15:46:16] j^: around? [15:50:53] paravoid: ssh to emery is still borked :( [15:51:13] it was fscking [15:51:30] should be ok now [15:51:30] oh sorry [15:51:33] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [15:51:37] yup [15:51:41] 307 days without being checked kinda does that :) [15:51:42] ty! [15:52:00] RECOVERY - SSH on emery is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:52:07] RECOVERY - udp2log log age for aft on emery is OK: OK: all log files active [15:52:07] RECOVERY - SSH on emery is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:52:08] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [15:52:41] New review: Faidon; "Needs a manual rebase." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46236 [15:52:45] RECOVERY - udp2log log age for aft on emery is OK: OK: all log files active [15:54:02] paravoid: and we are back at 25% packet loss at emery, could you kill the sqstat script on emery and see if that helps [15:55:01] New patchset: Silke Meyer; "Added variable to Wikidata client to enable other SiteIDS than "enwiki"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46236 [15:56:23] is ottomata not available? [15:56:33] ottomata is here! [15:56:36] hey [15:56:38] it looks better now [15:56:46] but still, could you monitor it a bit and help drdee? [15:56:53] I can help if it's something you can't handle [15:56:57] yeah, i'm helping him, i was just in a levelup chat with chad on it [15:57:03] for the last hour [15:57:09] sorry [15:57:10] New patchset: Silke Meyer; "Added variable to Wikidata client to enable other SiteIDS than "enwiki"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46236 [15:57:11] not 'on it' [15:57:12] but ja [15:57:18] no worries [15:57:26] did you reboot emery? [15:57:33] I did [15:57:33] aye cool [15:57:38] it looks much better now [15:57:42] paravoid: rebase is done [15:57:51] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [15:58:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46236 [15:58:41] paravoid: THANKS! :) [15:59:21] Silke_WMDE: done :) for future reference, #wikimedia-labs is the right place for labs-related support, questions and puppet commits and here (#wikimedia-operations) for general puppet questions etc. [15:59:32] ok [16:04:18] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [16:05:12] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [16:08:21] grr salt messages in dmesg are so annoying [16:08:38] every 30' [16:16:36] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [16:17:17] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [16:18:39] New patchset: MaxSem; "Postgres module for OSM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36155 [16:21:06] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [16:23:03] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [16:24:13] paravoid: whats up? [16:24:41] j^: hey, do you have any news regarding your cgroups patches? [16:24:53] I replied to the bug report a few minutes ago asking for Tim's input [16:25:03] I remember he was also doing something related [16:25:54] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [16:26:07] paravoid: i rebased it after tim added the hard timeout not sure about the open question about labs [16:27:06] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [16:27:07] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [16:27:53] but would rather try to get swift or whats used now setup in labs [16:28:09] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [16:29:12] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [16:36:09] omg the cronspam these days... [16:40:09] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [16:43:09] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [16:43:09] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [16:47:12] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [17:46:17] !Log shutting down ms-be11 for H/W replacement [17:47:28] PROBLEM - MySQL Replication Heartbeat on db1009 is CRITICAL: CRIT replication delay 203 seconds [17:48:28] RECOVERY - MySQL Replication Heartbeat on db1009 is OK: OK replication delay 0 seconds [18:00:40] !log shutting down ms-be11 for h/w replacement [18:00:54] Logged the message, Master [18:08:53] !log upgrading ffmpeg/libav for USN-1705-1/USN-1706-1 on all image/video scalers in eqiad/pmtpa [18:09:06] Logged the message, Master [18:24:55] PROBLEM - Host ms-be11 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:40] PROBLEM - Host ms-be11 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:52] ^demon: I just created a project in gerrit but clicked the wrong box and it's on the top level rather than under Operations/Debs like I intended. Can you delete it so that I can start again? (Or, tell me how to delete it and/or move it?) [18:26:30] <^demon> There's a command to change the parent :) [18:26:38] <^demon> What's the repo? [18:26:43] New patchset: Reedy; "Disable Mostcategories in $wgDisableQueryPageUpdate for frwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46262 [18:27:00] ^demon: rt-authen-externalauth [18:27:26] <^demon> Ah, it was mis-named, not wrong parent. [18:27:39] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46262 [18:28:43] ^demon: Does that mean that in the web interface I need to specify the parent and also prepend /the/parent/name to the project name? [18:28:56] <^demon> Yep. [18:29:32] <^demon> Parent is independent of project name. You could make foo/bar/baz inherit from omg/wtf/lol [18:30:45] hm, ok. [18:31:11] <^demon> We use the foo/bar/baz inherit from foo/bar out of convention (and because it makes sense, imho). [18:31:17] <^demon> :) [18:31:23] Anyway, thanks for moving! Looks right now. [18:31:41] !log reedy synchronized wmf-config/InitialiseSettings.php [18:31:45] <^demon> yw. [18:31:52] Logged the message, Master [18:34:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45108 [18:42:52] ^demon: externalauth is packaged in >= wheezy >= quantal [18:42:53] er [18:42:58] andrewbogott: externalauth is packaged in >= wheezy >= quantal [18:43:06] andrewbogott: rt4 might be nice to have too [18:43:08] <^demon> I was like um wtf? [18:43:15] sorry :) [18:44:12] paravoid: I just now had to sing the alphabet a couple of times… aren't both of those in the future? [18:45:27] we regularly backport stuff [18:45:45] oh btw, I've independently packaged externalauth years ago [18:45:52] for rt 3.4? something like that [18:45:54] it wasn't fun at all [18:46:07] RT4 is also much *much* nicer UI wise [18:46:24] expands a bit the scope of your endevaour [18:46:29] but I think it'll make your job easier [18:46:35] so it might be win-win [18:46:49] paravoid: OK… mutante and I discussed doing ldap and upgrading to 4 later, but maybe that's silly and we should do it the other way 'round [18:47:21] yep [18:47:33] rt4 is also perl, so a straight import of the packages (no rebuild) is probably possible [18:48:16] paravoid, do you happen to know if the upgrade path from 3.8 to 4 is smooth or rocky? [18:48:53] https://github.com/bestpractical/rt/blob/stable/docs/UPGRADING-4.0 [18:49:10] I remember it being like a 3.6->3.8 upgrade [18:49:16] but I don't remember that much tbh :) [18:49:22] andrewbogott: hi? [18:49:42] mutante, paravoid is advocating that we upgrade to RT 4 before doing anything else… because it'll make our life easier going forward. [18:49:54] I know we'd talked about doing it the other way 'round but I don't think we had a good reason :) [18:50:00] yes, because of http://packages.debian.org/sid/rt4-extension-authenexternalauth [18:50:10] or rather http://packages.ubuntu.com/rt4-extension-authenexternalauth [18:50:19] New patchset: Reedy; "Reduce duplication in $wgContentNamespaces" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46276 [18:50:31] paravoid: I just wanted to point out to you that https://github.com/wikimedia/sartoris is starting to take better shape [18:50:49] ah, that extension looks interesting for us indeed [18:50:50] paravoid: if you have any spare cycles and would like to review it at all… that would be great [18:50:53] preilly: oh thanks. sorry, I didn't manager to have a look yet [18:50:58] seing the "alternate SSO" part and stuff [18:51:02] s/manager/manage/ [18:51:13] mutante: that's the LDAP one [18:51:16] mutante: That's the extension I'm planning to use, it's just that with 3.8 I would have to package it myself [18:51:18] paravoid: no worries [18:51:34] andrewbogott: paravoid : i agree we did not have a reason which order to do it it [18:51:51] andrewbogott: yeah, *don't*, been there done that and it took me days to package it [18:51:51] we merely talked about both, hooking up to LDAP and upgrading [18:52:02] and I know my way around packaging better than you I'd guess :) [18:52:25] yea, it totally makes sense to not try and package it ourselves then [18:52:29] paravoid: Very likely! [18:52:46] mutante, want to take over my RT labs box and do a practice upgrade? [18:53:01] (Although that one is already polluted with extauth I guess...) [18:53:24] (Or I don't mind doing the upgrade myself; either way.) [18:53:35] yes, please add me as a sysadmin to your project. i just don't think i will get much done in this week, since i will be travelling [18:53:46] 'k [18:54:11] suggesting a fresh instance but in the same project then [18:54:56] mutante: I was actually working in the 'openstack' project just because it already had the firewall rules I needed :) Lemme see about making a fresh project and instance [18:55:18] ah, ok, yes, i think we should just make an RT project then [18:55:34] cool [18:55:55] extauth should be easy if you have packages btw [19:02:53] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46276 [19:05:05] !log reedy synchronized wmf-config/InitialiseSettings.php [19:05:18] Logged the message, Master [19:09:12] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf8 [19:09:22] Logged the message, Master [19:11:37] New patchset: Reedy; "enwiki to 1.21wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46278 [19:11:53] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46278 [19:18:54] New patchset: Reedy; "$wgFilterRobotsWL was removed in 1.7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46279 [19:19:14] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46279 [19:24:16] New patchset: Reedy; "Remove some exact duplicate default configs of wgNamespacesToBeSearchedDefault" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46280 [19:24:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46280 [19:25:20] !log reedy synchronized wmf-config/InitialiseSettings.php [19:25:32] Logged the message, Master [19:27:16] binasher: can you add job-pop-duplicate and job-insert-duplicate stats to gdash? [19:27:37] AaronSchulz: sure [19:27:48] will do later today [19:27:53] * AaronSchulz wishes he had access to change graphs on that box [19:28:25] * AaronSchulz still wants to split LockManager graphs out from https://gdash.wikimedia.org/dashboards/filebackend/ [19:28:28] AaronSchulz: i'll finish puppetizing the gdash graph definitions [19:28:47] so you can change em in the future with just an ops merge [19:29:00] should lockmanager get its own set of graphs? [19:29:10] under FileBackendStore yes [19:29:18] they are mixed with streamfile now, which is not related [19:29:42] New patchset: Reedy; "Fixup whitespace for defaults in $wgNamespacesWithSubpages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46282 [19:30:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46282 [19:32:27] New patchset: Reedy; "Removed $wgUseCategoryBrowser as unused/same as default" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46283 [19:32:40] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46283 [19:33:17] RobH: how stoney are the dell 420's? [19:34:32] binasher: lulz [19:36:19] hashar: can you look at https://gerrit.wikimedia.org/r/#/c/14068/ ? [19:36:33] is that abandonded? [19:38:35] New patchset: Reedy; "Remove duplication in $wgNamespaceProtection" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46284 [19:39:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46284 [19:40:06] AaronSchulz: hi [19:40:41] AaronSchulz: I discover that last comment [19:40:50] AaronSchulz: feel free to fix it up in what ever way you want :-] [19:41:01] AaronSchulz: or we can abandon it [19:41:07] and fix it later :-] [19:41:23] amending too much breaks authorship [19:56:21] RobH, can you close https://rt.wikimedia.org/Ticket/Display.html?id=4210 ? [20:09:00] fyi, I've cleaned up http://wikitech.wikimedia.org/view/Multicast_HTCP_purging ...please fix if I've gotten anything wrong [20:17:23] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [20:17:40] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [20:21:10] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:24:11] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:27:00] Could someone please clean up all the stupid logs in fluorine:/a/mw-log ? [20:27:07] -rw-r--r-- 1 udp2log udp2log 0 Jan 28 06:53 >userCan('read',.log [20:27:07] -rw-r--r-- 1 udp2log udp2log 0 Jan 28 06:53 ->userCan('read',.log [20:27:07] -rw-r--r-- 1 udp2log udp2log 0 Jan 28 06:53 userCan('read',.log [20:27:08] etc etc [20:39:06] Reedy: would you like us to remove files in there ? [20:39:32] Yeah, there's quite a lot of bad ones, but a few are legit [20:39:46] Anything size 0 is good to go [20:40:10] eww [20:41:19] Reedy: cleaned the size 0's [20:41:23] yay find [20:41:23] !log taking down osm-web1 for h/w installation [20:41:26] such a nice command [20:41:33] Logged the message, Master [20:42:06] Thanks! [20:42:19] !log cleared a lot of 0 size badly named logs on flourine:/a/mw-log [20:42:30] Logged the message, Mistress of the network gear. [21:02:45] PROBLEM - check_mysql on payments1001 is CRITICAL: Access denied for user jgreen@localhost (using password: YES) [21:05:18] RECOVERY - check_mysql on payments1001 is OK: Uptime: 5463390 Threads: 4 Questions: 19851457 Slow queries: 547 Opens: 445 Flush tables: 1 Open tables: 64 Queries per second avg: 3.633 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [21:26:23] !log pgehres synchronized php-1.21wmf8/extensions/LandingCheck 'Updating LandingCheck to master' [21:26:35] Logged the message, Master [21:34:36] I don't see the patch there [21:35:21] wrong channel [21:35:34] New patchset: RobH; "granting shell access to stat1001 to ryan faulkner per rt 4258" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46296 [21:35:35] rfaulkner: im adding your access to stat1001 now [21:35:49] RobH: woohoo! [21:37:08] RobH: thanks. I'll need to tail apache logs, will my access include permissions that enable me to do that? [21:37:32] uhhh, you guys didnt ask for that in ticket ;p [21:37:38] heh, i'll check. [21:37:47] if those logs are accessible by the other non roots [21:37:48] then yes [21:38:09] ezachte, diederik, dsc, dandreecu users have identical permissions [21:38:21] rfaulkner: when i finish pushing change, you can try it out and see ;] [21:38:24] and if not we fix [21:38:34] RobH: sounds good to me. thanks. [21:39:04] rfaulkner: i would check it right now because i have a feeling that you won't be able to do that [21:39:34] drdee: is that something the rest of you do already or nah? [21:39:44] cuz if not, then we prolly have to hack some puppet changes to set file permissions [21:39:53] (we being any of use can do it and I can roll it live ;) [21:40:04] none of us do that [21:40:06] New review: RobH; "make it so" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/46296 [21:40:07] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46296 [21:40:09] heh, then yea [21:40:23] rfaulkner: i bet it wont work, the permissions wont be set, checking shortly [21:40:28] one ticket at a time, heh [21:41:13] ok. no rush. thanks for the heads up. just tried to logon - still no perm [21:41:22] RobH: :) [21:41:24] yea, im running on stat1001 right now [21:41:29] puppet is slow [21:41:39] ........ [21:41:43] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Apache_site[000_default] is already defined in file /var/lib/git/operations/puppet/manifests/webserver.pp at line 320; cannot redefine at /var/lib/git/operations/puppet/manifests/webserver.pp:320 on node stat1001.wikimedia.org [21:41:57] to err is human [21:42:03] puppet is broken on stat1001 [21:42:07] my change didnt do that ;] [21:43:09] checking! [21:43:14] oh [21:43:15] yes [21:43:16] i know about that [21:43:22] not mine either [21:43:26] I commented on an RT that did it [21:43:28] one sec [21:43:50] i think i know what this is related to [21:44:00] see how it defines 000_default [21:44:28] recently we added new stuff to site.pp to disable the default page on a couple hosts [21:44:47] yeah [21:45:00] and it was also in webserver.pp it seems..hence the duplicate [21:45:37] hmm [21:45:54] apache_site { 000_default: name => "000-default", ensure => absent } [21:46:01] this must be elsewhere too [21:46:26] the duplicate error references both places to be the exact same [21:46:28] which is fubar. [21:46:50] mutante: any idea where you added that? [21:47:03] and which version should exist? [21:47:05] Leslie did.. hold on [21:47:12] :q [21:47:15] rfaulkner: See, its all LeslieCarr's fault! [21:47:26] someone heat up the tar and find the feathers! [21:49:47] hey [21:49:48] is that an idiomatic RobH ? :D [21:50:15] like "it is raining cats and dogs" [21:50:33] oh man [21:50:33] tar [21:51:05] RobH: haha [21:51:37] include webserver::apache [21:51:37] webserver::apache::module { "rewrite": require => Class["webserver::apache"] } [21:51:40] webserver::apache::site { $site_name: [21:51:44] http://www.bdenvrac.com/images/lucky/villeimage/poker.gif !! [21:51:45] misc/statistics.pp [21:54:33] rfaulkner: leslie helped us figure it out, shes submitting a change shortly [21:54:39] i'll ping ya when you shoudl try to login again [21:56:12] great! and thanks again RobH for being so informative [21:56:43] New patchset: Lcarr; "removing webserver from being called multiple times and moving it to be called once" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46359 [21:56:49] ^demon: So how hard is it to give someone +2 and who the hell actually rolls that change? [21:57:01] <^demon> To what? [21:57:06] <^demon> (Not hard, generally) [21:57:10] New patchset: Mwalker; "Enable CentralNotice/Translate on Test" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46365 [21:57:16] operations/mediawiki-config [21:57:27] if its somethign i can do, and you can teach me, that would be awesome =] [21:57:31] looks good to me [21:57:55] New review: Dzahn; "yep, avoids duplicate webserver definition" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/46359 [21:57:55] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46359 [21:57:59] <^demon> RobH: This group https://gerrit.wikimedia.org/r/#/admin/groups/21,members [21:58:34] ^demon: ok, added her [21:58:36] thats it? [21:58:39] cuz thats too easy. [21:59:01] <^demon> I haz good acls ;-) [21:59:12] this is obviously this easy due to countless hours of development by whoever deployed our gerrit. [21:59:13] ;] [21:59:21] ^demon: thx! [21:59:26] New patchset: Mwalker; "Enable CentralNotice/Translate on Test" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46365 [21:59:34] <^demon> yw [22:01:12] New patchset: Mwalker; "Enable CentralNotice/Translate on Test" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46365 [22:01:46] RobH: two weeks ago we have been talking about a possible equivalent of hume in EQIAD. Does it mean anything to you ? [22:01:59] i recall the conversation yep! [22:02:04] New patchset: RobH; "added sumanah to deployment access on cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46406 [22:02:09] and its on my list of 'stuff that needs a home in eqiad' [22:02:22] RobH: I think faidon said we might have a machine already [22:02:36] RobH: anyway, if you ever found out, let Rob Lanphier knows about it please :-] [22:02:37] already assigned or already deployed? [22:02:48] ah [22:02:49] cuz well, i had no less than 5 machiens as 'dev deployment hosts' in eqiad [22:02:50] good question [22:02:56] most of which werent even powered up [22:03:00] ohh [22:03:07] but thx for info, ill make sure to touch base with him [22:03:12] so I guess if we need a hume equivalent one can pick one from that spool can't we ? [22:03:29] i think hume replacement has to be a bit heftier [22:03:33] as its goign to be the scripting host [22:03:47] so im prolly putting it on a new high performance misc server (just arrived last week) [22:03:52] as we want the memory overhead [22:03:53] niceeee [22:04:01] we might even phase out hume in favor of that new host so [22:04:06] indeed [22:04:16] i want this to be better than hume, give you all reason to leave it [22:04:25] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45707 [22:04:51] RobH: nice!!! thanks :-] [22:05:05] RobH: it is nowhere urgent nor critical, I was just wondering. [22:06:34] no problem, never hurts to ask [22:06:40] New patchset: RobH; "added sumanah to deployment access on cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46406 [22:08:01] yeah I learned two things when joining the project: please ask and be bold [22:08:04] :-D [22:08:12] damn you gerritbot [22:08:16] run my parsing tests! [22:08:54] bleh... spacing bad. [22:09:08] New patchset: RobH; "added sumanah to deployment access on cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46406 [22:14:52] New review: RobH; "once more with feeling" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/46406 [22:14:53] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46406 [22:18:54] robla1: So sumanah now has deployment access. There is a question if her labs and production keys are the same, so i put the info in the ticket [22:19:04] but you may wanna touch base with her and ensure she isnt using the same key for both. [22:19:14] otherwise, she is all set. [22:19:54] paravoid: still around ? if not - wanted to ask about dbeacon - i think it could be good for interdatacenter connectivity issues/monitoring, putting >= 1 becaon in each physical location -- is it fairlyl lightweight? if so, i was wondering if we could put it in ganglia and have the ganglia listeners for each group be beacons [22:20:03] paravoid: or would that create too big of a mesh ? [22:20:18] !log pgehres Started syncing Wikimedia installation... : Dark deploying Translate integration with CentralNotice [22:20:28] Logged the message, Master [22:22:44] New patchset: Pyoungmeister; "add -e option and an expected of 200 to solr checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46444 [22:23:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46444 [22:23:46] heading home to get my car towed to the shop… wee! [22:28:51] New patchset: Dzahn; "remove one more webserver::apache inclusion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46445 [22:31:06] PROBLEM - Solr on solr1003 is CRITICAL: The command defined for service Solr does not exist [22:31:06] PROBLEM - Solr on solr1001 is CRITICAL: The command defined for service Solr does not exist [22:31:06] PROBLEM - Solr on solr3 is CRITICAL: The command defined for service Solr does not exist [22:31:15] PROBLEM - Solr on vanadium is CRITICAL: The command defined for service Solr does not exist [22:31:16] PROBLEM - Solr on solr1002 is CRITICAL: The command defined for service Solr does not exist [22:31:26] PROBLEM - Solr on solr2 is CRITICAL: The command defined for service Solr does not exist [22:31:26] PROBLEM - Solr on solr1 is CRITICAL: The command defined for service Solr does not exist [22:34:03] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46445 [22:34:32] cmjohnson1: You about? [22:34:39] yep [22:34:43] I am now confused on what is happening on the OSM servers that sbernardin1 is working on [22:34:51] lets discuss in here cuz its too hard to keep the three of us in sync in PM [22:35:05] whats up? [22:35:13] RECOVERY - Solr on solr1002 is OK: HTTP CRITICAL - Invalid HTTP response received from host on port 8983: HTTP/1.1 200 OK [22:35:13] RECOVERY - Solr on vanadium is OK: HTTP CRITICAL - Invalid HTTP response received from host on port 8983: HTTP/1.1 200 OK [22:35:22] ohhh [22:35:22] RECOVERY - Solr on solr1001 is OK: HTTP CRITICAL - Invalid HTTP response received from host on port 8983: HTTP/1.1 200 OK [22:35:25] i see what you did in ticket [22:35:29] !log pgehres Finished syncing Wikimedia installation... : Dark deploying Translate integration with CentralNotice [22:35:32] RECOVERY - Solr on solr2 is OK: HTTP CRITICAL - Invalid HTTP response received from host on port 8983: HTTP/1.1 200 OK [22:35:38] you renamed a db2 to cp2 to keep them split up in differing racks [22:35:39] Logged the message, Master [22:35:44] sbernardin1: ok, i see what he did [22:35:53] RECOVERY - Solr on solr3 is OK: HTTP CRITICAL - Invalid HTTP response received from host on port 8983: HTTP/1.1 200 OK [22:36:12] RECOVERY - Solr on solr1003 is OK: HTTP CRITICAL - Invalid HTTP response received from host on port 8983: HTTP/1.1 200 OK [22:36:35] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46365 [22:37:02] sbernardin1: sorry for confusion, so once they are relabeled and controller sin let me know [22:37:08] Robh: So am I OK so far? [22:37:22] i didnt realize chris relabeled some of them to keep them in differing racks [22:37:26] i re-read the ticket, you are ok so far =] [22:37:32] Robh: The first one is done... [22:37:41] sbernardin1: So you can boot and it can see the disks? [22:37:46] !log jenkins: applied pep8 and pyflakes jobs to sartoris repository. [22:37:55] RECOVERY - Solr on solr1 is OK: HTTP OK: Status line output matched 200 - 3115 bytes in 0.056 second response time [22:37:56] Logged the message, Master [22:38:37] Robh: It's booted up already [22:39:55] PROBLEM - Solr on solr2 is CRITICAL: The command defined for service Solr does not exist [22:39:56] PROBLEM - Solr on solr3 is CRITICAL: The command defined for service Solr does not exist [22:40:05] PROBLEM - Solr on solr1001 is CRITICAL: The command defined for service Solr does not exist [22:40:06] PROBLEM - Solr on solr1003 is CRITICAL: The command defined for service Solr does not exist [22:40:06] PROBLEM - Solr on vanadium is CRITICAL: The command defined for service Solr does not exist [22:40:15] PROBLEM - Solr on solr1002 is CRITICAL: The command defined for service Solr does not exist [22:40:37] New patchset: Pyoungmeister; "solr monitoring: order of options matters..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46451 [22:40:45] !log pgehres synchronized wmf-config/CommonSettings.php 'Enabling CN:Translate on testwiki' [22:40:55] Logged the message, Master [22:40:59] sbernardin1: right, my question is does the new controller see all 4 disks? [22:41:12] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46451 [22:41:13] i can login and check, but i want you to be aware of how to confirm that before you hand it off =] [22:41:39] if you dont know, i can walk you through how, no worries [22:43:18] notpeter: I have a question about search, so I am askign you, but if you dunno or whatever just lemme know. ticket https://rt.wikimedia.org/Ticket/Display.html?id=4387 is to give a new user shell access to the search boxen [22:43:25] so he can read the search logs [22:43:36] sure, sounds fine [22:43:40] so where are these search logs located? [22:43:46] cuz i need to check the permission set on them [22:43:55] ticket isnt very detailed ;] [22:43:56] /a/search/log/ [22:44:23] so is there a drawback to giving shell users on that host the search group [22:44:25] or is that bad? [22:44:26] they're 644 [22:44:31] oh, nm [22:44:33] i see htat now [22:44:38] huzzah, easy ticket! [22:44:40] thx [22:44:45] yeah, just add his user [22:44:49] should be fine [22:45:29] once i find a free uid. [22:45:30] =P [22:48:55] PROBLEM - Solr on solr1 is CRITICAL: The command defined for service Solr does not exist [22:49:40] New patchset: Hashar; "tox package on contint server (gallium)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46452 [22:49:52] grmbmblblbl [22:50:07] screw you ubuntu [22:50:14] Robh: drives show up during boot prompts [22:50:56] Robh: what other method can I use to verify? [22:50:58] Change abandoned: Hashar; "package is named python-tox and is not there yet :(" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46452 [22:51:47] sbernardin: thats about it [22:51:54] if the controller sees all 4 disks, you are set =] [22:52:09] OK ....so that one is all set [22:52:12] you could enter bios and see them there (or under the controller bios if it has any) [22:52:12] cool [22:52:22] when you are all done i'll sping them all up at once [22:52:30] spin even [22:55:21] RECOVERY - Solr on solr1002 is OK: HTTP OK: Status line output matched 200 - 3115 bytes in 0.002 second response time [22:55:34] New patchset: RobH; "added user ram per rt4387, corrected space vs tab in chris s account stanza" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46454 [22:56:42] man look at all that corrected tab from spacebar spacing [22:56:45] its so pretty! [22:56:52] RECOVERY - Solr on solr1 is OK: HTTP OK: Status line output matched 200 - 3115 bytes in 0.056 second response time [22:57:08] mark should be proud to have passed his neurosis about this to me. [22:58:36] New review: RobH; "if it wasn't for self review, i wouldn't have no review at all" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/46454 [22:58:37] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46454 [22:59:53] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [23:01:41] RECOVERY - Solr on solr1003 is OK: HTTP OK: Status line output matched 200 - 3116 bytes in 0.062 second response time [23:03:03] AaronSchulz: I know how mail works [23:03:11] ;-] [23:03:51] PROBLEM - Solr on solr1 is CRITICAL: The command defined for service Solr does not exist [23:04:02] TimStarling: hm? [23:04:11] exim::simple-mail-sender is in the "standard" class in puppet, so it's on all servers [23:04:27] oh, I though you were being sarcastic or something [23:04:30] LeslieCarr: icinga does not know how to monitor Solr on solr1 ^^^^^ [23:04:31] *thought [23:04:33] it installs exim with configuration from exim4.minimal.erb [23:04:40] MaxSem: you there? [23:04:49] notpeter, yep [23:05:02] ithink that I have fixed solr monitoring [23:05:02] which has some special configuration for MW: [23:05:06] condition = ${if eqi{$header_X-Mailer:}{MediaWiki mailer}} [23:05:07] is there something we can break to test it? [23:05:10] route_list = * <%= exim_mediawiki_route_list %> [23:05:32] notpeter, awesome! sudo service jetty stop [23:05:41] what box should i do that on? [23:05:47] or, what dc is currently active? [23:05:54] exim_mediawiki_route_list is smtp.pmtpa.wmnet (from realm.pp) [23:06:47] any but the master (solr1001). both DCs are used in MW config [23:07:32] MaxSem: ok, going to try flipping off solr1002 for a sec [23:07:55] so MW will just send to localhost:25 via mail()/sendmail [23:08:07] right [23:08:25] because it's not configured to use PEAR, as you noticed [23:08:41] RECOVERY - Solr on solr3 is OK: HTTP OK: Status line output matched 200 - 3115 bytes in 0.056 second response time [23:08:55] MaxSem: nope. wtf!?!?!? [23:09:13] will keep trying. this makes no sense [23:09:41] HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.055 seconds [23:10:01] I got it to say "connection refused OK" [23:10:08] New patchset: RobH; "giving spetrea access to analytics machines per rt4402" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46458 [23:10:17] ie: I have no fucking idea what I'm talking about OK! [23:10:21] one result is that mchenry can't have the "standard" class [23:10:22] notpeter: thats ok, neither do i. [23:11:15] New review: RobH; "self review, not to be confused with self help" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/46458 [23:11:16] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46458 [23:11:23] RobH: no no, I know what i'm talking about [23:11:29] nagios seems to be... confused [23:11:33] about what's ok and what's not [23:11:39] it doesn't understand boundaries [23:12:07] no means no nagios! [23:15:16] New patchset: RobH; "i need to read my changeset more closely" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46459 [23:16:40] New review: RobH; "spelling bee champion" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/46459 [23:16:41] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46459 [23:20:06] New patchset: RobH; "i am not smart" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46460 [23:20:26] * AaronSchulz watches RobH have fun [23:20:52] New review: RobH; "dont look at me, so ashamed" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/46460 [23:20:55] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46460 [23:21:04] the clipboard is your friend! ;) [23:21:16] yea, i typed it all [23:21:16] and learned my lesson. [23:21:35] if this isnt an advertisement against self review, i dunno what is. [23:21:43] RobH: ottomata: so i found one more duplicate webserver definition and removed that..yet it STILL throws the same error :/ [23:22:05] hm [23:22:35] if it makes you feel better, i just had no less than three patches to do the job of one. [23:22:38] cuz i merged prematurely [23:22:39] the weird thing about review in puppet is that you have to deploy things before you can test them [23:22:54] usually when I write code, I like to test it before I let other people look at it [23:22:54] totally! [23:23:09] i test locally for more generic things, like modules [23:23:19] PROBLEM - Solr on solr3 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:23:20] PROBLEM - Solr on solr1001 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:23:20] PROBLEM - Solr on vanadium is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:23:20] but no way I'm testing for user accounts, etc. [23:23:44] New patchset: Lcarr; "fixing iptables on neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46461 [23:23:46] PROBLEM - Solr on solr1 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:23:46] PROBLEM - Solr on solr2 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:24:05] ottomata: wait.. but you did not switch to puppetmaster::self or something and it just does not pickup our changes now,,heh?:) [23:24:13] PROBLEM - Solr on solr1003 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:24:17] re: testing locally [23:24:41] PROBLEM - Solr on solr1002 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:24:42] I asked how to test things in #puppet [23:25:03] New patchset: Pyoungmeister; "solr monitoring: must escape url for check to actually work..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46462 [23:25:10] they said, sure, if you have a server which is identical to the production server in every way: same files, same processes running, etc., then you could deploy to it first to test [23:25:21] New patchset: Lcarr; "fixing iptables on neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46461 [23:25:41] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46462 [23:25:44] binasher, could you weigh in on https://bugzilla.wikimedia.org/show_bug.cgi?id=44126 please? [23:26:28] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45272 [23:26:52] heya TimStarling, do you know anything about udp2log UDP packets containing about 3 logs lines each? [23:26:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46461 [23:27:02] TimStarling: the theoretical solution would be to test in labs using this i guess https://labsconsole.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [23:27:37] ottomata: do you mean UDP packets sent by the squid patch? [23:27:38] that way you dont have to merge until its right [23:27:40] or by varnish? [23:27:52] maybe both? I was examining varnish packets [23:28:04] i just didn't realize this was the case for a while, and it was throwing me off [23:28:17] buuut, ja just curious, what was the reason for that? just to get fewer packets on the network? [23:29:16] ottomata: yes. [23:29:27] MaxSem: will do [23:29:34] thanks! [23:29:53] it reduces CPU usage on the receiver [23:30:04] oh [23:30:22] is it more for the receiver's benefit than the network then? [23:30:25] and probably gives efficiency gains in every other part of the system too, but the receiver is especially limited [23:30:48] aye, does the patch always do 3 packets, or does it just examine size of the lines and try to add more in if it thinks they will fit? [23:30:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45564 [23:31:22] RECOVERY - Solr on solr2 is OK: HTTP OK: Status line output matched 200 - 3117 bytes in 0.161 second response time [23:31:37] the latter [23:31:42] RECOVERY - Solr on solr1 is OK: HTTP OK: Status line output matched 200 - 3117 bytes in 0.173 second response time [23:31:43] RECOVERY - Solr on solr1001 is OK: HTTP OK: Status line output matched 200 - 3116 bytes in 0.062 second response time [23:31:55] not sure why you think it is always 3 [23:32:06] maybe the log lines just happen to be that long? [23:32:07] oh I don't think it is always 3 [23:32:09] yeah [23:32:16] just the few I looked at were 3 [23:32:20] the squid patch implements this logic directly [23:32:21] there is a buffer fixed slightly smaller than the maximum size of a udp paket [23:32:24] the lines are about 300 or 400 bytes [23:32:39] the varnish patch uses fdopen() and setvbuf() to achieve the same result [23:32:39] if a full line can be appended, it is. if not, the buffer is flushed [23:32:40] right, and MTU is around 1200 or 1400 or something, right? [23:32:48] aye cool [23:33:11] the limit in both cases is 1450 bytes [23:33:17] ok [23:34:13] hm, so does udp2log have to examine the content in order to extract each line? or do log lines have the \n already in them, and it just uses them as is? [23:34:15] > SELECT SUBSTRING(job_timestamp,1,8) AS ts,COUNT(*) FROM job WHERE job_cmd='enotifNotify' AND job_attempts=3 GROUP BY ts ORDER BY ts; [23:34:17] stdClass Object [23:34:19] ( [23:34:20] [ts] => 20130122 [23:34:22] [COUNT(*)] => 505 [23:34:23] ) [23:34:25] stdClass Object [23:34:26] ( [23:34:28] [ts] => 20130123 [23:34:30] [COUNT(*)] => 23735 [23:34:31] ) [23:34:33] TimStarling: looks like it was just that time [23:34:44] so it's not not still a problem [23:34:58] someone changed a page with a lot of watchers? [23:35:01] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [23:35:02] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [23:35:02] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [23:35:02] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:35:02] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [23:35:02] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [23:35:02] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [23:35:03] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [23:35:24] ottomata: log lines are separated by \n in the packet [23:35:34] * AaronSchulz looks for how that matches with the wmf8 deploy [23:35:49] udp2log splits packets on \n for sampled streams [23:36:04] for unsampled streams, it just passes the packets directly through to the pipe processors using tee() [23:37:37] actually no it doesn't [23:38:01] for unsampled streams it combines multiple packets into larger blocks, for a further reduction in CPU usage [23:40:07] MaxSem: am I right in thinking that logged in users not going thru mobilefrontend do not generate an api.php?action=query&prop=info&inprop=watched request for every individual /wiki/article request? [23:40:10] nope it wasn't wmf8 [23:40:24] ottomata: see Udp2LogConfig::ProcessBlock() [23:40:52] binasher, yer right [23:41:20] MaxSem: my generic answer is that MF shouldn't do it if the rest of the site doesn't :) [23:41:23] binasher, basically, I proposed to do it as desktop skins do [23:41:24] i am confused as shit [23:41:27] damn you analytics1001 [23:41:29] do what i want [23:41:42] TimStarling: I don't see anything in SAL around 2013012223 that's interesting [23:41:46] analytics1001 info: Applying configuration version '1359416382' [23:42:22] I wonder if I should recycle all those jobs back into the queue before they expire [23:42:29] ok, thanks TimStarling! [23:42:32] Unixaccount[Stefan Petrea]/User[spetrea]/uid: change from 2548 to 612 failed: Could not set uid on user[spetrea]: Execution of '/usr/sbin/usermod -u 612 spetrea' returned 6: usermod: user 'spetrea' does not exist in /etc/passwd [23:42:32] MaxSem: yeah, i agree with that general approach, even if there isn't a mobile equivalent to skins [23:42:46] ohhhh [23:42:47] i know what that is [23:42:50] ldap [23:42:53] well, there's a SkinMobile [23:43:00] mutante, RobH [23:43:16] you just can't choose it in preferences [23:43:23] these nodes use LDAP, but also /etc/passwd, which I think is real weird [23:43:31] they were originally using /etc/passwd [23:43:36] the ldap is only for hadoop/hue auth [23:43:44] for groups [23:43:47] 1) spetrea used to have UID 2548. puppet is trying to fix the UID to 612 [23:43:49] then again people would 5 day old emails, heh [23:44:01] 2) but it can't find the user /etc/passwd because it isnt local [23:44:01] ottomata: lets chat in here cuz leslie is confused too [23:44:02] TimStarling: I guess I should just leave them then? [23:44:03] 3) fail [23:44:08] ok cool [23:44:21] yeah, i have a todo to fix something very related [23:44:23] but not this exactly [23:44:25] binasher, will you leave a comment in BZ or should I just copy your response? [23:44:35] So what is the solution to get him on these machines? [23:44:43] but I guess I haven't needed to add any shell acounts to these nodes since we enabled ldap [23:44:44] (gotta love the fvassard red herring) [23:44:46] i'm not sure [23:44:49] ha, yeah [23:44:51] MaxSem: i'll still comment on the ticket [23:45:11] uid=2548(spetrea) gid=500(wikidev) groups=50062(project-bastion),50090(project-analytics),500(wikidev) [23:45:22] I tihnk I need to talk to Ryan_Lane to know for sure, I talked with him once about deleting all local shell accounts and just using ldap [23:45:26] ottomata: last admins.pp user added to these nodes seems to be dandreescu:x [23:45:29] buuut, we don't want to allow just anyone to sign into these machines [23:45:39] yeah, and that was a while ago [23:45:45] the ldap stuff turned on maybe a month ago? [23:45:48] about that i think [23:46:01] well, not just anyone could log in [23:46:06] only if you add an ssh key for them [23:46:09] oh right duh [23:46:10] use the puppet UID in LDAP if its not used? [23:46:21] AaronSchulz: it's a bit concerning, why would those jobs fail? [23:46:27] yea cuz im not about to start tossing crazy odd uid into puppet [23:46:27] we could go another step further and require addition to a local group and enable pam_security [23:46:31] but we'd have to add the key using puppet, and LDAP accounts are not managed by puppet [23:46:31] right? [23:46:35] whatever the email problem was it stopped around 20130123155907 [23:46:47] yep. [23:46:48] it was definitely just email jobs? [23:47:04] cant use Labs LDAP and puppet to handle users at the same time [23:47:10] Ryan_Lane, are the analytics nodes the only ones that use LDAP aside from labs? [23:47:36] ottomata: no. formey is as well [23:47:41] and it's been a pain in the ass on there as well [23:47:45] aye [23:48:05] it's possible to install ssh keys separately from user accounts [23:48:09] fwiw, the idea here was to allow people who had signed the data nda access to data in hadoop and hue (hadoop web UI) without giving them shell accounts [23:48:12] well, not the way we have it setup, I guess [23:48:33] ideally we'd be able to separate the account management and ssh keys [23:48:43] yeah [23:48:43] but [23:48:51] will things be weird if he ahs a different uid here vs elsewhere? [23:48:53] i guess not [23:49:13] mutante, RobH, feel free to assign that ticket to me, and I will take care of it tomorrow/very soon [23:49:15] no. we have mismatched uids and gids all over the place [23:49:15] TimStarling: only those jobs failed in mass, refreshLinks and a few TMH jobs fail a small portion of the time [23:49:17] ok cool [23:49:30] those failed jobs are spread out and do not fit that neat date range [23:49:35] i'm happy to figure out how to make puppet deal with this case better [23:49:39] it's not ideal if you use tar or rsync in a way that keeps uid/gid [23:49:42] mainly just installing ssh key without managing user [23:50:20] so Ryan_Lane, if /home/spetrea/.ssh/authorized_keys existed [23:50:22] with his key in it [23:50:24] he'd be able to log in? [23:50:26] a few fail almost every day, where as email jobs failed heavily during that small range and not all at all otherwise [23:50:27] yes [23:50:30] right [23:50:35] add him on analytics1001 manually .. just so that puppet can fix it [23:50:41] adduser, run puppet [23:50:44] ok, I can figure out a good way to puppetize that [23:50:47] not adduser [23:50:52] just add the ssh key [23:50:57] New patchset: Hashar; "pylint on contint server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46466 [23:50:58] puppet is trying to run adduser too [23:50:59] AaronSchulz: but if mail() returned false, the job wouldn't fail, would it? [23:51:12] ottomata: thats not the issue though [23:51:16] no? [23:51:32] New review: Hashar; "installed on gallium, feel free to merge on sight." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/46466 [23:51:40] ottomata: the issue is "change from 2548 to 612 failed:" because "usermod: user 'spetrea' does not exist in /etc/passwd" [23:51:45] TimStarling: it would not "fail" in the system true [23:51:51] so add him to /etc/passwd somehow... then let puppet fix the UID [23:51:55] ack() is always called regardless of what run() returns [23:52:04] errrrrrrrr i guess so [23:52:10] but a fatal error would have done it [23:52:10] yeah hmmm [23:52:11] "failing" really means "runner dying" or "loosing contact with DB" [23:52:11] actually yeah [23:52:20] that woudl work, because I already have a shell account there and puppet is fine with it [23:52:25] Ryan_Lane: are you working on Sartoris today at all? [23:52:30] preilly: no. labs [23:52:34] yeah, an exception or fatal or process getting killed [23:52:38] okok, you guys got it, RobH, feel free to assign to me if you want me to figure it out [23:52:40] Ryan_Lane: okay [23:52:47] i gotta run real soon, guests coming over [23:52:50] thanks all [23:52:51] and amazingly, I'm seeing the exact same memcache bug in labs that I'm seeing in production [23:52:58] it's definitely not related to bad hardware [23:53:02] * AaronSchulz looks at the log archives [23:56:13] ottomata: ok [23:57:16] ottomata: https://rt.wikimedia.org/Ticket/Display.html?id=4402 is all yers [23:57:26] his user is added and such, so once its fixed he'll get access [23:57:41] hashar: merged https://gerrit.wikimedia.org/r/#/c/46463/ [23:57:46] hokay, thanks [23:58:42] preilly: thaanks :) [23:59:22] hashar: as-is https://gerrit.wikimedia.org/r/#/c/46464/ [23:59:27] TimStarling: I'm not seeing any wave of fatals or exceptions on the 22nd, 23rd, nor 24th in the logs [23:59:51] not many effect runJobs.php and the ones that did were for refreshLinks