[00:00:04] spagewmf: Yes both of them [00:00:13] !log esanders Finished scap: SWAT deploy (duration: 28m 39s) [00:00:19] Logged the message, Master [00:00:19] i deleted it, i think it's back up [00:00:50] i'm an idiot, though; i should have dumped the value so we could see why it couldn't be decoded [00:00:51] ori: now test.wikipedia is down [00:01:02] same looking stack trace [00:01:07] oh great, so i get to try that again :) [00:01:43] yep, who said life never gives you a second chance? [00:07:54] greg-g: i have a fix, not sure where to commit it [00:08:01] i'll update the bug [00:09:05] chrismcmahon: do you happen to remember/see which commit katie reverted to fix beta cluster? [00:09:19] chrismcmahon: I see https://gerrit.wikimedia.org/r/#/c/159818/ which is huge [00:10:35] greg-g: yikes, pretty sure that's the one. not even a +1 for that [00:10:59] oh wait, i saw her say "per break all the rules" [00:11:08] when she couldnt find a reviewer :) [00:11:11] hah [00:11:22] greg-g, marktraceur any chance we can still SWAT that mediaviewer bug [00:11:23] at least there was self awareness [00:12:08] Eloquence: I want to get test.wikipedia back up, wasn't aware of a waiting MV fix [00:12:08] !log esanders Synchronized php-1.24wmf21/extensions/MultimediaViewer/: (no message) (duration: 00m 07s) [00:12:13] Logged the message, Master [00:12:15] ah, there it is ^ [00:12:19] \o/ [00:12:25] !log esanders Synchronized php-1.24wmf21/resources/lib/oojs-ui/: (no message) (duration: 00m 03s) [00:12:30] Logged the message, Master [00:13:12] marktraceur, MV deploy done [00:13:30] spagewmf, FLow deploy done [00:13:51] thanks edsanders [00:14:10] * marktraceur testing [00:14:49] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [00:15:19] Seems good to me with debug=true, caching issues still but I'm not sure that's a bad thing [00:16:00] touch and sync the js file [00:16:32] bsitu, is it necessary to rescap to fix broken RL (https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.gettingstarted.lightbulb.flyout), or just change any file in module? [00:16:48] The message is defined, but I think RL cached it during scap before it had it. [00:17:08] Look at the bottom of that to see what I mean. [00:23:12] greg-g, we'd like to do some kind of scap or sync to fix two issues. [00:23:17] Stale RL messages and a CSS problem. [00:23:19] Is that possible [00:23:20] ? [00:24:21] superm401: for flow? yes, one second, ori is fixing test.wikipedia with a wikidata revert, I want that to be clean [00:24:41] greg-g, for GettingStarted. We're not ready yet anyway. We won't go until we have the OK. [00:24:49] awesome [00:24:53] will let you know [00:25:47] greg-g: flow can go ahead; i need a minute anyway [00:26:25] s/flow/gettingstarted/, superm401 if you're close [00:26:47] ori: 20:08 logmsgbot: reedy Synchronized php-1.24wmf21/extensions/Wikidata/: (no message) (duration: 00m 17s) [00:27:55] superm401: my bible (How to deploy code) claims "Fixing this issue requires that you touch the files in question and then re-sync them to the cluster.", not a scap. But I'm not sure how that applies to RL messages [00:28:07] greg-g, almost, not quite. [00:31:28] spagewmf, just touch resources in the same modules with those mesages [00:31:46] (oh, and did you scap the messages?) [00:32:12] aha, see why you asked:) [00:34:04] MaxSem, yeah, the messages are scapped, I think RL just needs to rebuild the blob. [00:36:31] !log ori Synchronized php-1.24wmf21/extensions/Wikidata: Update Wikidata to tip of master for I23b7eb54b8e (Bug: 70747) (duration: 00m 08s) [00:36:36] Logged the message, Master [00:38:28] greg-g: i think it's ok now [00:38:50] tada! [00:39:11] I had a new stacktrack right after you finished sync'ing, and was diff'ing it, but it's back! [00:39:40] i added details to the bug [00:39:41] thank you muchly, ori [00:39:51] including a raw dump of the value from memcached (it didn't contain any sensitive data) [00:39:57] * greg-g nods [00:40:01] so the wikidata folks should be able to reproduce it [00:56:48] (03PS2) 10Ori.livneh: mediawiki::sync: re-declare deployment paths [puppet] - 10https://gerrit.wikimedia.org/r/159674 [00:56:52] (03CR) 10jenkins-bot: [V: 04-1] mediawiki::sync: re-declare deployment paths [puppet] - 10https://gerrit.wikimedia.org/r/159674 (owner: 10Ori.livneh) [00:58:05] (03PS3) 10Ori.livneh: mediawiki::sync: re-declare deployment paths [puppet] - 10https://gerrit.wikimedia.org/r/159674 [00:59:54] (03CR) 10Ori.livneh: [C: 032] mediawiki::sync: re-declare deployment paths [puppet] - 10https://gerrit.wikimedia.org/r/159674 (owner: 10Ori.livneh) [01:09:26] greg-g, are we okay to sync? Basically, we just have a couple small CSS and JS changes, and apparently syncing those will also fix our stale RL message problem. [01:11:10] superm401: gah, yeah, sorry, I had a phone call [01:11:23] superm401: go ahead, ori fixed wikidata stuffs [01:11:40] It's okay, thank you. [01:13:43] (03CR) 10MaxSem: mediawiki::sync: re-declare deployment paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/159674 (owner: 10Ori.livneh) [01:15:44] wikitech has stopped working for me: https://wikitech.wikimedia.org/ => "Not Found: The requested URL /wiki/ was not found on this server.". Must have happened in the last half hour or so. [01:15:50] Yeah me too [01:15:59] Investigating [01:16:01] probably one of my patches [01:16:04] sec [01:16:49] which machine is it on again? virt1000? [01:17:01] I think so. [01:17:10] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [01:18:35] back now [01:19:05] !log manually migrated /u/l/a/common-local to /srv/mediawiki on virt1000 [01:19:14] Logged the message, Master [01:19:15] Thanks ori [01:19:36] Yep, works for me again. Thanks! [01:21:23] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [01:24:47] (03PS1) 10Ori.livneh: Update path reference for /srv/mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159945 [01:26:30] (03PS2) 10Ori.livneh: Update path reference for /srv/mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159945 [01:26:38] (03CR) 10Ori.livneh: [C: 032] Update path reference for /srv/mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159945 (owner: 10Ori.livneh) [01:26:43] (03Merged) 10jenkins-bot: Update path reference for /srv/mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159945 (owner: 10Ori.livneh) [01:28:47] !log ori updated /a/common to {{Gerrit|Ia5b81076e}}: Update path reference for /srv/mediawiki [01:28:52] Logged the message, Master [01:29:18] !log ori Synchronized wmf-config/wikitech.php: Ia5b81076e: Update path reference for /srv/mediawiki (duration: 00m 04s) [01:29:23] Logged the message, Master [01:30:39] PROBLEM - Disk space on searchidx1001 is CRITICAL: DISK CRITICAL - free space: / 20 MB (0% inode=23%): [01:30:58] (03PS1) 10Ori.livneh: beta: update one last path reference for /srv/mediawiki change [puppet] - 10https://gerrit.wikimedia.org/r/159946 [01:32:33] !log mattflaschen Synchronized php-1.24wmf20/extensions/GettingStarted/: CSS tweaks for GettingStarted A/B test (duration: 00m 21s) [01:32:37] Logged the message, Master [01:32:50] i think searchidx1001 might be related too, fixing [01:32:50] RECOVERY - Disk space on searchidx1001 is OK: DISK OK [01:33:01] !log mattflaschen Synchronized php-1.24wmf21/extensions/GettingStarted/: CSS tweaks for GettingStarted A/B test (duration: 00m 07s) [01:33:06] Logged the message, Master [01:33:19] (03CR) 10Dzahn: [C: 031] beta: update one last path reference for /srv/mediawiki change [puppet] - 10https://gerrit.wikimedia.org/r/159946 (owner: 10Ori.livneh) [01:34:00] (03PS1) 10Aude: Bump shared cache key for Wikidata (memcached storage of items, etc.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159948 [01:34:29] MaxSem, that doesn't seem to have fixed the RL message problem for us: https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout [01:34:38] Do I need to wait for something to expire, or am I going to have to scap? [01:35:05] does it work with debug=true? [01:35:25] No: https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout [01:35:54] I'm getting broken messages at the bottom of that (and I verified much earlier the actual GUI was broken too) [01:39:25] (03CR) 10Ori.livneh: [C: 032] beta: update one last path reference for /srv/mediawiki change [puppet] - 10https://gerrit.wikimedia.org/r/159946 (owner: 10Ori.livneh) [01:39:27] then there's something wrong with localisation cache [01:39:59] MaxSem, are you seeing the broken messages there too? [01:40:40] yup [01:40:43] https://dpaste.de/3vs8/raw [01:40:53] scapscapscap [01:41:11] * ori concurs [01:42:42] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures [01:42:49] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [01:42:59] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet has 1 failures [01:43:56] Thanks, ori, MaxSem. Our CSS fix introduced another problem, so I need to do one more round of cherry-picking. [01:43:58] Then, I'll scap. [01:44:00] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [01:46:46] mutante: snapshot1xxx is me, fixing [01:52:52] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [01:55:19] PROBLEM - puppet last run on virt0 is CRITICAL: CRITICAL: Puppet has 1 failures [02:00:32] RECOVERY - puppet last run on virt0 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [02:02:59] PROBLEM - RAID on es1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [02:07:49] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [02:08:45] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3614 MB (3% inode=99%): [02:09:10] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [02:17:20] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:23:29] PROBLEM - puppet last run on mw1130 is CRITICAL: CRITICAL: Puppet has 1 failures [02:26:44] (03PS5) 10Rush: *.wmfusercontent.org ssl termination on web-misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/159820 [02:26:46] (03PS1) 10Rush: allow only for localssl protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/159961 [02:27:31] (03CR) 10jenkins-bot: [V: 04-1] allow only for localssl protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/159961 (owner: 10Rush) [02:27:39] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [02:28:54] greg-g, okay, I'm going to scap for that CSS fix. It's just a oneliner to fix an issue introduced by the last hotfix. [02:29:08] But the scap is needed to fix https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout [02:32:33] (03PS2) 10Rush: allow only for localssl protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/159961 [02:32:36] (03PS1) 10Dzahn: detect mw configuration errors in header [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159963 (https://bugzilla.wikimedia.org/54161) [02:34:32] (03PS2) 10Dzahn: detect mw configuration errors in header [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159963 (https://bugzilla.wikimedia.org/54161) [02:39:35] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-12 02:39:35+00:00 [02:39:41] Logged the message, Master [02:39:49] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 1 failures [02:43:30] !log mattflaschen Started scap: One last CSS fix (wrapping issue for error state) for GettingStarted A/B test [02:43:36] Logged the message, Master [02:58:10] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [03:00:05] "02:57:55 30 apaches had sync errors" [03:00:21] 30 out of 229 is a lot. [03:00:39] RECOVERY - Disk space on virt0 is OK: DISK OK [03:03:04] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: Puppet has 1 failures [03:05:00] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 1 failures [03:08:11] !log mattflaschen Finished scap: One last CSS fix (wrapping issue for error state) for GettingStarted A/B test (duration: 24m 38s) [03:08:17] Logged the message, Master [03:08:29] ^ Well, it was quick, but that's the only good thing I can say about it. [03:09:02] Lots of errors, and https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=true&lang=en&modules=ext.gettingstarted.lightbulb.flyout is still broken. [03:11:22] superm401: i see "hosts had scap-rebuild-cdbs errors" some in the logs [03:11:33] aude, yep [03:11:54] aude, do you know where I can get a human-readable scap log (if not, I have it in my terminal)? [03:12:07] also rsync issues [03:12:25] human readable, depends... i am looking at flourine [03:12:31] scap.log [03:13:09] "some files vanished before they could be transferred" [03:13:31] aude, yeah, I'm going to file a bug about that one in particular. I think it may be close to the root cause. [03:13:37] hosts had sync_wikiversions errors [03:13:39] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 1 failures [03:13:43] 2 [03:14:12] this came after localisation update [03:14:24] shouldn't be problem though [03:15:58] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-12 03:15:57+00:00 [03:16:02] Logged the message, Master [03:17:10] springle: I get connection errors when I try to connect from command-line PHP [03:17:27] Lost connection to MySQL server at 'reading authorization packet' [03:19:06] springle: http://paste.tstarling.com/p/VJFbOR.html [03:20:17] i get a 503 on https://test.wikidata.org/wiki/Q22 [03:20:23] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [03:22:06] ok, i thik it's just wikibase, as deployed there [03:23:17] well, I know the site's not down, because apache traffic is normal [03:23:20] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [03:23:46] but tin's PHP seems to be screwed up [03:23:52] which could explain the scap failures [03:26:45] TimStarling, hmm, scap itself is written in Python but it might call into PHP somewhere. [03:30:59] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [03:42:03] Looks like the ResourceLoader messages finally started working. [03:50:35] greg-g: around? [03:50:35] aude: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [03:51:13] greg-g: testwikdata is broken (polluted cache and who knows what else) [03:51:38] not sure about waiting until monday,since apparently it's important for pywikibot folks [04:03:20] TimStarling: no idea what happened on tin. but something else has raised db traffic on at least commons and enwiki, whaich maybe contributed [04:04:51] investigating. am presently on leave and travelling with 3g, so need to catch up [04:06:22] right, who is responsible if you're not available? me? [04:06:32] heh, LinksUpdate queries everywhere [04:07:36] TimStarling: I'm still on call. hence present [04:08:00] lucky you [04:12:05] plenty of 'Too many connections' but no real site 5xx impact. the mariadb 10 slaves with connection pools took the brunt [04:12:33] same for external storage [04:27:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Sep 12 04:27:17 UTC 2014 (duration 27m 16s) [04:27:26] Logged the message, Master [04:53:21] <_joe_> springle: hey, I see in the backlog a good deal of trouble [04:53:29] <_joe_> may I help in any way? [04:55:34] _joe_: something caused tin to have issues. I didn'y see it in action, but TimStarling did. db connections spikes occurred (i think unrelated to tin) due to LinksUpdate queries, but site not affected [04:55:53] <_joe_> ok [04:56:23] <_joe_> TimStarling: which issues does tin have? Or, I'll take a look after coffee [04:56:40] http://paste.tstarling.com/p/VJFbOR.html [04:56:43] at this stage i think little can be done, except check out tin. i need to quiz people on what is causing LinksUpdate query spikes on commons and enwiki [04:56:45] maybe it wasn't just tin [04:56:58] I assumed it was because when I tried it on another server, it didn't do it [04:57:09] but I see that it doesn't do it on tin now [04:57:17] if it was transient it may have affected the whole cluster [04:57:19] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:00] basically it is a series of slave DB connection failures with message "Lost connection to MySQL server at 'reading authorization packet', system error: 0" [04:58:32] I had a few script failures before I ran that minimal test case for the pastebin [05:01:42] <_joe_> TimStarling: the error you see, is either from a network failure or a db failure usually [05:05:57] _joe_: only tin complained in dberror.log at the time TimStarling noted the issues. my guess is something tin-side, or maybe the swarm of rsync processes that appear for LocalisationUpdate [05:06:31] i've previously seen scap slow due to that, though not failing outright [05:06:33] <_joe_> springle: if so, it could have been hogged with network traffic [05:06:39] maybe [05:06:52] <_joe_> that would be seen in ganglia [05:07:52] <_joe_> https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=tin.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Miscellaneous+eqiad [05:08:49] <_joe_> TimStarling: does the timing of those two network spikes coincide with your issues? [05:09:21] yes [05:10:02] possibly also coinciding with aude's scap problems [05:10:13] don't know whether that was a cause or effect though [05:13:07] <_joe_> that particular message happens when there is a lag in the connection between the client and the server. So my best guess is that tin was capped out on network bandwidth [05:14:39] funny kind of error message then [05:15:28] <_joe_> TimStarling: it's quite common in mysqland to have funny error messages [05:16:10] <_joe_> (what really happens is that the client reaches its connect_timeout setting during the auth phase of the connection) [05:16:39] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:23:42] :! [05:25:26] _joe_: our connect_timeout is 3sec. hence... [05:25:40] <_joe_> springle: makes sense [05:25:50] wonder what was the cause there, though, on tin [05:25:50] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:58] <_joe_> springle: scap [05:26:14] <_joe_> filling out the tin network capabilities [05:26:25] do we usually get such a reaction from tin during scap? [05:26:36] <_joe_> springle: not really [06:00:46] ACKNOWLEDGEMENT - RAID on es1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Sean Pringle RT 8346 [06:28:10] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Epic puppet fail [06:28:10] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Epic puppet fail [06:28:10] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Epic puppet fail [06:28:51] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Epic puppet fail [06:29:00] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Epic puppet fail [06:29:00] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Epic puppet fail [06:29:00] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Epic puppet fail [06:29:10] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:40] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:02] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:10] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:13] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:19] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:19] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:20] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:20] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:30] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:39] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:49] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:10] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:49] <_joe_> it's mod_passenger o'clock again [06:45:12] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:19] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:19] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:30] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:30] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:45:32] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:29] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:47:29] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:47:30] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:48:30] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:58:48] (03CR) 10Filippo Giunchedi: [C: 032] Puppetize icinga log file permission fix. [puppet] - 10https://gerrit.wikimedia.org/r/158633 (owner: 10JanZerebecki) [06:59:53] mod_passenger <3 [07:06:40] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:09:15] godog, did you ever get your network device access sorted out? [07:10:44] apergos: not AFAICT, I'll poke akosiaris tho (hi!) [07:10:54] (morning!) [07:15:04] <_joe_> ok so, I'm going to change the pybal config host today [07:15:19] <_joe_> if no one objects [07:19:26] (03CR) 10Filippo Giunchedi: wmflib: add to_milliseconds() / to_seconds() (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/159692 (owner: 10Ori.livneh) [07:33:07] (03PS2) 10Giuseppe Lavagetto: pybal: change configuration host address [puppet] - 10https://gerrit.wikimedia.org/r/159739 [07:39:33] <_joe_> !log changing pybal config place; stopping puppet on all loadbalancers [07:39:38] Logged the message, Master [07:40:58] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: change configuration host address [puppet] - 10https://gerrit.wikimedia.org/r/159739 (owner: 10Giuseppe Lavagetto) [07:44:11] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Epic puppet fail [07:46:08] <_joe_> damn [07:46:22] <_joe_> I just resolved this one cm short of doing a WTF [07:48:02] (03PS1) 10Giuseppe Lavagetto: pybal conf: follow symlinks [puppet] - 10https://gerrit.wikimedia.org/r/159981 [07:48:25] (03PS2) 10Giuseppe Lavagetto: pybal conf: follow symlinks [puppet] - 10https://gerrit.wikimedia.org/r/159981 [07:48:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] pybal conf: follow symlinks [puppet] - 10https://gerrit.wikimedia.org/r/159981 (owner: 10Giuseppe Lavagetto) [07:49:08] <_joe_> .win 26 [07:51:21] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:57:49] <_joe_> if anyone needs to edit pybal config, please ping me [08:06:19] <_joe_> !log new pybal conf applied in all of ulsfo [08:06:25] Logged the message, Master [08:13:07] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:13:24] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:24] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:24] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:48] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:48] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:48] PROBLEM - check if dhclient is running on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:50] PROBLEM - Disk space on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:58] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:15] <_joe_> eh [08:14:26] <_joe_> oom, again [08:14:41] <_joe_> I have no time for it nowe [08:17:47] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:17:47] RECOVERY - check if dhclient is running on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [08:17:47] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [08:17:50] RECOVERY - Disk space on mw1053 is OK: DISK OK [08:17:58] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [08:18:08] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:18:27] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [08:18:27] RECOVERY - DPKG on mw1053 is OK: All packages OK [08:19:32] <_joe_> !log reactivated puppet on all lvs hosts, esams almost done, pending eqiad [08:19:38] Logged the message, Master [08:31:40] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:37:02] <_joe_> !log rolling restart of pybal finished. Adding note on Fenari [08:37:08] Logged the message, Master [08:38:02] exciting [08:38:23] <_joe_> apergos: I'm sending an email to ops@ in ~ 10 minutes [08:38:34] <_joe_> so that everyone knows where to edit pybal configs [08:38:53] good [08:39:07] maybe you wnt to move the fenari ones to pybal_DO_NOT_EDIT or something [08:39:51] <_joe_> apergos: mmmh not right now [08:39:59] heh [08:40:08] <_joe_> I moved pybal, not sure there isn't something checking that for some reason [08:40:17] <_joe_> I'll do a little testing btw [08:56:22] <_joe_> brb [09:39:11] (03PS2) 10Aude: Bump shared cache key for test.wikidata (memcached storage of items, etc.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159948 [10:19:45] apergos: I've found a workaround for now which is to bug someone else with access :) [10:42:28] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:45] YuviPanda: awesome: does this mean I can close the ticket? :-D [11:10:55] (03PS3) 10Giuseppe Lavagetto: puppet: hiera backend for the WMF [puppet] - 10https://gerrit.wikimedia.org/r/151869 [11:20:11] (03PS4) 10Giuseppe Lavagetto: puppet: hiera backend for the WMF [puppet] - 10https://gerrit.wikimedia.org/r/151869 [11:34:47] apergos: ya [11:35:00] apergos: looks like I'll just have to wait until I 'officially' move to ops (Nov) before I get that [11:35:01] seems ok [11:35:10] just a lot of people to bug before that :) [11:52:38] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:52:48] okey dokey! [12:11:00] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:13] (03PS5) 10Giuseppe Lavagetto: puppet: hiera backend for the WMF [puppet] - 10https://gerrit.wikimedia.org/r/151869 [12:26:25] (03CR) 10Mark Bergsma: "I think I need to review a bit more, but here are my initial comments." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/159820 (owner: 10Rush) [12:35:59] (03CR) 10Mark Bergsma: [C: 04-1] "This will break due to a duplciate definition of the localssl nginx site config." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/159820 (owner: 10Rush) [12:40:10] so localssl was a simplification of the original protoproxy stuff that used multiple sites [12:40:23] I guess with SNI, we should perhaps not try to make localssl use multiple sites [12:40:51] <_joe_> but... SNI means both ie6/7 and android <3 will be cut off our SSL [12:41:07] <_joe_> I just realized this is for the misc varnishes [12:41:36] it's 2014 [12:45:31] hmm looks like protoproxy::localssl is a define but should really be a class [12:45:36] interesting, android < 3 is 12% according to https://developer.android.com/about/dashboards/index.html [12:45:50] <_joe_> well, I took great care in supporting everything with the PFS changes [12:46:22] 12%? [12:46:23] wtf [12:46:35] _joe_: yeah for our main sites we definitely have to [12:46:54] for phab images, i'm not so sure [12:46:59] but 12% [12:47:16] <_joe_> mark: el cheapo android devices [12:47:49] <_joe_> and devices in the developing countries, which are often used phones coming from the richer ones, I guess [12:48:29] can't find trends though [12:49:33] ok [12:49:35] so what options do we have [12:49:51] a unified cert [12:49:55] or a separate site on a separate ip [12:50:07] or not putting this on the misc web cluster [12:50:26] <_joe_> or, we decide misc can live with SNI [12:51:14] well 12% of android traffic is a bit more than I'd like [12:52:27] <_joe_> Android default browser on Android 2.x[46] (Fixed in Honeycomb for tablets and Ice Cream Sandwich for phones) to be precise [12:53:33] <_joe_> http://en.wikipedia.org/wiki/Server_Name_Indication#Client_side [12:54:45] I really want to move the other dcs to the ulsfo model on nginx+varnish for ssl rather than separate ssl boxes [12:54:54] yes [12:54:55] if it works out for perf, etc. but it seems to. [12:55:27] perhaps we can pull some more precise user-agent stats from analytics, but even half that would be still high I think [12:55:38] mark: have you looked at https://gerrit.wikimedia.org/r/#/c/157978/ ? I know it's kind of a bitch of a change :) [12:56:04] damn bblack :P [12:56:08] i used to do those very incrementally [12:56:15] I've validated most of the stuff re: the dissappearing IPs. I'm a little less certain about possible SSL impact of what the "proxy_server" value means (does it even matter) when so many domains are proxying through one IP anyways [12:56:42] e.g. proxy_server_name => '*.wikipedia.org', [12:57:11] but most ssl v4 traffic for wikipedia.org is current going fine through the one that's '*.wikimedia.org' anyways [12:57:12] i don't know either [12:57:18] doesn't that end up using the unified cert? [12:57:26] so it seemslike it doesn't matter, they're all on unified certs [12:57:34] yeah likely [12:57:40] which makes me wonder if we still need the star certs installed at all in those cases [12:57:47] probably not [12:57:59] but perhaps migrating towards localssl is easier than cleaning this up now? [12:58:21] well the protoproxy fixes just mirror the LVS IP removals, I figured it was easier this way around. [12:58:39] i'd say, do manually on a box and test? ;) [12:58:51] :) [12:58:55] i agree with it in principle [12:59:03] can't say I am confident that this change doesn't break anything [12:59:25] well, basically everywhere a listener is being removed, it doesn't get any traffic anyways [12:59:31] both in the nginx and lvs cases [12:59:47] right [13:00:25] so mostly I guess I shouldn't be concerned at all, it just troubled me a bit with the seemingly pointless proxy_server_name values and star certs [13:00:53] it made sense before the unified cert [13:00:54] that got added later [13:01:25] we didn't have a unified cert for a long time [13:01:37] so we only had the star certs, and that's why we got all the different ips [13:02:06] of course the drawback of the unified is that it's a huge cert [13:02:14] so ideally we -would- use the star certs [13:02:20] which I don't think is done now [13:02:25] but for clients supporting SNI we could [13:02:25] as it stands, I almost deployed that yesterday but backed off. while actually prepping to do so, I realized it was going to be ridiculously complicated (27 lvs or ssl hosts to log into, manually disable puppet, watch the puppet runs, manually clean up removed IPs that puppet doesn't actually shut off or remove from the interfaces file, etc) [13:02:30] as long as we can fall back to the unified [13:02:46] yeah [13:02:46] so yeah, I might still go refactor it into some smaller subsets just to make the deploy less harrowing [13:02:51] that's why I always do this very incrementally [13:03:04] so I think [13:03:11] well I was in a mood at the time that I could totally do it. then I lost that mood :) [13:03:19] hmm [13:03:26] so we want to move to a localssl model [13:03:34] but we also want to support SNI to have smaller certs [13:03:46] so localssl is overly simplified atm, it only supports a single site and a single cert [13:04:01] perhaps we should create something intermediate that supports multiple certs/sites with SNI [13:04:05] huh? [13:04:13] but still only supports a single backend (varnish on the same host) [13:04:20] localssl in ulsfo doesn't seem to be supporting a single site per host? [13:04:29] what do you mean? [13:04:48] I mean text-lb in ulsfo all goes through one IP address for many hostnames in unrelated domains [13:04:59] yes [13:05:03] so, protoproxy::localssl [13:05:08] that's a define, but should be a class [13:05:14] effectively, it listens on all ips [13:05:17] doesn't care what the servername is [13:05:26] takes a single (possibly unified) cert [13:05:29] and uses that for all traffic it gets [13:05:32] yeah but only one of those IPs is really getting traffic, for many domains [13:05:35] and uses the localhost as backend [13:05:39] yes [13:05:51] my point is [13:05:58] you can't add more certs to that setup as it stands now [13:06:02] (which Chase is trying to do atm) [13:06:07] it'll break unless we extend it [13:06:09] oh, I see [13:06:20] we would actually like to support SNI [13:06:29] so we don't have to serve everyone the huge unified cert [13:06:44] and then we go back to star certs? [13:06:57] star certs for SNI capable clients, fallback to unified I think [13:06:59] but all on one ip [13:07:07] right, ok [13:07:38] what's it worth? how much size difference are we talking about? does a unified cert have just one key and lots of metadata, or many embedded keys? [13:07:40] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:42] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:42] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:51] uh oh [13:07:59] oh wth? I just woke up, I haven't even had time to break anything yet on my own :p [13:08:07] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 75.77 ms [13:08:12] hmm it's back [13:08:18] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 75.37 ms [13:08:18] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 75.19 ms [13:08:36] bblack: so, see https://wikitech.wikimedia.org/wiki/HTTPS/Future_work [13:08:40] performance enhancements [13:08:48] but basically yes [13:08:57] the unified ssl cert increases our RTT at connection setup [13:09:16] i haven't investigated this deeply, paravoid and ryan were investigating this at the time [13:09:27] but anyway [13:09:33] what we seem to need for ssl termination [13:09:38] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Puppet has 3 failures [13:09:40] is something in between localssl and the proxy.erb setup [13:09:47] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [13:10:00] so we probably want to make a default site that's a lot like current localssl, setup with the unified ssl cert [13:10:09] and then setup additional sites with the star certs, supporting SNI [13:10:20] yeah that's just puppet refactoring though. fundamentally, I think we can do the localssl model on the boxes and support multiple hosts where currently necessary. [13:10:21] and we -also- seem to need this for misc web cluster now [13:10:25] just needs the puppetwork [13:10:28] yep [13:10:29] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: Epic puppet fail [13:10:32] so chase needs this today [13:10:32] hehe [13:10:36] ok [13:10:54] and he's done some patchsets yesterday [13:10:59] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 3 failures [13:11:02] but they currently break, as localssl wasn't designed for this [13:11:04] and they're slightly messy too [13:11:11] is it time for the morning puppetmaster fallover already or something? [13:11:19] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [13:11:23] https://gerrit.wikimedia.org/r/#/c/159820/ is the main one [13:11:33] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 3 failures [13:11:41] PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: Puppet has 1 failures [13:12:08] bblack: nah it whined this morning like clockwork [13:12:12] perhaps it's easiest to take protoproxy::localssl, leave it alone but copy it [13:12:15] and extend it a bit [13:12:25] so we can switch to that on misc-web cluster now [13:12:30] and then on the main sites later too [13:12:47] <_joe_> bblack: getaddrinfo: Temporary failure in name resolution [13:12:49] <_joe_> !! [13:12:55] ? [13:12:55] where? [13:12:56] <_joe_> on cp4004 [13:13:11] maybe recdns is hurting somewhere [13:14:57] eh I donno [13:15:08] or puppet sucks at doing anything reliably [13:15:18] <_joe_> or both [13:15:48] kinda like when it gets a single 5xx for one of the 4,000 requests it makes to the master in a run and just barfs instead of retrying the request... [13:16:14] <_joe_> bblack: it's exactly the same scenario [13:17:03] well clearly we just had a hickup on the ulsfo - eqiad connection(s) [13:17:36] and ulsfo is dependent on that for recdns, it doesn't have its own servers [13:17:54] it might be nice to put one there, maybe on the bast host or something? [13:18:12] (just to be first in the local list, not to use from other sites) [13:19:40] <_joe_> mmmmh it was a network issue [13:19:53] (03PS3) 10Aude: Bump shared cache key for test.wikidata (memcached storage of items, etc.) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159948 [13:20:14] <_joe_> as on an esams server i find: Could not evaluate: end of file reached Could not retrieve file metadata for puppet:///modules/ganglia_new/upstart/ganglia-monitor.conf: end of file reached [13:21:41] <_joe_> ok I have one member of the italian community reporting he failed to load js and css for a couple of minutes [13:22:23] bblack: mmh I dunno, when we add codfw that should be enough [13:22:39] seems like puppet would be far more efficient if it compiled the catalog and all of the file/template resources and sent it in one chunk. the flow would basically be client: request plugins, server: send plugins, client: send facts, server: send everything the client needs, client: do work, report success. [13:23:09] <_joe_> bblack: no, the files themselves are served without ruby intervention, luckily enough [13:23:24] <_joe_> or the masters would die an enormous pain [13:23:30] <_joe_> *in [13:24:07] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:24:13] eh [13:24:27] fix ruby then, or don't use it. I'm not accepting that from puppet as an excuse [13:25:06] <_joe_> bblack: puppet sucks badly - it should use REST and HTTP headers to understand if a resource is changed [13:25:14] yeah [13:25:25] <_joe_> instead of transmitting the source, the metadata (in 10 useful http calls) [13:25:37] <_joe_> and some other useless noise. [13:25:49] or even in the current model, it could inform the client over the main connection that the following 427 files haven't changed since the last run [13:25:56] <_joe_> of course there is no retry and it will fail miserably if one of those fails [13:26:12] <_joe_> bblack: oh, VERSIONING [13:26:26] <_joe_> bblack: also, rsync :P [13:26:32] I think the more fundamental problem there is the client still wouldn't have enough info, it wants to diff locally after retrieval [13:26:38] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:27:17] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [13:27:22] we *could* do something crazy like rsync the files:// collection to each host and have them fetch those from themselves on disk :) [13:27:37] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:27:47] RECOVERY - puppet last run on amssq34 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:28:27] it should be <100M of data, and it doesn't change that rapidly [13:28:38] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [13:28:57] <_joe_> bblack: I thought about that in the past [13:28:57] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:29:16] (and use phases to have the rsync update happen before everything else on the client, via puppet itself?) [13:29:48] err not "phase", whatever it is puppet calls the meta-dependency grouping thing [13:30:05] stages [13:31:54] <_joe_> bblack: I don't think it's really feasible [13:33:50] is there no way via puppet.conf to say that puppet:///files/... comes from elsewhere, where elsewhere could be local? [13:34:16] e.g. a small lighty instance running on every host on an alternate port or something, so that it's still http? [13:35:32] bblack: not sure, but of course you can use puppet://host.name/files/.... [13:35:40] and we could standardize on a function for the host.name [13:35:50] it's just that we'd have to adapt /all/ manifests to use that function [13:41:31] (03PS1) 10Alexandros Kosiaris: Introduce mathoid LVS IP [dns] - 10https://gerrit.wikimedia.org/r/159996 [13:44:34] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [13:45:12] (03CR) 10Giuseppe Lavagetto: "Cherry-picked on beta, it works and does not interfere with puppet compilation. On monday I'll merge this and deploy it to labs as well." [puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto) [13:58:52] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [14:10:59] (03PS2) 10Alexandros Kosiaris: Introduce mathoid LVS IP [dns] - 10https://gerrit.wikimedia.org/r/159996 [14:11:49] (03CR) 10Ottomata: [C: 032 V: 032] Adding gzip compression for several file types [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/159181 (owner: 10Nuria) [14:13:59] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [14:17:32] (03PS5) 10Alexandros Kosiaris: Introducing Service Cluster A, hosting mathoid [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [14:17:54] (03PS1) 10Nuria: Bumping up wikimetrics module [puppet] - 10https://gerrit.wikimedia.org/r/159998 [14:18:59] PROBLEM - puppet last run on virt0 is CRITICAL: CRITICAL: Epic puppet fail [14:19:25] (03CR) 10Ottomata: [C: 032 V: 032] Bumping up wikimetrics module [puppet] - 10https://gerrit.wikimedia.org/r/159998 (owner: 10Nuria) [14:19:28] (03CR) 10Giuseppe Lavagetto: [C: 031] Introduce mathoid LVS IP [dns] - 10https://gerrit.wikimedia.org/r/159996 (owner: 10Alexandros Kosiaris) [14:20:38] (03CR) 10Giuseppe Lavagetto: "I can't care less about Qualys grade, I care about functionality and ease of rollback." [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [14:21:40] (03PS1) 10Dan-nl: Please add the domain *.scienceimage.csiro.au to the wgCopyUploadsDomains whitelist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159999 (https://bugzilla.wikimedia.org/70771) [14:26:06] (03PS2) 10Giuseppe Lavagetto: salt: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159463 (owner: 10Matanya) [14:26:47] (03CR) 10Giuseppe Lavagetto: [C: 032] salt: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159463 (owner: 10Matanya) [14:28:39] bblack: mark: https://gerrit.wikimedia.org/r/156576 [14:30:40] akosiaris: I think it's missing the bit that actually puts mathoid on some LVS [14:30:43] yeah [14:30:46] (03PS2) 10Giuseppe Lavagetto: rsync: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159462 (owner: 10Matanya) [14:30:47] and perhaps you should split it up [14:31:05] don't do everything in one huge commit [14:31:12] add the lvs service separately etc [14:32:07] the bit that's missing is the stanza in $lvs_services at the bottom of modules/lvs/manifests/configuration.pp [14:37:32] (03CR) 10JanZerebecki: "Rollback (as in exactly reverting the patch) works fine. The browser will pick up the shorter max-age again. I don't see any reason why th" [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [14:37:47] _joe_: /win 14 [14:38:01] sorry misstyped [14:38:55] <_joe_> jzerebecki: I'm pretty sure that if I set HSTS today at 1 year, the browsers will not degrade that. Or at least that was the supposed behaviour last I checked [14:39:28] RECOVERY - puppet last run on virt0 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:41:01] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "see also http://puppet-compiler.wmflabs.org/338/change/159462/html/labstore1001.eqiad.wmnet.html" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/159462 (owner: 10Matanya) [14:44:37] _joe_: never heard of anyone immplementing it that way. that would be in violation of https://tools.ietf.org/html/rfc6797#section-5.3 which says: "Specifying a zero time duration signals the UA to delete the HSTS Policy" [14:45:07] and "Thus, UAs cache the "freshest" HSTS Policy information on behalf of an HSTS Host." [14:45:35] <_joe_> jzerebecki: no you kinda missed my point, and I expressed it badly [14:45:58] <_joe_> say we have a problem for which we have https down suddenly for a service [14:46:02] <_joe_> for say a week [14:46:34] <_joe_> in that scenario, you won't be able to recover connection from anyone [14:46:49] <_joe_> as you cannot reset it [14:46:57] yup [14:46:58] <_joe_> if people can't connect to https [14:47:07] you would need to pick a new hostname [14:47:28] which you already need to do if you do not want to wait for a week [14:47:30] and get everyone to use it [14:48:05] <_joe_> I'm not a fan of HSTS in general I have to say [14:48:26] <_joe_> but 1 year seemed to me like something foolish [14:48:29] making people use http on gerrit seems like a bad idea, more so if they are not notified if it beforehand [14:49:28] <_joe_> say $some_country bans https; the answer to people in those countries will be "clear your cache". It may be acceptable. I dunno [14:49:54] "clear cache" probably doesn't clear HSTS, it would probably have to be some other button, which may not exist yet [14:49:56] <_joe_> (I think gerrit should simply not accept connections in plain http anyway) [14:50:44] <^d> gerrit just redirects http traffic to https. [14:51:13] What I don't get is the need for HSTS policy to begin with. Could browsers not just (a) assume that resources reachable once via HTTPS should be used as HTTPS-only from there forward, and then (b) if downgrade seems necessary for connectivity because https isn't reachable, just throw up a big warning to the user that something fishy is going on and let them decide? [14:51:18] <_joe_> ^d: HSTS should protect you from MITM if you accidentally type it without HTTPS [14:51:43] <_joe_> s/it/the url/ [14:52:10] ^d: the redirect itself is problematic, as the redirect is easy to hijack [14:52:24] <^d> Someone needs a better hobby than mitm gerrit :) [14:52:28] <_joe_> bblack: exactly what HSTS tries to solve [14:52:32] poorly [14:52:49] <_joe_> ^d: well, one could inject a backdoor in our hhvm sources [14:52:54] !log rebuilding enwiki's Cirrus index for more performance testing. Please be faster now. k? [14:52:59] Logged the message, Master [14:53:06] <_joe_> just the first thing that comes out of my mind [14:53:16] <^d> _joe_: They could? We push over ssh. [14:53:17] <_joe_> bblack: I agree, poorly [14:53:30] <_joe_> ^d: if someone fetches via http [14:53:31] bblack: how is that doing it poorly? [14:53:51] <_joe_> jzerebecki: it's the same problem you have with http caches [14:54:06] we didn't need a new protocol. how much harder would it have been for browsers to implement sane policy locally instead of implementing HSTS protocol? [14:54:07] <_joe_> you'd like them to be very long, but that tends to creep functionality [14:54:12] <_joe_> and operations agility [14:54:18] <_joe_> bblack: +1 [14:54:36] browsers could default to https-only mode with a heavy warning for fallback for any site that ever redirects them to http once. [14:54:45] err, to https once [14:55:02] <_joe_> bbl [14:55:21] <_joe_> jzerebecki: If everybody agrees that hsts=1yr is good, go on [14:55:30] bblack: that would break all sites that use some https but redirect back to http or break on certain urls which are many [14:55:44] which is broken security anyways, and should result in warning to the user [14:55:59] either way the user gets a warning when they're using broken security, and not when they're not [14:56:27] it's like an HSTS=infinity without the potential future breakage implications, and with more pressure on sites to upgrade, and more user awareness [14:56:27] but 95% of web security is broken [14:56:48] so nobody would use web browsers that implemented that [14:57:47] improvements need to be in small steps with almost full backwards compatibilty otherwise nobody uses them [14:57:53] the alternative is scary from our end. we're setting a policy that says "we'd like to tie our hands for a period of one year from making certain changes, even though we have no idea what the future will bring or how the strictness of those policies will evolve on the browser side" [14:58:41] we don't even put 1 year TTLs in DNS. :p [14:58:42] the only thing that we can not do is switch to http only [14:59:21] TTLs are something different as one does not need to do a refresh until they expire [14:59:24] no, the only thing we cannot do is use an http URL for that hostname, ever. Really it's more than that, a non-HTTPS URL. What if we like an alternate method of security later and HSTS isn't interpreted to include it? [14:59:34] HSTS refreshes on every connection [15:00:59] not all users are going to refresh, though. we can't say that just because it refreshes it's not a TTL [15:00:59] sure we can do that just offer https with an sts header with max-age=0 which fully disables sts for any connection afterwards [15:01:25] if you publish a 1yr TTL on HSTS, you can't change for a year. You can't assume everyone's refreshed the policy recently and then downgrade the TTL arbitrarily and make a change in a short window [15:01:33] as long as we still also offer https we can still change the sts header any way we like [15:02:12] which means sts for 1 year equals not being able to switch off https for one year [15:02:32] what about the guy who visits en.wp.o every day but only hits wikiquote.org once every 6 months? We publish 1yr HSTS on both, then we decide we need to make a change. We can't downgrade to 1week and then make a change a week later. Our hands are tied from 1 full year of the decision point. [15:03:45] (like the decision we might make in the future to allow login over non-https in a country who's border firewalls decide to disallow https) [15:03:50] but only for switching off https, nothing else. that is like wanting to switch off http within 1 week. [15:04:30] (and, while I know it's being a bit pedantic, it's not just "switching off https", it could also include switching to a newer/better security protocol) [15:05:13] nope if we still had https on while also supporting the new protocol then that would still work [15:05:53] hmm maybe a better way to state the restriction is "we're not allowed to turn off https access or restrict its availability for one year" [15:06:03] exactly [15:06:13] but it's not only us that could restrict availability... [15:06:42] and https infrastructure is pretty broken to begin with re: CA trust [15:07:08] sure, next step is certificate pinning :) [15:07:36] which is just putting lipstick on users accepting self-signed certs :) [15:08:37] fundamentally all of my opposition to a long HSTS is that *anything* where we're making a commitment to anything that far in the future is a potential source of problems. The future is unpredictable. [15:09:19] it is but if https is blocked for enough users of gerrit, we have a much bigger problem [15:09:27] (03PS1) 10Mark Bergsma: Replace role::cache::ssl if statement with a selector [puppet] - 10https://gerrit.wikimedia.org/r/160001 [15:09:30] (and also that it doesn't eliminate the attack surface anyways, it just reduces probabilities. over the average of millions of users, it may reduce http-redirect hits by several orders of magnitude, but they're still going to hit it regularly on new browser/device installs and HSTS expiries) [15:10:19] turning off http completely would be a more-complete solution if it were feasible. [15:10:21] with certificate pinning it would be as proken as ssh security [15:10:34] (03PS2) 10Mark Bergsma: Replace role::cache::ssl if statement with a selector [puppet] - 10https://gerrit.wikimedia.org/r/160001 [15:11:53] if we were to switch to http only on gerrit that would be nearly as bad as switching to telnet for production shell access [15:12:07] oh I agree [15:12:38] in the case of gerrit, since it's not a public service and doesn't have non-technical users, I don't see why we don't completely shut down http access and consider all http:// URLs invalid to begin with. [15:12:58] I'm much more concerned about HSTS for the truly public-facing stuff [15:13:27] this downside we are talking about it is like the downside of using gpg withouth giving someone else your unencrypted private key: if you loose the passphrase you loos your encrypted data. [15:14:04] yea i agree regarding HSTS for things like wikipedia.org [15:14:37] in any case 1yr seems rather arbitrary. [15:14:45] if there are no future-facing issues, why not 100y? [15:15:00] if we can't make a decision without a year's notice we basically can't make a decision [15:15:09] (03CR) 10Mark Bergsma: [C: 032] Replace role::cache::ssl if statement with a selector [puppet] - 10https://gerrit.wikimedia.org/r/160001 (owner: 10Mark Bergsma) [15:15:58] has anyone looked at the percentage reduction in attack surface from http:// hits on various HSTS lengths and tried to figure out where the limit of pragmatism is? [15:16:04] yes though it is not that arbitary, it is more than firefox and chrome require for inclusion in their HSTS preloaded list [15:16:17] (probably related to the average rate at which users install new browsers/OS's/phones/etc) [15:17:06] chromes requirement is eighteen weeks [15:18:35] others have higher requirements for "proper" sts so we just picked the biggest one because we thought that is already as you said we can't make a decision [15:19:43] not aware of any public statisical justification for those numbers [15:22:44] I know it's not directly comparable, but I keep thinking in terms of DNS TTL [15:23:19] we limit those (everyone does) because the future is so uncertain, and long TTLs can really tie your hands [15:23:47] even the NS list for the root servers publish with a 6 day TTL [15:24:01] (although it's much longer in practice because people embed that stuff in software for bootstrapping) [15:24:53] 2 days for the .com server TTLs, which aren't embedded anywhere [15:26:06] ah 41 days for the actual addresses on the rootservers, that's a little better [15:26:12] yes it only makes sense because it is basically a flag to protect against MITM [15:26:34] or agains any sniffing of content [15:27:14] well I get that, it's a powerful mechanism for security (against I guess URL typos as http, and old links/bookmarks that are http on an https site) [15:27:23] (and search results) [15:28:31] and reversing on myself, the really legitimate future concern about anti-https firewalls... it's not that great a burden to ask that users in those countries recognize what happened explicit and do something to clear HSTS in their browsers before they can reach us again [15:32:46] i wonder if offering a different domain like unsecure-wikimedia.org for countries that currently do that might be a way to make STS a viable thing in the long run on things like wikipedia.org [15:34:22] on a totally different topic, does anyone have a clear understanding of how gerrit topic related to real git? apparently pushing from a local differently-named branch does nothing. [15:34:41] there's git-review -t, but I generally don't start my commits in a git-review branch [15:34:57] gosh these git modules are confusing [15:35:01] what's the workflow for "I want to generate a set of N patches in a named topic branch in gerrit, starting on the commandline) [15:35:07] oops s/)/"/ [15:35:17] i'm working on the nginx module, this confused me a bit: [15:35:18] Changes not staged for commit: [15:35:18] (use "git add ..." to update what will be committed) [15:35:18] (use "git checkout -- ..." to discard changes in working directory) [15:35:18] modified: manifests/site.pp [15:35:30] manifests/site.pp means something else in my mind ;) [15:35:35] heh [15:36:05] bblack: git review -t mytopic does exactly that [15:36:29] _joe_: did you, Tim and Sean figure out what was going on last night? I just peered in before bed but since you two were awake I didn't look closely [15:36:48] jzerebecki: starting from a regular checkout of the production branch, can I fork a local branch, then do git-reivew -t, then make several commits, then push to refs/for/topicname, or? [15:37:02] I just don't get the workflow when starting with that intention [15:37:22] (well or git-review instead of push to refs/for I guess) [15:37:33] (03PS1) 10Mark Bergsma: Handle ensure == absent in nginx::site [puppet/nginx] - 10https://gerrit.wikimedia.org/r/160004 [15:37:33] does git-review -t work on a local branch created without git-review? [15:37:39] yes [15:37:55] if you use -t it has no relation to the local branch [15:38:12] it just sets the topic in gerrit [15:38:51] maybe I should try this with some smaller and less complicated changes first :) [15:39:02] <^d> bblack: refs/for// [15:39:13] ahhhh! [15:39:41] ^d: so for the initial push of a local branch of say 10 related commits, I could just "git push HEAD:refs/for/production/topic" and get a topic branch without touching git-review yet? [15:40:19] or do I still need the git-review -t as well? [15:40:48] the fundamental issue here seems to be that I generally don't grok what "git-review" actual does to the local repo and the push commands [15:40:56] <^d> Yeah, you can do that. [15:40:59] <^d> I never use git-review. [15:41:01] <^d> It's so bad :p [15:41:09] I only resort to using git-review when I have to, to download someone else's changeset and amend it or whatever [15:41:18] and then I'm like, in magic-land and I don't know what's really going on [15:41:33] <^d> That's what the copy+pastable cherry-pick/checkout links on gerrit are for :) [15:41:43] yep, no need for git review at all [15:41:47] although it's quite handy [15:42:40] ok thanks guys :) [15:43:17] (03PS2) 10Mark Bergsma: Handle ensure == absent in nginx::site [puppet/nginx] - 10https://gerrit.wikimedia.org/r/160004 [15:43:18] yes if you know the ref to push to you don't need it to even amend a gerrit change [15:44:49] yeah refs/for// seems to be a well kept secret [15:44:54] took me a while before I figured that one out [15:45:19] <^d> It's completely documented ;-) [15:45:34] <_joe_> greg-g: on tin, it was simply topping its network capabilities [15:45:59] (03PS3) 10Mark Bergsma: Handle ensure == absent in nginx::site [puppet/nginx] - 10https://gerrit.wikimedia.org/r/160004 [15:46:06] stupid puppet aligned arrows [15:48:30] _joe_: potentially related? https://bugzilla.wikimedia.org/show_bug.cgi?id=70752 [15:48:33] now that you say it is documented, half of https://gerrit.wikimedia.org/r/Documentation/user-upload.html is totaly new to me, like %r=a@b.com in the ref for adding a reviewer [15:48:40] _joe_: (I'm trying to line up timestamps now) [15:49:00] <_joe_> greg-g: not at all [15:49:38] <_joe_> greg-g: the timeframe is the same [15:50:03] <_joe_> and probably the origin of both is the same (a large scap sync) [15:50:58] right, that's what I'm thinking: somehow this big scap made things worse than normal [15:51:03] mark: too bad we can't easily measure how time we lose aligning arrows [15:51:19] * aude thinks localisation update and scap at same time [15:51:30] aude: probably a good guess [15:51:32] i think it finished 1.24wmf20 [15:51:36] but then not the other [15:51:59] didn't realize it was not complete [15:52:51] (03PS6) 10Alexandros Kosiaris: Introducing Service Cluster A, hosting mathoid [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [15:52:52] godog: I use https://github.com/vim-scripts/Align . visual select + :Align => = profit [15:53:13] (03CR) 10Mark Bergsma: "The Exim configuration is wrong, the notion of 'routers' and 'transports' are confused. More comments inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [15:54:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Gonna split it to 2 commit's per Mark's suggestion on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [15:54:21] bd808: oh! that's nice, going to try it! [15:55:17] godog: i only do it when I'm in a good mood [15:57:05] * csteipp is late to the conversation, but bblack, I'd be happy to setup a sslstrip demo next time you're in the office :) [15:58:00] mark: jenkins won't be amused! [15:58:20] it's either jenkins or me [15:58:59] csteipp: in the office in the office [15:59:00] that's too easy [15:59:15] when working from home is the only way not to compromise the infrastructure :) [16:01:32] * mark logs in to jenkins for the very first time [16:04:37] mark: Yeah, but forwarding an entire city's routing to my laptop just to demo it for him seemed like overkill. Unless I can borrow a few servers from you... [16:04:59] call it a challenge ;) [16:08:16] can someone review https://gerrit.wikimedia.org/r/#/c/160004/3 perhaps? [16:10:14] (03PS1) 10Mark Bergsma: Use the protoproxy::localssl $name as nginx site config name [puppet] - 10https://gerrit.wikimedia.org/r/160009 [16:10:50] (03CR) 10jenkins-bot: [V: 04-1] Use the protoproxy::localssl $name as nginx site config name [puppet] - 10https://gerrit.wikimedia.org/r/160009 (owner: 10Mark Bergsma) [16:11:25] stupid puppet commas [16:11:29] (03PS37) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:11:35] (03PS2) 10Mark Bergsma: Use the protoproxy::localssl $name as nginx site config name [puppet] - 10https://gerrit.wikimedia.org/r/160009 [16:11:50] manifests/site.pp ? [16:12:03] gerrit has mixed up something or is it me ? [16:12:05] confusing isn't it [16:12:06] no [16:12:11] (03PS38) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:12:11] i complained about that above [16:12:16] nginx is a separate git repo, a sub module [16:12:22] that's just nginx::site [16:12:53] I expected it to bite at some point [16:12:57] that and apache::site [16:16:12] (03Abandoned) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/157978 (owner: 10BBlack) [16:17:24] (03PS1) 10BBlack: Flip ed1a::0 and ed1a::1 in protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160010 [16:17:26] (03PS1) 10BBlack: Sanitize text-related addrs for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/160011 [16:17:28] (03PS1) 10BBlack: remove old mobile/bits addrs in eqiad+esams [puppet] - 10https://gerrit.wikimedia.org/r/160012 [16:17:30] (03PS1) 10BBlack: remove dead esams donatelbsecure [puppet] - 10https://gerrit.wikimedia.org/r/160013 [16:17:32] (03PS1) 10BBlack: add textsvc/uploadsvc in ulsfo for consistency [puppet] - 10https://gerrit.wikimedia.org/r/160014 [16:17:34] (03PS1) 10BBlack: Remove dead addrs from protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/160015 [16:17:36] (03PS1) 10BBlack: add login-lb to eqiad protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/160016 [16:17:38] (03PS1) 10BBlack: Remove dead protoproxy entries completely [puppet] - 10https://gerrit.wikimedia.org/r/160017 [16:18:10] nice, the topic branch thing worked :) [16:19:49] (03CR) 10Alexandros Kosiaris: [C: 032] "A step to the right direction. apache::site (well more correctly, apache::conf) has a few other nice ideas (input validation, standard ext" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/160004 (owner: 10Mark Bergsma) [16:20:53] argh [16:20:59] puppet-merge doesn't work for submodules? [16:21:09] how am I supposed to merge that now? [16:21:14] aude: so, what time on Monday do you want to do the wikidata deploy? [16:21:23] mark: it's supposed to, I think [16:21:31] perhaps I checked it out wrong [16:21:34] aude: you or whoever :) [16:23:14] * mark reads [16:23:32] greg-g: maybe before swat or after? [16:23:40] need enough time to run scap [16:25:33] so, I'll start push out those topic changes on monday, I'm still not going to mess with it today: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:lvs-protoproxy-cleanup,n,z [16:27:06] greg-g: signed up for before swat [16:27:47] (03PS39) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:31:26] heya manybubbles [16:31:36] do you have experience using python and urllib2 to send JSON as POST? [16:31:41] i'm trying now, getting close but haven't quite got it [16:32:07] ok i see i screwed it up by editing the submodule under operations/puppet/modules/ directly [16:33:49] aude: cool [16:35:52] (03PS3) 10Mark Bergsma: Use the protoproxy::localssl $name as nginx site config name [puppet] - 10https://gerrit.wikimedia.org/r/160009 [16:37:29] mark: looks like I get this error from the labs instance now on $ exim status : option "mw" unknown [16:38:07] due to our naming convenetion ? [16:38:22] mw-verp-api ? [16:38:30] (03PS4) 10Mark Bergsma: Use the protoproxy::localssl $name as nginx site config name [puppet] - 10https://gerrit.wikimedia.org/r/160009 [16:38:32] (03PS1) 10Mark Bergsma: Update nginx sub module [puppet] - 10https://gerrit.wikimedia.org/r/160020 [16:39:39] (03CR) 10Mark Bergsma: [C: 032] Update nginx sub module [puppet] - 10https://gerrit.wikimedia.org/r/160020 (owner: 10Mark Bergsma) [16:39:41] ok, manybubbles, i've got something kinda working using python requests, but now i'm getting a query parse error, lemme know when you are back [16:43:39] <^d> ottomata: I'm about if you want me to have a look. Pastebin? [16:45:46] bblack: I think I'll merge this change now: [16:45:47] https://gerrit.wikimedia.org/r/#/c/160009/4 [16:45:55] should only affect 1) all ulsfo caches and 2) misc-web [16:46:05] but afaict nginx won't get restarted, so I can go over them manually [16:46:34] mark: ok [16:47:24] (03CR) 10Mark Bergsma: [C: 032] Use the protoproxy::localssl $name as nginx site config name [puppet] - 10https://gerrit.wikimedia.org/r/160009 (owner: 10Mark Bergsma) [16:47:51] (03PS40) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:48:32] in theory converting the other sites to localssl isn't hard, esp if we turn it on in the varnish boxes first and then disable protoproxy afterwards. But I still want to just clean up the existing defs before I do it. Everything is too complicated by confusion right now for me :) [16:50:29] ok [16:50:36] i'm doing this mainly to help chase for his phabricator work now [16:51:16] ok ^d [16:52:46] one sec [16:52:51] mark: unrelated, but any thoughts on recursive DNS service IP? I mean technically it doesn't have to be in any of the 4x public subnets since it's private. the current one in eqiad happens to land in Mobile & Zero subnet. [16:53:01] we could place it elsewhere in codfw and not waste that space [16:53:44] or put in the misc public subnet [16:53:59] there's not really another specific block allocated that makes sense for it [16:54:16] well it shouldn't overlap with any subnet anyway [16:54:23] but yeah it could be private [16:54:32] i'd prefer it not to be yet though [16:54:39] esams has only one link to the US still [16:54:49] so over the internet is a bit more redundant [16:55:30] so, place it in misc public subnet? [16:55:41] (the one with e.g. parsoid/stream) [16:55:56] yeah I think so [16:56:01] we can move it out later if we need to [16:56:07] although it's a bit of a pain for anything not managed by puppet [16:56:08] yeah true [16:56:09] ^d: https://gist.github.com/ottomata/a4bf7159bfa688682e94 [16:56:17] routers etc ;) [16:56:28] "anything not managed by puppet" is a pain by definition [16:56:44] error is at the bottom [16:56:45] someone will revisit this in like 5 years and change it and life will go on :) [16:57:08] <^d> ottomata: Ew :\ [16:57:13] <^d> Lemme have a look in ES logs [16:57:20] oo ok [16:57:49] hm, maybe it is the the fact that I have part of my query in GET params? e.g. the _only_node preflerence [16:57:53] going to try to put that into the post [16:58:02] (can I do that?) [16:59:47] <^d> org.elasticsearch.search.SearchParseException: [enwiki_content_1407944746][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [_na_]] [17:01:51] hm, wait i have more of an error this time, but it looks the same [17:02:09] ^d: [17:02:09] http://etherpad.wikimedia.org/p/elastic-single-node-testing [17:02:10] at bottom [17:02:20] ah [17:02:29] i think that' because I took the _only_node out of GET and put it in POST [17:02:32] looks liek that didn't work [17:03:16] <^d> Weird it would make a difference. [17:03:58] ottomata: why are there two '{'s and '}'s everywhere where I expect just one? [17:04:17] (03PS1) 10RobH: setting mgmt ip for labcontrol2001 [dns] - 10https://gerrit.wikimedia.org/r/160025 [17:04:19] python string fromat [17:04:21] format [17:04:31] expects things like {0} [17:04:32] bleh [17:04:36] it will escape them if you use {{ and }} [17:04:52] can you print the query that is actually being sent? [17:04:55] ja [17:04:56] <^d> {{}} looks like wikitext :p [17:05:06] but its escaped { [17:05:20] if you print it and curl it manually you'll get more luck I think [17:06:02] manybubbles: http://www.codeshare.io/aITwc [17:06:12] hm k will try that [17:06:36] (03CR) 10RobH: [C: 032] setting mgmt ip for labcontrol2001 [dns] - 10https://gerrit.wikimedia.org/r/160025 (owner: 10RobH) [17:07:40] manybubbles: [17:07:44] {"error":"SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[fDeoGVcgSbmV88k-Nd4Kpg][enwiki_content_1407944746][0]: GroovyScriptExecutionException[MissingPropertyException[No such property: incoming_links for class: Script21]]}]","status":400} [17:07:46] got that when curling [17:08:18] unlight it please in the editor so I can read it [17:09:12] ottomata: you need quotes around incoming_links on line 39 [17:09:30] hm [17:09:49] hmmm [17:10:10] I see them here: https://en.wikipedia.org/wiki/Special:Search/foo%20bar%20baz?cirrusDumpQuery=yes [17:10:22] so I imagine they got thrown out somehow [17:13:46] getting cloooser [17:14:09] i got a curl to work [17:19:06] Ooo got it! thanks guys [17:22:00] <^d> Cannot parse 'çfilkjkllçnşxlslsslslsld .& & & & &. &&': Encountered "" [17:22:08] <^d> Why do people bang their keyboards and search? [17:23:30] for our amusement [17:23:34] (03PS1) 10Ori.livneh: Rename some constants to clarify their meaning and purpose [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160029 [17:23:53] ^d: çfilkjkllçnşxlslsslslsld is an active volacno in southeastern turkey [17:24:19] or at least it should be [17:24:44] <^d> You forgot the .& & & & &. && [17:26:20] that looks like valid ascii art [17:26:37] also, maybe it wasn't people [17:27:55] (03PS1) 10Mark Bergsma: Add protoproxy::localssl default_server and server_name parameters [puppet] - 10https://gerrit.wikimedia.org/r/160030 [17:33:51] (03PS2) 10Ori.livneh: Rename some constants to clarify their meaning and purpose [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160029 [17:34:07] ^d: ^ sanity-check? [17:34:21] <^d> sure [17:35:04] (03PS1) 10BBlack: Remove unused dns_auth LVS defs [puppet] - 10https://gerrit.wikimedia.org/r/160032 [17:35:54] (03CR) 10BBlack: [C: 032] Remove unused dns_auth LVS defs [puppet] - 10https://gerrit.wikimedia.org/r/160032 (owner: 10BBlack) [17:38:11] ok, manybubbles, ^d, got somethign working: https://gist.github.com/ottomata/fb03fd03267aa0eb1767 [17:38:14] check it out if you like [17:39:48] this one hits sda too! :) http://graphite.wikimedia.org/render/?width=1057&height=510&_salt=1410459461.018&from=-3hours&target=secondYAxis(servers.elastic1016.iostat.sda.iops.value)&target=secondYAxis(servers.elastic1016.iostat.sdb.iops.value)&target=servers.elastic1016.iostat.sdb.reads_byte.value&target=servers.elastic1016.iostat.sda.reads_byte.value [17:40:04] manybubbles: where do I look to see if this causes production problems? [17:40:12] should I tail logs on elastic1016? [17:40:13] <^d> ori: Are you wanting to deploy now or during swat or ??? [17:40:58] ^d: i was thinking now. do you think it's worrisome? [17:41:12] <^d> No, just wondering if I should +1 or +2 [17:41:44] ottomata: nah - check for anything cirrus related in the right logs. fenari's /home/wikipedia/syslog/apache.log should be good. or logstash [17:41:45] +2 then [17:42:56] (03CR) 10Chad: [C: 032] Rename some constants to clarify their meaning and purpose [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160029 (owner: 10Ori.livneh) [17:43:01] (03Merged) 10jenkins-bot: Rename some constants to clarify their meaning and purpose [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160029 (owner: 10Ori.livneh) [17:43:08] thanks! [17:43:24] !log ori updated /a/common to {{Gerrit|I4e4187285}}: Rename some constants to clarify their meaning and purpose [17:43:29] Logged the message, Master [17:44:27] (03CR) 10Rush: [C: 031] "seems good" [puppet] - 10https://gerrit.wikimedia.org/r/160030 (owner: 10Mark Bergsma) [17:46:35] * YuviPanda pokes springle [17:46:35] around? [17:46:51] oh [17:46:53] probably not [17:46:57] 3AM in Australia. heh [17:49:46] springle: when you're back: I'm wondering if there's an easy way to check replag that I can get into graphite [17:50:04] (03PS2) 10Mark Bergsma: Add protoproxy::localssl default_server and server_name parameters [puppet] - 10https://gerrit.wikimedia.org/r/160030 [17:51:32] (03CR) 10Mark Bergsma: [C: 032] Add protoproxy::localssl default_server and server_name parameters [puppet] - 10https://gerrit.wikimedia.org/r/160030 (owner: 10Mark Bergsma) [17:52:13] YuviPanda: the standard way to check replag is to issue 'SHOW SLAVE STATUS' and use the 'Seconds_Behind_Master' row [17:52:19] right [17:52:23] YuviPanda: dunno if thats what you mean though :_) [17:52:24] I suppose normal users can't access that [17:52:38] so would need to create a special user, manage passwords, and then setup a mysql monitor diamond handler [17:53:10] YuviPanda: yea apparently you need the Replication_Client privilege [17:53:16] yeah [17:53:26] should be fine, I think. super limited, and the passwords can go in the private repo [17:58:48] (03PS1) 10BBlack: add DNS for codfw dns-rec-lb [dns] - 10https://gerrit.wikimedia.org/r/160036 [17:58:55] (03PS1) 10BBlack: define codfw recursive DNS [puppet] - 10https://gerrit.wikimedia.org/r/160037 [17:59:10] (03PS2) 10BBlack: add DNS for codfw dns-rec-lb [dns] - 10https://gerrit.wikimedia.org/r/160036 [18:00:05] (03CR) 10BBlack: [C: 032] add DNS for codfw dns-rec-lb [dns] - 10https://gerrit.wikimedia.org/r/160036 (owner: 10BBlack) [18:00:19] (03PS2) 10BBlack: define codfw recursive DNS [puppet] - 10https://gerrit.wikimedia.org/r/160037 [18:01:02] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [18:01:27] (03CR) 10BBlack: [C: 032] define codfw recursive DNS [puppet] - 10https://gerrit.wikimedia.org/r/160037 (owner: 10BBlack) [18:02:44] ottomata: high sys load but otherwise no io load [18:03:12] are you looping the same few queries perchance? [18:04:04] i dobn' think so, i'm running an hour of queries from 09-09, and i just selected a new hour that I hadn't run before [18:05:03] there were a couple of bumps in in cpu_wio, but nothing lasting [18:05:04] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=&vl=&x=&n=&hreg[]=elastic101%5B26%5D&mreg[]=cpu_wio>ype=line&glegend=show&aggregate=1&embed=1&_=1410544996019 [18:05:18] sda is reading a bunch [18:05:27] sda and sdb, sorry [18:05:33] http://graphite.wikimedia.org/render/?width=1057&height=510&_salt=1410459461.018&from=-3hours&target=secondYAxis(servers.elastic1016.iostat.sda.iops.value)&target=secondYAxis(servers.elastic1016.iostat.sdb.iops.value)&target=servers.elastic1016.iostat.sdb.reads_byte.value&target=servers.elastic1016.iostat.sda.reads_byte.value [18:05:38] but reads have gone down over the last few minutes... [18:11:22] (03PS1) 10Ori.livneh: Rename MW_COMMON{_SOURCE} vars [puppet] - 10https://gerrit.wikimedia.org/r/160046 [18:12:05] (03PS1) 10Mark Bergsma: Replace role::cache::ssl::wikimedia with an SNI capable role::cache::ssl::misc [puppet] - 10https://gerrit.wikimedia.org/r/160047 [18:14:12] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:14:53] (03PS2) 10Mark Bergsma: Replace role::cache::ssl::wikimedia with an SNI capable role::cache::ssl::misc [puppet] - 10https://gerrit.wikimedia.org/r/160047 [18:16:19] (03Abandoned) 10Ori.livneh: Re-implement apache::mod_conf as custom type [puppet] - 10https://gerrit.wikimedia.org/r/151652 (owner: 10Ori.livneh) [18:16:25] (03Abandoned) 10Ori.livneh: HHVM: warm up the JIT by making web requests in Upstart post-start [puppet] - 10https://gerrit.wikimedia.org/r/150992 (owner: 10Ori.livneh) [18:17:04] (03Abandoned) 10Ori.livneh: Fix-ups for I3d002968c [puppet] - 10https://gerrit.wikimedia.org/r/154027 (owner: 10Ori.livneh) [18:17:10] (03Abandoned) 10Ori.livneh: HHVM: Increase Eval.PCRETableSize [puppet] - 10https://gerrit.wikimedia.org/r/157290 (owner: 10Ori.livneh) [18:17:54] <^d> chasemp: Can fab.wm.o be an alias to phabricator.wm.o? Less characters to type :) [18:18:19] (03PS1) 10RobH: setting production ip for ms-be2001-2012 [dns] - 10https://gerrit.wikimedia.org/r/160051 [18:19:52] why f?ab [18:19:55] phap, ja? [18:20:02] phab* [18:21:48] (03PS1) 10Mark Bergsma: Cleanup: remove old 'localssl' nginx site [puppet] - 10https://gerrit.wikimedia.org/r/160052 [18:21:49] <^d> phab is cool :) [18:22:19] (03CR) 10RobH: [C: 032] setting production ip for ms-be2001-2012 [dns] - 10https://gerrit.wikimedia.org/r/160051 (owner: 10RobH) [18:22:44] <^d> ottomata: I just think an alias would be phantastic. [18:22:50] OH DAMN [18:22:54] i forgot to align the arrows [18:23:23] making multiple patchsets for cosmetic changes, thats my territory! [18:24:27] (03PS2) 10Mark Bergsma: Cleanup: remove old 'localssl' nginx site [puppet] - 10https://gerrit.wikimedia.org/r/160052 [18:24:29] (03PS3) 10Mark Bergsma: Replace role::cache::ssl::wikimedia with an SNI capable role::cache::ssl::misc [puppet] - 10https://gerrit.wikimedia.org/r/160047 [18:27:14] (03CR) 10Rush: [C: 031] Replace role::cache::ssl::wikimedia with an SNI capable role::cache::ssl::misc [puppet] - 10https://gerrit.wikimedia.org/r/160047 (owner: 10Mark Bergsma) [18:27:46] (03CR) 10Ori.livneh: [C: 032] Rename MW_COMMON{_SOURCE} vars [puppet] - 10https://gerrit.wikimedia.org/r/160046 (owner: 10Ori.livneh) [18:27:56] (03PS4) 10Rush: Replace role::cache::ssl::wikimedia with an SNI capable role::cache::ssl::misc [puppet] - 10https://gerrit.wikimedia.org/r/160047 (owner: 10Mark Bergsma) [18:28:05] (03CR) 10Rush: [C: 032 V: 032] "let's go!" [puppet] - 10https://gerrit.wikimedia.org/r/160047 (owner: 10Mark Bergsma) [18:29:57] (03CR) 10Giuseppe Lavagetto: [C: 031] rcstream: add a 5s post-stop sleep delay [puppet] - 10https://gerrit.wikimedia.org/r/146001 (owner: 10Ori.livneh) [18:31:38] (03PS3) 10Rush: Cleanup: remove old 'localssl' nginx site [puppet] - 10https://gerrit.wikimedia.org/r/160052 (owner: 10Mark Bergsma) [18:31:45] (03CR) 10Rush: [C: 032 V: 032] Cleanup: remove old 'localssl' nginx site [puppet] - 10https://gerrit.wikimedia.org/r/160052 (owner: 10Mark Bergsma) [18:32:01] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Epic puppet fail [18:32:22] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Epic puppet fail [18:39:25] (03PS1) 10Ori.livneh: decom /etc/cluster file on app servers [puppet] - 10https://gerrit.wikimedia.org/r/160056 [18:41:06] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:44:54] (03CR) 10Gage: [C: 032] don't filter out stack traces, don't break hdfs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/159768 (owner: 10Gage) [18:50:48] (03PS1) 10Gage: merge regression fix & re-enable GELF logging [puppet] - 10https://gerrit.wikimedia.org/r/160061 [18:54:33] (03PS1) 10Dzahn: ssl certs - configure CA for wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/160063 [18:54:37] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:55:56] (03CR) 10Rush: [C: 031] ssl certs - configure CA for wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/160063 (owner: 10Dzahn) [18:55:59] (03CR) 10Ottomata: [C: 032] merge regression fix & re-enable GELF logging [puppet] - 10https://gerrit.wikimedia.org/r/160061 (owner: 10Gage) [18:56:58] (03CR) 10Gage: [C: 032] merge regression fix & re-enable GELF logging [puppet] - 10https://gerrit.wikimedia.org/r/160061 (owner: 10Gage) [18:57:11] ottomata: elastic1012 is very sad [18:57:47] I have a hunch that is going to cause trouble for users [18:59:00] well, it actually hasn't hurt anyone yet [18:59:04] so maybe ok [19:00:14] (03PS1) 10Ori.livneh: fold deployment::vars into mediawiki::scap [puppet] - 10https://gerrit.wikimedia.org/r/160064 [19:00:16] ottomata: respond? [19:01:06] (03CR) 10Dzahn: [C: 032] ".. or we get a broken chain with the default CA in it" [puppet] - 10https://gerrit.wikimedia.org/r/160063 (owner: 10Dzahn) [19:01:52] manybubbles: i was just chatting with him a moment ago, i expect he'll reappear shortly [19:02:45] jgage: thanks [19:03:14] manybubbles: sorry [19:03:14] hi [19:03:18] ottomata: its ok [19:03:24] manybubbles: i'm doing it to 1012 now [19:03:26] so I _think_ its actually ok [19:03:27] i'm watching logs on fenari [19:03:33] (03CR) 10Ori.livneh: [C: 032] fold deployment::vars into mediawiki::scap [puppet] - 10https://gerrit.wikimedia.org/r/160064 (owner: 10Ori.livneh) [19:05:16] ottomata: so what was the difference between these two machines? [19:05:27] because one of them didn't mind this kind of damage at all [19:05:30] 1016 has the newer ssds [19:05:37] 1012 the older [19:05:37] that's it, afaik [19:05:46] they both have the same enwiki shard [19:05:52] so I am running the same set of queries on each [19:05:57] as fast as I can [19:06:33] good news, I guess, is that the disk cache is filling up [19:06:39] and disk io is actually going down [19:06:39] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Puppet has 1 failures [19:08:06] I'm not sure I trust the results of your experiment [19:08:14] I mean, I don't know the flaw, but wow [19:08:23] its like the ssd is fucking magic [19:11:13] manybubbles: i [19:11:17] i'm not sure i trust it either [19:11:18] ! [19:11:32] lets see if the total is faster [19:11:35] aye [19:11:45] but the way the load was distributed is like a billion times better on the new disks [19:12:16] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Puppet has 1 failures [19:13:35] ottomata: what was the different model numbers? [19:15:39] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:15:40] old: intel 320 SSD [19:15:40] new: intel S3500 ssds [19:18:13] (03PS1) 10Dzahn: SSL certs - add class for GlobalSign CA [puppet] - 10https://gerrit.wikimedia.org/r/160066 [19:18:51] ottomata: so there is a distinct reason so far for the 3500s (nice since we've purchased a few of them now ;) [19:18:53] <^d> ottomata: 320 -> 3500? Shouldn't they be ~10x better? [19:19:00] =p [19:19:18] well, the 320 isn't enterprise grade ssd so the difference isnt a shock [19:20:01] (03CR) 10Dzahn: [C: 04-1] "DigiCertHighAssuranceCA-3.pem doesn't belong in here" [puppet] - 10https://gerrit.wikimedia.org/r/160066 (owner: 10Dzahn) [19:21:09] http://ark.intel.com/products/56563/Intel-SSD-320-Series-120GB-2_5in-SATA-3Gbs-25nm-MLC [19:21:13] http://ark.intel.com/products/75684/Intel-SSD-DC-S3500-Series-480GB-2_5in-SATA-6Gbs-20nm-MLC [19:21:22] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:21:51] yea, practically double the performance specs on paper [19:23:09] RobH: ^ we need an entire new class when we add a new CA [19:23:53] but cant copy it exactly from DigiCert [19:24:24] makes sense [19:25:59] (03PS2) 10Dzahn: SSL certs - add class for GlobalSign CA [puppet] - 10https://gerrit.wikimedia.org/r/160066 [19:26:00] chasemp: wait.. here.. this [19:26:10] ? [19:26:45] PS2 should be ok now [19:26:48] (03CR) 10RobH: [C: 031] "this makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/160066 (owner: 10Dzahn) [19:26:54] reading [19:27:07] i just didn't replace all occurences of the same string [19:27:31] it's a copy of the class above [19:27:47] another change but I think we need to add certificates::globalsign_ca to role::cache::ssl::misc [19:28:26] yes [19:28:33] putting up now [19:28:36] +1 [19:29:34] (03CR) 10Dzahn: [C: 032] SSL certs - add class for GlobalSign CA [puppet] - 10https://gerrit.wikimedia.org/r/160066 (owner: 10Dzahn) [19:29:50] (03PS1) 10Rush: need to have ca certs for globalsign [puppet] - 10https://gerrit.wikimedia.org/r/160069 [19:30:24] (03PS2) 10Dzahn: need to have ca certs for globalsign [puppet] - 10https://gerrit.wikimedia.org/r/160069 (owner: 10Rush) [19:31:15] (03CR) 10Dzahn: [C: 031] need to have ca certs for globalsign [puppet] - 10https://gerrit.wikimedia.org/r/160069 (owner: 10Rush) [19:31:22] so the hash will link but the changeset # will not [19:31:36] (03CR) 10Rush: [C: 032] need to have ca certs for globalsign [puppet] - 10https://gerrit.wikimedia.org/r/160069 (owner: 10Rush) [19:31:47] yes, the hash starting with I [19:32:06] gotcha [19:32:07] maybe the number does with "change: " in front, but i always just c/p the hash [19:32:14] from web ui [19:33:21] wrong topic .hrmm [19:33:27] on my change [19:34:39] !log running migratePass0.php across all CentralAuth wikis [19:34:45] Logged the message, Master [19:38:44] (03CR) 10Dzahn: "ottomata: yes, 3 of them (non-ops/WMF) already +1ed it above, so they want it" [puppet] - 10https://gerrit.wikimedia.org/r/159419 (owner: 10Dzahn) [19:44:32] (03CR) 10Dzahn: "fwiw, Reedy for example would still not be able to login because he is neither in ops nor in the NDA group, while we all know he has trust" [puppet] - 10https://gerrit.wikimedia.org/r/159419 (owner: 10Dzahn) [19:44:54] oh manybubbles, i see what you were saying before, iowait hardly blipped on 1016 during the test [19:44:55] right? [19:45:17] ottomata: yeah- did you happen to check what the actual io rates were? [19:45:22] I don't see that in ganglia [19:45:39] http://graphite.wikimedia.org/render/?width=1057&height=510&_salt=1410459461.018&from=-3hours&target=secondYAxis(servers.elastic1016.iostat.sda.iops.value)&target=secondYAxis(servers.elastic1016.iostat.sdb.iops.value)&target=servers.elastic1016.iostat.sdb.reads_byte.value&target=servers.elastic1016.iostat.sda.reads_byte.value [19:46:10] ottomata: so it read a _ton_ of data at first and didn't have any real iowait [19:47:28] ja, vs 1012: [19:47:28] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=&vl=&x=&n=&hreg[]=elastic101%5B26%5D&mreg[]=cpu_wio>ype=line&glegend=show&aggregate=1&embed=1&_=1410544996019 [19:47:32] http://graphite.wikimedia.org/render/?width=1057&height=510&_salt=1410459461.018&from=-3hours&target=secondYAxis%28servers.elastic1012.iostat.sda.iops.value%29&target=secondYAxis%28servers.elastic1012.iostat.sdb.iops.value%29&target=servers.elastic1012.iostat.sdb.reads_byte.value&target=servers.elastic1012.iostat.sda.reads_byte.value [19:48:30] both read close to same amount at first: ~4G, 1016 pushed more iops through too [19:48:30] ottomata: graphite.wikimedia.org/render/?width=1057&height=510&_salt=1410459461.018&from=-3hours&target=secondYAxis(servers.elastic1016.iostat.sda.iops.value)&target=secondYAxis(servers.elastic1016.iostat.sdb.iops.value)&target=servers.elastic1016.iostat.sdb.reads_byte.value&target=servers.elastic1016.iostat.sda.reads_byte.value&target=secondYAxis(servers.elastic1012.iostat.sda.iops.value)&target=secondYAxis(servers.elastic1012. [19:48:30] iostat.sdb.iops.value)&target=servers.elastic1012.iostat.sdb.reads_byte.value&target=servers.elastic1012.iostat.sda.reads_byte.value [19:48:52] new disk is better [19:48:54] cool, thanks [19:48:55] how much better [19:48:59] url hacking [19:49:16] new disk is clearly magic [19:49:17] missing 1012 sdb iops! [19:49:31] Why do I have a ping on sdb, tell me. :P [19:50:01] sjoerddebruin: fun! we'll make sure to talk about disks as often as possible [19:50:07] XD [19:50:18] got it [19:50:20] was missing a dot [19:50:21] http://graphite.wikimedia.org/render/?width=1057&height=510&_salt=1410459461.018&from=-3hours&target=secondYAxis(servers.elastic1016.iostat.sda.iops.value)&target=secondYAxis(servers.elastic1016.iostat.sdb.iops.value)&target=servers.elastic1016.iostat.sdb.reads_byte.value&target=servers.elastic1016.iostat.sda.reads_byte.value&target=secondYAxis(servers.elastic1012.iostat.sda.iops.value)&target=secondYAxis(servers.elastic1012.iostat.sdb. [19:51:43] PROBLEM - HTTPS on cp1043 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [19:52:53] RECOVERY - HTTPS on cp1043 is OK: SSL_CERT OK - X.509 certificate for *.wikimedia.org from RapidSSL CA valid until Oct 18 08:51:36 2015 GMT (expires in 401 days) [19:55:07] manybubbles: more url hacking: http://graphite.wikimedia.org/render/?width=1057&height=510&_salt=1410459461.018&from=-3hours&target=secondYAxis(sum(servers.elastic1016.iostat.sd*.iops.value))&target=secondYAxis(sum(servers.elastic1012.iostat.sd*.iops.value))&target=sum(servers.elastic1016.iostat.sd*.reads_byte.value)&target=sum(servers.elastic1012.iostat.sd*.reads_byte.value) [19:55:08] :) [19:55:42] PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection refused [19:56:28] ori: re scap: [19:56:29] 15:55 < superm401> greg-g, we don't have any i18n changes, so it will be a a sync-file, but I'll keep that in mind. [19:56:48] <_joe_> hey? [19:56:59] <_joe_> <+icinga-wm> PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection refused [19:57:06] yeah, that's not good, what's up? [19:57:10] <_joe_> is anyone on this? [19:57:16] <_joe_> I got paged [19:57:19] chase and mutante over in that other channel [19:57:20] ottomata: is there any information about the disk cache we can graph? [19:57:59] for some definition of "on it" [20:00:14] ottomata: your thing finished [20:00:45] RECOVERY - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 227 bytes in 0.079 second response time [20:00:47] why yes it did [20:01:12] results: [20:01:12] https://gist.githubusercontent.com/ottomata/fb03fd03267aa0eb1767/raw/8a4db0ecb4b938fe6a59d9e78803c6ced2d3a0ff/results [20:02:22] manybubbles: i'm searching for a way to know that, not sure yet [20:02:49] <^d> ~5m of real time difference. [20:03:03] <^d> Much more noticeable with 6h of traffic. [20:03:34] ottomata: its looks to me like the new disks were way way way better at handling the initial shot but once the cache was hot it made little difference [20:04:01] aye, makes sense [20:04:33] was that 6 horus? [20:04:35] no [20:04:37] that was 1 hour [20:04:40] of traffic [20:04:45] no rate limiting though [20:04:50] only 32 processes (on a 16 core box) [20:04:55] ah [20:04:56] yeah [20:05:21] well - one good thing - we'll doing that across three machine in real life - which is good [20:05:24] haha, i wonder if I could make the analytics cluster do a distributed hammer...:p [20:05:31] as a yarn python streaming app... [20:05:31] haha [20:05:41] that is why the analtyics cluster cannot talk to production though :( [20:05:42] haha [20:05:47] you _could_ do that, but then I'd have to firewall it [20:06:05] aye, ha [20:06:19] welp hm, manybubbles would it help to restart es on these and then watch these metrics again [20:06:23] as it has to re-read everything from disk? [20:06:31] orrrrr is this good enough? [20:06:55] its pretty clear to me that the new disks are better [20:07:09] what were we trying to find out? [20:07:14] that's all! [20:07:24] or, that they were not worse, i guess [20:07:32] they are better. not worse [20:07:33] not at all [20:07:39] i mean, we were pretty sure they wouldnt' be, i thikn we just wanted to verify that before we went and ordered a ton of them [20:07:44] unless you messed up and put the new one in 1012 [20:07:47] haha [20:07:58] welp, then! [20:08:00] we're just talking about putting them in the new nodes right? [20:08:00] i say: let's procure! [20:08:02] what say you? [20:08:04] yes i think so [20:08:07] procure [20:09:40] (03PS1) 10BBlack: disable wmfusercontent.org site on misc for now, nginx borked [puppet] - 10https://gerrit.wikimedia.org/r/160080 [20:10:11] (03CR) 10Dzahn: [C: 031] disable wmfusercontent.org site on misc for now, nginx borked [puppet] - 10https://gerrit.wikimedia.org/r/160080 (owner: 10BBlack) [20:10:58] (03CR) 10BBlack: [C: 032] disable wmfusercontent.org site on misc for now, nginx borked [puppet] - 10https://gerrit.wikimedia.org/r/160080 (owner: 10BBlack) [20:16:20] manybubbles: want to comment on this ticket and tell RobH what to do next (i.e. proceed with the dual s3500 setup?) [20:16:20] https://rt.wikimedia.org/Ticket/Display.html?id=7978 [20:17:02] if you guys have the metrics to backup the upgrade, please do update ticket with them =] [20:18:07] RobH, I updated the dependent ticket with them [20:18:13] https://rt.wikimedia.org/Ticket/Display.html?id=8071 [20:20:08] (03CR) 10Steinsplitter: [C: 031] "ok." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159999 (https://bugzilla.wikimedia.org/70771) (owner: 10Dan-nl) [20:23:07] We have greg-g's permission to do an out-of-cycle deployment for https://bugzilla.wikimedia.org/show_bug.cgi?id=70759 which I'm starting now. [20:23:12] No i18n changes, just a sync-dir. [20:23:49] * greg-g nods [20:37:20] !log mattflaschen Synchronized php-1.24wmf20/extensions/GettingStarted/: Deploy to fix GettingStarted bucketting for users with null registration date (duration: 00m 07s) [20:37:24] Logged the message, Master [20:37:55] !log mattflaschen Synchronized php-1.24wmf21/extensions/GettingStarted/: Deploy to fix GettingStarted bucketting for users with null registration date (duration: 00m 05s) [20:38:00] Logged the message, Master [20:54:18] * jamesofur is getting search errors [20:54:36] <^d> which wiki? [20:54:42] en [20:54:54] <^d> Do you have beta search turned on? [20:55:11] checking (it's maggie who is getting it, she's next to me) [20:55:19] An error has occurred while searching: The search backend returned an error: [20:55:26] <^d> Yep that's old search. [20:55:47] yeah, new search is not on [20:55:54] <^d> Go in her beta prefs and turn it on ;-) [20:56:07] hm, godog, i just noticed that it looks like all the elasticsearch health check icinga alert is not working [20:56:17] https://icinga.wikimedia.org/icinga/ [20:56:19] ack [20:56:24] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=elastic1013&service=ElasticSearch+health+check+for+shards [20:56:32] * jamesofur goes to meeting [20:57:58] <^d> jamesofur: So yeah, known bug in old search. We've never managed to fix it. Solution is turn on new search beta and pray for us to finish the last wikis soon :) [21:00:27] (03PS1) 10Ottomata: Cast cluster health keys and values to strings before attempting to utf8 encode them [puppet] - 10https://gerrit.wikimedia.org/r/160090 [21:01:18] <^d> mutante: Has https://rt.wikimedia.org/Ticket/Display.html?id=7924 still been a problem? [21:20:58] ^d: kind of, by which i mean it's a WARN, but not a CRIT [21:21:13] fluorine.eqiad.wmnet WARN: 'CirrusSearch-slow.log_line_rate' [21:23:36] (03PS1) 10Ori.livneh: mediawiki::scap: update for I083d6e58e [puppet] - 10https://gerrit.wikimedia.org/r/160094 [21:26:17] !log deployed fixes for bugs 70620, 69008 [21:26:23] Logged the message, Master [21:27:40] <^d> mutante: Ok, yeah we should fix it one way or the other then. [21:34:22] <^d> mutante: Is our math bad? http://p.defau.lt/?ZZRh5pp5fADzpjO09__Jxw - shouldn't it warn at around 0.0027 and critical at about 0.0055? [21:43:09] ^d: ? why those values, i just see warning => '0.00004', [21:43:37] <^d> Well it says in the comment above, we want to warn when we go over 10/hr [21:45:50] ^d: can you find this actually in ganglia ? [21:45:50] icinga just checks ganglia but there i dont see values? [21:45:50] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=mediaWiki&m=cpu_report&r=hour&s=descending&hc=4&mc=2#metric_mediaWiki.wfDebug.CirrusSearch ?? [21:49:16] mutante: [21:49:17] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=fluorine&mreg%5B%5D=CirrusSearch-slow.log_line_rate>ype=line&glegend=show&aggregate=1 [21:49:20] looks a little jerky [21:49:48] i wrote this thing, and if it is being stupid i'll try and fix it next week [21:50:03] i'd kinda like to have a change to learn to do this through graphite...so maybe that will be better [21:56:28] ottomata: cool, i was just answering to ^d's question, didn't know [21:56:32] ^d: you saw that? [21:57:31] ^d: btw, [21:57:34] ElasticSearch health check for shards ? [21:57:48] yeah, i just pinged them about that [21:57:55] i htink its this [21:58:03] https://gerrit.wikimedia.org/r/#/c/160090/ [21:58:33] :) [21:59:03] there are 2 RT's for that Cirrus stuff, but they are only warnings anymore [21:59:18] 7924 and 7779 [21:59:40] oh yea, the latter, that's the "maybe more hardware" one [22:02:35] <^d> Hmm [22:02:59] (03PS2) 10Ori.livneh: rcstream: add a 5s post-stop sleep delay [puppet] - 10https://gerrit.wikimedia.org/r/146001 [22:03:04] (03CR) 10Ori.livneh: [C: 032 V: 032] rcstream: add a 5s post-stop sleep delay [puppet] - 10https://gerrit.wikimedia.org/r/146001 (owner: 10Ori.livneh) [22:03:16] (03CR) 10Ori.livneh: [C: 032 V: 032] decom /etc/cluster file on app servers [puppet] - 10https://gerrit.wikimedia.org/r/160056 (owner: 10Ori.livneh) [22:03:27] (03PS2) 10Ori.livneh: mediawiki::scap: update for I083d6e58e [puppet] - 10https://gerrit.wikimedia.org/r/160094 [22:03:32] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki::scap: update for I083d6e58e [puppet] - 10https://gerrit.wikimedia.org/r/160094 (owner: 10Ori.livneh) [22:06:32] (03CR) 10Chad: [C: 031] Cast cluster health keys and values to strings before attempting to utf8 encode them [puppet] - 10https://gerrit.wikimedia.org/r/160090 (owner: 10Ottomata) [22:07:57] (03Abandoned) 10Dzahn: install wmfusercontent.org cert on protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/159867 (owner: 10Dzahn) [22:10:27] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.009 second response time [22:11:03] ori, YuviPanda, I'm looking at the 'packaging' labs project. there are a bunch of instances that have been shutoff since March (the migration from Tampa). Think there's any reason for me not to wipe them all? [22:11:06] (03PS1) 10Ori.livneh: deployment: remove declarations of obsoleted paths [puppet] - 10https://gerrit.wikimedia.org/r/160112 [22:11:34] andrewbogott: if i have anything there, i don't need it any more; not sure about others [22:12:06] lemme see if I can find attribution [22:13:54] (03PS2) 10Ori.livneh: deployment: remove declarations of obsoleted paths [puppet] - 10https://gerrit.wikimedia.org/r/160112 [22:14:10] (03CR) 10Dzahn: [C: 04-2] "jeremyb: wrong repo meanwhile, is it still desired?" [apache-config] - 10https://gerrit.wikimedia.org/r/24407 (owner: 10Jeremyb) [22:14:36] andrewbogott: if you have a chance afterwards, could you take a look at https://gerrit.wikimedia.org/r/#/c/160112/ and +1/-1? [22:14:41] it's very small [22:16:43] ori: well, I don't see an attribution for who created these instances, so I think i'm going to just delete them. They are called… build-lucid1, php-packaging, rt, udp-filter [22:16:49] (03CR) 10RobH: [C: 04-2] "so this is indeed wrong, but i dont want to reject it, rather leave it sitting in my reviews until i get back to it. (so im voting -2 so " [puppet] - 10https://gerrit.wikimedia.org/r/131087 (owner: 10RobH) [22:17:37] andrewbogott: sure; i dunno what the protocol is for such things. [22:18:08] ori: There isn't really one, except when I mothballed things I warned that they were subject to subsequent deletion. [22:18:19] andrewbogott: seems fair then [22:19:31] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.018 second response time [22:19:48] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 46 data above and 0 below the confidence bounds [22:19:48] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 46 data above and 0 below the confidence bounds [22:19:53] The Fair ended on Labor Day. [22:20:34] (03CR) 10Dzahn: "hmm.. added some reviewers to bump. do we want "RewriteRule ^/robots.txt$ /w/robots.php [L]" on public wikis?" [puppet] - 10https://gerrit.wikimedia.org/r/147487 (owner: 10Reedy) [22:20:36] (03CR) 10Andrew Bogott: [C: 031] "Looks right -- could use some babysitting on tin when it merges though. (Which I mention because I merged a similar patch on virt1000 las" [puppet] - 10https://gerrit.wikimedia.org/r/160112 (owner: 10Ori.livneh) [22:21:14] andrewbogott: danke, will do (babysit on tin) [22:22:27] (03PS3) 10Ori.livneh: deployment: remove declarations of obsoleted paths [puppet] - 10https://gerrit.wikimedia.org/r/160112 [22:22:51] (03CR) 10Ori.livneh: [C: 032 V: 032] deployment: remove declarations of obsoleted paths [puppet] - 10https://gerrit.wikimedia.org/r/160112 (owner: 10Ori.livneh) [22:23:53] (03CR) 10Dzahn: "it seems like this is now "files/logrotate.d_mediawiki_apache:# logrotate config for MediaWiki Apache logs" [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [22:26:12] (03CR) 10Dzahn: "abandoning per Jan and comments above. Christopher, if you ever want to pick this up again, we can also just hit the "restore" button and " [puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE)) [22:26:23] (03Abandoned) 10Dzahn: Icinga: Check Dispatch command for Wikidata notification [puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE)) [22:27:48] (03CR) 10Dzahn: "ori, do you know that? the related bug is resolved but this is unmerged. is that a contradiction or not?" [puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [22:53:41] can we remove osmium from icinga? [23:07:23] Reedy: @protactinium :) [23:14:04] (03PS1) 10Dzahn: include Wikimania group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/160157 [23:15:28] (03CR) 10Dzahn: "this code is what we provide ourselves for 3rd party users to embed images. from http://commons.wikimedia.org/wiki/Category:Wikimania_20" [puppet] - 10https://gerrit.wikimedia.org/r/160157 (owner: 10Dzahn) [23:35:30] (03CR) 10Dzahn: include Wikimania group photo on people.wm.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160157 (owner: 10Dzahn) [23:36:06] (03PS2) 10Jackmcbarn: Re-enable the Lua profiler on HHVM in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159903 [23:36:29] (03CR) 10Dzahn: "modules/cdh - raaage" [puppet] - 10https://gerrit.wikimedia.org/r/160164 (owner: 10Dzahn) [23:38:58] (03PS2) 10Dzahn: redirect http->https on racktables [puppet] - 10https://gerrit.wikimedia.org/r/160164 [23:39:42] (03PS3) 10Dzahn: redirect http->https on racktables [puppet] - 10https://gerrit.wikimedia.org/r/160164 [23:42:11] (03PS2) 10Dzahn: include Wikimania group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/160157 [23:42:42] (03CR) 10Ori.livneh: [C: 032] Re-enable the Lua profiler on HHVM in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159903 (owner: 10Jackmcbarn) [23:42:47] (03Merged) 10jenkins-bot: Re-enable the Lua profiler on HHVM in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159903 (owner: 10Jackmcbarn) [23:43:29] (03CR) 10Dzahn: "removed protocol from links except the attribution link for the photographer seems broken with or without https (http://www.roletschek.de " [puppet] - 10https://gerrit.wikimedia.org/r/160157 (owner: 10Dzahn) [23:44:09] (03CR) 10BryanDavis: [C: 031] "Why not." [puppet] - 10https://gerrit.wikimedia.org/r/160157 (owner: 10Dzahn) [23:44:58] (03CR) 10Dzahn: include Wikimania group photo on people.wm.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160157 (owner: 10Dzahn) [23:46:58] is zuul/jenkins/whatever still acting up and making beta cluster changes takes 6 hours to deploy? [23:48:09] (03Abandoned) 10Dzahn: Add Chinese fonts for VE screenshots feature [puppet] - 10https://gerrit.wikimedia.org/r/154086 (https://bugzilla.wikimedia.org/69535) (owner: 10KartikMistry) [23:48:12] never mind, it's there now [23:51:52] (03CR) 10Dzahn: [C: 032] "labs-only-stats-stuff" [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159963 (https://bugzilla.wikimedia.org/54161) (owner: 10Dzahn) [23:54:33] (03CR) 10Dzahn: [C: 032] "exact same way we solve this in a bunch of other services, just copied from contacts.wm" [puppet] - 10https://gerrit.wikimedia.org/r/160164 (owner: 10Dzahn) [23:59:17] (03PS1) 10Dzahn: racktables - apache, load mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/160168