[00:00:44] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [00:00:46] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [00:06:22] RECOVERY - puppet last run on mw2115 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:24:17] 10Ops-Access-Requests, 6operations: Requesting access to analytics-privatedata-users for Bryan Davis - https://phabricator.wikimedia.org/T115548#1727029 (10bd808) 3NEW [00:25:18] 10Ops-Access-Requests, 6operations: Requesting access to analytics-privatedata-users for Bryan Davis - https://phabricator.wikimedia.org/T115548#1727041 (10bd808) [01:32:05] (03CR) 10Greg Grossmeier: "I respect your opinion, Krenair (always have), but in this specific case we need to move ahead with the decision by Chris Steipp and mysel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [01:33:23] (03CR) 10Alex Monk: "So this (and perhaps the other open one) can go through, and then no more EducationProgram extension deployments?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [01:33:45] Krenair: which other one? [01:34:09] enwikiversity [01:37:10] Krenair: thanks. I need to confirm with chris as we were just asked about serbian... [01:37:24] if enwikiversity wasn't in the mix, then yes, your conclusion is correct [01:37:39] but I want to make sure chris and I are on the same page re enwikiversity [01:37:40] Who asked you exactly? [01:38:01] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [01:39:22] Krenair: Anna/Floor [01:42:51] bbl, dinner and such [01:44:50] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [02:31:35] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 06m 57s) [02:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:16] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-15 02:35:15+00:00 [02:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:16] hi [02:49:02] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 05m 02s) [02:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:28] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-15 02:51:27+00:00 [02:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:52:42] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [03:01:21] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [03:57:36] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1727261 (10intracer) Just did a quick test with one thumbnail - you spend about 200ms to get uncached thumbnail, 40ms to get cached thumbnail... [04:39:32] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: puppet fail [05:06:21] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:07:50] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: acct (process and login accounting) fill up instances /var/ partition - https://phabricator.wikimedia.org/T71604#1727302 (10bd808) /var is still filling up on deployment-bastion on a fairly regular basis because of these logs with the current re... [05:38:21] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: Puppet has 1 failures [05:41:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 15 05:41:25 UTC 2015 (duration 41m 24s) [05:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:03:40] RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:20:02] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [06:23:22] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [06:30:02] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:11] PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:11] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:20] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:32] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:40] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:12] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:12] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:53:21] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [07:04:23] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1727392 (10Tgr) Yes, the imageinfo delay for the median user is around 200 ms. If you show an interactive interface, that's nontrivial; if yo... [07:10:51] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:18:23] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1727393 (10intracer) I do show an interactive interface https://commons.wikimedia.org/wiki/Commons:WLX_Jury_Tool [07:20:11] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:26:21] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:26:30] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:41] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:26:41] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:27:01] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:27:11] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:12] RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:31] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:50] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:51] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:36:33] (03CR) 10Hashar: [C: 031] Move base::firewall include into the gerrit::production role [puppet] - 10https://gerrit.wikimedia.org/r/245975 (owner: 10Muehlenhoff) [08:37:40] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [08:40:51] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [08:41:31] (03CR) 10Hashar: [C: 031] "EBernhardson thanks a ton for the follow up on my nitpicks :-}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [09:05:52] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [09:32:54] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:54:46] <_joe_> !log restarted gitblit, unresponsive [09:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:12] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 55021 bytes in 0.072 second response time [11:45:22] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: puppet fail [11:58:44] 6operations, 10Wikimedia-General-or-Unknown: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#1727804 (10Aklapper) I received a private email by a user who would like to report a security issue to Wikimedia. I pointed to https://ww... [12:12:21] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:23:11] 6operations, 10Wikimedia-General-or-Unknown: Can't see any page, special:RandomPage gives databse error - https://phabricator.wikimedia.org/T115505#1727866 (10Luke081515) [12:25:08] (03CR) 10Muehlenhoff: "I made this role to ensure that systems which are currently unused (and for which the previous role was removed) still get all security up" [puppet] - 10https://gerrit.wikimedia.org/r/246392 (https://phabricator.wikimedia.org/T115489) (owner: 10Muehlenhoff) [12:25:41] (03Abandoned) 10Muehlenhoff: Mark former bits clusters as spares [puppet] - 10https://gerrit.wikimedia.org/r/246390 (https://phabricator.wikimedia.org/T115489) (owner: 10Muehlenhoff) [12:35:16] 6operations, 10Wikimedia-General-or-Unknown: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#1727900 (10Krenair) Can't they create a phabricator task properly? It'll be via https... [12:43:33] 6operations, 10Wikimedia-General-or-Unknown: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#1727923 (10Joe) @Krenair GPG will ensure that any comunication is reserved for the eyes of the intended people much better than ACLs on p... [13:02:30] (03PS1) 10Ori.livneh: Revert I1d7969a9d5 and follow-up fixes for T115505 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246677 [13:02:59] (03CR) 10Ori.livneh: [C: 032] Revert I1d7969a9d5 and follow-up fixes for T115505 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246677 (owner: 10Ori.livneh) [13:03:06] (03Merged) 10jenkins-bot: Revert I1d7969a9d5 and follow-up fixes for T115505 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246677 (owner: 10Ori.livneh) [13:03:25] <_joe_> good morning ori :) [13:03:31] hey [13:03:34] good morning [13:03:41] not going to deploy that, but staging it on tin [13:04:01] (03Abandoned) 10Ori.livneh: Revert "group1 wikis to 1.27.0-wmf.3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246283 (owner: 10Ori.livneh) [13:04:12] <_joe_> well, it might be a good idea to deploy it when we see that the logs are in a good shape [13:04:24] the logs are in good shape [13:04:25] <_joe_> meaning no more purges happen [13:04:34] the purges that are happening are not legitimate [13:04:38] i tailed it earlier [13:04:44] all the legit ones dropped off [13:04:44] <_joe_> oh ok [13:05:14] how do you know? [13:05:18] could be some rarely-accessed page [13:05:28] that gets accessed once a week or once a month or whatever [13:05:51] <_joe_> paravoid: doesn't parsercache expires? [13:05:53] so that's one page view that will be broken [13:05:57] if it's once a month [13:06:15] <_joe_> or did I understand that wrong? [13:06:34] I've yet to understand how this fix fixes varnish hits either tbh [13:06:47] because it sends a purge [13:07:02] but to get to the appservers in the first place it has to be a miss [13:07:08] or a logged-in user [13:07:21] a logged-in user visiting a page will fix it for all users, including anons that hit varnish [13:07:47] yes, but what logged-in users are far less than anons [13:07:57] s/what// [13:08:13] but i guess it is a random sample [13:08:32] it would be interesting to measure, but i think the wikis get effectively reparsed every 24h or so [13:08:42] so theoretically the set of logged-in users will visit all pages [13:08:46] keep in mind template updates and link changes cause re-parses too [13:09:04] yeah I've thought of that already [13:10:24] and don't forget bots that crawl the site for grammar errors etc [13:10:40] <_joe_> https://en.wikipedia.org/wiki/Poincar%C3%A9_recurrence_theorem [13:11:05] _joe_: oh, this looks very interesting. i haven't heard of this. [13:11:11] <_joe_> (there is always a Poincare' theorem that is relevant to any discussion) [13:11:35] <_joe_> ori: you're hypothesis is that the wikis are an ergodic system, basically :) [13:11:41] (not surprising given that my knowledge of physics is, ummm, pre-newtonian) [13:12:00] maybe aristotelian [13:12:04] rocks fall because it's in their nature [13:12:08] <_joe_> lol [13:12:19] <_joe_> Jeff_Green: backup4001 has a full disk [13:12:27] <_joe_> Jeff_Green: also, good morning! [13:12:39] yup just saw the SMS too [13:12:41] fixing... [13:12:49] <_joe_> k [13:14:35] for those of you not fortunate enough to attend velocity, i can disclose to you my biggest takeaway / insight from velocity 2015, which is that velocity is an expensive and mediocre conference not worth paying for [13:15:21] Did you come to that realisation quickly? [13:15:23] it may be possible to recuperate the attendance costs in t-shirts and schwag [13:15:35] Reedy: i see what you did there [13:15:56] * Reedy grins [13:17:22] * Jeff_Green wonders whether icinga has a per-metric alert frequency knob [13:17:47] <_joe_> Jeff_Green: what do you mean? [13:18:29] _joe_: I'd like an SMS for disk capacity on backup4001 but hourly or even every 4H would be more than adequate [13:18:59] <_joe_> notification_interval I guess? [13:19:06] yeah [13:19:29] <_joe_> but that might not work with passive checks [13:19:36] <_joe_> or checks that change the error message [13:22:57] out of curiosity, is there any reason we don't have a more useful DNS search list on iron? [13:23:30] "search wikimedia.org" why not "search wikimedia.org eqiad.wmnet codfw.wmnet ..." [13:23:54] probably no one has bothered to update it [13:24:35] it seems like it got downgraded somewhere along the way, my .ssh/config used to work fine for short hostnames but at some point stopped [13:27:19] Jeff_Green: It looks like a hieradata/hosts/iron.yaml file needs adding [13:27:29] * Reedy makes a commit [13:27:42] nice! thank you! [13:30:56] (03PS1) 10Reedy: Add base::resolving::domain_search for iron [puppet] - 10https://gerrit.wikimedia.org/r/246679 [13:33:37] <_joe_> Jeff_Green: AFAIR it had it [13:33:49] <_joe_> but yeah no merges today [13:33:54] <_joe_> and until monday [13:34:53] (03PS1) 10Reedy: Remove tmh100[12].yaml hieradata [puppet] - 10https://gerrit.wikimedia.org/r/246680 [13:34:58] yup. also it probably makes sense to make the change for all production bastions rather than just iron [13:38:51] shouldn't this be in the bastionhost::opsonly file? [13:39:00] or just all bastion hosts [14:03:02] (03CR) 10Alex Monk: "Yep, these became mw1259 and mw1260 IIRC" [puppet] - 10https://gerrit.wikimedia.org/r/246680 (owner: 10Reedy) [14:05:15] 6operations, 10Analytics, 6Services: Automatic monitoring not working for AQS - https://phabricator.wikimedia.org/T115588#1728063 (10mobrovac) 3NEW [14:06:09] _joe_, why is there tmh200[12] but no entries in puppet about them? [14:06:11] 6operations, 6Analytics-Kanban, 10netops, 5Patch-For-Review: Puppetize a server with a role that sets up Cassandra on Analytics machines [13 pts] {slug} - https://phabricator.wikimedia.org/T107056#1728076 (10mobrovac) [14:08:58] at least there are mgmt entries about those... but then which actual hosts are they? [14:12:34] <_joe_> Krenair: the hiera data? because I didn't remove it [14:12:57] sorry, two separate things going on here [14:12:58] <_joe_> and the dns, I added the new names but didn't remove the old ones, take a look [14:13:05] there is hiera data for tmh1* [14:13:10] <_joe_> they're now named mw1259 and mw1260 [14:13:13] but there is also dns entries for tmh2* mgmt [14:13:22] <_joe_> tmh2* ? [14:13:25] yes [14:13:34] <_joe_> ok I had nothing to do with it and that's blatantly wrong [14:13:34] templates/10.in-addr.arpa:8 1H IN PTR tmh2001.mgmt.codfw.wmnet. [14:13:34] templates/10.in-addr.arpa:9 1H IN PTR tmh2002.mgmt.codfw.wmnet. [14:13:34] templates/wmnet:tmh2001 1H IN A 10.193.1.8 [14:13:34] templates/wmnet:tmh2002 1H IN A 10.193.1.9 [14:13:43] <_joe_> we have no tmh2001/2 [14:13:52] <_joe_> but lemme take a look in a few [14:13:54] k [14:14:36] I can't ping those IPs, fwiw [14:25:52] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Can't see any page, special:RandomPage gives database error - https://phabricator.wikimedia.org/T115505#1728109 (10Graham87) [14:40:43] phab 503 :/ [14:41:40] yep, can't edit a project [14:43:00] wfm [14:44:13] 6operations, 6Release-Engineering-Team: Monitor Phabricator and Gerrit availability - https://phabricator.wikimedia.org/T115611#1728229 (10ori) 3NEW [14:44:33] i'm editing a project, so presumably it is more database intensive than just creating a new task, since it involves updating a number of existing records [14:46:45] well I edited a project description, but yeah just a one-character update to a desc field heh [14:46:45] <_joe_> ori: I would've ingenuously thought that would have meant just changing a field in a table [14:47:01] <_joe_> since we do have a db and foreign keys and all :P [14:47:29] foreign keys?? that sounds like a potential homeland security issue. Threat-level: orange. [14:55:57] <_joe_> lol [15:15:17] bblack: myISAM ftw [15:18:38] 6operations, 10Analytics, 6Services: Automatic monitoring not working for AQS - https://phabricator.wikimedia.org/T115588#1728301 (10mobrovac) a:3mobrovac [15:18:58] (03PS1) 10Mobrovac: RESTBase: make the domain to monitor configurable [puppet] - 10https://gerrit.wikimedia.org/r/246687 (https://phabricator.wikimedia.org/T115588) [15:22:02] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [15:22:33] bblack: _joe_ aren't ya'll supposed to be in the rain forest today or something? [15:22:41] or is there cell coverage? [15:23:33] (03PS1) 10Ori.livneh: Turn off UserDailyContribs extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246689 (https://phabricator.wikimedia.org/T85984) [15:23:50] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [15:24:45] <_joe_> greg-g: I am in the hotel as I got a bad flu [15:25:43] _joe_: ugh, sorry man [15:26:04] <_joe_> greg-g: heh, I'm holding the fort meanwhile :P [15:52:20] (03PS1) 10Ori.livneh: add ferm rule opening port 80 for grafana [puppet] - 10https://gerrit.wikimedia.org/r/246691 [16:23:28] (03Abandoned) 10Zfilipin: rubocop: Ignore Style/TrailingComma offense [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [16:24:30] (03Restored) 10Chad: Phabricator: Fetch all gerrit references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [16:24:42] (03PS5) 10Chad: Phabricator: Fetch all references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 [16:35:40] (03CR) 10Greg Grossmeier: "Just this (not the other one, enwikiversity) and then only new ones after Chris and I are satisfied of the Education Program's plans (with" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [16:36:01] Krenair: sorry for the delay in responding ^ ;) [16:38:41] (03CR) 10Alex Monk: Enable Education Program extension at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [16:40:37] (03PS3) 10DCausse: Enable config for all three search clusters, but only write to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) (owner: 10EBernhardson) [16:44:16] (03PS1) 10Bartosz DziewoƄski: Move ForeignUploadTargets config to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246703 [17:07:08] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728646 (10Krenair) [17:10:13] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728662 (10MaxSem) While defnitely interesting, this idea has its own downsides that need to be weightened - namely, that if something goes out of sync between mw1017... [17:11:36] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728667 (10greg) [17:13:26] (03PS4) 10DCausse: Enable config for all three search clusters, but only write to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) (owner: 10EBernhardson) [17:13:28] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728630 (10greg) [17:20:35] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728699 (10Joe) I think this has way more downsides than upsides. I strongly recommend against this. [17:21:11] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728700 (10Krenair) What about the ability to use X-Wikimedia-Debug: 0 to disable any automatic(?) use of mw1017? [17:21:49] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728704 (10greg) >>! In T115631#1728699, @Joe wrote: > I think this has way more downsides than upsides. I strongly recommend against this. without actually enumerat... [17:22:16] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728705 (10greg) >>! In T115631#1728700, @Krenair wrote: > What about the ability to use X-Wikimedia-Debug: 0 to disable any automatic(?) use of mw1017? Right, there... [17:22:55] (03CR) 10DCausse: [C: 031] Enable config for all three search clusters, but only write to eqiad (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) (owner: 10EBernhardson) [17:25:12] (03CR) 10EBernhardson: Enable config for all three search clusters, but only write to eqiad (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) (owner: 10EBernhardson) [17:26:37] (03PS5) 10EBernhardson: Enable config for all three search clusters, but only write to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) [17:26:45] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728720 (10Joe) >>! In T115631#1728704, @greg wrote: >>>! In T115631#1728699, @Joe wrote: >> I think this has way more downsides than upsides. I strongly recommend ag... [17:28:36] (03PS5) 10EBernhardson: Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 [17:30:56] (03PS1) 10Luke081515: Add throttle exception for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246709 (https://phabricator.wikimedia.org/T115632) [17:32:10] (03CR) 10DCausse: [C: 031] "looks good, in any case if something really bad happens we should be able to remove the lab replica from the write clusters." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson) [17:32:56] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to mw1017 - https://phabricator.wikimedia.org/T115631#1728740 (10greg) >>! In T115631#1728720, @Joe wrote: > - mw1017 runs a version of mediawiki that is ahead of the version actually run on the major wikis; bugs with co... [17:33:07] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1728741 (10Jgreen) >>! In T97676#1726675, @ellery wrote: > 1:10 is already much better. Pgheres has a campaign fi... [17:33:11] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to a machine/machines similar to mw1017 - https://phabricator.wikimedia.org/T115631#1728742 (10greg) [17:36:23] (03PS2) 10Luke081515: Add throttle exception for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246709 (https://phabricator.wikimedia.org/T115632) [17:39:10] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1728747 (10awight) @ellery: just noting that there's still an open question for you, which is blocking us. I don... [17:40:46] _joe_: just to be clear: there is no timeline for this proposal; it's just a discussion right now ;) [17:41:46] _joe_: also, I'd appreciate a more courious take on this as opposed to down right pre-determined outcome from the beginning [17:42:00] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to a machine/machines similar to the canary cluster - https://phabricator.wikimedia.org/T115631#1728750 (10Joe) [17:42:17] ah, a better name, yeah :) [17:44:23] <_joe_> greg-g: I think forcing everyone in office to get to an experimental cluster without a way to have an easy in-browser opt-out is wrong [17:44:55] <_joe_> So maybe setting/unsetting a cookie might be a safer approach [17:45:32] <_joe_> that would allow people to constantly browse the sites in an "experimental" state, and opt-out to it by just opening an incognito window [17:46:12] <_joe_> (note that we will probably want to do that with a fraction of users too, when/if such an infrastructure is ready) [17:46:56] <_joe_> greg-g: so yeah, my main concerns were related to a 1-machine cluster and people reporting false positives on outages/issues [17:47:20] 17:44 < _joe_> greg-g: I think forcing everyone in office to get to an experimental cluster without a way to have an easy in-browser opt-out is wrong [17:47:33] there's no reason ori's extension couldn't be written in the opposite way [17:48:05] but yeah, agreed re 1-machine issue, that was an oversight in my "quick, jot this idea down after the hallway conversation this morning" :) [17:48:34] we could recommend to eg Finance/HR to install that extension (that'd de-reroute to the canary cluster)... [17:48:47] * greg-g shrugs [17:48:53] * greg-g is still chewing on the idea [17:49:05] <_joe_> or we could recommend everyone in engineering/product to install it, even outside the office :) [17:49:16] right... that's not a bad recommendation either :) [17:50:25] we'd need a Fx version of it :) [17:50:45] <_joe_> and an IE version!!1! [17:51:02] <_joe_> (does IE allow extensions?) [17:51:17] heh [17:51:39] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to a machine/machines similar to the canary cluster - https://phabricator.wikimedia.org/T115631#1728765 (10greg) [17:51:49] * greg-g added the alternative to description [17:57:48] weird, I'm not getting the email notifications on that task... oh right.... spam folder :( [18:09:20] _joe_: Better question: does anyone on staff actually use IE? :p [18:10:12] maybe some pool soul in Finance? [18:10:17] poor [18:23:18] (03PS6) 10Chad: Phabricator: Fetch all references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 [18:39:53] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to a machine/machines similar to the canary cluster - https://phabricator.wikimedia.org/T115631#1728863 (10Krenair) >>! In T115631#1728740, @greg wrote: >>>! In T115631#1728720, @Joe wrote: >> - mw1017 runs a version of... [19:20:35] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1729009 (10Amire80) [19:28:13] (03CR) 10Legoktm: "Do we still need grunt-cli to be installed globally?" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [19:42:54] (03CR) 10Dereckson: [C: 031] Add throttle exception for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246709 (https://phabricator.wikimedia.org/T115632) (owner: 10Luke081515) [19:46:22] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [20:09:50] PROBLEM - very high load average likely xfs on ms-be1011 is CRITICAL: CRITICAL - load average: 227.13, 143.54, 70.49 [20:15:02] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:29:39] (03PS2) 10Smalyshev: Switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) [20:30:27] <_joe_> !log rebooting ms-be1011 [20:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:52] RECOVERY - very high load average likely xfs on ms-be1011 is OK: OK - load average: 12.91, 2.95, 0.97 [21:05:54] (03PS1) 10Aude: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246782 [21:11:01] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: puppet fail [21:12:02] (03CR) 10Aude: [C: 04-2] "not to enable until next Tuesday, but please code review :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246782 (owner: 10Aude) [21:12:10] (03PS2) 10Aude: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246782 (https://phabricator.wikimedia.org/T115653) [21:16:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [21:20:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [21:25:20] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [21:27:29] _joe_: hi, mind changing the channel status ? [21:27:52] <_joe_> thanks greg-g :) [21:27:57] :) [21:28:03] thanks both [21:28:04] <_joe_> you were faster than me :) [21:28:28] I think anyone can change the topic in here matanya? [21:28:46] <_joe_> Krenair: probably yes [21:28:48] Krenair: yes, but i didn't know of something was actully down [21:28:53] *if [21:29:05] so i didn't [21:29:25] (03PS3) 10Smalyshev: Switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) [21:30:05] sadly, nawadays i say useless comments like this one above and don't submit patches [21:34:00] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [21:38:21] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:40:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [21:45:50] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [21:50:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [21:54:22] should I worry about that ^^ [21:55:05] there is weirdness on https://gdash.wikimedia.org/dashboards/reqerror/ [21:56:00] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [21:56:16] actually, stuff like 164 Array to string conversion in /srv/mediawiki/php-1.27.0-wmf.2/includes/deferred/LinksUpdate.php on line 682 looks pretty scary [21:56:34] but it shouldn't be generating http errors [21:56:52] <_joe_> greg-g: it's just the scale I think [21:57:01] <_joe_> since we had no peaks fo errors today [21:57:03] (03PS4) 10Yurik: maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [21:57:08] <_joe_> the scale is pretty limited [21:57:37] ewww [21:57:40] <_joe_> greg-g: I took a look earlier, I saw no clear outliers, btw [21:57:48] function getPropertyDeletions( $existing ) { [21:57:48] return array_diff_assoc( $existing, $this->mProperties ); [21:57:48] } [21:58:47] I don't know if it's screwing up properties being saved [22:03:19] 10Ops-Access-Requests, 6operations: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T115666#1729674 (10atgo) 3NEW [22:03:57] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1729683 (10Catrope) Could you edit the title of the task to replace the placeholders "RESOURCE" and "USER"? [22:03:58] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1729685 (10atgo) [22:04:04] heh [22:04:10] haha [22:04:15] :) [22:04:21] And her edit was just a few seconds after mine so now I don't look stupid :) [22:04:37] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1729674 (10atgo) @catrope - done. Sorry about that! Had it right and then had to switch computers :) [22:05:05] I sometimes feel pressure right after reporting a bug that i know isn't done to get it all fleshed out asap. see also: that proposal re office ip users this morning :) [22:05:22] What proposal? [22:05:30] Oh the one for sending them to an experimental cluster? [22:05:33] https://phabricator.wikimedia.org/T115631 [22:05:55] 7Blocked-on-Operations, 6operations, 3Discovery-Maps-Sprint: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1729695 (10Yurik) @bblack, are there any blockers/thoughts about this? [22:06:02] Max, James, and I were talking about stuff and that idea came up, so I quickly wrote it down [22:06:56] the idea's not mine, I swear! [22:07:05] * MaxSem hides [22:08:36] !log catrope@tin Synchronized php-1.27.0-wmf.3/extensions/Flow/: Back out categories in sidebar feature (duration: 00m 20s) [22:08:40] greg-g: ---^^ [22:08:45] That's that done [22:08:51] coolio [22:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:33] !log don't worry, _joe_ was around and we approved Roan's last deploy as an exception :) [22:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:18:11] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [22:18:51] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1729718 (10Krenair) It sounds like you need the `statistics-privatedata-users` group then, I think... But maybe `researcher`? sigh... [22:20:36] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1729726 (10Krenair) And, of course, bastiononly, because that still isn't fixed. Also, do you have an LDAP account (providing login to labs, gerrit, phabricator... [22:29:58] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1729753 (10greg) Upstream docs: https://secure.phabricator.com/book/phabricator/article/notifications/ [22:47:10] (03PS3) 10Luke081515: Add throttle exception for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246709 (https://phabricator.wikimedia.org/T115632) [22:53:52] (03PS10) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [22:57:34] (03CR) 10EBernhardson: Refactor monolog handling for kafka logs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [22:59:56] (03PS11) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505)