[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150702T0000). [00:01:23] RECOVERY - puppet last run on virt1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:02:03] RECOVERY - puppet last run on labvirt1005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:04:02] RECOVERY - puppet last run on netmon1001 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [00:05:31] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [00:05:52] RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [00:07:52] RECOVERY - puppet last run on virt1004 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [00:09:22] RECOVERY - puppet last run on virt1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:10:42] 6operations, 7Graphite, 7Monitoring: evaluate tessera dashboards - https://phabricator.wikimedia.org/T104366#1419417 (10Krinkle) [00:10:52] 6operations, 7Graphite, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1419419 (10Krinkle) [00:11:52] 6operations, 7Graphite, 7Monitoring: deprecate gdash - https://phabricator.wikimedia.org/T104365#1419421 (10Krinkle) [00:12:54] (03PS2) 10Dzahn: add bromine as a misc-web backend [puppet] - 10https://gerrit.wikimedia.org/r/222198 (https://phabricator.wikimedia.org/T101734) [00:13:22] gonna do the phabricator upgrade ... [00:15:31] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1419428 (10Dzahn) ``` ┌─────────────────┤ [!!] Finish the installation ├─────────────────┐ ┌───│ │ ─┐ │ │... [00:17:59] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1419431 (10Dzahn) after that the installation finished and the system powered down. when i powered it up again the installer started from scratch [00:22:52] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:25:13] (03CR) 10Dzahn: [C: 032] add bromine as a misc-web backend [puppet] - 10https://gerrit.wikimedia.org/r/222198 (https://phabricator.wikimedia.org/T101734) (owner: 10Dzahn) [00:25:52] (03PS2) 10Dzahn: switch static-bugzilla to backend bromine [puppet] - 10https://gerrit.wikimedia.org/r/222200 (https://phabricator.wikimedia.org/T101734) [00:26:56] (03PS2) 10Dzahn: Redirect wikipedia.is to is.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221877 (https://phabricator.wikimedia.org/T103915) (owner: 10Glaisher) [00:28:22] sitic: congratulations, you got https://gerrit.wikimedia.org/r/#/c/222222/ [00:29:04] (03CR) 10Dzahn: [C: 032] Redirect wikipedia.is to is.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/221877 (https://phabricator.wikimedia.org/T103915) (owner: 10Glaisher) [00:33:04] ori: :-) [00:33:14] *confetti* [00:33:21] (03CR) 10Dzahn: "[terbium:~] $ apache-fast-test iswiki.urls mw1033" [puppet] - 10https://gerrit.wikimedia.org/r/221877 (https://phabricator.wikimedia.org/T103915) (owner: 10Glaisher) [00:33:25] 6operations, 10ops-eqiad, 10Analytics-Cluster: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1419469 (10RobH) a:5RobH>3Cmjohnson [00:35:38] 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-DNS, 7domains: Faulty DNS setup for wikipedia.is - https://phabricator.wikimedia.org/T103915#1419475 (10Dzahn) [00:37:39] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1419478 (10Dzahn) next attempt the installer finished without these error messages above. after finishing it shuts down the machine. after you power it up again, installing starts aga... [00:41:06] (03PS2) 10Dzahn: switch analytics and analytics_kafka to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/222153 [00:41:17] (03CR) 10Dzahn: "should work now after https://phabricator.wikimedia.org/T104036 is resolved" [puppet] - 10https://gerrit.wikimedia.org/r/222153 (owner: 10Dzahn) [00:42:11] (03CR) 10Dzahn: [C: 032] switch analytics and analytics_kafka to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/222153 (owner: 10Dzahn) [00:44:40] !log Repooling mw1152 (HHVM image scaler) for testing) [00:44:46] Logged the message, Master [00:45:03] phabricator broken [00:45:21] twentyafterfour: it's the upgrade, right [00:45:22] hmm [00:45:36] no alerts? [00:45:52] he probably scheduled downtime [00:46:10] we do have a check for it [00:46:32] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 1 process with UID = 997 (phd) [00:46:41] that's another one [00:46:46] (03CR) 10Filippo Giunchedi: [C: 031] Set `uWSGIForceWSGIScheme https` for all mod_uwsgi webapps [puppet] - 10https://gerrit.wikimedia.org/r/222216 (owner: 10Ori.livneh) [00:47:18] (03PS2) 10Ori.livneh: Set `uWSGIForceWSGIScheme https` for all mod_uwsgi webapps [puppet] - 10https://gerrit.wikimedia.org/r/222216 [00:47:51] (03CR) 10Ori.livneh: [C: 032 V: 032] "(Aside: having a directory of snippets to Include from or using mod_macro might be a good way to reduce code duplication like this)" [puppet] - 10https://gerrit.wikimedia.org/r/222216 (owner: 10Ori.livneh) [00:47:53] ori: yes, it did trigger and it was scheduled downtime.. all good [00:48:51] there we go, also catchpoint [00:53:48] AphrontSchemaQueryException: #1054: Unknown column 'r.spacePHID' in 'where clause' [00:53:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 12 data above and 0 below the confidence bounds [00:54:57] mutante, presumably the db update has not been done yet [00:55:06] needs schema changes [00:55:50] I kind of wonder if springle/jynus are supposed to be involved [00:56:12] 5xx spike probably related to me pooling the HHVM scaler, /me checks. [00:56:33] ACKNOWLEDGEMENT - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 1 process with UID = 997 (phd) 20after4 phab upgrade [00:58:50] Krenair: yep, still being upgraded, nevermind until that's over [00:59:08] sorry all, I didn't schedule downtime for all the services and I didn't disable puppet so it restarted things prematurely [00:59:20] schema update is still running [00:59:20] Krenair: what's happening? [00:59:38] springle, oh, just the phabricator upgrade [00:59:45] springle: it's just me updating schema on phabricator dbs [00:59:46] twentyafterfour: np, that one did not page [00:59:55] ah ok [01:00:08] this schema update sure is taking forever [01:00:54] it's updated 2.3 million records so far. which seems excessive [01:02:48] it's some sort of unbatched updates. no actual schema change running (at least, atm) [01:03:43] phabricator_metamta.metamta_mail [01:04:00] yeah it's super lame [01:04:32] (03PS3) 10Dzahn: add new node bromine, add bz-static role [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) [01:08:21] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:08:53] um? [01:08:59] almost done [01:09:13] no i meant mobile-lb.eqiad.wikimedia.org_ipv6 [01:09:13] (03PS2) 10Dzahn: cassandra: add team-services for cql failure [puppet] - 10https://gerrit.wikimedia.org/r/222201 (https://phabricator.wikimedia.org/T104467) (owner: 10Filippo Giunchedi) [01:09:15] um LVS is not me ;) [01:09:24] (03CR) 10Dzahn: [C: 031] cassandra: add team-services for cql failure [puppet] - 10https://gerrit.wikimedia.org/r/222201 (https://phabricator.wikimedia.org/T104467) (owner: 10Filippo Giunchedi) [01:09:44] this LVS alert is becoming familiar [01:10:10] works ok for me [01:10:12] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 19180 bytes in 3.063 second response time [01:10:32] fwiw i'm using curl -v -6 -H 'Host: en.wikipedia.org' http://mobile-lb.eqiad.wikimedia.org [01:14:08] (03CR) 10Dzahn: "it's because submodules are always a PITA" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [01:15:39] (03CR) 10Dzahn: Add Phragile module. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [01:18:10] OMG now it's started over and running through all the mail IDS again [01:18:26] jgage, it was warning about HTTPS... don't know if that might affect your curl check? [01:20:36] (03CR) 10Dzahn: [C: 031] "looks ok to me after glancing at the OTRS Apache config. as long as the .chained.crt is being regenerated it should just work" [puppet] - 10https://gerrit.wikimedia.org/r/221161 (https://phabricator.wikimedia.org/T91504) (owner: 10RobH) [01:21:38] thanks krenair, i overlooked that. pulled that command from my shell history, clearly it needs to be updated! [01:22:09] curl -v -k -6 -H 'Host: en.wikipedia.org' https://mobile-lb.eqiad.wikimedia.org [01:23:10] twentyafterfour: in the future, can we have a better error message than the standard one? [01:23:15] terbium:~] $ echo "https://mobile-lb.eqiad.wikimedia.org" > mobile.urls [01:23:18] [terbium:~] $ apache-fast-test mobile.urls [01:23:22] jgage: ^ [01:23:49] legoktm: the downtime should have lasted < 2 minutes [01:23:51] easier if you have multiple URLs to test or want to run against the whole cluster [01:24:08] a better error message would be good but I'm not sure how to implement that [01:24:10] heh 50 threads, nice [01:24:57] jgage: if you specify a server name as second ARG it will just test that single one, otherwise all [01:24:58] I wish this db update would use 50 threads ... instead of 1 [01:25:27] cool mutante [01:26:11] twentyafterfour, isn't there some sort of issue associated with php, threads and windows? [01:26:26] I imagine phabricator upstream + corporate users... [01:26:33] is there a way to override the standard varnish 503? Or would I have to arrange for apache to return a maintenance page? [01:26:53] Krenair: yeah it probably wouldn't do well with threads. but it could do multiple processes in parallel [01:27:01] yeah [01:27:23] phabricator is good at that. I think that they just didn't do it because upstream probably doesn't have such a big table and for them it was a fast update [01:27:30] twentyafterfour: theoretically you could switch the varnish config that tells it which backend to use for requests to phab [01:27:33] I suspect this is a symptom of something being wrong [01:28:47] like, wtf good does it do to store a couple of million phabricator emails in the db ... they must be undeliverable junk or garbage that should have been collected [01:28:54] what are URLs to see the "relic Toolserver files"? [01:29:19] they are HTML ..somewhere [01:32:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [01:33:37] (03CR) 10Dzahn: [C: 031] "before:" [puppet] - 10https://gerrit.wikimedia.org/r/221067 (owner: 10Ricordisamoa) [01:34:55] (03PS2) 10Dzahn: Make relic Toolserver files HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/221067 (owner: 10Ricordisamoa) [01:36:10] (03CR) 10Dzahn: [C: 032] Make relic Toolserver files HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/221067 (owner: 10Ricordisamoa) [01:37:02] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [01:39:08] (03PS1) 10Krinkle: static: Add foundationwiki-2x.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 [01:40:53] err, phab is down [01:41:36] (03PS2) 10Krinkle: static: Add foundationwiki-2x.png and foundationwiki-1.5x.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 [01:41:55] MaxSem: it's a scheduled upgrade, it just takes longer than expected [01:42:08] ahh, thanks [01:42:15] wasn't following [01:44:01] hopefully it's really almost done now [01:44:16] 2.71m out of 2.89m ids [01:44:32] mutante: may wanna remove the xmlns="" attribute as well, which is a no-op now. [01:44:42] since it's no longer xhtml [01:46:14] (03PS3) 10Krinkle: static: Add foundationwiki-2x.png and foundationwiki-1.5x.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 [01:46:28] Krenair: Could you do a sanity check on https://gerrit.wikimedia.org/r/#/c/222230/ ? [01:46:53] Would like to push out in order to use it off-wiki in error pages still pointing to upload.wm.o for high-res [01:48:32] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:49:16] https://phabricator.wikimedia.org/ is giving 503 Service Unavailable [01:49:29] scheduled maintenance, twentyafterfour almost done [01:49:40] ok phab should be back in business [01:49:48] looks up to me [01:49:52] and prettier! [01:50:04] yeah notice the logo? nice huh? :D [01:50:12] font change too [01:50:17] breaks the tab layout tho [01:50:20] under 'activity' [01:50:32] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 11 processes with UID = 997 (phd) [01:50:33] Krinkle, did you use optipng -o7? [01:50:51] ori: tab layout? you mean a tabbed dashboard page? [01:51:06] * twentyafterfour never looks at it logged out [01:51:08] Krinkle, I checked it and got a file 2 bytes smaller :D [01:51:32] twentyafterfour: umm, it's still broken weirdly in firefox with the giant left margin :( [01:51:48] Krenair: nice, will update :) [01:52:06] legoktm: damn, I'll apply a local fix, I found the problem but upstream seems to have ignored my report [01:52:07] Oh wow it looks ugly now :( [01:52:17] I don't see a left margin problem [01:52:24] But, WTF re https://phabricator.wikimedia.org/project/sprint/board/1339/ [01:52:37] RoanKattouw: that only happens when you enlarge the font with ctrl+ [01:52:42] twentyafterfour: is there a change list? [01:52:47] There's like no contrast [01:53:03] spagewmf: too many to list [01:53:05] heh [01:53:11] The columns are much less visible now [01:53:30] Krinkle: yes, removing all remaining errors to make it valid [01:53:32] RoanKattouw: yeah agreed, that looks a little funky. maybe broken css [01:53:42] And there's no top/bottom padding on the project labels [01:53:47] the styling does seem a bit off [01:53:51] See e.g. the "Flow" label at https://phabricator.wikimedia.org/T104399 [01:54:03] Yeah I wouldn't be surprised if there was CSS missing [01:54:14] Is there a reference install of this new version that we (or someone) could compare to? [01:54:21] 6operations, 7Graphite, 7HTTPS, 5Patch-For-Review: Insecure XHR for 'http://tessera.wikimedia.org/api/preferences/' has been blocked - https://phabricator.wikimedia.org/T104424#1419502 (10ori) 5Open>3Resolved a:3ori [01:54:23] the project label text appears too high up for me. less top margin than bottom [01:54:29] (03PS4) 10Krinkle: static: Add foundation logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 [01:54:31] RoanKattouw, secure.phabricator.com? [01:54:31] https://secure.phabricator.com/ [01:54:54] Krenair: It's the same as the original, no? [01:55:07] (03PS3) 10Dzahn: switch static-bugzilla to backend bromine [puppet] - 10https://gerrit.wikimedia.org/r/222200 (https://phabricator.wikimedia.org/T101734) [01:55:07] upstream has the same issue with the labels. don't nitpick too much, the tiny details will be addressed I'm sure.. it's still a work in progress [01:55:09] (03PS4) 10Dzahn: add new node bromine, add bz-static role [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) [01:55:11] (03PS1) 10Dzahn: Make relic Toolserver files valid HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/222234 [01:55:13] arrrrgg [01:55:23] Oh wow [01:55:28] Upstream is really bad too [01:55:32] https://secure.phabricator.com/tag/differential/ [01:55:37] Look at the "Manage board" button [01:55:44] The icon alignment in those buttons! [01:55:52] (03CR) 10Krinkle: [C: 04-1] Make relic Toolserver files valid HTML5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [01:55:54] yes, secure.phabricator.org has the tag text shifted higher and the big left margin in Firefox when you change zoom level [01:56:13] Even in Chrome, even with zoom level 100% [01:56:26] How did someone release this without having their eyes bleed :S [01:56:31] I can fix the left margin thing because upstream doesn't pay attention to my report. I haven't seen it with zoom level =100% though [01:56:55] RoanKattouw: overall it looks really nice to me, there are just a few little details to work out [01:57:09] Hmm. [01:57:10] wtf: https://phabricator.wikimedia.org/project/sprint/board/1297/query/5ee8RApNfawG/ [01:57:18] "For HTML or XHTML served as HTML, you should always use the tag inside " [01:57:21] Krinkle: [01:57:26] Krenair: ? [01:57:38] (03CR) 10Ricordisamoa: "https://validator.w3.org/check?uri=http%3A%2F%2Ftoolserver.org still returns "24 Errors, 2 warning(s)"" [puppet] - 10https://gerrit.wikimedia.org/r/221067 (owner: 10Ricordisamoa) [01:57:46] (03CR) 10Dzahn: Make relic Toolserver files valid HTML5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [01:58:02] James_F, it's horrible [01:58:07] do you have the old one open in another tab? [01:58:08] twentyafterfour: text clipped in heading of sprint boards, e.g. https://phabricator.wikimedia.org/tag/mobile-app-sprint-60-android-lightning-round/ [01:58:17] (03PS5) 10Krinkle: static: Add foundation logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 [01:58:24] twentyafterfour: Sorry, I'm just unreasonably allergic to lack of attention to detail [01:58:25] Krenair: Eh. It's about as bad as the previous one. [01:58:25] (03CR) 10Dzahn: "maybe these files are not deployed by puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/221067 (owner: 10Ricordisamoa) [01:58:32] Krenair: More disappointingly, https://phabricator.wikimedia.org/p/Forrestbot/feed/ no longer works. [01:59:11] RoanKattouw: it's a complete redesign, done by one person. He's probably working on the details last. [01:59:14] there's https://phabricator.wikimedia.org/p/Forrestbot/ instead James_F [01:59:16] Sure [01:59:16] the clipping is worse in Firefox, but still there on the button actions in chromium. Also, crazy icon alignm ent [01:59:21] But someone somewhere approved this for release [01:59:29] (03CR) 10Dzahn: Make relic Toolserver files valid HTML5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [01:59:36] move fast and breakfast things [01:59:40] While the most prominent buttons in the entire interface have dramatically misaligned icons [01:59:41] (03CR) 10Ricordisamoa: Make relic Toolserver files valid HTML5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [01:59:42] Unless we're working off master? [01:59:47] Krenair: Yeah, which doesn't let me paginate further back. [02:00:16] oh, damn [02:00:37] (03CR) 10Ricordisamoa: Make relic Toolserver files valid HTML5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [02:00:50] they created a stable branch, then started merging master into the redesign branch. but the redesign branch has all the new stuff which we really need early access to so we get to live with design inconsistencies for a week [02:01:46] for instance, now we can ditch a bunch of nasty custom code because upstream implemented "subscribers" policy just for our security bug report use-case [02:02:17] Hmm, it doesn't seem to be broken on OSX [02:02:25] Maybe it only looks crap when you don't have certain fonts [02:02:27] RoanKattouw: lol [02:02:31] It looks fine here. [02:02:32] that would explain it [02:02:34] Well. "Fine". [02:02:38] It's a bit too jarring. [02:02:43] But I'll adapt. [02:03:58] I'm pretty sure upstream developers all use osx [02:04:00] (03CR) 10Krinkle: Make relic Toolserver files valid HTML5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [02:04:17] Thought so [02:04:46] "Why do you want a Mac? So Phabricator doesn't make me want to claw my eyes out" [02:05:12] (03PS2) 10Dzahn: Make relic Toolserver files valid HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/222234 [02:05:13] "Mac is for humans, Linux for servers, Windows for testing IE?" [02:05:32] people still test IE? [02:05:48] Isn't Mobile Safari considered worse than IE now? [02:05:51] well I mean, we have automated test VMs for that right? [02:05:56] I would agree actually [02:06:10] (03CR) 10Ricordisamoa: Make relic Toolserver files valid HTML5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [02:06:23] http://nolanlawson.com/2015/06/30/safari-is-the-new-ie/ [02:06:28] A good read :) [02:07:01] we should focus on the fact that upstream added a feature for just US.. that's new [02:07:06] mutante: I'm afraid the doctype is mandatory-ish. I mean, it's a simple page, it won't matter. but without the doctype, the browser activates Netscape-IE4 compat mode [02:07:06] Yeah.. I think I saw that recently [02:07:07] instead of button alignment [02:07:09] That's HTML1.2+bugs [02:07:11] mutante: does it work? [02:07:11] :P [02:07:47] mutante: yeah and it's a really good feature [02:08:13] and it's an extensible api with a lot of possibilities [02:09:30] Gerrit-Phabricator bot seems to have broken [02:09:33] "object policies" - they are exposed in the policy dropdown right now - I had to do uncomfortable things to implement "subscribers can view" policy in security bug reports' policies [02:09:40] love [02:09:42] lovely [02:09:52] Krinkle: i did not even want to remove it, just mistake having to do all the manual rebasing [02:10:05] which is getting a bit annoying [02:10:23] ariel's law - it never takes 5 minutes [02:10:27] Krenair: is there an error log from the bot? [02:10:42] I don't even remember where it runs [02:10:55] doesn't it run on the gerrit server? [02:11:00] https://phabricator.wikimedia.org/p/gerritbot/ tells me nothing [02:11:59] yep [02:12:08] some sort of its-phabricator plugin [02:12:35] (03PS3) 10Dzahn: Make relic Toolserver files valid HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/222234 [02:13:41] yeah it runs inside gerrit.. hmm [02:13:56] not sure if I have access to that server? [02:13:58] (03CR) 10Ricordisamoa: Make relic Toolserver files valid HTML5 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [02:14:16] it'd be ytterbium I think? [02:14:58] So it looks like the problem is pretty simple [02:14:58] Their font stack starts with a bunch of proprietary fonts, then contains 'Lato' [02:14:58] If I remove 'Lato' and let if fall back to my system's sans-serif font, it looks better [02:15:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [02:15:08] and you are in the gerrit-admin group twentyafterfour [02:15:35] Although that doesn't fix the icons it seems [02:16:29] Krenair: but I can't ssh to ytterbium [02:17:05] it has role gerrit::production [02:17:10] ssh_exchange_identification: Connection closed by remote host [02:17:14] hieradata/role/common/gerrit/production.yaml includes gerrit-admin [02:17:20] and you are in gerrit-admin [02:17:22] hmm [02:17:23] ostriches: about? [02:17:26] so I don't see why it shouldn't work? [02:17:39] (03CR) 10Ricordisamoa: "Minor nitpicking in COMMITMSG on PS2, otherwise looks good (always wanted to write this in operations/puppet!!!)" [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [02:18:14] who else is in gerrit admin? [02:18:33] (03PS4) 10Dzahn: Make relic Toolserver files valid HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/222234 [02:18:51] qchris, dduvall, hashar, thcipriani, and zfilipin [02:18:57] none of whom are around [02:19:12] twentyafterfour: are you doing ".wikimedia.org" , not "eqiad.wmnet"? [02:19:22] it has public IP [02:19:32] mutante: ssh ytterbium.eqiad.wmnet [02:19:34] ssh_exchange_identification: Connection closed by remote host [02:19:44] oh so I need to use wikimedia.org? hmm [02:19:45] twentyafterfour: ssh ytterbium.wikimedia.org [02:19:48] No permission denied? [02:20:06] ah ha works now [02:20:07] (03CR) 10Ricordisamoa: [C: 031] Make relic Toolserver files valid HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [02:20:19] silver is like that [02:21:02] and I imagine other hosts are as well [02:21:29] well everything is locked down to root. hmm. trying to track down gerrit's log file [02:21:51] bingo found it [02:22:43] the log or the error? ;) [02:23:14] (03CR) 10Krinkle: [C: 04-1] Make relic Toolserver files valid HTML5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [02:23:48] heh [02:24:39] (03PS6) 10Krinkle: static: Add foundation logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 [02:26:16] Krenair: So, you mentioned the text is off? [02:26:19] the log [02:26:32] Krinkle, no...? [02:26:40] I hacked it up via console to see how it looks on wikimediafoundation.org with https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Wikimedia_Foundation_RGB_logo_with_text.svg/270px-Wikimedia_Foundation_RGB_logo_with_text.svg.png and background-size 135px [02:26:48] apparently gerrit has been having trouble replicating to a couple of github repos ...that's flooding the logs [02:27:04] Krenair: It seems the current foundationwiki.png doesn't match this commons image [02:27:10] even the 135px size [02:27:30] which repos, twentyafterfour? [02:27:33] e.g. open in two tabs and switch: https://wikimediafoundation.org/static/images/project-logos/foundationwiki.png https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Wikimedia_Foundation_RGB_logo_with_text.svg/135px-Wikimedia_Foundation_RGB_logo_with_text.svg.png [02:27:35] 6operations, 6Multimedia: Add monitoring of upload rate on commons to icinga alerts - https://phabricator.wikimedia.org/T92322#1419551 (10Tgr) https://gerrit.wikimedia.org/r/#/c/222224/ will make this easy again. [02:27:48] the commons' one has bolder text [02:27:57] Probably something changed in the installed fonts [02:28:04] git@github.com:wikimedia/apps-ios-wikipedia and git@github.com:wikimedia/mediawiki-services-mathoid [02:28:26] but since the SVG uses paths instead of fonts, I'll take that as the right one [02:28:46] Caused by: org.eclipse.jgit.errors.MissingObjectException: Missing unknown 6a7595de888c8d04899cc58065fa8682eb844a39 [02:30:05] !log l10nupdate Synchronized php-1.26wmf11/cache/l10n: (no message) (duration: 10m 23s) [02:30:12] Logged the message, Master [02:30:33] maybe gerrit just needs to be restarted to reconnect phabot? [02:34:05] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [02:35:19] https://phabricator.wikimedia.org/T104518 is the one I've noticed it ignoring [02:36:05] (03PS7) 10Krinkle: static: Add foundationwiki-2x.png and foundationwiki-1.5x.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 [02:36:21] replication error is a known bug: https://code.google.com/p/gerrit/issues/detail?id=2025 [02:36:30] (03PS8) 10Krinkle: static: Add foundation logo (with hidpi variants) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 [02:36:36] Darn, finally. [02:37:03] !log LocalisationUpdate completed (1.26wmf11) at 2015-07-02 02:37:03+00:00 [02:37:10] Logged the message, Master [02:47:31] (03CR) 10Krinkle: [C: 032] static: Add foundation logo (with hidpi variants) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 (owner: 10Krinkle) [02:47:37] (03Merged) 10jenkins-bot: static: Add foundation logo (with hidpi variants) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222230 (owner: 10Krinkle) [02:51:12] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 05m 19s) [02:51:21] Logged the message, Master [02:52:55] !log krinkle Synchronized docroot and w: 245a1ff (duration: 00m 12s) [02:53:02] Logged the message, Master [02:54:06] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-02 02:54:06+00:00 [02:54:13] Logged the message, Master [02:58:05] 6operations, 7Varnish: /static generates (and caches!) redirect loops on cache-miss - https://phabricator.wikimedia.org/T104532#1419619 (10Krinkle) 3NEW [02:58:16] 6operations, 7Varnish: /static generates (and caches!) redirect loops on cache-miss - https://phabricator.wikimedia.org/T104532#1419626 (10Krinkle) [03:00:40] 6operations, 7Varnish: /static generates (and caches!) redirect loops on cache-miss - https://phabricator.wikimedia.org/T104532#1419629 (10Krinkle) I've send a squid purge to both of them to resolve the immediate issue. [03:01:21] PROBLEM - puppet last run on cp4016 is CRITICAL puppet fail [03:01:36] (03PS5) 10Dzahn: Make relic Toolserver files valid HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/222234 [03:01:53] Krinkle: the first paste in your ticket is missing the curl command [03:02:13] I'd like to understand better what was generating the redirect here... [03:02:48] bblack: just plain [03:02:48] $ curl -i 'https://www.wikimedia.org/static/images/wmf-2x.png' [03:02:52] no query parameters [03:02:56] 6operations, 7Varnish: /static generates (and caches!) redirect loops on cache-miss - https://phabricator.wikimedia.org/T104532#1419639 (10Krinkle) [03:03:15] what I mean is: why did MediaWiki/apache ever redirect that file to itself? [03:03:25] I don't know, that's what I'm asking you :P [03:03:33] Both were access right before I scapped it [03:03:44] did the file exist before the scap? was brand-new? [03:03:47] new [03:03:58] have you tried other non-existent filenames? [03:04:10] I have, they don't have a problem it seems [03:04:32] I get 404's for most others I try [03:04:36] Yeah [03:04:39] (03CR) 10Ricordisamoa: Make relic Toolserver files valid HTML5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [03:04:49] but the bottom line here is: varnish didn't generate the redirect. MediaWiki did. [03:04:56] I suspect it may have to do with it being domain-agnostic [03:05:04] mediawiki, or hhvm [03:05:18] sure or apache, but none of those answers make much sense [03:06:04] bblack: So I had wmf.png?foo open before the sync, it gave a 404. Then I synced, and refreshed and got the loop [03:06:12] Perhaps it's a combination of the 404 being cached for a short while [03:06:17] and then the file existing [03:06:22] and it getting really confused [03:06:31] (03CR) 10Ricordisamoa: [C: 031] Make relic Toolserver files valid HTML5 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [03:06:34] they both say frontend hit [03:06:36] which is odd [03:07:53] /static/images/ is in fact globally-shared regardless of $domain, right? [03:08:16] we don't have some convoluted logic somewhere that makes some of it only conditionally available via some hostnames? [03:08:39] (or worse, that tries to redirect it to the same path on another hostname, maybe?) [03:08:59] (03PS1) 10Krinkle: base-page: Use static logo and resolve wikimedia.org www-redirect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222236 [03:09:01] (03PS1) 10Dzahn: remove wikkii table entirely [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222237 (https://phabricator.wikimedia.org/T104367) [03:09:16] bblack: it's domain agnostic indeed [03:09:25] which is something we've only done since very recently afaik [03:09:32] Or do we have other cases of domain agnostic caching? [03:09:44] I think it's a relatively new concept for us. Used only for /static and /beacon right now afaik [03:09:55] I mean at the apache/hhvm/mediawiki level. I know the varnish part, I wrote that. [03:09:57] where /beacon is trivial [03:10:17] well, the backends don't know any of that. No idea what they are getting from varnish for such request. [03:10:20] at the apache/hhvm/mediawiki-level, when serving requests for /static, is it domain-agnostic and domain-redirect-free? [03:10:25] Would be interesting to intercept a backend request of that kind [03:10:40] One that varnish sends to a backend for /static [03:10:48] maybe it breaks the request somehow due to that varnish logic [03:10:55] it's unlikely [03:11:07] it goes through Apache and then HHVM-static. [03:11:17] I understand the varnish part pretty well and can explain it, but my questions are about how MW generated the redirect in the first place [03:11:20] Or rather nginx ->hhvm-static [03:11:21] not apache [03:11:30] I think [03:12:03] do we still use apache in front of hhvm? [03:12:10] afaik, yes [03:12:20] ok, I have a new theory [03:12:21] 6operations, 10MediaWiki-extensions-WikimediaMaintenance, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: cn.wikimedia.org: short interwikis to other wikimedia projects are invalid. - https://phabricator.wikimedia.org/T104198#1419649 (10zhuyifei1999) 1.26wmf12 deployed on cnwikimedia.... [03:12:24] php > hhvm > apache > nginx (ssl) [03:12:27] wow [03:12:32] no, not even close [03:12:33] 6operations, 10MediaWiki-extensions-WikimediaMaintenance, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: cn.wikimedia.org: short interwikis to other wikimedia projects are invalid. - https://phabricator.wikimedia.org/T104198#1419652 (10zhuyifei1999) a:5zhuyifei1999>3None [03:12:34] and then lvs > varnish > lvs [03:12:48] OK :) [03:13:15] php > hhvm > apache > LVS > varnish > varnish (> varnish)? > nginx > LVS [03:13:30] Oh, right. [03:13:33] Because we can't cache ssl [03:14:02] huh? [03:14:11] ssl terminator is client specific. [03:14:20] something like that, yeah [03:14:27] !log legoktm Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 12s) [03:14:34] Logged the message, Master [03:14:40] bblack: so, theory? [03:14:48] Reedy: thank you for removing interwiki cache from git :D [03:16:21] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [03:16:57] Krinkle: no, my theory didn't stand up to testing [03:17:28] (03PS1) 10Dzahn: remove gentoo table entirely [debs/wikistats] - 10https://gerrit.wikimedia.org/r/222238 (https://phabricator.wikimedia.org/T104367) [03:18:35] ah, I found a variant of my theory that works! [03:19:33] 6operations, 7Graphite: Upgrade Graphite from 0.9.12 to 0.9.13 - https://phabricator.wikimedia.org/T104536#1419673 (10Krinkle) 3NEW [03:19:53] still, the only good answer is that for these special paths like /static/, from apache on down everything needs to be completely hostname-agnostic. It should never redirect. [03:19:57] and it's not. [03:20:21] 6operations, 7Graphite: Upgrade Graphite from 0.9.12 to 0.9.13 - https://phabricator.wikimedia.org/T104536#1419680 (10Krinkle) [03:20:29] bblack: So what is it doing then? [03:20:35] Is varnish sending a generic hostname? [03:21:34] what is a "generic hostname"? [03:21:45] anyways, give me a sec to nail down the details, and I'll explain with examples [03:22:33] lol https://discussions.apple.com/thread/4979563?start=0&tstart=0 [03:22:47] apparrently some dns got us confused for facebook and other famous sites at some point? [03:23:08] searching for our "unknown domain" text message in google shows up all kinds of support threads. [03:24:25] sounds more like an apple-specific problem [03:26:11] (03CR) 10Krinkle: "To test:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222236 (owner: 10Krinkle) [03:26:44] so the basic problem is this: [03:27:39] (1) for /static paths, varnish ignores the request hostname. specifically, when it first sees a /static/foo that's not currently in cache, it will fetch it with the domainname the user requested, but store it such that all future requests will use that cached response regardless of request domainname [03:28:20] (2) we have site-level redirects that preserve the path, implemented in apache, such as "cz.wikipedia.org -> cs.wikipedia.org" [03:28:52] Interesting [03:29:03] (3) therefore: [03:29:09] bblack-mba:puppet bblack$ curl -sv 'https://cz.wikipedia.org/static/images/wmf-2x.pngxxx' 2>&1 >/dev/null |egrep 'Moved|Location' [03:29:12] < HTTP/1.1 301 Moved Permanently [03:29:15] < Location: https://cs.wikipedia.org/static/images/wmf-2x.pngxxx [03:29:17] bblack-mba:puppet bblack$ curl -sv 'https://cs.wikipedia.org/static/images/wmf-2x.pngxxx' 2>&1 >/dev/null |egrep 'Moved|Location' [03:29:20] < HTTP/1.1 301 Moved Permanently [03:29:23] < Location: https://cs.wikipedia.org/static/images/wmf-2x.pngxxx [03:29:27] I've just redirect-looped the non-existent pathname /static/images/wmf-2x.pngxxx for all sites [03:29:33] bblack: Ah! [03:29:37] I think I know how I triggered this [03:29:42] by accessing it through a redirect domain before it's ever accessed through a real domain [03:29:44] I triggered it with wikimedia.org [03:29:45] it's a race [03:29:51] not www.wikimedia.org [03:29:59] so the www redirect got cached [03:30:03] even if later accessed through www proper [03:30:03] right [03:30:30] the answer here is that varnish and the applayer have to be in agreement about whether /static is truly $hostname-agnostic or not [03:30:37] and likewise, once cached properly, we can access it even over non-canonical domains [03:30:38] the applayer is not being $hostname-agnostic about it in all cases [03:30:45] varnish is completely agnostic about it [03:31:04] e.g. https://wikipedia.org/static/images/project-logos/enwiki.png won't redirect to www. [03:31:08] because it gets the /static cache first [03:31:14] for now [03:31:15] unless it's a cache miss [03:31:20] in whch case it screws it up [03:31:21] until someone happens to win a timing race [03:31:24] Yeah [03:31:49] So it's apache. [03:31:56] well, it's both [03:32:11] it's our mis-understanding between the two layers of how universal /static is [03:32:14] I mean, the one crafting the redirect originally [03:32:20] well, yes [03:32:22] So we shoudl excempt redirects from /static ? [03:32:31] that might be the answer, yes [03:32:34] pass through to the client [03:32:57] I think apache serves /static directly, doesn't it? [03:32:59] or not? [03:33:07] I thought so [03:33:11] until I saw HHVM-static in the header [03:33:25] I guess it's deferring all of / to hhvm [03:33:45] or w/static which /static is a symlink to [03:33:51] w/ [03:34:06] if possible we should still cache the redirect, but domain specific. [03:34:13] not sure if that's feasible inside that context. [03:34:13] /static is the canonical path from the outside POV [03:34:19] Yeah [03:34:23] Not sure why it's inside w/ [03:34:30] there's no point caching a redirect or issuing a redirect [03:34:31] we can swap it and point the other way afaic [03:34:40] it should just be universal [03:34:41] Right, true. [03:35:00] for now, I can fix this in varnish by disabling the hostname-agnostic part there, but it sucks for efficiency [03:35:11] bblack: What about mocking in a fixed hostname? [03:35:14] something predictable [03:35:24] stil caching hostname agonistic [03:35:26] ah, good idea [03:35:28] (03PS6) 10Dzahn: Make relic Toolserver files valid HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/222234 [03:35:31] but have the backend request use a fixed hosntname [03:35:34] I can force all /static req.hostnames to be enwiki [03:35:42] Right [03:36:00] or www.wikimedia.org [03:36:51] (03CR) 10Krinkle: [C: 032] base-page: Use static logo and resolve wikimedia.org www-redirect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222236 (owner: 10Krinkle) [03:36:56] (03Merged) 10jenkins-bot: base-page: Use static logo and resolve wikimedia.org www-redirect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222236 (owner: 10Krinkle) [03:37:55] !log krinkle Synchronized 404.html: 6d49d229806 (duration: 00m 12s) [03:38:02] Logged the message, Master [03:38:15] !log krinkle Synchronized docroot/default/index.html: 6d49d229806 (duration: 00m 12s) [03:38:22] Logged the message, Master [03:40:11] Hm.. :( Looks like our global /favicon.ico rewrite is also broken [03:40:18] https://wikimediafoundation.org/favicon.ico https://en.wikipedia.org/favicon.ico [03:40:24] 200 empty response [03:40:37] that was served by favicon.php afair [03:40:41] Yeah [03:40:50] w/favicon.php is also 200 OK empty [03:40:53] I'll fix that one [03:40:59] probably an outdated path [03:43:38] The W3C recommends that icons be supported through the use of the following style of markup: [03:43:41] [03:43:45] https://phabricator.wikimedia.org/T19980 [03:43:51] 6operations, 7Varnish: /static generates (and caches!) redirect loops on cache-miss - https://phabricator.wikimedia.org/T104532#1419699 (10BBlack) So, after debugging this on IRC with Krinkle, the net of the problem is this: 1. Varnish in [[ https://github.com/wikimedia/operations-puppet/blob/c1fd51aa39632bf4... [03:45:14] (03PS1) 10BBlack: Fix /static hashing by forcing enwiki hostname [puppet] - 10https://gerrit.wikimedia.org/r/222242 (https://phabricator.wikimedia.org/T104532) [03:46:25] ori: ping? [03:46:46] mutante: Yeah, but all browsers still request /favicon.ico in many scenarios. [03:46:52] mutante: for example, when opening an image in a new window. [03:47:11] mutante: and more importantly, we reference /favicon.ico in some pages using [03:47:17] it's a public API of sorts [03:47:36] Krinkle: gotcha, ok, just saw that one when looking for open tickets related to favicons [03:48:21] (03CR) 10Ricordisamoa: [C: 031] Make relic Toolserver files valid HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [03:48:30] Krinkle: if you've fixed the immediate fallout from the new files, let's hold on merging the varnish fix [03:48:34] i think somebody did report it here but we assumed it's just because after the https switch they get the lock icon now [03:49:00] we've been operating under these conditions for a while, and cache-hotness seems to generally save us or we'd have noticed before. and I'd like to get ori's feedback first before forcing enwiki. [03:49:13] (since he was involved in the /static move, and there could be consequences I don't understand) [03:49:18] Yeah [03:49:23] bblack: what fallout? [03:49:36] well the loops you already purged [03:49:42] Yeah [03:50:17] mutante: So that task is about using one or another . Neither is favicon.ico. That's a wmf-specific hack :D [03:50:32] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [03:50:38] which rewrites to favicon.php, which then uses wmf-config to find the bits url, and then does a stream pass thru [03:50:42] it's amazing [03:51:11] (03CR) 10Dzahn: "just that the files did not seem to actually get updated even though they are puppetized:" [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [03:51:16] we are the rube goldberg machine of web sites [03:51:38] Krinkle: ok :) [03:57:04] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T104537" [puppet] - 10https://gerrit.wikimedia.org/r/221067 (owner: 10Ricordisamoa) [03:57:34] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T104537" [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [03:58:12] (03CR) 10Dzahn: [C: 032] Make relic Toolserver files valid HTML5 [puppet] - 10https://gerrit.wikimedia.org/r/222234 (owner: 10Dzahn) [03:58:47] OK. I'm gonna stop looking at logs or I"ll keep finding issues. [03:59:08] https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors says there's 4000+ issues with 10.192.16.17 in the last 15 minutes alone [03:59:15] :) [03:59:30] that's db2029.codfw.wmnet [03:59:30] I dunno databases, so gonna ignore that and hope someone else read that [04:00:18] it's being hit from enwiki /w/api.php requests fwiw [04:00:20] it's ignorable, if silly [04:00:40] those errors are being generated on codfw appservers, about a codfw database, neither of which are in the flow of production things [04:00:59] but I guess they're already logging to the common production logstash endpoint [04:01:13] Ah, interesting [04:01:13] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 15.38% of data above the critical threshold [100000000.0] [04:01:27] what's generating 500 requests per minute? [04:01:33] good question! [04:02:39] the requests are coming from codfw LVS... [04:02:46] Health checks would generate a few requests [04:02:51] ah, health checks [04:02:53] yes [04:02:56] But not that many, hopefully [04:02:59] one for each app server [04:03:09] yeah [04:03:12] Oh right [04:03:14] one for each app server, every 10 seconds [04:03:16] It's a DB, not an app server [04:03:30] well 2 actually, for the redundant LVS [04:03:31] the error is coming from an app server though [04:03:49] but we have more app servers than db servers [04:03:49] right [04:03:55] k [04:03:59] LVS does 0.2 reqs/sec/appserver to enwiki/Main_Page, and then the applayer has to hit a DB to try to render it [04:04:00] Right, so you only need 40 app servers for 2*40*(60/10) to be 480 reqs/min [04:05:01] (03PS2) 10Dzahn: Redirect dartar's cite-o-meter to Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/221063 (owner: 10Ricordisamoa) [04:05:25] it's probably that none of that was pooled for LVS in codfw until we recently started testing the etcd->pybal integration there [04:05:48] it's also probable that even if they were defined, they were all configured as depooled until I globally-pooled them all earlier today heh [04:06:15] (03CR) 10Dzahn: [C: 032] Redirect dartar's cite-o-meter to Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/221063 (owner: 10Ricordisamoa) [04:07:28] (03PS1) 10BBlack: add (depooled, hwfail) cp3011.esams to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/222245 [04:07:28] bblack: hey [04:07:41] i'm here. should i read the backlog or is there a tl;dr? [04:07:44] (03PS2) 10BBlack: add (depooled, hwfail) cp3011.esams to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/222245 [04:08:01] ori: https://phabricator.wikimedia.org/T104532 [04:08:01] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:08:28] (03CR) 10BBlack: [C: 032] add (depooled, hwfail) cp3011.esams to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/222245 (owner: 10BBlack) [04:09:17] 6operations, 7Regression, 7Varnish: [Regression] /favicon.ico broken. Serves empty 200 OK response from HHVM - https://phabricator.wikimedia.org/T104538#1419737 (10Krinkle) 3NEW [04:09:21] (03PS2) 10Dzahn: labnet1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/217358 (https://phabricator.wikimedia.org/T99701) (owner: 10RobH) [04:09:27] (03CR) 10jenkins-bot: [V: 04-1] labnet1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/217358 (https://phabricator.wikimedia.org/T99701) (owner: 10RobH) [04:09:43] OK. No what's causing https://phabricator.wikimedia.org/T104538 - someone else can have a go :D [04:09:46] nn [04:09:47] o/ [04:10:22] good night timo [04:10:41] nite :) [04:10:42] bblack: I had the same solution in mind as I read your comment [04:10:53] but let's canonicalize to www.wikimedia.org rather than enwiki [04:10:53] (03CR) 10Dzahn: "if nobody cares about them wouldn't that be even more reason to not keep that 180 days?" [puppet] - 10https://gerrit.wikimedia.org/r/195917 (owner: 10ArielGlenn) [04:11:08] ori: ok [04:11:21] it gets served by the same app servers -- https://www.wikimedia.org/static/images/wmf.png works [04:11:25] (03CR) 10Dzahn: [C: 031] toss mw-logs after 90 days, not 180 [puppet] - 10https://gerrit.wikimedia.org/r/195917 (owner: 10ArielGlenn) [04:12:16] 6operations, 5Patch-For-Review, 7Varnish: /static generates (and caches!) redirect loops on cache-miss - https://phabricator.wikimedia.org/T104532#1419747 (10Krinkle) This problem surfaced due to incoming requests for /static using https://wikimedia.org, which is a redirect to https://www.wikimedia.org on ca... [04:12:34] the way our URL-routing logic straddles a dozen layers really is remarkable [04:13:30] (03PS2) 10BBlack: Fix /static hashing by forcing www.wm.o hostname [puppet] - 10https://gerrit.wikimedia.org/r/222242 (https://phabricator.wikimedia.org/T104532) [04:14:42] (03CR) 10Ori.livneh: [C: 031] "Nice. I think this would have been an improvement even if this bug had not surfaced." [puppet] - 10https://gerrit.wikimedia.org/r/222242 (https://phabricator.wikimedia.org/T104532) (owner: 10BBlack) [04:15:17] technically, we could do one better and both force the hostname and skip hashing the hostname, but the only benefit over this would be saving one hash_data() operation [04:15:20] (03CR) 10Dzahn: "what keeps us from actually using the same config on beta as on production? what is the diff between them?" [puppet] - 10https://gerrit.wikimedia.org/r/173492 (owner: 10Reedy) [04:15:21] this seems simpler :) [04:16:03] (03CR) 10BBlack: [C: 031] Fix /static hashing by forcing www.wm.o hostname [puppet] - 10https://gerrit.wikimedia.org/r/222242 (https://phabricator.wikimedia.org/T104532) (owner: 10BBlack) [04:16:13] (03CR) 10BBlack: [C: 032] Fix /static hashing by forcing www.wm.o hostname [puppet] - 10https://gerrit.wikimedia.org/r/222242 (https://phabricator.wikimedia.org/T104532) (owner: 10BBlack) [04:16:16] stupid mouse [04:17:09] (03CR) 10Dzahn: "does the comment still make sense today?" [puppet] - 10https://gerrit.wikimedia.org/r/182141 (owner: 10AndyRussG) [04:18:32] !log depooled mw1152. [04:18:38] Logged the message, Master [04:23:32] PROBLEM - puppet last run on cp4017 is CRITICAL Puppet has 2 failures [04:23:33] PROBLEM - puppet last run on cp3030 is CRITICAL Puppet has 2 failures [04:24:06] checking [04:24:22] PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 2 failures [04:24:39] bleh, bad patch [04:24:43] PROBLEM - puppet last run on cp2021 is CRITICAL Puppet has 2 failures [04:25:18] sorry, i should have noticed that [04:26:08] me too! [04:26:17] (03PS1) 10KartikMistry: Beta: Add cxserver::restbase URL [puppet] - 10https://gerrit.wikimedia.org/r/222247 [04:27:11] I disabled cache puppets for now, wanted to check one more thing first [04:28:46] it might be safer/better to do the hostname mangling in recv than hash [04:28:53] and get rid of the hash hack entirely [04:29:08] going to revert that other one for now [04:29:20] (03PS1) 10BBlack: Revert "Fix /static hashing by forcing www.wm.o hostname" [puppet] - 10https://gerrit.wikimedia.org/r/222248 [04:29:45] (03CR) 10BBlack: [C: 032 V: 032] Revert "Fix /static hashing by forcing www.wm.o hostname" [puppet] - 10https://gerrit.wikimedia.org/r/222248 (owner: 10BBlack) [04:33:56] it's complicated heh [04:34:23] we want purges to work correctly as well, and mobile redirects happen after purges [04:34:29] hmmmm [04:42:33] (03PS1) 10BBlack: Replace static-hash with hostname normalization [puppet] - 10https://gerrit.wikimedia.org/r/222249 (https://phabricator.wikimedia.org/T104532) [04:42:37] ori: ^ ? [04:45:22] PROBLEM - puppet last run on cp2021 is CRITICAL Puppet has 2 failures [04:45:32] PROBLEM - puppet last run on cp1067 is CRITICAL Puppet has 2 failures [04:45:41] PROBLEM - puppet last run on cp1047 is CRITICAL Puppet has 2 failures [04:45:41] PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 2 failures [04:45:42] PROBLEM - puppet last run on cp4009 is CRITICAL Puppet has 2 failures [04:45:42] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 2 failures [04:45:42] PROBLEM - puppet last run on cp1054 is CRITICAL Puppet has 2 failures [04:45:51] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 2 failures [04:45:52] PROBLEM - puppet last run on cp2023 is CRITICAL Puppet has 2 failures [04:45:53] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 2 failures [04:45:53] PROBLEM - puppet last run on cp1066 is CRITICAL Puppet has 2 failures [04:46:02] PROBLEM - puppet last run on cp4018 is CRITICAL Puppet has 2 failures [04:46:02] PROBLEM - puppet last run on cp4017 is CRITICAL Puppet has 2 failures [04:46:02] PROBLEM - puppet last run on cp1059 is CRITICAL Puppet has 2 failures [04:46:02] PROBLEM - puppet last run on cp3030 is CRITICAL Puppet has 2 failures [04:46:03] PROBLEM - puppet last run on cp3031 is CRITICAL Puppet has 2 failures [04:46:03] PROBLEM - puppet last run on cp3040 is CRITICAL Puppet has 2 failures [04:46:03] PROBLEM - puppet last run on cp3005 is CRITICAL Puppet has 2 failures [04:46:03] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 2 failures [04:46:04] PROBLEM - puppet last run on cp3015 is CRITICAL Puppet has 2 failures [04:46:04] PROBLEM - puppet last run on cp3016 is CRITICAL Puppet has 2 failures [04:46:05] PROBLEM - puppet last run on cp3004 is CRITICAL Puppet has 2 failures [04:46:05] PROBLEM - puppet last run on cp3041 is CRITICAL Puppet has 2 failures [04:46:06] PROBLEM - puppet last run on cp3006 is CRITICAL Puppet has 2 failures [04:46:12] PROBLEM - puppet last run on cp2007 is CRITICAL Puppet has 2 failures [04:46:22] PROBLEM - puppet last run on cp2009 is CRITICAL Puppet has 2 failures [04:46:23] PROBLEM - puppet last run on cp2019 is CRITICAL Puppet has 2 failures [04:46:23] PROBLEM - puppet last run on cp2004 is CRITICAL Puppet has 2 failures [04:46:31] PROBLEM - puppet last run on cp1052 is CRITICAL Puppet has 2 failures [04:46:33] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 2 failures [04:46:33] PROBLEM - puppet last run on cp1055 is CRITICAL Puppet has 2 failures [04:46:38] I love how agent disable instantly-suppressess all the impending current failures, but holds them for release when re-enabled later :P [04:46:42] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 2 failures [04:46:43] PROBLEM - puppet last run on cp3013 is CRITICAL Puppet has 2 failures [04:46:51] PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 2 failures [04:47:01] PROBLEM - puppet last run on cp3012 is CRITICAL Puppet has 2 failures [04:47:01] PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 2 failures [04:47:01] PROBLEM - puppet last run on cp3007 is CRITICAL Puppet has 2 failures [04:47:32] RECOVERY - puppet last run on cp1054 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [04:47:32] RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [04:47:42] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [04:48:01] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [04:48:01] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [04:48:02] RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [04:48:02] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [04:48:13] RECOVERY - puppet last run on cp2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:48:22] RECOVERY - puppet last run on cp1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:48:33] RECOVERY - puppet last run on cp3013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:48:50] (03CR) 10MaxSem: [C: 04-1] "It does, however it needs an update because /some/ ugly URLs are redirected now so it's even more important that the CA usage is accuratel" [puppet] - 10https://gerrit.wikimedia.org/r/182141 (owner: 10AndyRussG) [04:48:52] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:49:12] RECOVERY - puppet last run on cp1067 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [04:49:22] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [04:49:42] RECOVERY - puppet last run on cp1066 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [04:49:42] RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [04:49:51] RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:49:52] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:50:21] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:50:31] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:50:32] RECOVERY - puppet last run on cp1065 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:50:42] RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:50:42] RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:51:13] RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [04:51:33] RECOVERY - puppet last run on cp4017 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [04:51:41] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:51:42] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [04:51:51] RECOVERY - puppet last run on cp2007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:52:01] RECOVERY - puppet last run on cp2019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:52:12] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:53:21] RECOVERY - puppet last run on cp2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:53:22] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:55:14] ori: https://gerrit.wikimedia.org/r/#/c/222249 (I think that's right wrt other hostname-hacking code. it's wrong for purge, but we'll just have to remember to purge /static via www.wikimedia.org directly) [04:56:03] arguably there's an inherent conflict between all of our rewriting and purges there, but that's a whole other thing to fix. [04:56:20] (03PS2) 10Ori.livneh: Replace static-hash with hostname normalization [puppet] - 10https://gerrit.wikimedia.org/r/222249 (https://phabricator.wikimedia.org/T104532) (owner: 10BBlack) [04:56:35] (commit message said mediawiki.org; patch said wikimedia.org) [04:56:44] (03CR) 10Ori.livneh: [C: 031] Replace static-hash with hostname normalization [puppet] - 10https://gerrit.wikimedia.org/r/222249 (https://phabricator.wikimedia.org/T104532) (owner: 10BBlack) [04:57:07] sigh. [04:57:15] heh [04:57:26] i am regularly mocked by my fortune file [04:57:36] i started a shell session and this one came up: [04:57:39] The unavoidable price of reliability is simplicity. [04:57:39] -- C.A.R. Hoare [04:57:43] lol [04:58:30] (03CR) 10BBlack: [C: 032] Replace static-hash with hostname normalization [puppet] - 10https://gerrit.wikimedia.org/r/222249 (https://phabricator.wikimedia.org/T104532) (owner: 10BBlack) [04:58:54] someone should write up an RFC for "Rename the org to The Wikipedia Foundation and end all the naming madness" [04:59:49] mediawiki vs wikipedia isn't so bad [05:00:03] but medawiki vs wikimedia vs wikipedia is awful to keep track of at times [05:02:22] RECOVERY - puppet last run on cp2021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:03:01] RECOVERY - puppet last run on cp1059 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:03:02] RECOVERY - puppet last run on cp3015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:03:22] RECOVERY - puppet last run on cp2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:03:31] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [05:03:43] bblack: iirc "wikiMedia" was chosen specially to trick people :D [05:03:55] it does a great job of that! :P [05:04:06] it tricks everyone who works here, but confuses all the normal public [05:04:23] RECOVERY - puppet last run on cp1047 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [05:04:25] I pretty much gave up and just started answer questions with "I work for Wikipedia" [05:04:33] bblack: https://lists.wikimedia.org/pipermail/foundation-l/2007-May/029991.html [05:04:50] because anything else becomes an annoying conversation [05:05:14] I too [05:05:16] well [05:06:16] "I am a freelance consultant that happens to work for an unrelated foundation based in San Francisco that helps maintain Wikipedia" [05:06:38] but yeah Wikipedia is a strong name [05:06:42] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [05:06:42] PROBLEM - puppet last run on cp3005 is CRITICAL puppet fail [05:06:43] brand [05:06:43] whatever [05:06:48] <3 Eloquence. Can I +1 his branding proposal 8 years after it apparently failed? :) [05:07:14] the community will shoot you on spot ! [05:07:38] they want to anyways. the only time they ever know who I am is when I break something they were using :P [05:07:40] I miss Erik. [05:08:34] rebranding the foundation to Wiki P edia , will further shadow the other projects (like Wikisource) [05:08:42] RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [05:09:01] ori: I too :/ [05:11:52] but just think of all the -1 code reviews for mixing up some variety of /(wiki|me[dp]ia)(me[dp]ia|wiki)/ [05:12:04] it might save us donor money in lost developer time :) [05:14:08] good point :D [05:39:22] RECOVERY - puppet last run on etcd1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:40:32] RECOVERY - puppet last run on conf1001 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [05:42:41] PROBLEM - puppet last run on conf1002 is CRITICAL Puppet last ran 12 hours ago [05:44:32] RECOVERY - puppet last run on conf1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:47:51] PROBLEM - puppet last run on conf1003 is CRITICAL Puppet last ran 7 hours ago [05:48:51] PROBLEM - puppet last run on cp3016 is CRITICAL puppet fail [05:49:42] RECOVERY - puppet last run on conf1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:53:11] RECOVERY - puppet last run on etcd1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:57:41] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [06:06:11] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:12:54] (03PS1) 10Giuseppe Lavagetto: conftool: update etcd hosts list [puppet] - 10https://gerrit.wikimedia.org/r/222250 [06:13:49] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: update etcd hosts list [puppet] - 10https://gerrit.wikimedia.org/r/222250 (owner: 10Giuseppe Lavagetto) [06:27:57] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 2 06:27:57 UTC 2015 (duration 27m 56s) [06:28:03] Logged the message, Master [06:31:44] (03PS1) 10Jcrespo: Emergency depooling of db2029 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222251 [06:32:22] (03CR) 10Jcrespo: [C: 032] Emergency depooling of db2029 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222251 (owner: 10Jcrespo) [06:32:41] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:34:13] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 1 failures [06:34:37] !log jynus Synchronized wmf-config/db-codfw.php: Emergency depool of db2029 (duration: 00m 12s) [06:34:43] Logged the message, Master [06:35:36] kibana is happy now [06:35:51] now the investigation [06:36:01] PROBLEM - puppet last run on wtp2018 is CRITICAL Puppet has 1 failures [06:36:51] PROBLEM - puppet last run on mw2076 is CRITICAL Puppet has 1 failures [06:36:52] PROBLEM - puppet last run on db2018 is CRITICAL Puppet has 1 failures [06:36:52] PROBLEM - puppet last run on db1046 is CRITICAL Puppet has 1 failures [06:37:02] PROBLEM - puppet last run on labvirt1003 is CRITICAL Puppet has 1 failures [06:37:45] and icinga is happy, too [06:40:42] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:40:52] PROBLEM - puppet last run on mw1242 is CRITICAL Puppet has 1 failures [06:41:33] PROBLEM - puppet last run on mw1173 is CRITICAL Puppet has 1 failures [06:42:03] PROBLEM - puppet last run on mw1205 is CRITICAL Puppet has 1 failures [06:43:23] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:46:32] RECOVERY - puppet last run on db1046 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:47:21] RECOVERY - puppet last run on wtp2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:51] RECOVERY - puppet last run on mw1205 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:47:52] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:21] RECOVERY - puppet last run on mw2076 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:48:22] RECOVERY - puppet last run on db2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:22] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:48:31] RECOVERY - puppet last run on mw1242 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:48:32] RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:12] RECOVERY - puppet last run on mw1173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:31] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:04] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:32] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [07:04:15] (03PS1) 10Krinkle: favicon/touch icon proxy: Fix broken redirect to internal (http) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222253 (https://phabricator.wikimedia.org/T104538) [07:04:43] 6operations, 5Patch-For-Review, 7Regression, 7Varnish: [Regression] /favicon.ico broken. Serves empty 200 OK response from HHVM - https://phabricator.wikimedia.org/T104538#1419855 (10Krinkle) ``` $ curl -I 'https://test.wikipedia.org/favicon.ico' HTTP/1.1 200 OK Server: nginx/1.9.2 Date: Thu, 02 Jul 2015 0... [07:05:03] (03CR) 10Krinkle: [C: 032] "Unbreak favicon.ico" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222253 (https://phabricator.wikimedia.org/T104538) (owner: 10Krinkle) [07:05:09] (03Merged) 10jenkins-bot: favicon/touch icon proxy: Fix broken redirect to internal (http) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222253 (https://phabricator.wikimedia.org/T104538) (owner: 10Krinkle) [07:05:55] !log krinkle Synchronized w/favicon.php: T104538 (duration: 00m 11s) [07:06:01] Logged the message, Master [07:06:17] !log krinkle Synchronized w/touch.php: T104538 (duration: 00m 11s) [07:06:23] Logged the message, Master [07:07:19] 6operations, 7HTTPS, 7Regression: [Regression] /favicon.ico broken. Serves empty 200 OK response from HHVM - https://phabricator.wikimedia.org/T104538#1419861 (10Krinkle) 5Open>3Resolved a:3Krinkle [07:07:49] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1419866 (10akosiaris) >>! In T103604#1418961, @Dzahn wrote: > merged DHCP config (thanks John) > > tried to get console: > > [ganeti1003:~] $ sudo gnt-instance console bromine.eqiad.... [07:08:34] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1419869 (10akosiaris) >>! In T103604#1419079, @Dzahn wrote: > after waiting a bit and restarting it i got console and saw the installer. > then, as with planet1001 at first: > > > `... [07:10:03] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1419881 (10akosiaris) >>! In T103604#1419478, @Dzahn wrote: > next attempt the installer finished without these error messages above. after finishing it shuts down the machine. > > af... [07:18:31] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:40:17] (03PS3) 10Alexandros Kosiaris: Specify etherpad.wikimedia.org logging [puppet] - 10https://gerrit.wikimedia.org/r/220086 [07:40:19] (03PS5) 10Alexandros Kosiaris: etherpad: Log the incoming request's IP address [puppet] - 10https://gerrit.wikimedia.org/r/220087 [07:40:21] (03PS3) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [07:43:51] (03PS6) 10Muehlenhoff: Allow optional firejail containment for nodejs services. [puppet] - 10https://gerrit.wikimedia.org/r/219177 (https://phabricator.wikimedia.org/T101870) [07:44:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] Allow optional firejail containment for nodejs services. [puppet] - 10https://gerrit.wikimedia.org/r/219177 (https://phabricator.wikimedia.org/T101870) (owner: 10Muehlenhoff) [07:51:10] giblet dieing again :-( [07:53:36] I am wondering whether it can be caused by some robot crawling git.wikimedia.org [07:53:54] there is a lot of CLOSE_WAIT connections on the server [07:55:03] could a root potentially copy antimony.wikimedia.org:/var/log/upstart/gitblit.log to my /home/hashar I would not mind looking at them :-} [08:00:53] (03PS4) 10Alexandros Kosiaris: Specify etherpad.wikimedia.org logging [puppet] - 10https://gerrit.wikimedia.org/r/220086 [08:00:55] (03PS6) 10Alexandros Kosiaris: etherpad: Log the incoming request's IP address [puppet] - 10https://gerrit.wikimedia.org/r/220087 [08:00:57] (03PS4) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [08:02:29] hashar: done (the log if full of endless java tracebacks) [08:03:04] hashar: java tracebacks again :P [08:03:12] hashar: feeling brave ? [08:03:26] akosiaris: always! [08:03:28] moritzm: danke ! [08:03:42] moritzm: and thanks for the java/kernel leap seconds upgrade. Seems Jenkins was all happy yesterday morning [08:04:03] we gotta sort out that giblit mess. Seems to be barely maintained :-/ [08:04:16] it is not maintained at all [08:04:45] we are just expecting chad to say "all projects have moved to phabricator" so we can kill it [08:05:14] ah yeah maybe that was the .plan [08:05:18] migrate to Differential [08:05:38] I think differential is gerrit's replacement [08:05:49] diffusion is the gitblit replacement [08:06:42] yeah, and https://phabricator.wikimedia.org/T616 is tracking, I think. [08:07:11] hashar: wow :P https://phabricator.wikimedia.org/T73974 [08:07:20] it hasn't been going down for just the last couple of days [08:08:54] (03PS1) 10Matanya: access: grant David Causse deployment rights [puppet] - 10https://gerrit.wikimedia.org/r/222255 [08:09:22] well, if we had eveent handlers in icinga and ssh icinga users and all the rest of the scaffolding needed, that would be the one service I would be ok to have an event handler issue a "service gitblit restart" on every critical [08:09:41] (03CR) 10jenkins-bot: [V: 04-1] access: grant David Causse deployment rights [puppet] - 10https://gerrit.wikimedia.org/r/222255 (owner: 10Matanya) [08:09:41] it's that much unmaintained [08:10:09] wow, that bad? :P [08:10:16] so we can ignore giblit and wait for Diffusion [08:15:06] Could not find class role::salt::minions for etherpad1001.eqiad.wmnet on node etherpad1001.eqiad.wmnet [08:15:10] that's what you get for using roles in modules [08:47:42] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1420009 (10jcrespo) Maybe dumps of labswiki are failing, too. Can you confirm it @ArielGlenn? I just saw `:real_connect(): (HY000/2003): Can't connect to... [08:52:19] akosiaris: did Faidon delegates ops duty to you for this week? Would have a bunch of puppet changes / debian packages to upload this afternoon if you are available :D [09:30:08] hashar: I don't think so [09:30:22] if he has, then I have not been informed [09:49:21] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [09:58:42] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [10:00:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [10:01:33] (03PS1) 10Yuvipanda: labstore: Escape grants properly [puppet] - 10https://gerrit.wikimedia.org/r/222265 (https://phabricator.wikimedia.org/T101758) [10:02:32] PROBLEM - puppet last run on labvirt1009 is CRITICAL Puppet has 1 failures [10:08:02] 6operations, 10Traffic, 7discovery-system, 5services-tooling: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1420202 (10Joe) For Varnish switch: I am verifying that all hosts are represented correctly in the generated lists, so far v... [10:12:51] PROBLEM - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is CRITICAL - Socket timeout after 10 seconds [10:13:37] what's going on? [10:13:46] ? [10:13:52] mobrovac: ? [10:14:09] merde [10:14:33] <_joe_> mobrovac: can we help? [10:14:42] lemme take a look first [10:16:52] <_joe_> it's down since ~ 9.30 UTC according to ganglia's network graph [10:17:31] RECOVERY - puppet last run on labvirt1009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:21] RECOVERY - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.005 second response time [10:18:42] ? [10:18:54] and it's ok now ? this isn't making sense [10:19:03] what is going on here? [10:19:12] all of the rb processes on all nodes are up [10:19:19] and listening to 7231 [10:19:24] local curls work on them [10:19:39] <_joe_> mobrovac: ok, moving to look at pybal [10:20:52] req/s are around 75 according to grafana. way less than usual [10:21:04] logstash has nothing useful [10:21:22] well, almost nothing at all to be more precise [10:21:31] <_joe_> so at the moment pybal sees only restbase 1001, 1003 and 1005 as pooled [10:21:37] euh [10:21:38] wtf [10:21:38] wait [10:21:46] rb1006 conn refused on 7231 [10:21:53] <_joe_> 2015-07-02 10:21:39.000603 [restbase_7231 ProxyFetch] restbase1002.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 5.449 s [10:21:56] <_joe_> 2015-07-02 10:21:39.008012 [restbase_7231 ProxyFetch] restbase1006.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 5.454 s [10:21:59] <_joe_> 2015-07-02 10:21:39.009618 [restbase_7231 ProxyFetch] restbase1004.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 5.455 s [10:22:57] <_joe_> proxyfetch.url = ["http://restbase.svc.eqiad.wmnet"] [10:23:12] <_joe_> so it just tries to fetch the / of restbase I guess? [10:24:07] k, rb1006 is good again [10:24:18] <_joe_> what did you do? [10:24:24] _joe_: / works for rb [10:24:45] _joe_: it seems rb <-> cass conn swallowed up something [10:24:45] <_joe_> mobrovac: yeah the problem is it was taking a ludicrous amount of time to fetch that [10:24:58] <_joe_> what can cause such slowness? [10:25:34] <_joe_> so 1 and 6 are ok [10:25:37] <_joe_> restart the others [10:25:53] back-pressure is applied on the nodes, so if all of the workers are "busy" the conn hangs until there's a free one [10:26:02] PROBLEM - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is CRITICAL - Socket timeout after 10 seconds [10:26:13] doing rb1002 [10:26:17] <_joe_> yeah [10:26:35] <_joe_> because right now the pages are caused by pybal not depooling all the fucked up servers [10:26:58] the local logger logs all show the same issue - cassandra conn gone awry [10:27:14] <_joe_> rb1002 are ok [10:27:21] <_joe_> *is [10:27:38] _joe_: for now. if the problem is in the backend, it will just re-manifest itself [10:27:39] going on rb1004 [10:27:40] <_joe_> restart 3,4,5 [10:27:51] RECOVERY - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.022 second response time [10:27:59] <_joe_> akosiaris: well yes, for now I'd like to fight the fire [10:28:09] <_joe_> mobrovac: maybe leave 1005 in this state [10:28:14] <_joe_> so you can look into it [10:28:50] looking at it [10:29:13] _joe_: I get you, I am just saying just restarting the nodejs service might be fuelling the fire [10:29:16] <_joe_> mobrovac: I'll restart the other server [10:30:52] <_joe_> now I could insert here a joke aobout nonblocking eventloops [10:30:55] <_joe_> :P [10:31:14] ha! [10:31:33] <_joe_> I've always said the mod_php model is the best. throw away everything at every request, and serve as many clients as you can fork() for :P [10:31:34] _joe_: insert it tomorrow, tomorrow's friday [10:32:17] (03PS1) 10Chmarkine: Wikidata - HSTS include subdomains and preload [puppet] - 10https://gerrit.wikimedia.org/r/222270 (https://phabricator.wikimedia.org/T104244) [10:32:42] <_joe_> ok, back to verifying data [10:33:51] so why the hell are we getting alerts only for the LVS IP? [10:34:11] where are the invidivual restbaseNNNN alerts? [10:34:40] not important right now of course :) [10:34:43] <_joe_> paravoid: nowhere? [10:34:45] <_joe_> :) [10:35:39] k, a strange thing happened [10:36:17] so, we have a mechanism that if a worker dies for whatever reason, the master respawns it [10:36:21] <_joe_> there will be monitoring once I'm done with https://phabricator.wikimedia.org/T94831, but a simple fetch for the same url pybal calls will help [10:36:28] <_joe_> I'll add it straight away [10:37:00] in this instance, however, all of the workers died due to cassandra timeouts, but for some mysterious reason, they were not respawned [10:37:09] this will need a closer look [10:37:23] <_joe_> yes [10:37:38] <_joe_> I'll work on monitoring at least / [10:38:29] at least we have bunch of events in logstash ( https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase ) [10:39:24] yeah, but they are not helpful and/or indicative in this case [10:39:25] :( [10:39:49] seems to indicate cassandra yields nothing [10:39:57] but you are the pro, forget me :} [10:40:14] haha [10:40:18] hashar: those started showing up after the restarts [10:40:27] the error ones [10:40:32] when we got the page, that view was empty [10:41:15] well, almost empty... there were 25 events IIRC for node heap [10:41:26] maybe the restbase daemon ends up deadlocked / unable to send events [10:48:02] PROBLEM - puppet last run on mw2114 is CRITICAL puppet fail [10:48:22] PROBLEM - puppet last run on mw2083 is CRITICAL puppet fail [10:49:41] PROBLEM - RAID on db1002 is CRITICAL 1 failed LD(s) (Degraded) [10:52:40] (03PS1) 10Giuseppe Lavagetto: restbase: check http connections [puppet] - 10https://gerrit.wikimedia.org/r/222272 [10:56:45] (03PS2) 10Giuseppe Lavagetto: restbase: check http connections [puppet] - 10https://gerrit.wikimedia.org/r/222272 [11:01:09] (03CR) 10Mobrovac: [C: 031] restbase: check http connections [puppet] - 10https://gerrit.wikimedia.org/r/222272 (owner: 10Giuseppe Lavagetto) [11:02:36] (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: check http connections [puppet] - 10https://gerrit.wikimedia.org/r/222272 (owner: 10Giuseppe Lavagetto) [11:02:42] (03CR) 10Mobrovac: [C: 031] cassandra: add team-services for cql failure [puppet] - 10https://gerrit.wikimedia.org/r/222201 (https://phabricator.wikimedia.org/T104467) (owner: 10Filippo Giunchedi) [11:03:13] <_joe_> mobrovac: want me to look at this and merge it in case? ^^ [11:03:40] _joe_: that'd be gr8 given the circumstances [11:03:40] thnx [11:03:57] <_joe_> uhm [11:04:01] <_joe_> that won't work btw [11:04:36] <_joe_> how do you get notifications? [11:04:46] <_joe_> I'll have to look into it [11:05:03] _joe_: we get a mail on services@wm.org [11:07:02] <_joe_> yeah but I guess only for services defined as nagios_critical? [11:07:46] <_joe_> no sorry, that's just for pages [11:09:48] _joe_: https://gerrit.wikimedia.org/r/#/c/216893/ [11:09:51] (03PS3) 10Giuseppe Lavagetto: cassandra: add team-services for cql failure [puppet] - 10https://gerrit.wikimedia.org/r/222201 (https://phabricator.wikimedia.org/T104467) (owner: 10Filippo Giunchedi) [11:09:56] will ack db1002 because of T103005 [11:10:02] that's all i know [11:10:52] RECOVERY - puppet last run on mw2083 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:11:36] (03CR) 10Giuseppe Lavagetto: [C: 032] cassandra: add team-services for cql failure [puppet] - 10https://gerrit.wikimedia.org/r/222201 (https://phabricator.wikimedia.org/T104467) (owner: 10Filippo Giunchedi) [11:12:12] ACKNOWLEDGEMENT - RAID on db1002 is CRITICAL 1 failed LD(s) (Degraded) Jcrespo may be decommed: T103005 [11:12:22] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:13:55] _joe_: mille grazie [11:14:15] <_joe_> mobrovac: di niente [11:14:41] k, i think i have a bit more info on what happened [11:14:47] <_joe_> good [11:14:58] and it seems it's a cassandra node driver bug [11:15:00] hm [11:15:37] so, when requesting / storing something, a local quorum of C* nodes needs to ack that [11:15:56] when the nodes are busy, as it happens, timeouts can happen [11:16:07] that is handled in the driver [11:16:09] all good [11:16:25] but, it seems it expects _at least_ 1 node to answer [11:16:54] and when all of them time out, there's an uncaught exception, causing the process to hang (as it's running in the event loop) [11:16:58] 6operations, 7Database: review eqiad database server quantities / warranties / service(s) - https://phabricator.wikimedia.org/T103936#1420379 (10jcrespo) db1002 (may be decommissioned) also just got a disk failure. Noting it only because I think it is one more of the "old disks" that may not be worth replacin... [11:17:39] mind you, that's the running theory [11:19:08] !log restbase restarting cassandra on rb1005 [11:19:14] Logged the message, Master [11:23:43] <_joe_> mobrovac: maybe an incident report would be good. [11:24:06] ack [11:25:01] <_joe_> just so that we have a canonical place to look at and a list of actionables that nails the two of us to fixing what is needed :) [11:25:15] <_joe_> bbl, lunch [11:25:31] <_joe_> mobrovac: I expect the restbase check to fail on restbase1003 at least [11:25:43] <_joe_> the restbase root url check [11:26:34] _joe_: why on 1003? [11:29:19] PROBLEM - Restbase root url on restbase1003 is CRITICAL - Socket timeout after 10 seconds [11:37:10] ^^ [11:38:40] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [11:42:15] <_joe_> mobrovac: because I see the future! [11:42:31] haha [11:42:33] <_joe_> more seriously, it was failing on pybal :) [11:42:46] so rb1003 doesn't work? [11:42:48] <_joe_> maybe you didn't restart it earlier [11:42:49] * mobrovac checking [11:43:05] <_joe_> yeah the check works :) [11:43:43] {{done}} [11:43:51] RECOVERY - Restbase root url on restbase1003 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.007 second response time [11:45:40] newbie question: how to find a paste in phab? [11:45:58] <_joe_> no idea [11:46:11] k, found [11:46:23] need to use the "paste" app and then you get the list [11:52:09] (03CR) 10Faidon Liambotis: [C: 04-1] "Second pass :)" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [11:52:27] paravoid: I'm rewriting another perl script as well atm! [11:52:41] which one? [11:53:45] paravoid: maintain-replicas.pl, creates mysql user accounts [11:53:52] good! [11:53:54] paravoid: turns out it depends on local state kept in labstore1001 which is since wiped [11:54:13] lol [11:54:24] paravoid: so I can either fix the perl script to not depend on that state... [11:54:28] paravoid: or rewrite it. [11:54:50] paravoid: I don't fully understand *why* it depends on that state in /var/cache either - it feels like it should be superfluous, but there's no documentation... [11:55:09] I have no idea [11:55:23] me neither [11:55:33] so clearly my first reaction is to rewrite it and document it. [11:55:50] sounds good [11:56:24] legoktm: around ? [11:57:14] (03CR) 10Yuvipanda: labstore: Rewrite of manage-nfs-volumes-daemon (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [11:57:56] paravoid: my usual strategy is for the deamon to crash, and monitoring (which I should add!) catching it, in case of network / JSON errors. [11:58:00] I'll add a check for the gid stuff tho [11:58:14] network errors can very easily happen for whatever reason [11:58:26] silver's apache getting restarted or something [11:59:49] so being a little more resilient to such an event would be a good thing, IMO [12:01:39] paravoid: what exactly can it do? skip a turn? [12:01:53] paravoid: if so, I'd rather have systemd do that with a respawn limit [12:02:28] even if that's the case, catch the exception and exit :) [12:02:49] seeing backtraces in the logs for unrelated network failures isn't nice [12:04:22] hmm, alright. [12:04:41] I generally like actual stacktraces vs a 'helpful message' that needs to be tracked down [12:04:47] but that might be just trauma from my JS days [12:04:48] (03PS2) 10KartikMistry: Beta: Add cxserver::restbase URL [puppet] - 10https://gerrit.wikimedia.org/r/222247 [12:05:16] where this... terrible thing that I was using caught all exceptions including syntax errors and logged a 'error caught!' type generic useless message... [12:05:17] anyway [12:06:24] * mobrovac -> lunch [13:03:04] _joe_: euh, re https://gerrit.wikimedia.org/r/#/c/222272/ completely neglected the fact the putting team_services for alerts would be rather beneficial [13:08:25] (03CR) 10Tobias Gritschacher: Add Phragile module. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [13:15:49] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1420663 (10hashar) [13:22:54] (03CR) 10Hashar: "check experimental" [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [13:30:59] (03PS1) 10John F. Lewis: remove db-secondary.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222289 [13:33:16] (03PS1) 10John F. Lewis: refresh symlinks to catch new dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222290 [13:33:27] tox for the win !!!! [13:35:13] Krenair: ^ nice quick merge above (dblists) if you have a few seconds :) [13:35:36] _joe_: I am going to drop py34 from the default list of envs :D [13:35:43] maybe fix flake8 and make the job voting [13:36:38] (03PS3) 10Hashar: Setup tox for easy venv [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (https://phabricator.wikimedia.org/T103972) [13:37:10] (03CR) 10Hashar: "removed py34 from the default list of env. So running `tox` would just do flake8 and py27." [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [13:40:17] 6operations, 7Database: investigate performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1420821 (10jcrespo) [13:40:27] (03CR) 10Andrew Bogott: labstore: Rewrite of manage-nfs-volumes-daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [13:40:59] (03PS4) 10Hashar: Setup tox for easy venv [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (https://phabricator.wikimedia.org/T103972) [13:41:01] (03PS1) 10Hashar: Fix flake8 issues [software/conftool] - 10https://gerrit.wikimedia.org/r/222291 [13:41:18] (03CR) 10Hashar: "flake8 should be fixed with new parent change https://gerrit.wikimedia.org/r/222291" [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [13:41:25] (03CR) 10Hashar: "check experimental" [software/conftool] - 10https://gerrit.wikimedia.org/r/221087 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [13:41:32] (03PS1) 10Jcrespo: Repool db2029; depool db2047 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222292 [13:43:04] (03CR) 10Jcrespo: [C: 032] Repool db2029; depool db2047 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222292 (owner: 10Jcrespo) [13:43:10] (03PS9) 10Yuvipanda: labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [13:43:12] paravoid: ^ updated [13:43:16] (03CR) 10jenkins-bot: [V: 04-1] labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [13:43:49] (03PS6) 10Yuvipanda: labstore: Simplify (and expand!) projects-config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/221856 [13:44:05] aaarggh, more merge conflicts [13:44:28] (03CR) 10Hashar: "flake8 is fixed as seen on PS4 of child change https://gerrit.wikimedia.org/r/#/c/221087/" [software/conftool] - 10https://gerrit.wikimedia.org/r/222291 (owner: 10Hashar) [13:44:49] _joe_: I fixed flake8 and proposed a change for CI to make the tox-flake8 job voting :-} [13:46:14] wtf gerrit [13:47:18] (03PS10) 10Yuvipanda: labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [13:47:22] paravoid: ^ [13:49:26] !log jynus Synchronized wmf-config/db-codfw.php: repool db2029; depool db2047 for maintenance (duration: 00m 13s) [13:49:32] Logged the message, Master [13:50:16] let's keep an eye on kibana now [13:50:39] <_joe_> hashar: thanks [13:51:04] there definitely is something wrong with db2029 [13:54:00] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [13:55:01] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [13:55:28] on it ^^^ [13:56:30] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1420926 (10Chmarkine) [13:58:27] !log restarted restbase1005.eqiad [13:58:33] Logged the message, Master [13:58:40] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1420932 (10Chmarkine) [13:58:47] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1420936 (10Qgil) [13:59:18] 6operations, 10Traffic, 7discovery-system, 5services-tooling: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1420945 (10BBlack) I've applied all the custom hardware-based weighting that matters at all levels for nginx/varnish-* pools.... [13:59:31] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [14:00:41] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.000 second response time on port 9042 [14:01:00] YuviPanda: sys.exit(1), not -1 [14:02:29] oh, right. [14:02:40] * YuviPanda wonders where I got my muscle memory from. [14:03:12] (03PS11) 10Yuvipanda: labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [14:04:03] 6operations, 7Database: codfw frontends cannot connect to db2029 - https://phabricator.wikimedia.org/T104573#1420971 (10jcrespo) 3NEW [14:04:19] (03CR) 10Faidon Liambotis: [C: 032] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [14:04:28] paravoid: woah woah, not yet [14:04:35] I did not merge [14:04:41] right, just realized [14:04:52] (03PS1) 10Jcrespo: Depool db2029 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222295 [14:05:25] (03CR) 10Jcrespo: [C: 032] Depool db2029 again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222295 (owner: 10Jcrespo) [14:05:33] paravoid: can you +2 the parent commit? :) [14:05:41] https://gerrit.wikimedia.org/r/#/c/221856/6 [14:05:46] not really :) [14:05:56] I have no idea if those gids are correct or not :P [14:06:02] or the whole config, in fact! [14:06:15] hmm [14:06:21] if you're confident enough, feel free to self-merge :) [14:06:25] ok! [14:06:35] https://www.irccloud.com/pastebin/XcgZNVLo/ [14:06:40] used that to generate the gids [14:06:59] hmm, I put a password on that paste but that's already public so eh. [14:07:18] paravoid: so what happens if the gids are wrong? :) [14:07:39] mayhem [14:07:45] take a backup of /etc/exports.d first [14:08:04] well I should say: nothing happens until sync-exports (exportfs) runs [14:08:15] 6operations, 7discovery-system: confctl input-validation and/or no-create - https://phabricator.wikimedia.org/T104574#1420987 (10BBlack) 3NEW a:3Joe [14:08:28] (03PS1) 10Merlijn van Deen: [nfs/toolsbeta] Set toolsbeta config to be like tools [puppet] - 10https://gerrit.wikimedia.org/r/222296 [14:09:06] 6operations, 7discovery-system: conftools: hostname creation validation, set != create - https://phabricator.wikimedia.org/T104574#1420995 (10BBlack) [14:10:12] paravoid: yeah, and if you want me to be doubly careful I can: 1. generate it on a different dir, 2. diff the gids alone with current exports.d, 3. verify they're the same [14:10:42] that works too [14:11:02] paravoid: can you confirm that the systemd unit file as written won't autostart? :) [14:11:08] then I can merge these and do the diff test [14:13:58] !log jynus Synchronized wmf-config/db-codfw.php: depool db2029 again: T104573 (duration: 00m 12s) [14:14:05] Logged the message, Master [14:14:38] I *think* it won't [14:14:49] base::service_unit doesn't seem to be calling Service [14:16:04] (03PS1) 10Andrew Bogott: Change privs for pdns.conf [puppet] - 10https://gerrit.wikimedia.org/r/222297 [14:16:33] (03CR) 10Andrew Bogott: [C: 032] Change privs for pdns.conf [puppet] - 10https://gerrit.wikimedia.org/r/222297 (owner: 10Andrew Bogott) [14:16:38] (03PS2) 10BBlack: Wikidata - HSTS include subdomains and preload [puppet] - 10https://gerrit.wikimedia.org/r/222270 (https://phabricator.wikimedia.org/T104244) (owner: 10Chmarkine) [14:17:02] paravoid: yeah, the other uses of base::service_unit have their own thing. [14:17:16] alright, mergig now! [14:17:21] :) [14:17:22] (03PS7) 10Yuvipanda: labstore: Simplify (and expand!) projects-config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/221856 [14:17:29] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Simplify (and expand!) projects-config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/221856 (owner: 10Yuvipanda) [14:17:37] of course without a sync-export hook, this is an incomplete replacement [14:17:41] indeed [14:17:41] but you know that already, right [14:17:45] yeah [14:18:26] I feel a lot more queasy rewriting syncexports however. [14:18:42] I guess I've a lot of man page reading to catch up on [14:18:54] it's really not that complicated [14:18:59] I can explain if you run into questions [14:19:30] (03PS1) 10Merlijn van Deen: [tools] Add user keys for Tools roots [puppet] - 10https://gerrit.wikimedia.org/r/222298 [14:20:03] YuviPanda: ^ should I make this more general, and just add it next to ssh::userkey as ssh::hierakey, or something like that? [14:20:05] 6operations, 7Database: codfw frontends cannot connect to db2029 - https://phabricator.wikimedia.org/T104573#1421023 (10jcrespo) [14:20:17] let me first test if this works :-p [14:20:17] (03CR) 10jenkins-bot: [V: 04-1] [tools] Add user keys for Tools roots [puppet] - 10https://gerrit.wikimedia.org/r/222298 (owner: 10Merlijn van Deen) [14:20:27] 6operations, 7Database: codfw frontends cannot connect to mysql at db2029 - https://phabricator.wikimedia.org/T104573#1421026 (10jcrespo) [14:21:12] valhallasw`cloud: :) ok. I'd prefer another way, though - current labs keys roots are in labs/private, perhaps have a way to add on to that from hiera. [14:21:30] valhallasw`cloud: we already have multiple ways of auth, don't want to add another (the admin module, LDAP, root keys) [14:24:09] (03PS3) 10BBlack: Wikidata - HSTS include subdomains and preload [puppet] - 10https://gerrit.wikimedia.org/r/222270 (https://phabricator.wikimedia.org/T104244) (owner: 10Chmarkine) [14:24:34] YuviPanda: unfortunately, 'we' does not include me [14:24:45] (03CR) 10BBlack: [C: 032] "But, please don't submit. We'll do that after it's rolled out everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/222270 (https://phabricator.wikimedia.org/T104244) (owner: 10Chmarkine) [14:24:51] (03CR) 10BBlack: [V: 032] "But, please don't submit. We'll do that after it's rolled out everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/222270 (https://phabricator.wikimedia.org/T104244) (owner: 10Chmarkine) [14:25:04] valhallasw`cloud: you know labs-private isn't actually private, right? [14:25:07] it's just a normal git repo [14:25:19] anyway, I've broken puppet on all labs hosts again [14:25:20] wheee [14:25:24] let me fix that first. [14:27:51] (03CR) 10Andrew Bogott: "I can live with this but I haaaaaate that it relies on wikitech as a go-between for instance data. Ldap was better." [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [14:28:29] !log restarted apache on labcontrol1001 [14:28:36] Logged the message, Master [14:28:45] poof, at least that wasn't my patch that broke it [14:29:26] YuviPanda: labs-private only provides for labs-wide roots. modules/admin looks like it's prod-only, so only LDAP remains, as far as I can see? [14:29:45] valhallasw`cloud: indeed, which is why I said 'we should allow per-project roots to be made available via hiera' [14:29:54] (03PS1) 10Faidon Liambotis: mediawiki: remove HSTS from donate's Apache config [puppet] - 10https://gerrit.wikimedia.org/r/222299 [14:30:12] anyway, I'm going to pointedly ignore this until I get the NFS stuff rolled out. [14:30:24] sorry. but I do agree that we need out of band access for people that doesn't depend on LDAP [14:30:32] and addable via hiera. [14:30:54] YuviPanda: I don't see what your 'other way' would be, other than what I just wrote [14:31:03] * YuviPanda ignores :P [14:31:17] (03PS12) 10Yuvipanda: labstore: Rewrite of manage-nfs-volumes-daemon [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) [14:31:50] (03CR) 10BBlack: [C: 031] mediawiki: remove HSTS from donate's Apache config [puppet] - 10https://gerrit.wikimedia.org/r/222299 (owner: 10Faidon Liambotis) [14:32:25] (03CR) 10Yuvipanda: [C: 032] "Poof." [puppet] - 10https://gerrit.wikimedia.org/r/221872 (https://phabricator.wikimedia.org/T102782) (owner: 10Yuvipanda) [14:35:25] (03PS1) 10BBlack: HSTS: increase to 1y, do not allow applayer override [puppet] - 10https://gerrit.wikimedia.org/r/222301 [14:36:16] (03PS2) 10BBlack: HSTS: increase to 1y, do not allow applayer override [puppet] - 10https://gerrit.wikimedia.org/r/222301 (https://phabricator.wikimedia.org/T40516) [14:36:30] PROBLEM - puppetmaster https on palladium is CRITICAL - Socket timeout after 10 seconds [14:37:10] (03PS1) 10Hashar: Create proper logger for tests [software/conftool] - 10https://gerrit.wikimedia.org/r/222302 [14:37:16] wtf puppet [14:37:37] (03CR) 10Hashar: "check experimental" [software/conftool] - 10https://gerrit.wikimedia.org/r/222302 (owner: 10Hashar) [14:38:21] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 9.070 second response time [14:38:24] (03PS2) 10Faidon Liambotis: mediawiki: remove HSTS from donate's Apache config [puppet] - 10https://gerrit.wikimedia.org/r/222299 [14:38:26] (03PS1) 10Faidon Liambotis: contint: remove doc.mediawiki.org Apache vhost [puppet] - 10https://gerrit.wikimedia.org/r/222303 [14:38:43] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mediawiki: remove HSTS from donate's Apache config [puppet] - 10https://gerrit.wikimedia.org/r/222299 (owner: 10Faidon Liambotis) [14:38:53] (03CR) 10Faidon Liambotis: [C: 032 V: 032] contint: remove doc.mediawiki.org Apache vhost [puppet] - 10https://gerrit.wikimedia.org/r/222303 (owner: 10Faidon Liambotis) [14:40:00] PROBLEM - puppet last run on mw1155 is CRITICAL puppet fail [14:40:01] PROBLEM - puppet last run on wtp1007 is CRITICAL Puppet has 11 failures [14:40:01] PROBLEM - puppet last run on mw1135 is CRITICAL puppet fail [14:40:01] PROBLEM - puppet last run on mw1110 is CRITICAL puppet fail [14:40:01] PROBLEM - puppet last run on tmh1001 is CRITICAL puppet fail [14:40:10] PROBLEM - puppet last run on wtp2016 is CRITICAL puppet fail [14:40:11] PROBLEM - puppet last run on cp4011 is CRITICAL puppet fail [14:40:11] PROBLEM - puppet last run on elastic1029 is CRITICAL puppet fail [14:40:20] PROBLEM - puppet last run on cp2008 is CRITICAL puppet fail [14:40:20] PROBLEM - puppet last run on analytics1001 is CRITICAL Puppet has 24 failures [14:40:21] PROBLEM - puppet last run on wtp1008 is CRITICAL puppet fail [14:40:21] PROBLEM - puppet last run on restbase1003 is CRITICAL puppet fail [14:40:30] PROBLEM - puppet last run on cp2002 is CRITICAL Puppet has 6 failures [14:40:31] PROBLEM - puppet last run on mw2207 is CRITICAL puppet fail [14:40:31] PROBLEM - puppet last run on wtp2017 is CRITICAL puppet fail [14:40:31] PROBLEM - puppet last run on mw2067 is CRITICAL puppet fail [14:40:31] PROBLEM - puppet last run on db2010 is CRITICAL puppet fail [14:40:31] PROBLEM - puppet last run on mw1018 is CRITICAL puppet fail [14:40:32] PROBLEM - puppet last run on mw1104 is CRITICAL puppet fail [14:40:32] PROBLEM - puppet last run on db1056 is CRITICAL puppet fail [14:40:41] PROBLEM - puppet last run on db2017 is CRITICAL puppet fail [14:40:51] PROBLEM - puppet last run on mw2120 is CRITICAL puppet fail [14:40:51] PROBLEM - puppet last run on mc2007 is CRITICAL Puppet has 8 failures [14:40:51] PROBLEM - puppet last run on cp4009 is CRITICAL Puppet has 2 failures [14:40:52] PROBLEM - puppet last run on cp1067 is CRITICAL Puppet has 8 failures [14:40:52] PROBLEM - puppet last run on mw1171 is CRITICAL Puppet has 12 failures [14:40:52] PROBLEM - puppet last run on db1027 is CRITICAL puppet fail [14:41:00] PROBLEM - puppet last run on mw2047 is CRITICAL puppet fail [14:41:00] PROBLEM - puppet last run on analytics1031 is CRITICAL puppet fail [14:41:01] PROBLEM - puppet last run on graphite1001 is CRITICAL Puppet has 9 failures [14:41:10] PROBLEM - puppet last run on mw1053 is CRITICAL Puppet has 25 failures [14:41:11] PROBLEM - puppet last run on cp1054 is CRITICAL Puppet has 25 failures [14:41:11] PROBLEM - puppet last run on mw2176 is CRITICAL puppet fail [14:41:11] PROBLEM - puppet last run on mw2084 is CRITICAL puppet fail [14:41:11] PROBLEM - puppet last run on rdb2001 is CRITICAL Puppet has 5 failures [14:41:20] PROBLEM - puppet last run on db2050 is CRITICAL Puppet has 6 failures [14:41:21] PROBLEM - puppet last run on cp4012 is CRITICAL Puppet has 4 failures [14:41:21] PROBLEM - puppet last run on cp3048 is CRITICAL puppet fail [14:41:22] PROBLEM - puppet last run on cp3047 is CRITICAL Puppet has 3 failures [14:41:22] PROBLEM - puppet last run on cp3038 is CRITICAL puppet fail [14:41:22] PROBLEM - puppet last run on cp3021 is CRITICAL Puppet has 1 failures [14:41:29] (03PS1) 10Yuvipanda: labstore: Rename and sacrifice to Lord Puppet [puppet] - 10https://gerrit.wikimedia.org/r/222304 [14:41:31] PROBLEM - puppet last run on db2056 is CRITICAL puppet fail [14:41:31] PROBLEM - puppet last run on es2010 is CRITICAL puppet fail [14:41:40] PROBLEM - puppet last run on ms-be1015 is CRITICAL puppet fail [14:41:41] PROBLEM - puppet last run on mw2078 is CRITICAL puppet fail [14:41:41] PROBLEM - puppet last run on mw2038 is CRITICAL Puppet has 6 failures [14:41:41] PROBLEM - puppet last run on mw1239 is CRITICAL Puppet has 24 failures [14:41:41] PROBLEM - puppet last run on mw1154 is CRITICAL puppet fail [14:41:41] PROBLEM - puppet last run on cp3018 is CRITICAL Puppet has 2 failures [14:41:42] PROBLEM - puppet last run on mw1188 is CRITICAL Puppet has 24 failures [14:41:42] PROBLEM - puppet last run on mw1128 is CRITICAL puppet fail [14:41:43] PROBLEM - puppet last run on mw2137 is CRITICAL puppet fail [14:41:43] PROBLEM - puppet last run on mw2195 is CRITICAL puppet fail [14:41:44] PROBLEM - puppet last run on mw2158 is CRITICAL puppet fail [14:41:44] PROBLEM - puppet last run on baham is CRITICAL Puppet has 5 failures [14:41:45] woooo [14:41:46] was that me? [14:41:50] PROBLEM - puppet last run on cp4016 is CRITICAL puppet fail [14:41:50] PROBLEM - puppet last run on elastic1005 is CRITICAL Puppet has 13 failures [14:41:50] PROBLEM - puppet last run on mw1253 is CRITICAL puppet fail [14:41:50] PROBLEM - puppet last run on cp3039 is CRITICAL Puppet has 3 failures [14:41:51] PROBLEM - puppet last run on analytics1014 is CRITICAL Puppet has 21 failures [14:42:01] PROBLEM - puppet last run on cp1045 is CRITICAL Puppet has 28 failures [14:42:02] PROBLEM - puppet last run on radon is CRITICAL Puppet has 7 failures [14:42:10] PROBLEM - puppet last run on cp1053 is CRITICAL Puppet has 17 failures [14:42:10] PROBLEM - puppet last run on db1064 is CRITICAL Puppet has 23 failures [14:42:10] PROBLEM - puppet last run on analytics1011 is CRITICAL Puppet has 17 failures [14:42:10] PROBLEM - puppet last run on cp2016 is CRITICAL Puppet has 10 failures [14:42:21] PROBLEM - puppet last run on wtp2006 is CRITICAL Puppet has 6 failures [14:42:21] PROBLEM - puppet last run on cp2011 is CRITICAL Puppet has 6 failures [14:42:27] ok, that's all codfw [14:42:31] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 12 failures [14:42:51] it's not all codfw [14:42:58] seems like puppetmaster failure [14:43:01] PROBLEM - puppet last run on mw1210 is CRITICAL Puppet has 26 failures [14:43:10] PROBLEM - puppet last run on pc1003 is CRITICAL Puppet has 3 failures [14:43:11] PROBLEM - puppet last run on mw2189 is CRITICAL Puppet has 6 failures [14:43:11] PROBLEM - puppet last run on elastic1017 is CRITICAL Puppet has 13 failures [14:43:12] PROBLEM - puppet last run on rdb1003 is CRITICAL Puppet has 5 failures [14:43:12] PROBLEM - puppet last run on virt1008 is CRITICAL Puppet has 12 failures [14:43:19] yep, also 1s and 4s [14:43:20] PROBLEM - puppet last run on db1030 is CRITICAL Puppet has 5 failures [14:43:21] PROBLEM - puppet last run on lvs2003 is CRITICAL Puppet has 7 failures [14:43:21] PROBLEM - puppet last run on mw2065 is CRITICAL Puppet has 6 failures [14:43:22] yeah [14:43:30] PROBLEM - puppet last run on db1072 is CRITICAL Puppet has 12 failures [14:43:31] PROBLEM - puppet last run on es1010 is CRITICAL Puppet has 3 failures [14:43:31] PROBLEM - puppet last run on mw2201 is CRITICAL Puppet has 5 failures [14:43:32] PROBLEM - puppet last run on db2012 is CRITICAL Puppet has 3 failures [14:43:32] PROBLEM - puppet last run on db1011 is CRITICAL Puppet has 6 failures [14:43:35] !log kicked puppetmaster on palladium [14:43:42] PROBLEM - puppet last run on mc1018 is CRITICAL Puppet has 6 failures [14:43:42] Logged the message, Master [14:43:43] PROBLEM - puppet last run on mw1074 is CRITICAL Puppet has 18 failures [14:43:43] PROBLEM - puppet last run on mw1023 is CRITICAL Puppet has 15 failures [14:43:52] root 29479 1 0 14:36 ? 00:00:00 /usr/sbin/apache2 -k start [14:44:05] PROBLEM - puppet last run on strontium is CRITICAL Puppet has 15 failures [14:44:05] PROBLEM - puppet last run on ms-fe3001 is CRITICAL Puppet has 2 failures [14:44:05] PROBLEM - puppet last run on db1045 is CRITICAL Puppet has 2 failures [14:44:05] PROBLEM - puppet last run on ms-be1011 is CRITICAL Puppet has 10 failures [14:44:05] PROBLEM - puppet last run on db1035 is CRITICAL Puppet has 6 failures [14:44:08] it had already just been restarted at :36, which is probably related or causal [14:44:11] PROBLEM - puppet last run on analytics1003 is CRITICAL Puppet has 4 failures [14:44:11] PROBLEM - puppet last run on db1038 is CRITICAL Puppet has 5 failures [14:44:20] PROBLEM - puppet last run on db2053 is CRITICAL Puppet has 4 failures [14:44:20] PROBLEM - puppet last run on mw2167 is CRITICAL Puppet has 6 failures [14:44:20] PROBLEM - puppet last run on mw2155 is CRITICAL Puppet has 6 failures [14:44:20] PROBLEM - puppet last run on wtp2013 is CRITICAL Puppet has 5 failures [14:44:20] PROBLEM - puppet last run on mw2116 is CRITICAL Puppet has 6 failures [14:44:21] PROBLEM - puppet last run on achernar is CRITICAL Puppet has 4 failures [14:44:21] PROBLEM - puppet last run on db2030 is CRITICAL Puppet has 4 failures [14:44:22] PROBLEM - puppet last run on db2028 is CRITICAL Puppet has 7 failures [14:44:22] PROBLEM - puppet last run on elastic1003 is CRITICAL Puppet has 1 failures [14:44:23] PROBLEM - puppet last run on db1070 is CRITICAL Puppet has 7 failures [14:44:23] PROBLEM - puppet last run on mw1034 is CRITICAL Puppet has 14 failures [14:44:24] PROBLEM - puppet last run on labvirt1008 is CRITICAL Puppet has 11 failures [14:44:30] PROBLEM - puppet last run on analytics1027 is CRITICAL Puppet has 3 failures [14:44:30] PROBLEM - puppet last run on db1037 is CRITICAL Puppet has 3 failures [14:44:30] PROBLEM - puppet last run on db2058 is CRITICAL Puppet has 1 failures [14:44:30] PROBLEM - puppet last run on labvirt1002 is CRITICAL Puppet has 6 failures [14:44:31] PROBLEM - puppet last run on wtp2020 is CRITICAL Puppet has 7 failures [14:44:31] PROBLEM - puppet last run on mw2027 is CRITICAL Puppet has 6 failures [14:44:31] PROBLEM - puppet last run on wtp1001 is CRITICAL Puppet has 9 failures [14:44:32] PROBLEM - puppet last run on db2011 is CRITICAL Puppet has 6 failures [14:44:32] PROBLEM - puppet last run on mw2068 is CRITICAL Puppet has 6 failures [14:44:33] PROBLEM - puppet last run on db1061 is CRITICAL Puppet has 8 failures [14:44:57] bblack: hmm, I think new runs are succeeding now, but just really really slow [14:45:00] PROBLEM - puppet last run on db1033 is CRITICAL Puppet has 4 failures [14:45:01] PROBLEM - puppet last run on labvirt1004 is CRITICAL Puppet has 2 failures [14:45:01] PROBLEM - puppet last run on db2052 is CRITICAL Puppet has 7 failures [14:45:02] YuviPanda: what was "kicked"? the :36 one or just then? [14:45:10] PROBLEM - puppet last run on analytics1018 is CRITICAL Puppet has 10 failures [14:45:11] PROBLEM - puppet last run on db2054 is CRITICAL Puppet has 6 failures [14:45:11] PROBLEM - puppet last run on ocg1003 is CRITICAL Puppet has 2 failures [14:45:11] PROBLEM - puppet last run on wtp1003 is CRITICAL Puppet has 5 failures [14:45:11] PROBLEM - puppet last run on db1047 is CRITICAL Puppet has 11 failures [14:45:20] PROBLEM - puppet last run on mc1017 is CRITICAL Puppet has 9 failures [14:45:20] PROBLEM - puppet last run on mw1029 is CRITICAL Puppet has 17 failures [14:45:21] PROBLEM - puppet last run on iridium is CRITICAL Puppet has 3 failures [14:45:21] PROBLEM - puppet last run on chromium is CRITICAL Puppet has 9 failures [14:45:21] PROBLEM - puppet last run on einsteinium is CRITICAL Puppet has 9 failures [14:45:22] (because they still started at :36, not :43) [14:45:30] PROBLEM - puppet last run on lvs1004 is CRITICAL Puppet has 8 failures [14:45:30] PROBLEM - puppet last run on mw2193 is CRITICAL Puppet has 7 failures [14:45:30] PROBLEM - puppet last run on ms-be1002 is CRITICAL Puppet has 9 failures [14:45:31] PROBLEM - puppet last run on mw1198 is CRITICAL Puppet has 18 failures [14:45:31] PROBLEM - puppet last run on db2060 is CRITICAL Puppet has 4 failures [14:45:35] bblack: that was me too. [14:45:40] PROBLEM - puppet last run on ms-fe1002 is CRITICAL Puppet has 9 failures [14:45:41] PROBLEM - puppet last run on labvirt1006 is CRITICAL Puppet has 5 failures [14:45:48] so this is from the earlier restart? [14:45:50] well nothing happened at :43, just wondering what was supposed to happen :) [14:45:51] PROBLEM - puppet last run on labstore2001 is CRITICAL puppet fail [14:46:01] PROBLEM - puppet last run on lvs4004 is CRITICAL Puppet has 3 failures [14:46:11] PROBLEM - puppet last run on wtp2011 is CRITICAL Puppet has 10 failures [14:46:11] bblack: uh, apparntly I didn't press enter... [14:46:21] PROBLEM - puppet last run on mw2187 is CRITICAL Puppet has 6 failures [14:46:31] PROBLEM - puppet last run on mw2186 is CRITICAL Puppet has 4 failures [14:46:31] PROBLEM - puppet last run on analytics1004 is CRITICAL Puppet has 7 failures [14:46:31] bblack: should I still restart? I see puppet runs succeeding now. [14:46:50] probably not [14:46:51] PROBLEM - puppet last run on mw2140 is CRITICAL Puppet has 6 failures [14:46:53] yeah [14:47:06] did previous puppetmaster restarts have this much of an effect? [14:47:11] PROBLEM - puppet last run on mw2058 is CRITICAL Puppet has 5 failures [14:47:21] PROBLEM - puppet last run on mw2026 is CRITICAL Puppet has 6 failures [14:47:30] (03PS2) 10Merlijn van Deen: [tools] Add user keys for Tools roots [puppet] - 10https://gerrit.wikimedia.org/r/222298 [14:47:41] PROBLEM - puppet last run on mw2144 is CRITICAL Puppet has 6 failures [14:47:50] (03CR) 10Merlijn van Deen: "Tested to work on Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/222298 (owner: 10Merlijn van Deen) [14:47:51] PROBLEM - puppet last run on mw1108 is CRITICAL Puppet has 5 failures [14:47:51] PROBLEM - puppet last run on mw1033 is CRITICAL Puppet has 9 failures [14:48:01] PROBLEM - puppet last run on mw2204 is CRITICAL Puppet has 10 failures [14:48:02] PROBLEM - puppet last run on mw2044 is CRITICAL Puppet has 4 failures [14:48:10] PROBLEM - puppet last run on mw1186 is CRITICAL Puppet has 17 failures [14:48:30] PROBLEM - puppet last run on mw1243 is CRITICAL Puppet has 2 failures [14:48:31] PROBLEM - puppet last run on mw2008 is CRITICAL Puppet has 2 failures [14:48:41] PROBLEM - puppet last run on mw1087 is CRITICAL Puppet has 1 failures [14:48:41] PROBLEM - puppet last run on mw1001 is CRITICAL Puppet has 5 failures [14:48:41] PROBLEM - puppet last run on mw1032 is CRITICAL Puppet has 2 failures [14:48:50] PROBLEM - puppet last run on mw2210 is CRITICAL Puppet has 11 failures [14:48:51] PROBLEM - puppet last run on mw1043 is CRITICAL Puppet has 6 failures [14:48:51] PROBLEM - puppet last run on mw1010 is CRITICAL Puppet has 3 failures [14:48:51] PROBLEM - puppet last run on mw1185 is CRITICAL Puppet has 7 failures [14:49:01] PROBLEM - puppet last run on mw2139 is CRITICAL Puppet has 7 failures [14:49:01] PROBLEM - puppet last run on mw2081 is CRITICAL Puppet has 5 failures [14:49:14] PROBLEM - puppet last run on mw2153 is CRITICAL Puppet has 6 failures [14:49:20] <_joe_> what the hell is going on? [14:49:20] PROBLEM - puppet last run on mw1022 is CRITICAL Puppet has 7 failures [14:49:21] PROBLEM - puppet last run on mw1193 is CRITICAL Puppet has 2 failures [14:49:26] <_joe_> puppetmaster down? [14:49:28] I'm going to kill icinga-wm [14:49:31] PROBLEM - puppet last run on mw1220 is CRITICAL Puppet has 6 failures [14:49:31] PROBLEM - puppet last run on mw1143 is CRITICAL Puppet has 9 failures [14:49:31] PROBLEM - puppet last run on mw1223 is CRITICAL Puppet has 3 failures [14:49:31] PROBLEM - puppet last run on mw1241 is CRITICAL Puppet has 3 failures [14:49:32] PROBLEM - puppet last run on mw1219 is CRITICAL Puppet has 3 failures [14:49:32] PROBLEM - puppet last run on mw1231 is CRITICAL Puppet has 4 failures [14:49:34] <_joe_> no [14:49:40] PROBLEM - puppet last run on fluorine is CRITICAL Puppet has 6 failures [14:49:41] PROBLEM - puppet last run on mw2091 is CRITICAL Puppet has 5 failures [14:49:41] PROBLEM - puppet last run on mw1148 is CRITICAL Puppet has 3 failures [14:49:45] _joe_: I restarted it earlier and that's caused these failures that are being spewed out. [14:49:50] PROBLEM - puppet last run on mw2087 is CRITICAL Puppet has 7 failures [14:49:50] PROBLEM - puppet last run on mw2064 is CRITICAL Puppet has 8 failures [14:49:50] PROBLEM - puppet last run on mw2031 is CRITICAL Puppet has 8 failures [14:49:50] PROBLEM - puppet last run on mw2007 is CRITICAL Puppet has 2 failures [14:49:51] PROBLEM - puppet last run on mw1229 is CRITICAL Puppet has 3 failures [14:49:51] PROBLEM - puppet last run on mw1201 is CRITICAL Puppet has 6 failures [14:49:51] PROBLEM - puppet last run on mw1077 is CRITICAL Puppet has 1 failures [14:49:51] <_joe_> oh ok [14:49:52] PROBLEM - puppet last run on mw1215 is CRITICAL Puppet has 12 failures [14:49:52] PROBLEM - puppet last run on mw2188 is CRITICAL Puppet has 3 failures [14:49:52] _joe_: puppet runs are succeeding now. [14:49:53] PROBLEM - puppet last run on mw2159 is CRITICAL Puppet has 8 failures [14:49:53] PROBLEM - puppet last run on mw2148 is CRITICAL Puppet has 2 failures [14:50:00] PROBLEM - puppet last run on mw2010 is CRITICAL Puppet has 2 failures [14:50:00] PROBLEM - puppet last run on mw2071 is CRITICAL Puppet has 3 failures [14:50:01] PROBLEM - puppet last run on mw1091 is CRITICAL Puppet has 3 failures [14:50:01] PROBLEM - puppet last run on mw1016 is CRITICAL Puppet has 1 failures [14:50:09] !log killed icinga-wm for a bit [14:50:15] Logged the message, Master [14:50:23] am tailing logs as penance [14:50:50] (03CR) 10Yuvipanda: [C: 032] labstore: Rename and sacrifice to Lord Puppet [puppet] - 10https://gerrit.wikimedia.org/r/222304 (owner: 10Yuvipanda) [14:54:49] !log restarted apache on strontium [14:54:55] Logged the message, Master [14:54:59] YuviPanda , strontium was still broken [14:55:04] mutante: aha! [14:55:05] i supposed you were on palladium [14:55:07] yeah [14:55:11] (03PS1) 10Ottomata: Oozie sharelib is no longer a gz [puppet/cdh] - 10https://gerrit.wikimedia.org/r/222308 [14:55:19] I see recoveries now [14:55:24] it was mod_passenger crash [14:55:42] (03PS1) 10Yuvipanda: labstore: Fix implicit cyclic dependency [puppet] - 10https://gerrit.wikimedia.org/r/222309 [14:55:59] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Fix implicit cyclic dependency [puppet] - 10https://gerrit.wikimedia.org/r/222309 (owner: 10Yuvipanda) [14:56:58] !log restarting gitblit on antimony for the 123443th time [14:57:05] Logged the message, Master [14:57:46] Do we know why it's failing? [14:57:52] (gitblit) [14:58:10] it's terrible software. [14:58:17] it was down for 9 hours this time [14:58:26] also new ticket about it again [14:59:30] paravoid: whoops, base::service_unit *does* have a service call... [14:59:35] ensure_resource('service',$name, $params) [14:59:42] and so the daemon already ran. [14:59:43] oh well [15:00:03] did you take a backup? :) [15:00:04] manybubbles anomie ostriches marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150702T1500). [15:00:04] jzerebecki: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:07] of exports.d? [15:00:12] paravoid: yes [15:00:16] <_joe_> mutante: I was sure it was the 123453th [15:00:18] awesome [15:00:39] paravoid: :) preparing a dirty script to do the diffing now [15:00:43] _joe_: my bad, 3rd, not 3th :p [15:00:51] <_joe_> eheh [15:00:53] o/ [15:01:04] YuviPanda: just diff -Nurp :) [15:01:07] (03CR) 10Ottomata: [C: 032] Oozie sharelib is no longer a gz [puppet/cdh] - 10https://gerrit.wikimedia.org/r/222308 (owner: 10Ottomata) [15:01:21] paravoid: that'll diff the instance IPs too, which are in different ordering, no? [15:01:29] * YuviPanda tries anyway [15:01:43] I guess so [15:01:48] jzerebecki: It looks like your patch is against the wrong repo. mediawiki/extensions/Wiki*base* isn't deployed on WMF wikis, we use mediawiki/extensions/Wiki*data* [15:01:51] yeah, that's what it does. [15:02:03] anomie: i updated the link [15:02:09] still easy enough - make copies, rip out instance ips, and diff [15:02:44] anomie: to https://gerrit.wikimedia.org/r/#/c/222311/ [15:02:49] jzerebecki: ok [15:04:02] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1421122 (10Dzahn) analytics clusters also switched now https://gerrit.wikimedia.org/r/#/c/222153/ [15:04:16] (03PS2) 10Hashar: tests: create proper loggers [software/conftool] - 10https://gerrit.wikimedia.org/r/222302 [15:04:18] (03PS1) 10Hashar: tests: catch KVObject.setup() SystemExit [software/conftool] - 10https://gerrit.wikimedia.org/r/222313 [15:04:20] andrewbogott: (switching channels) [15:04:26] andrewbogott: not yet. we haven't done the syncfs call yet - am verifying that the gids are all correct [15:04:31] ok [15:04:35] (03CR) 10Hashar: "check experimental" [software/conftool] - 10https://gerrit.wikimedia.org/r/222313 (owner: 10Hashar) [15:04:39] (03PS1) 10Muehlenhoff: Convert firewall resource declarations to an include for consistency [puppet] - 10https://gerrit.wikimedia.org/r/222314 [15:04:51] andrewbogott: should be done shortly, and then I can probably do a syncfs by hand, and then figure out how to hook that on automatically [15:04:52] Also — does your switch to a .yaml file in puppet affect how an instance determines whether or not to mount a given volume? [15:05:20] andrewbogott: yes, hiera is ineffective now. [15:05:34] the yaml file is single source of truth [15:05:35] Does if mount_nfs_volume($::instanceproject, 'scratch') still work? [15:05:41] andrewbogott: yes [15:05:45] that hooks into the yaml file [15:05:45] ok :) [15:10:00] (03PS1) 10Merlijn van Deen: [nfs/toolsbeta] re-set NFS mounts for toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/222315 [15:10:17] YuviPanda: ^ [15:10:37] (03Abandoned) 10Merlijn van Deen: [nfs/toolsbeta] Set toolsbeta config to be like tools [puppet] - 10https://gerrit.wikimedia.org/r/222296 (owner: 10Merlijn van Deen) [15:11:18] (03PS2) 10Yuvipanda: labstore: enable NFS mounts for toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/222315 (owner: 10Merlijn van Deen) [15:11:55] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: enable NFS mounts for toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/222315 (owner: 10Merlijn van Deen) [15:12:10] valhallasw`cloud: ^ done [15:12:15] thaqnks [15:14:20] RECOVERY - puppet last run on mw1148 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:21] PROBLEM - puppet last run on mw2062 is CRITICAL puppet fail [15:14:22] PROBLEM - puppet last run on mw2059 is CRITICAL puppet fail [15:14:22] RECOVERY - puppet last run on mw1229 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:31] RECOVERY - puppet last run on mw2204 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:50] RECOVERY - puppet last run on mw2209 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:14:50] RECOVERY - puppet last run on mw2068 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:51] RECOVERY - puppet last run on mw2186 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:51] RECOVERY - puppet last run on mw1243 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:51] PROBLEM - puppet last run on mc2013 is CRITICAL Puppet has 2 failures [15:14:52] RECOVERY - puppet last run on cp1067 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:15:01] RECOVERY - puppet last run on mw1105 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:15:01] PROBLEM - puppet last run on ms-be1018 is CRITICAL puppet fail [15:15:21] RECOVERY - puppet last run on mw2210 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:15:21] RECOVERY - puppet last run on mw1043 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:15:21] RECOVERY - puppet last run on mw1010 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:15:21] RECOVERY - puppet last run on mw2175 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:15:31] PROBLEM - puppet last run on cp3031 is CRITICAL Puppet has 1 failures [15:15:41] RECOVERY - puppet last run on mw1167 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [15:15:42] RECOVERY - puppet last run on mw2153 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:42] RECOVERY - puppet last run on mw2193 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:51] RECOVERY - puppet last run on mw1122 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:15:52] PROBLEM - puppet last run on mc1004 is CRITICAL Puppet has 2 failures [15:16:00] RECOVERY - puppet last run on mw1223 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:16:01] RECOVERY - puppet last run on mw1219 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:16:01] RECOVERY - puppet last run on mw1231 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:16:02] RECOVERY - puppet last run on fluorine is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:16:10] RECOVERY - puppet last run on mw1033 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:16:20] PROBLEM - puppet last run on ms-be1005 is CRITICAL Puppet has 4 failures [15:16:21] PROBLEM - puppet last run on es1008 is CRITICAL Puppet has 1 failures [15:16:21] (03PS1) 10BBlack: depool cp1065 for thermal stuff: T103226 [puppet] - 10https://gerrit.wikimedia.org/r/222319 [15:16:30] RECOVERY - puppet last run on mw2159 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:16:30] RECOVERY - puppet last run on mw2188 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:16:31] RECOVERY - puppet last run on mw2010 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:16:31] RECOVERY - puppet last run on mw1186 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:16:31] PROBLEM - puppet last run on ms-be1014 is CRITICAL Puppet has 9 failures [15:16:31] PROBLEM - puppet last run on ms-be1001 is CRITICAL Puppet has 1 failures [15:16:35] (03CR) 10BBlack: [C: 032 V: 032] depool cp1065 for thermal stuff: T103226 [puppet] - 10https://gerrit.wikimedia.org/r/222319 (owner: 10BBlack) [15:16:40] RECOVERY - puppet last run on mw1225 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:16:40] PROBLEM - puppet last run on zirconium is CRITICAL Puppet has 4 failures [15:16:40] RECOVERY - puppet last run on mw1236 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:16:41] PROBLEM - puppet last run on mc1010 is CRITICAL Puppet has 4 failures [15:16:41] RECOVERY - puppet last run on mw2009 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:16:41] RECOVERY - puppet last run on mw2180 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:16:41] PROBLEM - puppet last run on ms-be2015 is CRITICAL Puppet has 2 failures [15:16:41] RECOVERY - puppet last run on mc2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:50] RECOVERY - puppet last run on mw2075 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:16:50] PROBLEM - puppet last run on ms-be1006 is CRITICAL Puppet has 4 failures [15:16:50] PROBLEM - puppet last run on stat1001 is CRITICAL Puppet has 3 failures [15:16:51] PROBLEM - puppet last run on ms-be1017 is CRITICAL Puppet has 6 failures [15:16:51] PROBLEM - puppet last run on mc1008 is CRITICAL Puppet has 1 failures [15:17:01] RECOVERY - puppet last run on mw2020 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:17:01] RECOVERY - puppet last run on mw2004 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:17:01] RECOVERY - puppet last run on mw1112 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:17:01] PROBLEM - puppet last run on mc1011 is CRITICAL Puppet has 4 failures [15:17:02] RECOVERY - puppet last run on mw1209 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:17:02] RECOVERY - puppet last run on mw1024 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:02] RECOVERY - puppet last run on mw1142 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:10] RECOVERY - puppet last run on mw1032 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:17:10] PROBLEM - puppet last run on ms-be1013 is CRITICAL Puppet has 1 failures [15:17:10] PROBLEM - puppet last run on dbstore1001 is CRITICAL Puppet has 1 failures [15:17:11] PROBLEM - puppet last run on db1058 is CRITICAL Puppet has 1 failures [15:17:11] RECOVERY - puppet last run on mw2157 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:20] PROBLEM - puppet last run on uranium is CRITICAL Puppet has 2 failures [15:17:20] RECOVERY - puppet last run on mw1185 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:21] PROBLEM - puppet last run on ms-be2014 is CRITICAL Puppet has 2 failures [15:17:21] PROBLEM - puppet last run on mc1006 is CRITICAL Puppet has 2 failures [15:17:22] PROBLEM - puppet last run on es1006 is CRITICAL Puppet has 2 failures [15:17:22] PROBLEM - puppet last run on mc1014 is CRITICAL Puppet has 1 failures [15:17:22] PROBLEM - puppet last run on lvs3001 is CRITICAL Puppet has 1 failures [15:17:23] !log depooled cp1065 in pybal/puppet [15:17:29] Logged the message, Master [15:17:30] PROBLEM - puppet last run on mc1003 is CRITICAL Puppet has 2 failures [15:17:30] RECOVERY - puppet last run on mw2141 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:31] RECOVERY - puppet last run on mw2081 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:17:31] PROBLEM - puppet last run on ms-fe2004 is CRITICAL Puppet has 5 failures [15:17:31] PROBLEM - puppet last run on ms-be2013 is CRITICAL Puppet has 2 failures [15:17:31] PROBLEM - puppet last run on ms-be2002 is CRITICAL Puppet has 2 failures [15:17:40] PROBLEM - puppet last run on ms-be1010 is CRITICAL Puppet has 6 failures [15:17:41] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 3 failures [15:17:41] RECOVERY - puppet last run on mw1193 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:17:41] RECOVERY - puppet last run on mw1022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:41] RECOVERY - puppet last run on mw1093 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:17:41] PROBLEM - puppet last run on mc1015 is CRITICAL Puppet has 2 failures [15:17:42] RECOVERY - puppet last run on mw2154 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:42] RECOVERY - puppet last run on mw2115 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:17:50] PROBLEM - puppet last run on ms-be2010 is CRITICAL Puppet has 6 failures [15:17:50] PROBLEM - puppet last run on ms-be1016 is CRITICAL Puppet has 2 failures [15:17:50] PROBLEM - puppet last run on mc1005 is CRITICAL Puppet has 3 failures [15:17:51] PROBLEM - puppet last run on iron is CRITICAL Puppet has 1 failures [15:17:57] !log killed icinga-wm again [15:18:03] Logged the message, Master [15:18:28] !log anomie Synchronized php-1.26wmf12/extensions/Wikidata/: SWAT: Update Wikibase: SearchEntities return 'aliases' when not same as label [[gerrit:222311]] (duration: 00m 20s) [15:18:28] jzerebecki: ^ Test please [15:18:33] Logged the message, Master [15:19:19] anomie: works. thx. [15:19:30] * anomie is done with SWAT [15:19:37] YuviPanda, do I need to restart the puppetmaster with your change? [15:20:20] (NFS config move) [15:22:47] RECOVERY - puppet last run on ms-be2015 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:22:47] RECOVERY - puppet last run on virt1002 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:22:47] RECOVERY - puppet last run on db2051 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:22:47] RECOVERY - puppet last run on stat1001 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:22:50] YuviPanda: yeah, that seems to help. Could you please restart the puppetmasters for other people as well? Thanks. [15:22:50] RECOVERY - puppet last run on ms-be1017 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:22:51] RECOVERY - puppet last run on mc1008 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:22:51] RECOVERY - puppet last run on caesium is OK Puppet is currently enabled, last run 1 second ago with 0 failures [15:22:51] RECOVERY - puppet last run on mc1011 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:23:01] RECOVERY - puppet last run on mw1071 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:23:01] RECOVERY - puppet last run on ms-be1013 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:23:01] RECOVERY - puppet last run on dbstore1001 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:23:10] RECOVERY - puppet last run on analytics1039 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:23:11] RECOVERY - puppet last run on uranium is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:23:21] RECOVERY - puppet last run on ms-be2014 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:23:21] RECOVERY - puppet last run on mc1006 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:23:21] RECOVERY - puppet last run on es1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:21] RECOVERY - puppet last run on logstash1003 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:23:21] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:23:21] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:30] RECOVERY - puppet last run on ms-be2013 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:23:30] RECOVERY - puppet last run on ms-be2002 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:23:30] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:23:32] valhallasw`cloud: oh yeah, that was the *previous* change. I already restarted labcontrol1001 puppetmaster [15:23:41] PROBLEM - puppet last run on mw1103 is CRITICAL Puppet has 1 failures [15:23:41] RECOVERY - puppet last run on ms-be2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:42] RECOVERY - puppet last run on ms-be1016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:50] RECOVERY - puppet last run on mc1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:51] RECOVERY - puppet last run on lanthanum is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:51] RECOVERY - puppet last run on mc1009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:00] RECOVERY - puppet last run on es1009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:00] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:01] RECOVERY - puppet last run on wtp1017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:10] RECOVERY - puppet last run on elastic1028 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:10] PROBLEM - puppet last run on mw2111 is CRITICAL Puppet has 1 failures [15:24:10] RECOVERY - puppet last run on ms-be1005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:11] RECOVERY - puppet last run on mw1207 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:11] RECOVERY - puppet last run on elastic1026 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:11] RECOVERY - puppet last run on ms-fe1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:20] RECOVERY - puppet last run on ms-be1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:20] RECOVERY - puppet last run on es1008 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:24:21] PROBLEM - puppet last run on mw2196 is CRITICAL Puppet has 1 failures [15:24:21] PROBLEM - puppet last run on mw2203 is CRITICAL Puppet has 1 failures [15:24:21] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:21] PROBLEM - puppet last run on mw2198 is CRITICAL Puppet has 1 failures [15:24:21] RECOVERY - puppet last run on es2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:22] RECOVERY - puppet last run on mw2067 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:22] RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:23] PROBLEM - puppet last run on mw2070 is CRITICAL Puppet has 1 failures [15:24:30] YuviPanda: yeah, but self-hosted puppetmasters... [15:24:30] RECOVERY - puppet last run on californium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:30] RECOVERY - puppet last run on ms-be1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:41] RECOVERY - puppet last run on magnesium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:41] RECOVERY - puppet last run on ms-be1006 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:24:41] RECOVERY - puppet last run on ms-be2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:41] RECOVERY - puppet last run on ms-be2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:42] PROBLEM - puppet last run on mw1182 is CRITICAL Puppet has 1 failures [15:24:51] PROBLEM - puppet last run on mw1124 is CRITICAL Puppet has 1 failures [15:24:51] RECOVERY - puppet last run on ms-be2004 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:24:51] PROBLEM - puppet last run on mw2061 is CRITICAL Puppet has 1 failures [15:24:52] PROBLEM - puppet last run on mw1060 is CRITICAL Puppet has 1 failures [15:24:53] PROBLEM - puppet last run on mw1214 is CRITICAL Puppet has 1 failures [15:25:00] PROBLEM - puppet last run on mw1138 is CRITICAL Puppet has 1 failures [15:25:02] RECOVERY - puppet last run on netmon1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:10] PROBLEM - puppet last run on mw1184 is CRITICAL Puppet has 1 failures [15:25:11] PROBLEM - puppet last run on mw2131 is CRITICAL Puppet has 1 failures [15:25:11] RECOVERY - puppet last run on db2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:11] PROBLEM - puppet last run on mw1021 is CRITICAL Puppet has 1 failures [15:25:12] RECOVERY - puppet last run on lvs3001 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:25:16] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1421168 (10BBlack) cp1065 downtimed and depooled in various places and software poweroff'd, can use that one. [15:25:20] RECOVERY - puppet last run on mc1003 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:25:21] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:21] PROBLEM - puppet last run on mw2101 is CRITICAL Puppet has 1 failures [15:25:21] RECOVERY - puppet last run on ms-fe2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:30] PROBLEM - puppet last run on mw1078 is CRITICAL Puppet has 1 failures [15:25:40] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [15:25:41] PROBLEM - puppet last run on mw2169 is CRITICAL Puppet has 1 failures [15:25:42] RECOVERY - puppet last run on iron is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:25:50] PROBLEM - puppet last run on mw1046 is CRITICAL Puppet has 1 failures [15:26:00] PROBLEM - puppet last run on mw1101 is CRITICAL Puppet has 1 failures [15:26:01] RECOVERY - puppet last run on mc1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:02] RECOVERY - puppet last run on db1002 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:26:11] PROBLEM - puppet last run on mw1054 is CRITICAL Puppet has 1 failures [15:26:12] RECOVERY - puppet last run on analytics1035 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:20] PROBLEM - puppet last run on mw2182 is CRITICAL Puppet has 1 failures [15:26:20] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [15:26:20] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [15:26:20] PROBLEM - puppet last run on mw1098 is CRITICAL Puppet has 1 failures [15:26:20] RECOVERY - puppet last run on db1023 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:26:21] PROBLEM - puppet last run on mw1252 is CRITICAL Puppet has 1 failures [15:26:30] PROBLEM - puppet last run on mw2161 is CRITICAL Puppet has 1 failures [15:26:31] RECOVERY - puppet last run on ms-fe1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:40] PROBLEM - puppet last run on mw2006 is CRITICAL Puppet has 1 failures [15:26:40] PROBLEM - puppet last run on mw1038 is CRITICAL Puppet has 1 failures [15:26:41] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:41] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:41] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:50] RECOVERY - puppet last run on mw1060 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:27:21] RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:27:30] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:27:31] PROBLEM - puppet last run on mw1125 is CRITICAL Puppet has 1 failures [15:27:41] RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:27:43] (03PS12) 10Merlijn van Deen: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [15:27:51] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:27:51] RECOVERY - puppet last run on mc1012 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:27:52] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:00] RECOVERY - puppet last run on mw2059 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:28:01] RECOVERY - puppet last run on ms-be3002 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:28:11] RECOVERY - puppet last run on mw1054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:11] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:11] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:21] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:30] RECOVERY - puppet last run on silver is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:28:40] RECOVERY - puppet last run on gallium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:01] RECOVERY - puppet last run on labnet1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:02] RECOVERY - puppet last run on mc1014 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:29:22] RECOVERY - puppet last run on bast4001 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:29:30] RECOVERY - puppet last run on mc1005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:51] RECOVERY - puppet last run on mw2062 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:29:51] RECOVERY - puppet last run on ms-be2011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:00] RECOVERY - puppet last run on ms-be1012 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:30:21] RECOVERY - puppet last run on mw2006 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:30:21] RECOVERY - puppet last run on mc2013 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:30:22] RECOVERY - puppet last run on ms-be3001 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:30:32] RECOVERY - puppet last run on ms-be2005 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:30:32] RECOVERY - puppet last run on ms-be2001 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:30:50] RECOVERY - puppet last run on ms-be1008 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:31:01] RECOVERY - puppet last run on ms-be2012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:11] RECOVERY - puppet last run on ms-fe3002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:20] RECOVERY - puppet last run on mw1125 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:31:41] RECOVERY - puppet last run on db2037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:50] RECOVERY - puppet last run on ms-be2008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:03] RECOVERY - puppet last run on mw1098 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:22] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:32:23] RECOVERY - puppet last run on ms-be2007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:30] RECOVERY - puppet last run on mc1013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:31] RECOVERY - puppet last run on mc1007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:33:31] RECOVERY - puppet last run on ms-fe3001 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:33:41] RECOVERY - puppet last run on mc2016 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [15:35:50] RECOVERY - puppet last run on mw1091 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:35:50] RECOVERY - puppet last run on mw1016 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:36:20] RECOVERY - puppet last run on mw1064 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:36:40] RECOVERY - puppet last run on mw1021 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:36:51] RECOVERY - puppet last run on mw1154 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:36:56] boo [15:37:37] https://www.irccloud.com/pastebin/fj9JTrUa/ [15:37:41] paravoid: ^ gids are sane [15:37:50] that diff is ok that's a new project. [15:38:53] (03PS3) 10BBlack: HSTS: increase to 1y, do not allow applayer override [puppet] - 10https://gerrit.wikimedia.org/r/222301 (https://phabricator.wikimedia.org/T40516) [15:39:04] (03CR) 10BBlack: [C: 032 V: 032] HSTS: increase to 1y, do not allow applayer override [puppet] - 10https://gerrit.wikimedia.org/r/222301 (https://phabricator.wikimedia.org/T40516) (owner: 10BBlack) [15:41:51] RECOVERY - puppet last run on mw1252 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:42:12] RECOVERY - puppet last run on mw1124 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:43:31] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [15:44:01] RECOVERY - puppet last run on mw1038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:44:34] (03PS4) 10ArielGlenn: rsync of phab dumps from iridium to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/221658 (https://phabricator.wikimedia.org/T103028) [15:44:52] RECOVERY - puppet last run on mw2169 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:45:45] (03CR) 10ArielGlenn: [C: 032] rsync of phab dumps from iridium to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/221658 (https://phabricator.wikimedia.org/T103028) (owner: 10ArielGlenn) [15:45:50] RECOVERY - puppet last run on mw2161 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:46] if dataset puppet whines, ignore please, it's me [15:53:54] paravoid: so if my understanding of exportfs is correct, I can do that manually atm and it'll let NFS work on instances that've been created so far. [15:54:08] * YuviPanda is being overly cautious with allt eh NFS stuff [15:58:02] YuviPanda: exportfs alone won't work, bind mounts aren't there [15:58:07] YuviPanda: sync-exports would, though [15:58:31] paravoid: that's for *new* projects, right? the current projects already have bind mounts [15:58:36] paravoid: just need updated IPs [15:58:40] correct [15:58:54] but exportfs will also try to export the new projects as well [15:59:28] so sync-exports won't work because paths have changed [15:59:38] (03PS1) 10ArielGlenn: phab dumps rsync using ipv4 client addr [puppet] - 10https://gerrit.wikimedia.org/r/222325 [16:01:13] (03CR) 10ArielGlenn: [C: 032] phab dumps rsync using ipv4 client addr [puppet] - 10https://gerrit.wikimedia.org/r/222325 (owner: 10ArielGlenn) [16:01:41] apergos: why? [16:02:01] that single-line commit isn't great [16:02:13] 6operations, 7Database: review eqiad database server quantities / warranties / service(s) - https://phabricator.wikimedia.org/T103936#1421335 (10RobH) p:5High>3Normal [16:02:14] *commit message [16:02:22] because there's only a dynamic ipv6 addr [16:02:27] (03PS1) 10Cmjohnson: Fixing a dns typo for labnet1002.mgmt [dns] - 10https://gerrit.wikimedia.org/r/222326 [16:02:38] (03PS1) 10BBlack: Revert "depool cp1065 for thermal stuff: T103226" [puppet] - 10https://gerrit.wikimedia.org/r/222327 [16:02:41] which host? [16:02:44] iridium [16:02:50] (03CR) 10BBlack: [C: 032 V: 032] Revert "depool cp1065 for thermal stuff: T103226" [puppet] - 10https://gerrit.wikimedia.org/r/222327 (owner: 10BBlack) [16:03:03] probably a lot of hosts like that [16:03:11] internal network [16:03:19] internal network has nothing to do with that [16:03:31] just give it a static IP [16:03:48] people don't think about adding ipv6 for the internal hosts generally, or at least they haven't in the past [16:04:00] which people? :) [16:04:25] !log clean out exports.d in labstore1002, will get regenerated. backup in /root/exports.backup [16:04:26] whoever I've come behind and added one in the past [16:04:30] Logged the message, Master [16:04:34] $ grep -c 2620: wmnet [16:04:34] 128 [16:04:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: MediaWiki deployment shell access request - https://phabricator.wikimedia.org/T104546#1421346 (10RobH) 5Open>3stalled This is closely related to T104222 (setting up his user and giving him root via that task). The sudo request won't be covered until... [16:04:52] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: MediaWiki deployment shell access request - https://phabricator.wikimedia.org/T104546#1421348 (10RobH) [16:04:54] 10Ops-Access-Requests, 6operations: Grant dcausse root on the search cluster - https://phabricator.wikimedia.org/T104222#1421349 (10RobH) [16:05:00] just give it an ipv6 address [16:05:03] (03CR) 10Cmjohnson: [C: 032] Fixing a dns typo for labnet1002.mgmt [dns] - 10https://gerrit.wikimedia.org/r/222326 (owner: 10Cmjohnson) [16:05:04] fine by me [16:05:10] since it's >= trusty, this will also remove the dynamic IP [16:05:22] we haven't yet fixed the dynamic autoconfig IPs for all hosts [16:05:22] !log cp1065 undowntimed/repooled [16:05:27] (03PS13) 10Merlijn van Deen: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [16:05:28] Logged the message, Master [16:05:31] I'll bear it in mind [16:06:13] YuviPanda: ^ this correctly deploys (as in: no changes other than comments) on toolsbeta. Could you give it a quick glance-over? then we can deploy it on tools-mail and confirm it works correctly there [16:06:31] valhallasw`cloud: probably not atm, sorry - juggling a lot of things. [16:06:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: MediaWiki deployment shell access request - https://phabricator.wikimedia.org/T104546#1421350 (10RobH) p:5Triage>3Normal @dcausse: technically your manager should approve deployment rights on this task (since its a totally different subset of rights.... [16:06:41] NFS! NFS! NFS [16:06:47] valhallasw`cloud: https://gerrit.wikimedia.org/r/#/c/164386/ [16:06:56] valhallasw`cloud: "This change seems fine by itself, but would break, due to the fact that exim4-daemon-heavy is not (reliably) staying installed, it's replaced by Puppet for exim4-deamon-light which is the default for all non-mail server hosts. We need to fix that before deploying this." [16:07:15] valhallasw`cloud: sounds like you fixed that ;) [16:07:46] paravoid: *nod* [16:07:58] well, scfc did, I think [16:08:10] roughly half of that patch is his work [16:08:55] (03PS14) 10Merlijn van Deen: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [16:09:09] woo, so exports.d is 'clean' now [16:09:25] YuviPanda: can you +2 and +2 a revert if it fails? [16:09:31] that's good enough for me =p [16:10:15] paravoid: awww fuck sync-exports won't work at all - the entire structure of /exp and /srv has changed. [16:10:36] there's /exp/project/* and /exp/others/* [16:10:40] and I'm not sure which ones are active. [16:11:11] /srv/others/* has everything not toollabs and /srv/project has tools. [16:12:08] valhallasw`cloud: :P disabled puppet on the current host? [16:12:21] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: MediaWiki deployment shell access request - https://phabricator.wikimedia.org/T104546#1421358 (10Manybubbles) Approved. @dcausse should deploy! [16:13:15] valhallasw`cloud: alright, doing now. [16:13:21] thanks [16:13:29] (03PS15) 10Yuvipanda: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [16:13:45] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [16:14:05] (03PS1) 10Yuvipanda: Revert "Tools: Simplify and fix mail setup" [puppet] - 10https://gerrit.wikimedia.org/r/222329 [16:14:21] valhallasw`cloud: ^ done, and revert prepared. [16:14:25] thanks! [16:14:31] testing now [16:14:34] valhallasw`cloud: ok! [16:15:05] paravoid: do you know why tools is special cased this way? [16:15:20] andrewbogott: looks like sync-exports also has to be rewritten before NFS is fixed. [16:15:22] different LV? [16:15:38] paravoid: does it mean it has to be on a different path? [16:15:49] sort of [16:15:57] YuviPanda: is that because the exports files that we create now are different? [16:16:07] it either needs that, or it's dependent on the other mount [16:16:09] andrewbogott: no, but the bind mounts are in different paths. [16:16:12] (03PS1) 10ArielGlenn: iridium: add ipv6 addr [puppet] - 10https://gerrit.wikimedia.org/r/222330 [16:16:21] ah, ok. Should be a pretty quick fix then [16:16:26] (03CR) 10Merlijn van Deen: "Mail server merged correctly, the only change being" [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [16:17:02] andrewbogott: maaaybe. they're only kind of consistently different - tools is in a different path and everything else is in a different path. [16:17:31] want me to do it? I can finish up with what I’m doing now pretty quick [16:17:34] paravoid: so, in /exp, I see mounts in both /exp/project and /exp/others. How do I find out which ones are actually being used by the mounts? [16:18:01] YuviPanda: success \o/ [16:18:07] YuviPanda: I don't understand the question tbh :) [16:18:07] by the mounts in the instances, I mean [16:18:15] (03CR) 10ArielGlenn: [C: 032] iridium: add ipv6 addr [puppet] - 10https://gerrit.wikimedia.org/r/222330 (owner: 10ArielGlenn) [16:18:26] (03CR) 10Merlijn van Deen: [C: 04-1] "Worked correctly \o/ so this can be abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/222329 (owner: 10Yuvipanda) [16:18:42] paravoid: ok, so sync-exports currently assumes it is going to mount stuff from /srv/* to /exp/project/* [16:18:59] paravoid: except, I have to now modify it to mount stuff from /srv/others/* and /srv/projects/* to... somewhere. [16:19:17] paravoid: I'm not sure if that 'somewhere' is /exp/project/* or /exp/others/*, since currently there are mounts in both of those places. [16:19:37] hmm, I guess I can look at the old exports.d contents [16:20:03] ok, everything seems to be in /exp/projects [16:20:50] andrewbogott: sure if you want to - I'm not entirely sure atm what's the right thing to do though. [16:21:04] andrewbogott: but /etc/exports.d contents is correctly being generated by the daemon atm. [16:21:13] valhallasw`cloud: \o/ thanks! [16:25:06] andrewbogott: paravoid I'm going to start a straight up port of sync-exports into python now. and modify to fit before merging. [16:25:06] * YuviPanda does. [16:25:20] YuviPanda: ok then :) [16:27:29] so current sync-exports basically seems to be: if it is in /srv and not in /exp, create it in /exp/project. if in /exp/project and not in /srv, rm in /exp/project [16:29:06] that seems reasonable... [16:29:21] (03PS1) 10ArielGlenn: ipv6 addr for iridium [dns] - 10https://gerrit.wikimedia.org/r/222334 [16:29:40] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60576 bytes in 0.335 second response time [16:31:47] (03CR) 10ArielGlenn: [C: 032] ipv6 addr for iridium [dns] - 10https://gerrit.wikimedia.org/r/222334 (owner: 10ArielGlenn) [16:32:45] apergos: just PTR but no forward? [16:33:09] erm? [16:33:10] YuviPanda: http://www.nytimes.com/2015/07/05/magazine/what-happens-when-a-state-is-run-by-movie-stars.html [16:33:26] ori: hah, I know which one it is before clicking :D [16:33:38] "A.I.A.D.M.K." is a pretty bad-ass acronym [16:33:50] apergos: the IPv6 address for iridium. there seems to be no AAAA record [16:33:51] it sounds like something that has lasers [16:33:55] yes [16:34:03] left it out ugh [16:36:02] (03PS1) 10ArielGlenn: er, and the rest of ipv6 for iridum. [dns] - 10https://gerrit.wikimedia.org/r/222336 [16:36:10] ori: > Chennai in particular is a city whose self-image is genteel, cultured and intellectual [16:36:12] what BS :P [16:36:54] apergos: s/iridum/iridium/ [16:37:05] apergos: and yeah, in general better commit messages wouldn't hurt ;) [16:37:23] ugh [16:37:30] just all wins today on my part [16:37:34] heh [16:37:51] I know the feeling [16:38:47] me too atm [16:39:19] 6operations, 10RESTBase-Cassandra, 6Services: Alert the services team mailing list when Cassandra dies - https://phabricator.wikimedia.org/T104467#1421531 (10mobrovac) 5Open>3Resolved [16:40:26] !log Restarted logstash on logstash1001 due to OOM [16:40:32] Logged the message, Master [16:44:11] (03PS2) 10ArielGlenn: er, and the rest of ipv6 for iridium. [dns] - 10https://gerrit.wikimedia.org/r/222336 [16:47:23] (03CR) 10Dzahn: [C: 031] er, and the rest of ipv6 for iridium. [dns] - 10https://gerrit.wikimedia.org/r/222336 (owner: 10ArielGlenn) [16:48:28] (03CR) 10ArielGlenn: [C: 032] er, and the rest of ipv6 for iridium. [dns] - 10https://gerrit.wikimedia.org/r/222336 (owner: 10ArielGlenn) [16:49:11] EPL down? [16:50:12] 6operations, 3Discovery-Cirrus-Sprint: Import Elasticsearch 1.6.0 deb into wmf apt - https://phabricator.wikimedia.org/T102008#1421584 (10ksmith) [17:04:45] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1421689 (10Deskana) [17:09:05] 6operations, 6Phabricator, 5Patch-For-Review: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1421722 (10ArielGlenn) Did a run of the rsync by hand; file showed up in the right place. Leaving this open til we know the cron runs ok. [17:12:19] 6operations, 6Phabricator, 5Patch-For-Review: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1421743 (10chasemp) 5Open>3Resolved >>! In T103028#1421722, @ArielGlenn wrote: > Did a run of the rsync by hand; file showed up in the right place. Leaving this open til... [17:17:56] 6operations: logstash1003 - RAID failed - https://phabricator.wikimedia.org/T104592#1421758 (10Dzahn) 3NEW [17:18:30] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [17:18:50] (03PS1) 10Yuvipanda: labstore: Do not use tempfils for exports [puppet] - 10https://gerrit.wikimedia.org/r/222341 [17:22:19] (03PS1) 10Yuvipanda: labstore: Rewrite sync-exports to python [puppet] - 10https://gerrit.wikimedia.org/r/222342 [17:22:22] paravoid: ^ [17:22:23] andrewbogott: ^ [17:22:30] not complete... [17:22:49] also I'm wondering if they need to be a different file at all. [17:23:10] can't I grant sudo permissions to the daemon to just do any calls to mount, umount and be done? [17:23:13] paravoid: and you are right, it's pretty simple :) [17:24:46] paravoid: haha, the new project never had NFS enabled, so it isn't actually required. [17:24:48] * YuviPanda gets rid of that. [17:26:28] aaaarggghh, I have created a local branch called origin/production somehow [17:26:30] * YuviPanda cries [17:30:13] (03PS1) 10Yuvipanda: labstore: Remove NFS from 'wildcat' project [puppet] - 10https://gerrit.wikimedia.org/r/222347 [17:30:55] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Remove NFS from 'wildcat' project [puppet] - 10https://gerrit.wikimedia.org/r/222347 (owner: 10Yuvipanda) [17:34:00] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:35:12] 6operations, 5Patch-For-Review: irc bots should send NOTICE not PRIVMSG - https://phabricator.wikimedia.org/T101575#1421832 (10valhallasw) Well, that was a short but rough test. ``` 19:30 ok, this is going to suck on IRCCloud 19:30 it's huge ``` ``` 19:30 WHAT? 19:3... [17:41:21] PROBLEM - puppet last run on mw2129 is CRITICAL puppet fail [17:45:51] andrewbogott: were there any problems when you ran exportfs manually last time? [17:46:03] YuviPanda: not that I noticed [17:47:55] Hallo. [17:48:05] Is the train deployment running already? [17:48:15] Ran earlier? Running later? [17:49:35] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1421880 (10Dzahn) I did that and set the boot order to disk during install. After shutdown i started it up again. Then i tried to connect to console.. no output [17:50:23] greg-g, kart_ ^ [17:55:02] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1421886 (10Dzahn) ``` [ganeti1003:~] $ sudo gnt-instance modify --hypervisor-parameters=boot_order=disk bromine.eqiad.wmnet Modified instance bromine.eqiad.wmnet - hv/boot_order -> d... [17:56:28] 6operations, 10vm-requests, 5Patch-For-Review: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1421888 (10Dzahn) all that said, it still worked and i could ssh to it with the "new_install" key and it's up and running :) [17:57:19] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1421897 (10Dzahn) [17:58:31] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:58:51] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1394369 (10Dzahn) a couple minutes later the screen with the "frozen" console came to life and works now: bromine login: Debian GNU/Linux 8 bromine ttyS0 [18:00:04] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150702T1800). Please do the needful. [18:00:31] (03CR) 10Dzahn: [C: 032] add new node bromine, add bz-static role [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) (owner: 10Dzahn) [18:01:56] andrewbogott: so I'm going to run it now, and hope nothing explodes. that'll give all new instances NFS, but also remove NFS from any projects where it was working accidentally. [18:02:03] aka it's really going to reflect the NFS one now [18:02:08] ok [18:02:18] alright [18:02:30] (03PS5) 10Dzahn: add new node bromine, add bz-static role [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) [18:02:43] !log running exportfs -ra on labstore1002 [18:02:49] Logged the message, Master [18:03:36] (03PS6) 10Dzahn: add new node bromine, add bz-static role [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) [18:04:28] (03CR) 10Dzahn: [C: 032] add new node bromine, add bz-static role [puppet] - 10https://gerrit.wikimedia.org/r/222203 (https://phabricator.wikimedia.org/T101734) (owner: 10Dzahn) [18:04:51] (03PS1) 10Merlijn van Deen: [tools] New host: tools-mailrelay-02 [puppet] - 10https://gerrit.wikimedia.org/r/222358 [18:09:46] paravoid: after this train deploy my attempt at fixing the non-UTC log timestamps will be on all wikis. If you have suggestions on how to verify that it worked (or didn't) that would be awesome [18:10:20] (03PS2) 10Merlijn van Deen: [tools] New host: tools-mailrelay-02 [puppet] - 10https://gerrit.wikimedia.org/r/222358 (https://phabricator.wikimedia.org/T97574) [18:11:15] (03PS1) 10Yuvipanda: labstore: Enable /home for wikidata-query project [puppet] - 10https://gerrit.wikimedia.org/r/222361 [18:11:29] (03PS2) 10Yuvipanda: labstore: Enable /home for wikidata-query project [puppet] - 10https://gerrit.wikimedia.org/r/222361 [18:11:41] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Enable /home for wikidata-query project [puppet] - 10https://gerrit.wikimedia.org/r/222361 (owner: 10Yuvipanda) [18:11:58] (03PS1) 10Merlijn van Deen: Add PTR record for mailrelay-02.tools.wmflabs.org [dns] - 10https://gerrit.wikimedia.org/r/222362 (https://phabricator.wikimedia.org/T97574) [18:13:42] YuviPanda, bd808 - is the train running now? [18:14:15] aharoni: the window is open. Not sure if twentyafterfour is patching yet or not [18:14:58] ottomata: the new hadoop servers...going to be "analytics10xx"? [18:15:04] aharoni: bd808: I am just about ready to push it to all wikis [18:15:59] anything I should be aware of before the train embarks on it's final journey of the week? [18:17:18] nothing special, I'm just looking forward to seeing some ContentTranslation bugs to be finally fixed in production. [18:17:22] cmjohnson1: yes, continueing from whereever we are [18:17:23] is good [18:19:13] andrewbogott: ok, that seems to work? [18:19:18] okay..thx [18:19:43] wheee [18:19:49] andrewbogott: exportfs is still manual but that can wait probably [18:20:22] YuviPanda: does exportfs need running per instance or just when new projects are added? [18:20:34] andrewbogott: per instance. new project additions require a bit more work :) [18:21:03] wait, exportfs needs a manual run for each new instance? [18:21:14] Then in what sense can that wait? Doesn’t that mean things are exactly as broken as they were before? [18:21:40] andrewbogott: no, because before that we needed to hand edit the /etc/exports.d file as well [18:22:04] *shrug* from a user perspective it still requires someone to ask an op [18:22:22] indeed. by wait a bit I mean till tomorrow, since I'm totally exhausted and have to go. [18:22:26] ah, ok. [18:22:28] it's been a bit of a slog 3 days. [18:22:39] andrewbogott: oh yeah, I don't intend to let this linger longer. just... not today :( [18:23:11] andrewbogott: replica.my.cnf is still broken because it seems to have depended on state left in /var/cache on labstore1001 which isn't around yet :( so I am in the process of rewriting *that* as well. [18:23:16] hopefully all of those will be done by tomorrow [18:25:01] (03PS2) 10Yuvipanda: labstore: Do not use tempfils for exports [puppet] - 10https://gerrit.wikimedia.org/r/222341 [18:25:11] PROBLEM - check_puppetrun on boron is CRITICAL puppet fail [18:25:16] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Do not use tempfils for exports [puppet] - 10https://gerrit.wikimedia.org/r/222341 (owner: 10Yuvipanda) [18:25:46] ^^^ fixing [18:26:32] (03PS1) 10Ori.livneh: Tessera: base config.py.erb on Tessera's config.py [puppet] - 10https://gerrit.wikimedia.org/r/222365 [18:28:44] (03PS1) 1020after4: all wikis to 1.26wmf12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222366 [18:29:11] (03CR) 1020after4: [C: 032] all wikis to 1.26wmf12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222366 (owner: 1020after4) [18:29:17] (03Merged) 10jenkins-bot: all wikis to 1.26wmf12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222366 (owner: 1020after4) [18:29:45] aharoni: your wait is almost over [18:29:47] (03PS1) 10Alexandros Kosiaris: Fix typo [dns] - 10https://gerrit.wikimedia.org/r/222367 [18:30:10] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 254 seconds ago with 0 failures [18:30:34] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: all wikis to 1.26wmf12 [18:30:40] Logged the message, Master [18:30:48] (03CR) 10Alexandros Kosiaris: [C: 032] Fix typo [dns] - 10https://gerrit.wikimedia.org/r/222367 (owner: 10Alexandros Kosiaris) [18:33:00] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1422016 (10RobH) We haven't migrated the entire approvals process into phabricator quite yet, so the purchase approvals are being handled via: https://rt.wiki... [18:34:10] PROBLEM - puppet last run on polonium is CRITICAL Puppet has 1 failures [18:34:56] andrewbogott: actually I think I might be able to hook exportfs in shortly! attempting to patch... [18:35:12] cool [18:35:19] twentyafterfour: yippee, I see stuff fixed. [18:35:21] Thank you. [18:40:44] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1422055 (10RobH) a:5jcrespo>3RobH (stealing this since I've escalated it into purchase approvals.) [18:44:12] aharoni: you're welcome, I'm glad it's fixed [18:45:53] (03PS1) 10Ori.livneh: Add varnish stats reporter for ResourceLoader requests [puppet] - 10https://gerrit.wikimedia.org/r/222371 (https://phabricator.wikimedia.org/T104277) [18:46:07] bblack: ^ [18:48:49] (03CR) 10Ori.livneh: "The script can be tested by running /tmp/varnishrls on cp1066." [puppet] - 10https://gerrit.wikimedia.org/r/222371 (https://phabricator.wikimedia.org/T104277) (owner: 10Ori.livneh) [18:49:30] RECOVERY - puppet last run on polonium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:50:25] (03PS1) 10Yuvipanda: labstore: Run exportfs on every run of the daemon [puppet] - 10https://gerrit.wikimedia.org/r/222372 [18:50:41] andrewbogott: ^ [18:51:10] paravoid: ^ [18:51:55] YuviPanda: no, you need bind mounts as well [18:52:04] paravoid: only for new projects :) [18:52:14] all currently existing projects already have bind mounts [18:52:30] paravoid: and that's actually part of https://gerrit.wikimedia.org/r/#/c/222342/ - I'll move that inside tomorrow. [18:52:44] the exportfs call would make things work for new instances in existing projects [18:52:58] yeah but you're writing exports right now for directories that don't exist [18:53:24] paravoid: /exp/projects? they exist. [18:53:31] I checked [18:53:41] and manually running exportfs -ra makes everything work atm. [18:54:10] !log Running sync-common on mw1111; fatal log showed it to be running 1.26wmf9 [18:54:16] Logged the message, Master [18:56:20] 6operations, 5Patch-For-Review: Move static-bugzilla from zirconium to ganeti - https://phabricator.wikimedia.org/T101734#1422103 (10Dzahn) [18:56:22] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1422101 (10Dzahn) 5Open>3Resolved there was a typo in DNS that lead to issues when trying to sign the puppet cert for initial run (and possibly some of the other issues) https://gerrit.wikimedia.org/r... [18:56:34] YuviPanda: sorry, where does sync-exports fit into this? I thought that happened between manage- and export- [18:56:47] andrewbogott: yeah, I'm merging sync-exports into manage [18:56:52] ah, ok. [18:57:03] 6operations, 10ops-codfw: Equip osm-cp200{1,2,3,4} with 2 1.2TB SSDs each - https://phabricator.wikimedia.org/T104610#1422104 (10akosiaris) 3NEW a:3Papaul [18:57:27] andrewbogott: I was rewriting sync-exports, realized that I have to copy paste half the code between manage and sync, and also there's no reason for sync to run separately since it only needs to use sudo for two commands. [18:57:46] (03PS2) 10Yuvipanda: labstore: Run exportfs on every run of the daemon [puppet] - 10https://gerrit.wikimedia.org/r/222372 [18:57:48] (03PS1) 10Yuvipanda: labstore: Restart nfs projects daemon when source changes [puppet] - 10https://gerrit.wikimedia.org/r/222375 [18:58:43] * YuviPanda is excited [18:58:47] there's light at the end of the tunnel! [18:59:31] (03CR) 10Yuvipanda: [C: 032] labstore: Restart nfs projects daemon when source changes [puppet] - 10https://gerrit.wikimedia.org/r/222375 (owner: 10Yuvipanda) [19:00:41] paravoid: mind if I merge and try it? we already verified that exportfs -ra makes NFS work on new instances just fine. [19:01:13] 6operations, 7HTTPS: audit and replace all fundraising certificates in sha1 to sha256 - https://phabricator.wikimedia.org/T104378#1422124 (10RobH) It seems civicrm is also using its own ssl certificate, is SHA1, and it expires in 7 days. We should fix that as well. [19:02:09] 6operations: improve redis master/slave monitoring - https://phabricator.wikimedia.org/T101584#1422127 (10mark) p:5Normal>3High [19:05:59] YuviPanda: yeah, have at. [19:06:42] andrewbogott: alright, wanna +1? :) [19:06:47] yeah, one second [19:06:56] (I was watching the metrics meeting until a few minutes ago) [19:07:04] ah, I see [19:07:40] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ list - https://phabricator.wikimedia.org/T104596#1422148 (10Krenair) [19:07:54] (03CR) 10Andrew Bogott: [C: 031] labstore: Run exportfs on every run of the daemon [puppet] - 10https://gerrit.wikimedia.org/r/222372 (owner: 10Yuvipanda) [19:07:55] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ list - https://phabricator.wikimedia.org/T104596#1422152 (10Krenair) Looks like this must be an alias in the private exim config. [19:08:23] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Run exportfs on every run of the daemon [puppet] - 10https://gerrit.wikimedia.org/r/222372 (owner: 10Yuvipanda) [19:10:00] andrewbogott: unrelated, but did you read https://wikitech.wikimedia.org/wiki/Help:Shared_storage? [19:10:26] YuviPanda: no, but I can now. [19:10:40] andrewbogott: please do and edit as appropriate. [19:10:52] andrewbogott: I'm using it to have project owners figure out which mounts they want. [19:11:21] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [19:12:33] CUSTOM - RAID on logstash1003 is CRITICAL 1 failed LD(s) (Degraded) [19:16:09] YuviPanda: do we want to say ‘no log files on nfs’ or is that a valid use case? [19:16:26] andrewbogott: yeah, no log files on NFS is a thing we should say [19:17:04] ok, done [19:18:06] ACKNOWLEDGEMENT - RAID on logstash1003 is CRITICAL 1 failed LD(s) (Degraded) daniel_zahn T104592 [19:18:40] 6operations: logstash1003 - RAID failed - https://phabricator.wikimedia.org/T104592#1422217 (10Dzahn) [19:19:14] what about priority for something like that? [19:20:33] 6operations: logstash1003 - RAID failed - https://phabricator.wikimedia.org/T104592#1421758 (10Dzahn) unsure about the appropriate priority level [19:21:22] 6operations, 10ops-eqiad: logstash1003 - RAID failed - https://phabricator.wikimedia.org/T104592#1422225 (10Dzahn) [19:23:11] andrewbogott: w00t, it works for newly created instances!!! [19:23:40] YuviPanda: great! [19:23:45] andrewbogott: what isn't supported atm is new *projects*, but I believe that can wait for next week [19:23:52] yep [19:24:01] Since that requires hand work regardless. [19:24:42] andrewbogott: yeah. [19:24:49] andrewbogott: I confirmed it with 'tools-boom' instance [19:25:10] CUSTOM - Tool Labs instance distribution on labcontrol1002 is CRITICAL master class instances not spread out enough [19:25:11] CUSTOM - puppet last run on labcontrol1002 is CRITICAL Puppet has 1 failures [19:25:11] CUSTOM - puppetmaster https on labcontrol1002 is CRITICAL: Connection refused [19:25:31] ^ all known? i realize it was renamed but all tickets look closed [19:26:05] mutante: chris is installing [19:26:22] ah, reinstall? ok [19:26:28] 6operations, 10Datasets-Archiving: Publish a full SVN dump - https://phabricator.wikimedia.org/T93179#1422250 (10valhallasw) [19:27:10] andrewbogott: is it still going to be a puppetmaster? [19:27:26] mutante: hm… actually, wait [19:27:33] puppet is trying to start apache there and fails [19:27:34] he’s reinstalling labnet1002, not labcontrol1002. [19:27:40] Man, ever since I renamed things I can’t keep track :) [19:27:47] this is ex-virt1000 [19:27:49] right [19:27:49] I will log in and see what’s happening [19:28:03] failed: Could not start Service[apache2]: Execution of '/etc/init.d/apache2 start' returned 1: [19:28:15] AH00526: Syntax error on line 16 of /etc/apache2/sites-enabled/50-puppetmaster-wikimedia-org.conf [19:28:31] andrewbogott:labcontrol1002 is not labnet1002 [19:28:43] cmjohnson1: yeah, I just figured that out :) [19:28:48] labcontrol1002 is old virt1000 iirc I took it down a coupled of days ago [19:28:52] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:28:52] but it's back up [19:28:53] line 16 is SSLCertificateFile [19:29:04] mutante: yeah, looks fine to me [19:29:15] too many labsomethings [19:29:15] certainly not syntax error [19:29:21] syntax error though [19:29:48] andrewbogott: SSLCertificateFile: file '/var/lib/puppet/server/ssl/certs/labs-puppetmaster-eqiad.wikimedia.org.pem' does not exist or is empty [19:29:53] missing cert [19:29:58] hm [19:30:09] mutante: this only just now started complaining? [19:30:23] andrewbogott: 7d 2h [19:30:28] oh [19:30:35] well then [19:30:45] well, that is the puppetmaster part [19:30:53] the puppet fail is only since 21h [19:31:18] 6operations, 6Labs, 10Tool-Labs-tools-Other: Move geohack to production - https://phabricator.wikimedia.org/T102960#1422255 (10valhallasw) [19:31:21] puppet somehow ran 21h ago [19:31:37] puppetmaster must have another issues besides the cert then [19:31:57] or it was just shut down 7h ago [19:32:00] but the cert was still there [19:32:05] 7d, sorry [19:32:36] 21 hours ago is maybe jgage’s puppet chnages [19:32:41] but, let me see why that file isn’t there [19:33:13] 6operations, 6Labs, 10Tool-Labs-tools-Other: Move geohack to production - https://phabricator.wikimedia.org/T102960#1422263 (10yuvipanda) I'll note that wdq-mm / ORES didn't do too badly during the big bad NFS outage. Those are puppetized, have monitoring, and do not depend on NFS in any form or way. We can... [19:34:52] 6operations, 6Phabricator, 5Patch-For-Review: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1422293 (10JAufrecht) I've accessed the fresh dump and confirmed that it has data through July 2. thank you. [19:35:32] (03PS1) 10Cmjohnson: Updating mac for labnet1002 in dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/222458 [19:36:25] 6operations, 6Phabricator, 5Patch-For-Review: Automate nightly dump of Phabricator metadata - https://phabricator.wikimedia.org/T103028#1422306 (10JAufrecht) I've accessed the fresh dump and confirmed that it has data through July 2. :) *Joel Aufrecht* Team Practices Group Wikimedia Foundation [19:36:42] (03CR) 10Cmjohnson: [C: 032] Updating mac for labnet1002 in dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/222458 (owner: 10Cmjohnson) [19:42:25] i icinga lying to me?! [19:42:29] is* [19:43:44] ottomata: about? [19:45:23] ottomata: news: analytics cluster (and analytics kafka) now also switched to ganglia_new | logstash1003 has a RAID fail, what should the priority be to replace deisks [19:45:31] stuff breaking [19:45:32] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hosts=all&style=hostservicedetail&hoststatustypes=12&hostprops=2097162&servicestatustypes=28&serviceprops=2097162&nostatusheader [19:45:41] 7Puppet, 6Labs: Expose public hostname as Fact in puppet - https://phabricator.wikimedia.org/T101903#1422353 (10valhallasw) p:5Triage>3Low [19:46:20] ottomata: in general or just the kafka related things? [19:46:44] mutante: i thin it might be ganglia things [19:46:52] kafka related, and also a check for active namenode on analytics1001 [19:46:59] but ja, analytics cluster (ganglia?) thigns [19:47:09] ja, have data for this in graphite, but not in ganglia [19:47:28] ottomata: is it checking ganglia? then it is, yea. so remember we needed the ACL on network gear to switch? we got that yesterday so we could do that [19:47:36] aye [19:47:37] (03PS1) 10Yuvipanda: tools: Dark launch new webserice-new webservices [puppet] - 10https://gerrit.wikimedia.org/r/222461 [19:47:45] but something isn't working after the switch i guess? [19:47:46] it looks ok in ganglia-web to me [19:47:49] (03PS2) 10Yuvipanda: tools: Dark launch new webserice-new webservices [puppet] - 10https://gerrit.wikimedia.org/r/222461 [19:47:54] but the aggregator changed [19:47:56] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Dark launch new webserice-new webservices [puppet] - 10https://gerrit.wikimedia.org/r/222461 (owner: 10Yuvipanda) [19:47:56] to carbon [19:48:15] like when you recently asked me for the new port on carbon [19:48:18] for something similar [19:48:30] and had to adjust that.. where was it [19:50:23] ottomata: so here they are as normal, i checked that they all came back, even the stat* hosts that were gone before [19:50:26] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Analytics%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false [19:50:44] hm [19:51:04] wait, analytics1012 is NOT [19:51:18] ja tha tis in ganglai kafka cluster [19:51:21] eh, and that's normal [19:51:24] !log starting restbase1005 (died) [19:52:19] yea http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Analytics%2520Kafka%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false [19:52:40] (03Abandoned) 10Yuvipanda: Revert "Tools: Simplify and fix mail setup" [puppet] - 10https://gerrit.wikimedia.org/r/222329 (owner: 10Yuvipanda) [19:53:47] (03PS2) 10Ori.livneh: Add varnish stats reporter for ResourceLoader requests [puppet] - 10https://gerrit.wikimedia.org/r/222371 (https://phabricator.wikimedia.org/T104277) [19:54:08] (03PS3) 10Ori.livneh: Add varnish stats reporter for ResourceLoader requests [puppet] - 10https://gerrit.wikimedia.org/r/222371 (https://phabricator.wikimedia.org/T104277) [19:55:54] hmmm [19:56:13] ottomata: so it asks uranium, in monitoring::ganglia .. hmm [19:56:25] uranium is still ganglia-web [19:56:33] is that ok? [19:56:48] i'm looking at analytis1001, and it is missing hadoop metrics...but this might be a jmxtrans problem, not a ganglia problem [19:58:12] hm yeah the kafka brokers are missing kafka metrics too [19:58:13] hm [19:58:20] i think this is not related to ganglia new stuff [19:58:32] ottomata: it looks like ACL [19:58:52] ottomata: it is configured to check uranium on port 8654 [19:59:03] i can reach that from elsewhere but not from analytics1012 [19:59:32] Ah! [19:59:36] maybe the recent changes to let it talk to carbon [19:59:36] jmxtrans is trying to send to 239.192.1.45 [19:59:43] (03PS2) 10Merlijn van Deen: Tools: Only forward mail for project users [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [19:59:44] oh? [19:59:45] influence this [19:59:49] yeah that might be why [19:59:51] is that uranium? [19:59:53] i can't reach it either [20:00:03] no [20:00:03] hm [20:00:04] legoktm DerHexer: Dear anthropoid, the time has come. Please deploy Global merge testing (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150702T2000). [20:00:27] ottomata: try the same from a non-analytics prod host, telnet or nc [20:00:30] that works for me [20:00:40] mutante: [20:00:40] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/kafka.pp#L93 [20:00:50] L92 actually [20:00:52] what should that be? [20:01:09] also https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/hadoop.pp#L237 [20:01:10] RECOVERY - puppetmaster https on labcontrol1002 is OK: HTTP OK: Status line output matched 400 - 287 bytes in 1.592 second response time [20:01:40] RECOVERY - puppet last run on labcontrol1002 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:01:50] (03PS3) 10Merlijn van Deen: Tools: Only forward mail for project users [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [20:02:26] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1422476 (10RobH) [20:02:49] (03PS1) 10Legoktm: Temporarily enable $wgCentralAuthEnableUserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222464 [20:03:00] (03CR) 10Legoktm: [C: 032] Temporarily enable $wgCentralAuthEnableUserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222464 (owner: 10Legoktm) [20:03:07] (03Merged) 10jenkins-bot: Temporarily enable $wgCentralAuthEnableUserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222464 (owner: 10Legoktm) [20:04:02] ottomata: 8649 is still 8649 on uranium ..uhmm ... [20:04:11] mutante: not the port [20:04:12] the host. [20:04:14] udp 0 0 239.192.1.8:8649 0.0.0.0:* 999 874174749 - [20:04:19] is wrong [20:04:21] !log legoktm Synchronized wmf-config/CommonSettings.php: Temporarily enable $wgCentralAuthEnableUserMerge (duration: 00m 12s) [20:04:22] that [20:04:30] those were the gangalia aggregator multicast IPs [20:04:43] so [20:04:44] 239.192.1.8 [20:04:44] ? [20:04:45] for both? [20:04:57] they are different 'ganglia clusters', do they have different IPs to use? [20:06:36] mutante: ^^ [20:06:50] (03PS4) 10Merlijn van Deen: Tools: Only forward mail for project users [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [20:07:53] (03PS4) 10Andrew Bogott: puppetmaster: Enable autosigning puppet certs for labs [puppet] - 10https://gerrit.wikimedia.org/r/218380 (https://phabricator.wikimedia.org/T102504) (owner: 10Yuvipanda) [20:07:56] (03PS4) 10Andrew Bogott: Switch on salt auto_accept for labs. [puppet] - 10https://gerrit.wikimedia.org/r/220306 (https://phabricator.wikimedia.org/T102504) [20:08:56] (03CR) 10Merlijn van Deen: [C: 031] "Tested localuser on tools:" [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [20:10:33] (03PS1) 10Legoktm: Only enable global merge on meta, give local steward group rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222467 [20:10:54] (03CR) 10Legoktm: [C: 032] Only enable global merge on meta, give local steward group rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222467 (owner: 10Legoktm) [20:11:00] (03Merged) 10jenkins-bot: Only enable global merge on meta, give local steward group rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222467 (owner: 10Legoktm) [20:11:21] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [20:11:37] !log legoktm Synchronized wmf-config/: Only enable global merge on meta, give local steward group rights (duration: 00m 13s) [20:13:03] !log restarted keystone on labcontrol1001 [20:13:12] Logged the message, Master [20:13:50] PROBLEM - HHVM rendering on mw1051 is CRITICAL - Socket timeout after 10 seconds [20:14:35] ottomata: on carbon.eqiad.wmnet port 9694 for Kafka and port 9681 for Analytics .. i think .. sorry, i have to pick up kid from school [20:15:15] ok thanks will change [20:15:19] ottomata: so i'm sure these are the aggreagtor ports on carbon. just not sure about multicast address [20:15:38] the config files are /etc/ganglia/aggregators/1032.conf and 1045.conf [20:15:40] PROBLEM - Apache HTTP on mw1051 is CRITICAL - Socket timeout after 10 seconds [20:16:13] mutante: it doesn't need mluticast if the aggregator isn't using multicast [20:16:31] it should be the same as those hosts are supposed to use for regular gmond [20:16:35] so it should be those [20:16:38] trying... [20:16:41] ottomata: then it should work :)great [20:16:56] i have to run but i will certainly check later if it's good in icinga [20:17:16] k thanks [20:17:52] (03PS1) 10Ottomata: Use new ganglia IPs and Ports for analytics clusters [puppet] - 10https://gerrit.wikimedia.org/r/222469 [20:19:30] (03CR) 10Ottomata: [C: 032] Use new ganglia IPs and Ports for analytics clusters [puppet] - 10https://gerrit.wikimedia.org/r/222469 (owner: 10Ottomata) [20:19:31] PROBLEM - HHVM queue size on mw1051 is CRITICAL 42.86% of data above the critical threshold [80.0] [20:19:41] PROBLEM - HHVM busy threads on mw1051 is CRITICAL 87.50% of data above the critical threshold [86.4] [20:22:46] (03PS5) 10Andrew Bogott: puppetmaster: Enable autosigning puppet certs for labs [puppet] - 10https://gerrit.wikimedia.org/r/218380 (https://phabricator.wikimedia.org/T102504) (owner: 10Yuvipanda) [20:23:23] (03PS5) 10Andrew Bogott: Switch on salt auto_accept for labs. [puppet] - 10https://gerrit.wikimedia.org/r/220306 (https://phabricator.wikimedia.org/T102504) [20:23:46] (03CR) 10Andrew Bogott: [C: 032] puppetmaster: Enable autosigning puppet certs for labs [puppet] - 10https://gerrit.wikimedia.org/r/218380 (https://phabricator.wikimedia.org/T102504) (owner: 10Yuvipanda) [20:28:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:29:54] mutante: btw, i think this is working. [20:30:47] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1422727 (10BBlack) I think there's a good chance redoing the thermal paste addressed the issue on cp1065. It's been roughly 4 hours since it was p... [20:31:02] AaronSchulz: around? [20:31:09] lots of [20:31:10] Query: UPDATE `archive` SET ar_user = '2211638',ar_user_text = 'YuviPanda' WHERE ar_user = '1404827' [20:31:10] Function: MergeUser::mergeDatabaseTables [20:31:10] Error: 2013 Lost connection to MySQL server during query (10.64.32.28) [20:31:39] ouch [20:31:40] I've seen those errors with TMH for months [20:31:59] not sure why, since it flushes the trx and ping() should apply...wonder if that's related [20:32:36] (03CR) 10Andrew Bogott: [C: 032] Switch on salt auto_accept for labs. [puppet] - 10https://gerrit.wikimedia.org/r/220306 (https://phabricator.wikimedia.org/T102504) (owner: 10Andrew Bogott) [20:34:07] legoktm: is that in a trx or not? [20:34:43] if it is then ping() doesn't apply, so a query after a long delay (like talking to another DB) will give that [20:34:50] any super long query might give that in any case [20:35:00] AaronSchulz: it is [20:37:01] (03PS1) 10Andrew Bogott: Fix an erb typo [puppet] - 10https://gerrit.wikimedia.org/r/222475 [20:37:43] AaronSchulz: what would you suggest doing here? [20:38:35] (03CR) 10Andrew Bogott: [C: 032] Fix an erb typo [puppet] - 10https://gerrit.wikimedia.org/r/222475 (owner: 10Andrew Bogott) [20:38:45] does this run in jobs or elsewhere? [20:39:04] jobs [20:39:43] and if I recall they are human initiated and not terrible frequent? [20:39:48] yep [20:39:57] bunch of Error: 1146 Table 'aswiki.cn_notice_log' doesn't exist (10.64.16.27) too, yippe [20:40:16] still could be some contention when runners pick them up, but that doesn't seem super likely [20:40:47] well job-job contention, that doesn't rule out contention with page deletes/restores [20:41:04] but they should mostly be touching their own database... [20:42:05] where is the job class [20:42:18] extensions/CentralAuth/includes/LocalRenameJob/LocalUserMergeJob.php [20:42:27] which calls UserMerge's MergeUser class [20:42:30] right [20:43:08] (03PS4) 10BBlack: Add varnish stats reporter for ResourceLoader requests [puppet] - 10https://gerrit.wikimedia.org/r/222371 (https://phabricator.wikimedia.org/T104277) (owner: 10Ori.livneh) [20:43:11] (03PS1) 10Legoktm: Disable global merge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222477 [20:43:15] // Can't batch/wait when in a transaction or when no batch key is given [20:43:26] (03CR) 10Legoktm: [C: 032] Disable global merge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222477 (owner: 10Legoktm) [20:43:31] legoktm: since jobs *do* use transactions, does that ever actually apply? [20:43:33] (03Merged) 10jenkins-bot: Disable global merge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222477 (owner: 10Legoktm) [20:43:55] (03CR) 10BBlack: [C: 031] "Sounds good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/222371 (https://phabricator.wikimedia.org/T104277) (owner: 10Ori.livneh) [20:44:39] AaronSchulz: the code can also be executed from the web through Special:UserMerge, but that's disabled on WMF wikis [20:45:21] sure, but I mean that it never uses batches [20:45:40] (03CR) 10Ori.livneh: [C: 032] Add varnish stats reporter for ResourceLoader requests [puppet] - 10https://gerrit.wikimedia.org/r/222371 (https://phabricator.wikimedia.org/T104277) (owner: 10Ori.livneh) [20:45:41] not all tables have a batchKey I think [20:46:01] $this->mergeEditcount(); would always trip DBO_TRX [20:46:12] AaronSchulz: also, you wrote this code :) https://gerrit.wikimedia.org/r/#/c/158310/ [20:47:12] where job runners on hhvm then? It would have worked before that. [20:47:34] this was last october [20:47:56] !log legoktm Synchronized wmf-config/CommonSettings.php: Disable global merge (duration: 00m 14s) [20:48:02] Logged the message, Master [20:49:45] legoktm: if merge() had a trx flag, then LocalUserMergeJob could just tell it to flush and batch [20:50:04] that seems reasonably easy to reason about [20:50:29] do you want to write a patch for that? I'm not really following you [20:51:01] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [20:51:16] I mean that we have run() -> doRun() -> merge() -> mergeEditcount() and then mergeDatabaseTables is called later [20:51:32] the select()s in mergeEditcount trigger DBO_TRX logic that causes begin() [20:51:45] that means $db->trxLevel() is 1 [20:52:24] maybe I forgot about DBO_TRX when updating/merging sam's patch then [20:52:40] should mergeEditcount just call $db->commit() then? [20:52:41] (DBO_TRX for runners that is, which was still slightly novel) [20:53:17] isn't that flakey and also a problem for web requests (even if we turned that off)? [20:55:42] I...don't know. I don't really understand the transaction stuff that's going on here... [20:57:02] AaronSchulz: so uh, wanna write patches for this? :) [20:59:54] I suppose [21:03:57] 6operations, 10OTRS, 6Security, 7HTTPS, 5Patch-For-Review: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1422841 (10RobH) So I think I should be able to simply roll the above certificate replacement into place with minimal downtime. I'll run through said plan on this task... [21:05:06] Does anyone know of the proper channel to inform all otrs users for downtime? [21:05:23] Keegan: ^ [21:05:28] other than sending an email tot he 9 publicly listed otrs lists [21:05:32] =] [21:05:48] reference: sha1 cert currently used for ticket.w.o [21:05:48] https://phabricator.wikimedia.org/T91504 [21:06:14] robh: Yeah, I can send a notice to all agents through OTRS [21:06:26] I just know if I don't schedule a proper window, it'll somehow break. Proper window means it will more likely go smoothly [21:06:31] robh: there's an otrs-admins list, and Keegan is an OTRS admin :P [21:06:37] Keegan: cool, I'll create the phab task for notification [21:06:53] think notice now for next tuesday is enough leadtime? [21:07:00] It is [21:07:50] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [21:10:46] robh: Okay, I've got my notice to the other OTRS admins drafted, will send when you have the phab task for the notification, then I'll email the agents, and we're all set. [21:10:50] Allegedly. [21:11:11] 6operations, 7HTTPS, 3Roadmap, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1422863 (10RobH) 3NEW [21:11:14] ^ =] [21:11:20] (03PS1) 10Ori.livneh: add varnish::logging::rls to remaining 2layer varnishes [puppet] - 10https://gerrit.wikimedia.org/r/222485 [21:11:22] 6operations, 7HTTPS, 3Roadmap, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1422873 (10RobH) a:3RobH [21:11:33] 6operations, 7HTTPS, 3Roadmap, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1422863 (10RobH) p:5Triage>3Normal [21:12:15] (03CR) 10Ori.livneh: [C: 032 V: 032] add varnish::logging::rls to remaining 2layer varnishes [puppet] - 10https://gerrit.wikimedia.org/r/222485 (owner: 10Ori.livneh) [21:12:49] 6operations, 10OTRS, 7HTTPS, 3Roadmap, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1422882 (10Krenair) [21:14:50] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [21:16:22] ok, its also on the deployments wiki page as well [21:16:35] Keegan: let me know if you need anything else from me, thank you for sending the notification =] [21:16:53] Will do, thanks for fixing it [21:17:09] overdue on my part, but very welcome [21:21:36] 6operations, 10OTRS, 6Security, 7HTTPS, 5Patch-For-Review: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1422922 (10RobH) p:5High>3Normal [21:22:43] !log legoktm Synchronized php-1.26wmf12/extensions/CentralNotice/: https://gerrit.wikimedia.org/r/222484 (duration: 00m 15s) [21:22:50] Logged the message, Master [21:24:11] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [21:37:53] !log restarted apache2 or iridium after applying hotfix for phabricator css issue [21:37:59] Logged the message, Master [21:39:18] twentyafterfour: thank you :D [21:48:11] 6operations, 10Wikimedia-Git-or-Gerrit, 7HTTPS: Chromium says "Your connection to gerrit.wikimedia.org is encrypted with obsolete cryptography" - https://phabricator.wikimedia.org/T104649#1423102 (10Krenair) [21:48:21] 6operations, 10Wikimedia-Git-or-Gerrit, 7HTTPS: Chromium says "Your connection to gerrit.wikimedia.org is encrypted with obsolete cryptography" - https://phabricator.wikimedia.org/T104649#1423105 (10polybuildr) [21:49:06] legoktm: no problem, it was annoying me as well [21:49:32] 6operations, 10Wikimedia-Git-or-Gerrit, 7HTTPS: Chromium says "Your connection to gerrit.wikimedia.org is encrypted with obsolete cryptography" - https://phabricator.wikimedia.org/T104649#1423094 (10polybuildr) Also, Firefox does not complain. [21:51:53] !log legoktm Synchronized php-1.26wmf12/extensions/Interwiki/Interwiki_body.php: Add missing global $wgInterwikiViewOnly declaration (duration: 00m 15s) [21:51:59] Logged the message, Master [21:53:43] win 23 [21:53:45] blerg [21:58:08] 6operations, 10Wikimedia-Git-or-Gerrit, 7HTTPS: Chromium says "Your connection to gerrit.wikimedia.org is encrypted with obsolete cryptography" - https://phabricator.wikimedia.org/T104649#1423128 (10BBlack) Even commercial Chrome complains about this, and it's a valid complaint. Our gerrit server runs Apach... [21:58:14] 6operations, 10OTRS, 7HTTPS, 3Roadmap, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1423130 (10Keegan) All agents have been notified of the downtime via Admin Notification on OTRS. [22:00:22] 6operations, 10OTRS, 7HTTPS, 3Roadmap, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1423135 (10jeremyb) > T91504 - replace sha1 cert with sha256 do you mean this one: {T73156} ? [22:00:26] A log message just hit hhvm.log on fluorine with a timestamp from 17:41:21 [22:00:52] any ideas what would make rsyslog buffer a message for 4 hours? [22:02:56] /a/mw-log/hhvm.log on fluorine is full of wacky timestamps [22:03:31] 20:59:06 > 17:38:26 > 21:32:30 > 21:57:15 [22:03:51] it's the JIT's branch preiction [22:03:51] those are consecutive entries [22:03:57] 6operations, 10RESTBase-Cassandra, 6Services, 7RESTBase-architecture: alternative Cassandra metrics reporting - https://phabricator.wikimedia.org/T104208#1423151 (10Eevans) >>! In T104208#1410038, @Eevans wrote: > One option, would be to implement a JMX-based collector that writes to Graphite, in Java. I... [22:03:58] (just kidding) [22:05:17] I logged into a couple of the servers with skewed lines and their clocks are fine [22:05:40] so it something in the rsylog forwarding from the MW hosts to fluorine or ...? [22:05:51] 6operations, 10OTRS, 7HTTPS, 3Roadmap, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1422863 (10Keegan) >>! In T104634#1423135, @jeremyb wrote: > > or maybe there's not a ticket for OTRS sha1 specifically just OTRS SSL in general? This. [22:12:19] anomie: ping [22:13:15] (03PS8) 10BBlack: tlsproxy: multi-cert support, including ocsp [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) [22:14:19] legoktm: i need you or anomie [22:14:26] matanya: hi? [22:14:30] (03CR) 10BBlack: [C: 04-1] "Correct now, but holding till Monday, as it's complex and requires some baby-sitting and post-cleanup of outdated OCSP filenames." [puppet] - 10https://gerrit.wikimedia.org/r/222067 (https://phabricator.wikimedia.org/T86654) (owner: 10BBlack) [22:14:33] matanya: Brad is probably off for the weekend [22:14:44] regarding the OS bug [22:14:54] i can't put the info in phab [22:15:07] there is a reason this stuff is OS'ed [22:15:18] right [22:15:23] Can i pm you some examples? [22:15:23] that's fine [22:22:31] PROBLEM - puppet last run on mw1152 is CRITICAL Puppet last ran 12 hours ago [22:31:50] !log legoktm Synchronized php-1.26wmf12/extensions/UserMerge/: Added USE_MULTI_COMMIT flag to enable query batching (duration: 00m 26s) [22:31:56] Logged the message, Master [22:34:03] !log legoktm Synchronized php-1.26wmf12/extensions/CentralAuth/: Made use of new USE_MULTI_COMMIT flag in user merge jobs (duration: 00m 18s) [22:34:09] Logged the message, Master [22:35:12] RECOVERY - puppet last run on mw1152 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [22:43:07] 6operations, 7Database: Test and fix db1047 BBU - https://phabricator.wikimedia.org/T103345#1423311 (10DarTar) [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150702T2300). [23:03:15] No patches in the SWAT :) [23:04:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 3 below the confidence bounds [23:15:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 2 below the confidence bounds [23:16:36] 6operations, 10RESTBase, 6Services: Add services team as a contact alert for RESTBase back-end HTTP checks - https://phabricator.wikimedia.org/T104656#1423388 (10mobrovac) 3NEW [23:21:45] 6operations, 10OTRS, 7HTTPS, 3Roadmap, 7notice: OTRS Maintenance Window - July 7th 17:00 UTC to 18:00 UTC - https://phabricator.wikimedia.org/T104634#1423396 (10Dzahn) it's T91504 which is a blocker for T73156 [23:23:03] 6operations, 10Wikimedia-Git-or-Gerrit, 7HTTPS: Chromium says "Your connection to gerrit.wikimedia.org is encrypted with obsolete cryptography" - https://phabricator.wikimedia.org/T104649#1423399 (10Dzahn) [23:23:40] (03PS1) 10Ori.livneh: varnishlog: allow passing NULL parameter to VCL_Arg() [puppet] - 10https://gerrit.wikimedia.org/r/222507 [23:24:25] (03PS1) 10Mobrovac: Add the Services team to the contact list for RESTBase HTTP checks [puppet] - 10https://gerrit.wikimedia.org/r/222508 (https://phabricator.wikimedia.org/T104656) [23:26:02] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ list - https://phabricator.wikimedia.org/T104596#1423404 (10Dzahn) Hi, fr-all does not have indidivual members. It is just a combination of these: fr-all: fr-development, fr-online, fr-software-engineers, fr-tech-ops, fr-tech which of... [23:26:21] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1423405 (10Dzahn) [23:26:58] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1421841 (10Dzahn) >>! In T104596#1422152, @Krenair wrote: > Looks like this must be an alias in the private exim config, rather than a mailman thing. correct, not a mailman thing [23:27:51] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1423418 (10Dzahn) a:3Dzahn [23:34:35] (03CR) 10Dzahn: [C: 032] Add the Services team to the contact list for RESTBase HTTP checks [puppet] - 10https://gerrit.wikimedia.org/r/222508 (https://phabricator.wikimedia.org/T104656) (owner: 10Mobrovac) [23:35:30] PROBLEM - puppet last run on db1062 is CRITICAL Puppet has 1 failures [23:37:49] 6operations, 10RESTBase, 6Services: Add services team as a contact alert for RESTBase back-end HTTP checks - https://phabricator.wikimedia.org/T104656#1423440 (10mobrovac) 5Open>3Resolved [23:37:52] 6operations, 5Patch-For-Review: Move static-bugzilla from zirconium to ganeti - https://phabricator.wikimedia.org/T101734#1423446 (10Dzahn) [23:37:54] 6operations, 10vm-requests: eqiad: (1) VM for static-bugzilla - https://phabricator.wikimedia.org/T103604#1423444 (10Dzahn) 5Resolved>3Open after it was installed and the initial puppet run i could not login with either SSH key nor root password from console [23:38:15] 6operations, 10RESTBase, 6Services: Add services team as a contact alert for RESTBase back-end HTTP checks - https://phabricator.wikimedia.org/T104656#1423388 (10mobrovac) Thank you, @Dzahn ! [23:40:05] 6operations, 10RESTBase, 6Services: Add services team as a contact alert for RESTBase back-end HTTP checks - https://phabricator.wikimedia.org/T104656#1423449 (10Dzahn) applied on neon: ``` - contact_groups admins + contact_groups admins,team-services host_name... [23:40:11] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1423450 (10atgo) Jerry is the admin for Lisa Gruwell. What sub-list is Lisa on? [23:41:50] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1423452 (10Dzahn) Lisa Gruwell is on fr-development and fr-online. [23:44:30] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1423461 (10atgo) Let's do the same for Jerry. Thanks! [23:44:58] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1423470 (10atgo) Actually.. what list is Kourosh on? [23:47:28] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1423472 (10Dzahn) Is Kourosh = kkarimkhany ? Then just on fr-development. [23:48:17] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1423473 (10atgo) Thanks @dzahn. Let's go for both online and development. Much appreciated! [23:51:17] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr-all@ alias - https://phabricator.wikimedia.org/T104596#1423477 (10Dzahn) 5Open>3Resolved done. he has been added. no problem. [23:51:29] RECOVERY - puppet last run on db1062 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:52:14] 6operations, 7Mail: Please add Jerry Kim (jkim@wikimedia.org) to fr aliases - https://phabricator.wikimedia.org/T104596#1423480 (10Dzahn)