[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151208T0000). Please do the needful. [00:00:06] RoanKattouw aude bblack Krinkle jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:59] * aude here [00:01:03] \o [00:02:12] (03CR) 10Catrope: [C: 032] Enable Flow user opt-in Beta Feature on two more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251523 (https://phabricator.wikimedia.org/T117991) (owner: 10Jforrester) [00:03:24] (Roan is taking SWAT.) [00:04:06] (03Merged) 10jenkins-bot: Enable Flow user opt-in Beta Feature on two more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251523 (https://phabricator.wikimedia.org/T117991) (owner: 10Jforrester) [00:07:23] (03PS4) 10BryanDavis: Enable Cards and RelatedArticles so it rides the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257434 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [00:07:31] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow opt-in beta feature on bswiki and urwiki (duration: 00m 28s) [00:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:02] 6operations, 6WMF-NDA-Requests: Give Neil Quinn access to WMF-NDA - https://phabricator.wikimedia.org/T119122#1860865 (10Neil_P._Quinn_WMF) @Aklapper, no, I haven't. I assume I do that by adding the project? [00:11:47] Config change done and scripts run [00:11:56] Apparently we had one prior opt-in on bswiki and two on urwiki [00:12:15] RoanKattouw: shoot i have one more config change that is supposed to be in swat.. [00:12:20] Go for it [00:12:34] As in, put it on the wiki page and ping me when done [00:12:34] 6operations, 6WMF-NDA-Requests: Give Neil Quinn access to WMF-NDA - https://phabricator.wikimedia.org/T119122#1860883 (10Dzahn) >>! In T119122#1860326, @Aklapper wrote: > Have you contacted Operations? No specific reason for Operations here. It's about having the permissiosn in phabricator. [00:14:16] (03PS1) 10Jdlrobson: Disable beta optin experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257503 (https://phabricator.wikimedia.org/T114038) [00:14:35] Oh RoanKattouw actually ignore me. [00:14:52] Seems like we might have disabled it - so we may be okay. [00:16:35] RoanKattouw: we'll stick to the plan :) [00:17:08] (03Abandoned) 10Jdlrobson: Disable beta optin experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257503 (https://phabricator.wikimedia.org/T114038) (owner: 10Jdlrobson) [00:18:30] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [00:19:52] 6operations, 6WMF-NDA-Requests: Give Neil Quinn access to WMF-NDA - https://phabricator.wikimedia.org/T119122#1860957 (10Dzahn) That said, the process above is for volunteers. I'm not sure we have anything for employees. [00:22:38] (03CR) 10Jdlrobson: [C: 031] Enable Cards and RelatedArticles so it rides the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257434 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [00:24:20] RoanKattouw: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151208T0000 [00:25:13] (03CR) 10Catrope: [C: 032] Enable data access for beta meta-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256707 (owner: 10Aude) [00:25:54] (03Merged) 10jenkins-bot: Enable data access for beta meta-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256707 (owner: 10Aude) [00:26:14] (03CR) 10Catrope: [C: 032] Revert "wgHTCPRouting: use separate address for upload" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256730 (owner: 10BBlack) [00:26:55] (03Merged) 10jenkins-bot: Revert "wgHTCPRouting: use separate address for upload" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256730 (owner: 10BBlack) [00:27:31] RoanKattouw: o/ [00:27:34] (03CR) 10Catrope: [C: 032] Remove Browse experimental config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252660 (https://phabricator.wikimedia.org/T113686) (owner: 10Phuedx) [00:27:45] Hey Krinkle [00:28:18] (03Merged) 10jenkins-bot: Remove Browse experimental config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252660 (https://phabricator.wikimedia.org/T113686) (owner: 10Phuedx) [00:29:41] (03CR) 10Catrope: [C: 032] Enable banners extension on mobile web beta only (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251544 (https://phabricator.wikimedia.org/T101108) (owner: 10Jdlrobson) [00:30:03] 6operations, 10DBA, 6WMF-Legal, 7HTTPS, 5Patch-For-Review: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#1861019 (10Dzahn) [00:30:21] (03Merged) 10jenkins-bot: Enable banners extension on mobile web beta only (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251544 (https://phabricator.wikimedia.org/T101108) (owner: 10Jdlrobson) [00:31:48] !log catrope@tin Synchronized wmf-config/squid.php: SWAT: stop using separate address for upload in wgHTCPRouting (duration: 00m 29s) [00:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:36] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: enable banners on mobile web beta on enwiki; remove MobileFrontend Browse config (duration: 00m 27s) [00:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:01] !log catrope@tin Synchronized php-1.27.0-wmf.7/resources/ResourcesOOUI.php: SWAT: connect OOjs UI to MW's l10n system (duration: 00m 27s) [00:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:44] !log catrope@tin Synchronized php-1.27.0-wmf.7/resources/src/oojs-ui-local.js: SWAT: connect OOjs UI to MW's l10n system (duration: 00m 28s) [00:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:38:22] !log catrope@tin Synchronized php-1.27.0-wmf.7/extensions/MwEmbedSupport/: SWAT (duration: 00m 28s) [00:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:12] !log catrope@tin Synchronized php-1.27.0-wmf.7/extensions/MobileFrontend/: SWAT (duration: 00m 28s) [00:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:40:56] Krinkle jdlrobson bblack aude: That's SWAT all done [00:41:06] RoanKattouw: okay testing. Thanks! [00:41:23] RoanKattouw: thanks [00:42:02] (03PS2) 10GWicke: Update RESTBase configs for RESTBase v0.9.1 [puppet] - 10https://gerrit.wikimedia.org/r/257408 [00:42:04] it will take a few minutes for the changes to appear on beta, but confirm we didn't accidentally enable our stuff on the real meta-wiki :) [00:42:15] RoanKattouw: confirmed. fixed. [00:48:55] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1591088 (10Dzahn) [00:52:40] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [00:54:04] 6operations: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274#1861105 (10Reedy) This was filed for T93531 [00:59:05] !log disabling puppet in production restbase cluster in preparation for testing https://gerrit.wikimedia.org/r/#/c/257408/ in staging [00:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:59:49] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000000.0] [01:00:07] 6operations: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#1861112 (10Reedy) 3NEW [01:00:23] 6operations: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#1861119 (10Reedy) [01:00:25] 6operations: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274#1861120 (10Reedy) [01:00:25] (03PS1) 10Ori.livneh: keystone: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257509 [01:00:49] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [5000000.0] [01:00:51] 6operations: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#1861112 (10Reedy) [01:00:53] 6operations, 10SEO: secure.wikimedia.org entries still showing up in Google search results - https://phabricator.wikimedia.org/T93531#1861128 (10Reedy) [01:01:21] (03CR) 10jenkins-bot: [V: 04-1] keystone: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257509 (owner: 10Ori.livneh) [01:02:18] (03CR) 10GWicke: "I have now disabled puppet on all restbase nodes, so that we can test this in staging without affecting production." [puppet] - 10https://gerrit.wikimedia.org/r/257408 (owner: 10GWicke) [01:03:02] (03PS1) 10Reedy: Remove secure.wikimedia.org from apacheo [puppet] - 10https://gerrit.wikimedia.org/r/257510 (https://phabricator.wikimedia.org/T120790) [01:03:30] RoanKattouw: i noticed a follow up needed with one of the patches - guessing i'll have to wait till tomorrows swat window for that right? [01:04:21] (03CR) 10Ori.livneh: [C: 032] "My understanding from chatting with Gabriel is that this patch is a dependency for a VisualEditor feature which is slotted for deployment " [puppet] - 10https://gerrit.wikimedia.org/r/257408 (owner: 10GWicke) [01:04:43] (03PS1) 10Jdlrobson: Minerva should also be blacklisted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257511 (https://phabricator.wikimedia.org/T101108) [01:04:56] (03CR) 10Andrew Bogott: "We used redis replication when we migrated from pmtpa to eqiad -- now that Labs is single datacenter this feature isn't used (and the upst" [puppet] - 10https://gerrit.wikimedia.org/r/257509 (owner: 10Ori.livneh) [01:05:52] (03CR) 10Ori.livneh: "Andrew, gotcha. Would you prefer that I decom the redis instance altogether?" [puppet] - 10https://gerrit.wikimedia.org/r/257509 (owner: 10Ori.livneh) [01:05:55] jdlrobson: https://gerrit.wikimedia.org/r/#/c/257511/ is trivial enough that I'll deploy it right now if you want [01:06:07] if you could that would save me an early start tomorrow :) [01:06:14] and i would me much appreciative :) [01:07:14] (03PS4) 10Ori.livneh: gerrit: Map /tools/hooks/commit-msg to /r/tools/hooks/commit-msg [puppet] - 10https://gerrit.wikimedia.org/r/257396 (owner: 10Bartosz Dziewoński) [01:07:20] (03CR) 10Ori.livneh: [C: 032 V: 032] gerrit: Map /tools/hooks/commit-msg to /r/tools/hooks/commit-msg [puppet] - 10https://gerrit.wikimedia.org/r/257396 (owner: 10Bartosz Dziewoński) [01:07:47] (03CR) 10Andrew Bogott: "Yes, probably, so that that code doesn't remain in wait to trap another hopeful developer :)" [puppet] - 10https://gerrit.wikimedia.org/r/257509 (owner: 10Ori.livneh) [01:08:17] (03PS2) 10Reedy: Remove secure.wikimedia.org from apacheo [puppet] - 10https://gerrit.wikimedia.org/r/257510 (https://phabricator.wikimedia.org/T120790) [01:08:35] RoanKattouw: ^ [01:09:07] (03CR) 10Chad: [C: 04-1] "My bookmarks in my browser will break! :(" [puppet] - 10https://gerrit.wikimedia.org/r/257510 (https://phabricator.wikimedia.org/T120790) (owner: 10Reedy) [01:09:18] * Reedy kicks ostriches [01:09:51] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [01:09:59] (03CR) 10Catrope: [C: 032] Minerva should also be blacklisted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257511 (https://phabricator.wikimedia.org/T101108) (owner: 10Jdlrobson) [01:10:18] (03PS1) 10Reedy: Remove secure.wikimedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/257513 (https://phabricator.wikimedia.org/T120790) [01:10:25] Reedy: You shouldn't kick ostriches. They're fast and very much taller than you! [01:10:27] (03Merged) 10jenkins-bot: Minerva should also be blacklisted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257511 (https://phabricator.wikimedia.org/T101108) (owner: 10Jdlrobson) [01:10:50] * Reedy wonders what apacheo is [01:10:57] Also a good question! [01:11:16] (03CR) 10Jforrester: "Cool URLs don't change… but secure.wm.org was never cool" [puppet] - 10https://gerrit.wikimedia.org/r/257510 (https://phabricator.wikimedia.org/T120790) (owner: 10Reedy) [01:11:28] (03PS2) 10Ori.livneh: role::labs::openstack::keystone::server: decom redis instance [puppet] - 10https://gerrit.wikimedia.org/r/257509 [01:11:34] andrewbogott: ^ [01:11:47] James_F: But if you were never cool...how do you know what cool was?! [01:11:54] secure.wm.o could've been too cool for you! [01:11:54] :) [01:12:07] ostriches: I worked for the author of the essay. Does that make me cool? [01:12:17] James_F: For the record, I never got invited to those sorts of parties growing up either :( [01:12:17] If it's good enough for a US Defence Contractor... [01:12:22] * James_F grins. [01:13:02] (03PS3) 10Reedy: Remove secure.wikimedia.org from apache [puppet] - 10https://gerrit.wikimedia.org/r/257510 (https://phabricator.wikimedia.org/T120790) [01:13:51] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [5000000.0] [01:14:32] (03CR) 10Andrew Bogott: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/257509 (owner: 10Ori.livneh) [01:15:12] Thanks RoanKattouw - should this patch be live now? [01:15:49] Oh sorry [01:15:50] PROBLEM - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [01:15:54] I was waiting for it to merge without noticing it already had [01:15:56] Deploying now [01:16:21] PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused [01:16:35] this is me testing ^^ [01:16:47] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Also blacklist Minerva for WPB (duration: 00m 28s) [01:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:17:05] jdlrobson: Done [01:17:08] thanks RoanKattouw ! [01:17:13] andrewbogott: am finishing it up, https://etherpad.wikimedia.org/p/labs-incident-timeline [01:17:20] andrewbogott: finding exact times for the last bits [01:17:25] andrewbogott: but that's the ordering [01:17:44] ACKNOWLEDGEMENT - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) gwicke Config testing. [01:17:44] ACKNOWLEDGEMENT - Restbase root url on xenon is CRITICAL: Connection refused gwicke Config testing. [01:17:54] YuviPanda: thanks, will look [01:18:41] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [01:19:39] (03PS1) 10Ori.livneh: Fix-up for I81452131b: rewrite into absolute URL [puppet] - 10https://gerrit.wikimedia.org/r/257515 [01:19:52] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I81452131b: rewrite into absolute URL [puppet] - 10https://gerrit.wikimedia.org/r/257515 (owner: 10Ori.livneh) [01:23:02] andrewbogott: don't have exact timelnies for the three restarts but put in close enough numbers [01:23:13] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:23:13] 6operations, 5Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#1861150 (10Legoktm) Uhh, lets not break links please? :/ [01:23:15] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [01:23:16] (03PS3) 10Ori.livneh: role::labs::openstack::keystone::server: decom redis instance [puppet] - 10https://gerrit.wikimedia.org/r/257509 [01:23:18] (03CR) 10Ori.livneh: [C: 032 V: 032] "catalog compiler: https://puppet-compiler.wmflabs.org/1456/" [puppet] - 10https://gerrit.wikimedia.org/r/257509 (owner: 10Ori.livneh) [01:24:04] thanks [01:25:13] 6operations, 5Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#1861152 (10Reedy) What links? Where? See T119274 for amount of incoming links [01:32:43] (03PS1) 10Ori.livneh: ircyall: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257517 [01:33:33] YuviPanda: earlier today (19:30 ish UTC) I definitely verified that new instances still weren’t working, due to failure to communicate with designate [01:33:49] do you know for a fact that new instances were working correctly at any point between ‘everything back to normal’ and then? [01:34:18] andrewbogott: no, I had looked at instances in tools project and in contincloud that were in 'scheduling' that became active [01:34:20] and I logged into one [01:34:23] and i was like ok' [01:34:27] (03CR) 10Ori.livneh: [C: 032] ircyall: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257517 (owner: 10Ori.livneh) [01:35:11] hmmmmm [01:40:41] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 1 failures [01:41:03] YuviPanda: so, the timing is so close… it seems like the puppet run at 10:25 /must/ have caused the dns outage. [01:41:13] Do you know how long puppet had been broken before then? [01:42:19] andrewbogott: no, but I think I saw it broke about the time gabriel reported the breakage... [01:42:55] When was that? [01:43:13] yesterday around noon SF time [01:43:16] There were probably alerts, surely there’s an icinga log of things someplace [01:43:20] 22:55 - https://phabricator.wikimedia.org/T120586 gets filed [01:43:24] so the previous day [01:43:51] ok... [01:44:19] I must’ve been bit by a radioactive ellipsis a few weeks ago [01:45:59] (03PS1) 10Ori.livneh: wikimetrics: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257521 [01:48:43] (03PS1) 10Chad: Gerrit: redirect old gitweb project urls to Diffusion instead of Gitblit [puppet] - 10https://gerrit.wikimedia.org/r/257523 [01:49:17] (03CR) 10Ori.livneh: "> is it in the cards to keep using it after the migration is done? -- nope!" [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257167 (https://phabricator.wikimedia.org/T120544) (owner: 10Ori.livneh) [01:50:06] (03CR) 10Ori.livneh: [C: 032 V: 032] Imported Upstream version 0.7.4 [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257167 (https://phabricator.wikimedia.org/T120544) (owner: 10Ori.livneh) [01:56:41] btw, YuviPanda just because reverse dns is dumb for that IP I’d still expect everything to work [01:56:49] you can access that system and such right? [01:57:14] andrewbogott: no [01:57:36] andrewbogott: cna't ssh [01:57:37] is it getting the wrong ip? [01:57:38] *can't [01:58:28] andrewbogott: dig is giving me an IP, let me verify it with wikitech [01:59:08] andrewbogott: um, the IP is correct [01:59:12] andrewbogott: but instance isn't reachable [01:59:36] andrewbogott: I wonder if both of those instances have the same IP [01:59:54] andrewbogott: oh, hmm, only one of the instances is inaccessible (tools-worker-05) [01:59:57] another one is accessible... [02:00:00] wat [02:00:22] console access for tools-worker-05 says [02:00:24] tools-worker-05 login: [02:00:41] except I can't ssh in [02:00:43] * YuviPanda is very nervous [02:04:11] YuviPanda: root@labservices1001:/home/andrew/latestdesignate.txt has a dump of everything designate should know about (forward dns, not reverse) [02:04:52] latestdesignatereverse.txt has the reverse entries [02:05:35] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [02:08:34] andrewbogott: hmm so there are two records there [02:08:37] andrewbogott: with the same IP [02:08:53] andrewbogott: well, *4* records [02:09:01] ok, for starters, I think this is a bug but not the cause of the overlap: https://phabricator.wikimedia.org/T120792 [02:09:10] There should be two... [02:09:15] but, yes, 4 is bad [02:09:18] so I can delete the bad ones by hand [02:09:35] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: puppet fail [02:09:37] I haven’t looked at the dates yet — which came first? [02:11:15] YuviPanda: ok, so the CI dns entry leaked [02:11:28] then tools reused the IP, duplicating things [02:11:32] that doesn’t upset/shock me a ton [02:11:44] since CI names are monotonic, I’ll just purge old ones [02:12:29] ok [02:17:42] (03CR) 10Ori.livneh: [C: 032] "Won't apply immediately because wikimetrics is using a self-hosted puppetmaster setup, but when it is merged the configured redis instance" [puppet] - 10https://gerrit.wikimedia.org/r/257521 (owner: 10Ori.livneh) [02:18:52] (03PS1) 10Ori.livneh: Fixup for I6fe4b05b922 [puppet] - 10https://gerrit.wikimedia.org/r/257529 [02:19:05] (03CR) 10Ori.livneh: [C: 032 V: 032] Fixup for I6fe4b05b922 [puppet] - 10https://gerrit.wikimedia.org/r/257529 (owner: 10Ori.livneh) [02:21:35] ori: don't do the tools instance, needs announcements [02:21:45] YuviPanda: yes of course not [02:21:59] i thought of helping with the patch tho [02:22:04] ori: +1 [02:22:08] ori: that'd be lovely :D [02:22:08] where is it actually running? [02:22:12] * YuviPanda is still in several roles [02:22:17] ori: tools-redis-01 and -02 [02:22:21] one is master one is slave [02:22:25] * ori nods [02:22:44] ori: nobody is hitting the slave, so you can practice with that if need be [02:23:15] nah, let's coordinate tomorrow. I asked because I have a good workflow now for diffing the config file against the default config that ships with the package [02:23:27] ori: ah, cool [02:23:45] ori: btw, I thought I should show you https://phabricator.wikimedia.org/T120697 [02:24:08] (I don't expect it to get deployed anytime soon, but will end up building it anyway :D) [02:24:15] yeah, i've suggested that before [02:24:23] good idea [02:24:34] once all the current stuff gets stabilized [02:24:46] ori: also, https://github.com/oreillymedia/thebe [02:25:18] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 09m 56s) [02:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:02] ori: I suppose I can easily make it so that it can render it either by shelling out or by calling a service [02:26:13] * YuviPanda shall make this a fun side project at some point [02:26:30] I suspect I should show this to stephen at some point too [02:27:26] https://gist.github.com/atdt/4a334fd1716a8db111ce [02:28:16] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.028 second response time [02:28:42] ori: is that the diff? [02:28:47] yep [02:29:10] yeah, so maxmemory and the renames are important [02:29:17] and so is maxmemory-policy [02:30:05] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [02:32:51] andrewbogott: so... should it all be ok now? [02:33:10] I still can't reach tools-worker-05 [02:33:16] Deleting these records turns out to be incredibly hard [02:34:03] ouch [02:34:08] ok... [02:34:19] andrewbogott: the instance is expendable if needed, fwiw. [02:34:23] both of them [02:34:26] designate api v1 won’t delete ‘managed’ records [02:34:33] and v2 will except it doesn’t do anything, just returns success [02:34:40] fun [02:34:47] nah, that IP will just stay there waiting to bite us if we don’t clean these up [02:34:49] *thos [02:34:50] *those [02:34:57] ouch [02:34:59] ok [02:35:31] Hey YuviPanda, do you have a howto move a bot into a container? I'd move csbot and use the opportunity to note if anything is missing or unclear in the doc? [02:35:45] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:35:54] Coren: no, I think we shouldn't move more bots for a few more weeks. [02:36:09] Coren: PAWS is putting real users into the system, showing up bugs that I'm fixing. [02:36:29] (03PS1) 10Ori.livneh: toollabs: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257534 [02:36:33] Coren: I'll write up more docs once PAWS is all good [02:36:38] YuviPanda: ^ (for later, don't worry about it now) [02:36:56] Hm, okay. Csbot makes a good perl test case if nothing else. [02:36:58] thanks ori <3 [02:37:08] \o [02:37:23] Coren: yeah. http://imgur.com/gallery/Yodcq [02:38:05] Coren: we'll probably end up just using debian inside the containers [02:38:05] (03CR) 10jenkins-bot: [V: 04-1] toollabs: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257534 (owner: 10Ori.livneh) [02:39:00] YuviPanda: A little more heavy, but more general and flexible. Only this or also one or two specialized ones like pywikibot? [02:39:04] (03PS2) 10Ori.livneh: toollabs: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257534 [02:39:29] Coren: so I think we'll end up with a 'base' debian container [02:39:35] Coren: and all other ones will be based off it [02:39:42] YuviPanda: Sounds sane. [02:39:52] Coren: the current paws one is based off jessie as well [02:40:12] YuviPanda: Wave in my direction when you want me to throw csbot at it. It's very self-contained so a pretty good test. [02:40:44] Coren: are you able to test it locally? [02:41:01] Coren: also does it run continuously or on a cron? [02:41:46] YuviPanda: It's continuous, and I see no reason I couldn't test it locally - it was developed that way and ran on my own colo box for a long time. [02:42:04] Coren: ok, so I'd suggest putting it into a Docker container and seeing how it does locally [02:42:19] * Coren nods. [02:42:23] Coren: then I can review it and see how to adapt it, and you can also get some experience writing Dockerfiiles [02:42:31] Sounds good! [02:42:33] Coren: once a Docker container is ready, it's about 5 mins to actually deploy it [02:43:01] Coren: does it need DB access? [02:43:15] YuviPanda: Nope; it does API only. [02:43:39] Coren: awesome. so play around with a dockerfile (inherit from debian:jessie!) and get it to run locally? [02:43:47] kk! [02:44:11] Coren: https://github.com/yuvipanda/paws/blob/master/singleuser/Dockerfile is the PAWS pywikibot one, if that's of any help [02:44:21] Coren: poke me if you run into anything! happy/excited to help [02:44:36] It probably will. Are we planning to have a method by which custom debs can be added though? [02:44:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:45:18] Obviously I can do so locally. [02:45:26] Coren: custom debs as in? [02:45:30] custom built debs? [02:45:42] or just debs you want to apt-get install? [02:45:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:45:48] Yeah; my bot needs a perl module that requires compilation. [02:46:19] Coren: yeah, so you can just do RUN apt-get install --yes [02:46:34] Coren: PAWS too requires python modules that require compilation [02:46:48] Ah, so it's a good example for that too. [02:46:52] Coren: yup! [02:47:26] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [02:47:57] Goodie. That's my volunteer hat, but I'll be doing this during downtime. [02:48:03] Coren: <3 cool! [02:50:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:53:25] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:53:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:04:10] (03PS1) 10Chad: gitblit: use mw/vendor as the icinga check instead of mw/core [puppet] - 10https://gerrit.wikimedia.org/r/257538 [03:04:38] ori: That'll probably actually make a difference so we don't have to hit mw/core every minute-ish. [03:12:26] !log Keeping puppet on restbase production cluster disabled until Marko & Filippo can deploy tomorrow morning. PLEASE DO NOT ENABLE! [03:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:00:03] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Dec 8 04:00:02 UTC 2015 (duration 1h 34m 44s) [04:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:25:42] (03CR) 10GWicke: "Updated Filippo, Mark & James per mail. Staging testing has gone well. Puppet is still disabled in preparation for a deploy during Europea" [puppet] - 10https://gerrit.wikimedia.org/r/257408 (owner: 10GWicke) [04:52:56] PROBLEM - cassandra CQL 10.64.32.178:9042 on restbase1008 is CRITICAL: Connection refused [04:55:36] ACKNOWLEDGEMENT - cassandra CQL 10.64.32.178:9042 on restbase1008 is CRITICAL: Connection refused gwicke Finished decommission in preparation for a conversion to a multi-instance setup. [05:12:25] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Puppet has 1 failures [05:14:45] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [05:16:46] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: puppet fail [05:26:34] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [05:36:06] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [05:42:34] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [05:45:56] !log restbase1008: disable cassandra with `systemctl mask cassandra` [05:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:49:12] gwicke: I didn't know about 'systemctl mask'. That's handy. [05:50:05] PROBLEM - cassandra service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [05:50:20] yeah, that's the "do-as-I-say" variant [05:50:49] "puppet-I-really-don't-want-you-to-start-this" [05:52:26] Heh, yes. [05:53:29] downside is that it breaks puppet [05:54:17] or at least makes puppet's brokenness more apparent [05:55:01] it is the world that is broken, puppet is correct a priori [05:55:48] heh, yeah [06:03:01] 6operations: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1861397 (10GWicke) 3NEW [06:03:37] 6operations, 10RESTBase: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1861404 (10GWicke) [06:30:26] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:30:35] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:56] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:14] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:15] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:15] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:15] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:34] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:35] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:56:55] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:57:05] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:38:25] (03CR) 10Alexandros Kosiaris: [C: 032] gitblit: use mw/vendor as the icinga check instead of mw/core [puppet] - 10https://gerrit.wikimedia.org/r/257538 (owner: 10Chad) [07:42:07] (03CR) 10Alexandros Kosiaris: [C: 032] Update hieradata for trebuchet module [puppet] - 10https://gerrit.wikimedia.org/r/256261 (https://phabricator.wikimedia.org/T119988) (owner: 10Thcipriani) [07:42:13] (03PS2) 10Alexandros Kosiaris: Update hieradata for trebuchet module [puppet] - 10https://gerrit.wikimedia.org/r/256261 (https://phabricator.wikimedia.org/T119988) (owner: 10Thcipriani) [07:45:10] (03CR) 10Alexandros Kosiaris: [C: 032] AQS: Configure Cassandra for AQS in BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/257406 (https://phabricator.wikimedia.org/T116206) (owner: 10Mobrovac) [07:45:16] (03PS4) 10Alexandros Kosiaris: AQS: Configure Cassandra for AQS in BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/257406 (https://phabricator.wikimedia.org/T116206) (owner: 10Mobrovac) [08:03:57] 6operations, 10Wikimedia-Etherpad: Disable old Etherpad installation after migrating content to Etherpad Lite installion - https://phabricator.wikimedia.org/T47312#1861489 (10akosiaris) [08:46:31] 6operations, 10RESTBase: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1861535 (10GWicke) After testing 2.1.12 on cerium for a while I gradually proceeded to roll it out to the eqiad staging hosts, followed by restbase1007. That looked good after an hour, with significantly less compac... [09:11:27] mobrovac: I'm going to start with the restbase1008 reimage, unless you want to look at the rb deploy first? [09:15:13] !log reimage restbase1008 [09:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:54] kk godog [09:20:33] godog: i'd like to do the rb deploy after lunch [09:20:50] mobrovac: yup, works for me [09:20:55] cool [09:21:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [5000000.0] [09:21:40] (03CR) 10Filippo Giunchedi: [C: 04-1] "thanks Ori, still -1 for debian/rules and $(SCONS) but looks good generally" [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 (owner: 10Ori.livneh) [09:21:44] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000000.0] [09:37:28] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:40:29] (03PS1) 10ArielGlenn: puppetize dumps monitor as a service [puppet] - 10https://gerrit.wikimedia.org/r/257560 (https://phabricator.wikimedia.org/T110888) [09:42:18] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:48:35] !log gallium: removing bunch of :i386 packages which were used for Android build but got removed ( https://gerrit.wikimedia.org/r/#/c/183790/3/modules/androidsdk/manifests/dependencies.pp,unified ) [09:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:07] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1008-a instance [puppet] - 10https://gerrit.wikimedia.org/r/257562 [09:55:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1008-a instance [puppet] - 10https://gerrit.wikimedia.org/r/257562 (owner: 10Filippo Giunchedi) [09:56:13] !log gallium: upgraded python-diamond and java. Restarting Jenkins [09:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:36] !log start cassandra on restbase1008, bootstrapping [10:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:18] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [10:07:38] PROBLEM - Restbase root url on restbase1008 is CRITICAL: Connection refused [10:08:08] PROBLEM - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: Connection refused [10:08:37] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [10:08:53] mobrovac: I've depooled 1008 from pybal, let me know when I can repool it [10:11:54] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1861667 (10fgiunchedi) restbase1008-a bootstrapping ``` Receiving 818 files, 410.89 GB total. Already received 23 files, 608.11 MB total... [10:15:31] (03PS1) 10Hashar: contint: monitor Jenkins has a ZMQ publisher [puppet] - 10https://gerrit.wikimedia.org/r/257568 (https://phabricator.wikimedia.org/T120669) [10:16:19] (03CR) 10Hashar: [C: 031 V: 031] "Low timeout cause it should always respond quite fast and there is no need to have nrpe idle for 10 seconds if zmq is dead." [puppet] - 10https://gerrit.wikimedia.org/r/257568 (https://phabricator.wikimedia.org/T120669) (owner: 10Hashar) [10:24:01] Maybe it could be useful to add Cenarium to the gerrit whitelist, he makes useful changes: https://gerrit.wikimedia.org/r/#/c/257559/ and is sysop at enwiki, so what do you think? [10:25:53] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1861693 (10fgiunchedi) I don't see anything regarding this unscheduled upgrade in SAL, why is that? [10:56:23] 6operations, 6Analytics-Backlog, 6Discovery, 6WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*:8888 from analytics cluster - https://phabricator.wikimedia.org/T120010#1861764 (10akosiaris) 5Open>3Resolved a:3akosiaris ACLs updated. Just tested it from `stat1003` and it... [11:02:54] !log upgrade diamond to 3.5-5 in codfw [11:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:10] (03CR) 10Addshore: [C: 04-1] "Per little chat with daniel this should be the size as recorded in the db, also what is displayed on LongPages." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 (owner: 10JanZerebecki) [11:14:38] 6operations, 6Analytics-Backlog, 6Discovery, 6WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*:8888 from analytics cluster - https://phabricator.wikimedia.org/T120010#1861805 (10Addshore) Many thanks! :) [11:39:59] 7Puppet, 6operations, 6Labs, 10Labs-Infrastructure: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1861849 (10hashar) 3NEW [12:04:48] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [12:18:38] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [12:39:44] (03PS3) 10Alexandros Kosiaris: Update WikimediaTemplates to support 5.0.1 [software/otrs] - 10https://gerrit.wikimedia.org/r/248916 [12:39:46] (03PS1) 10Alexandros Kosiaris: Drop the WikimediaEnableMultilines packages [software/otrs] - 10https://gerrit.wikimedia.org/r/257587 [13:00:10] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1861962 (10mark) >>! In T118154#1792791, @ArielGlenn wrote: > Initial thoughts: we have dataset1001 with ms1001 as a fallbac... [13:02:28] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1861965 (10mark) >>! In T113151#1858288, @Joe wrote: > So testing this with a single apache in a pool, I can see that AcceptFilter http none... [13:02:39] godog: i'd be ready to do a deploy of RB now [13:03:06] godog: what's the status of RB on rb1008? it's got the tin version of the code? [13:25:44] hi, i've got a renaming problem: [13:25:50] https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Archives_cantonales_jurassiennes [13:26:24] task was started Nov 26th - hasn't completed correctly [13:27:11] initial logged error was [90e25e28] 2015-11-26 09:54:29: Fataler Ausnahmefehler des Typs „JobQueueError“ [13:28:43] i can't restart it myself, can a database admin help? [13:30:11] (03PS2) 10coren: Add stable.toolserver.org to legacy redirects [puppet] - 10https://gerrit.wikimedia.org/r/257425 (https://phabricator.wikimedia.org/T120526) [13:30:31] MBq: that is in prod right ? [13:31:11] MBq: If noone is here, file a task at phabricator ;) (I don't know if a DB admin is here) [13:31:26] hahar: ? [13:31:49] akosiaris: utils/new_wmf_service.py is awesome. [13:31:51] hashar: ? [13:32:08] hashar: yeah it is [13:32:10] mobrovac: ^ But why it committed change :/ [13:32:20] MBq: hashar asks, if this is at production or beta cluster [13:32:58] found it ! [13:33:04] the stacktrace on the server side [13:33:08] kart_: what do you mean it committed the change? [13:33:17] MBq: have you filled a task in phabricator by any chance? [13:33:24] mobrovac: git commit. [13:33:33] hashar: not yet, should i? [13:33:33] ah https://phabricator.wikimedia.org/T119696 [13:33:41] mobrovac: I thought it will just generate diff. [13:33:56] so read script before running :D [13:33:59] (03CR) 10coren: [C: 032] "Well-contained change." [puppet] - 10https://gerrit.wikimedia.org/r/257425 (https://phabricator.wikimedia.org/T120526) (owner: 10coren) [13:34:05] kart_: ok, but you can do more changes and then use git add ... && git commit --amend [13:34:13] MBq: you can CC to https://phabricator.wikimedia.org/T119696 , I am going to slightly update it [13:34:20] kart_: you end up with the same result [13:34:35] mobrovac: yep. I was bit scared. [13:34:40] :) [13:35:40] hashar: thx a lot, sorry for doubling that request, didn't know of it [13:36:27] (03CR) 10BBlack: [C: 031] Improve handling of disableImages cookie [puppet] - 10https://gerrit.wikimedia.org/r/257491 (https://phabricator.wikimedia.org/T120151) (owner: 10Ori.livneh) [13:37:30] MBq: seems the issue in the code has been fixed [13:37:30] (03CR) 10BBlack: [C: 031] Improve handling of mobile variant cookies [puppet] - 10https://gerrit.wikimedia.org/r/257496 (https://phabricator.wikimedia.org/T119798) (owner: 10Ori.livneh) [13:37:45] MBq: no idea how to resume the work though :( [13:41:22] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1862041 (10ArielGlenn) I wil happily replace 3 if we want to do that now. I figured I would wait to replace snapshot1001 ti... [13:41:31] MBq: not much I can do sorry :-( [13:43:01] hashar: anyway - thanks for trying [13:45:56] (03CR) 10BBlack: [C: 04-1] "The other commit does more better :)" [puppet] - 10https://gerrit.wikimedia.org/r/257491 (https://phabricator.wikimedia.org/T120151) (owner: 10Ori.livneh) [13:49:34] (03PS2) 10BBlack: varnish: don't store hit-for-pass objects for logged-in users [puppet] - 10https://gerrit.wikimedia.org/r/257382 (owner: 10Faidon Liambotis) [13:50:04] (03CR) 10BBlack: [C: 031] varnish: don't store hit-for-pass objects for logged-in users [puppet] - 10https://gerrit.wikimedia.org/r/257382 (owner: 10Faidon Liambotis) [13:53:37] (03PS1) 10BBlack: varnish: remove hash_ignore_busy on pass [puppet] - 10https://gerrit.wikimedia.org/r/257591 [14:00:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:03:08] mobrovac: back, sorry lunch took longer than expected [14:03:18] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:03:27] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:03:52] hehe np godog :) [14:04:01] godog: so, ok for me to start the deployment? [14:04:31] mobrovac: yup, 1008 is still depooled in pybal and bootstrapping cassandra [14:04:39] kk gr8 [14:05:18] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:05:28] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:08:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:09:09] PROBLEM - puppet last run on restbase1001 is CRITICAL: CRITICAL: Puppet last ran 13 hours ago [14:09:19] !log restbase running puppet on rb1001 [14:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:20] RECOVERY - puppet last run on restbase1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:42] (03PS1) 10BBlack: varnish: move IMS check above common pass-checks [puppet] - 10https://gerrit.wikimedia.org/r/257595 [14:15:44] (03PS1) 10BBlack: varnish: fold stash_cookie into evaluate_cookie [puppet] - 10https://gerrit.wikimedia.org/r/257596 [14:16:54] !log restbase canary deploy of f47405a onto restbase1001 [14:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:59] !log upgrading openssl package on caches [14:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:11] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1862078 (10Eevans) [14:24:26] (03CR) 10Krinkle: [C: 031] varnish: move IMS check above common pass-checks [puppet] - 10https://gerrit.wikimedia.org/r/257595 (owner: 10BBlack) [14:32:57] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1862102 (10Eevans) >>! In T120803#1861535, @GWicke wrote: > After testing 2.1.12 on cerium for a while I gradually proceeded to roll it out to the eqiad staging hosts, followed by restbase1007. That looked... [14:35:42] (03PS1) 10Hashar: nodepool: drop domain from instance hostnames [puppet] - 10https://gerrit.wikimedia.org/r/257597 (https://phabricator.wikimedia.org/T120792) [14:39:07] (03CR) 10Andrew Bogott: [C: 032] nodepool: drop domain from instance hostnames [puppet] - 10https://gerrit.wikimedia.org/r/257597 (https://phabricator.wikimedia.org/T120792) (owner: 10Hashar) [14:39:25] (03PS1) 10Faidon Liambotis: librenms: add new cronjobs, remove stale settings [puppet] - 10https://gerrit.wikimedia.org/r/257598 [14:39:41] (03PS2) 10Faidon Liambotis: librenms: add new cronjobs, remove stale settings [puppet] - 10https://gerrit.wikimedia.org/r/257598 [14:40:43] (03CR) 10Faidon Liambotis: [C: 032] librenms: add new cronjobs, remove stale settings [puppet] - 10https://gerrit.wikimedia.org/r/257598 (owner: 10Faidon Liambotis) [14:43:22] (03PS1) 10Ottomata: Temporarily disable https etcd in labs eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/257599 [14:45:14] (03CR) 10Ottomata: [C: 032] Temporarily disable https etcd in labs eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/257599 (owner: 10Ottomata) [14:46:42] (03PS1) 10BBlack: sslcert: newly-regenerated dhparam contents [puppet] - 10https://gerrit.wikimedia.org/r/257601 [14:48:19] (03CR) 10BBlack: [C: 032] sslcert: newly-regenerated dhparam contents [puppet] - 10https://gerrit.wikimedia.org/r/257601 (owner: 10BBlack) [14:51:11] (03Abandoned) 10BBlack: varnish: fold stash_cookie into evaluate_cookie [puppet] - 10https://gerrit.wikimedia.org/r/257596 (owner: 10BBlack) [14:52:41] (03PS3) 10BBlack: Improve handling of mobile variant cookies [puppet] - 10https://gerrit.wikimedia.org/r/257496 (https://phabricator.wikimedia.org/T119798) (owner: 10Ori.livneh) [14:53:09] 6operations, 10RESTBase-Cassandra: Update to Cassandra 2.1.12 - https://phabricator.wikimedia.org/T120803#1862190 (10Eevans) It should also be pointed out that are now conducting range movements (a bootstrap), in a mixed version environment, a cardinal sin of Cassandra ops. ``` eevans@agenor:~/dev/src/git/med... [14:53:55] (03CR) 10BBlack: [C: 032] Improve handling of mobile variant cookies [puppet] - 10https://gerrit.wikimedia.org/r/257496 (https://phabricator.wikimedia.org/T119798) (owner: 10Ori.livneh) [14:55:12] (03PS3) 10BBlack: varnish: don't store hit-for-pass objects for logged-in users [puppet] - 10https://gerrit.wikimedia.org/r/257382 (owner: 10Faidon Liambotis) [14:56:31] bblack: so a possibility we did not think of yesterday is [14:56:47] mediawiki actually emitting cacheable objects for logged-in users [14:56:50] (03CR) 10BBlack: [C: 032] varnish: don't store hit-for-pass objects for logged-in users [puppet] - 10https://gerrit.wikimedia.org/r/257382 (owner: 10Faidon Liambotis) [14:57:06] (varied on by Cookie, so cacheable per logged-in user) [14:57:43] paravoid: either way they weren't getting any true shared caching out of it, and caching in varnish for one user can't be worth much when the browser caches too [14:58:11] something to improve, but I don't think we're going to regress over it [14:58:58] (we do have exceptions in place for /static/ and load.php for just that reason already, which remain in effect, btw) [15:02:59] 7Puppet, 6operations, 6Labs, 10Labs-Infrastructure: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1862208 (10Andrew) We moved the default classes out of the ldap node def and into hiera; this is probably a side-effect of that.... [15:03:07] (03PS2) 10BBlack: varnish: remove hash_ignore_busy on pass [puppet] - 10https://gerrit.wikimedia.org/r/257591 [15:04:36] (03CR) 10BBlack: [C: 032] varnish: remove hash_ignore_busy on pass [puppet] - 10https://gerrit.wikimedia.org/r/257591 (owner: 10BBlack) [15:05:38] (03PS2) 10BBlack: varnish: move IMS check above common pass-checks [puppet] - 10https://gerrit.wikimedia.org/r/257595 [15:07:07] (03CR) 10BBlack: [C: 032] varnish: move IMS check above common pass-checks [puppet] - 10https://gerrit.wikimedia.org/r/257595 (owner: 10BBlack) [15:09:40] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1862222 (10Eevans) [15:10:34] !log restbase canary deploy of 28bf071 onto restbase1001 [15:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:37] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1862228 (10ArielGlenn) At least one of the 'minion doesn't see incoming test ping after lapse of some hours' issue is apparently a problem with the master config option "ping_on_rotate" which l... [15:15:46] (03PS1) 10Hashar: ldap-yaml-enc: tolerate empty 'puppetClass' [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) [15:16:04] (03PS1) 10DCausse: Add initial rescore profiles for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) [15:16:36] (03CR) 10Hashar: "I have no idea what I am doing. deployment-tin.deployment-prep.eqiad.wmflabs is broken on beta cluster, so maybe cherry pick this patch o" [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [15:17:12] !log restbase start deploy of 28bf071 [15:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:45] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1862248 (10ArielGlenn) [15:20:59] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-ulsfo:ae2BR [15:21:30] (03PS2) 10coren: Replicas: include a restricted watchlist view [software] - 10https://gerrit.wikimedia.org/r/225218 (https://phabricator.wikimedia.org/T59617) [15:21:58] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:22:02] !log restbase enable puppet in prod [15:22:06] known ^^, on it [15:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:22] paravoid: is that you? ^ [15:22:25] (03CR) 10coren: [C: 032] Replicas: include a restricted watchlist view [software] - 10https://gerrit.wikimedia.org/r/225218 (https://phabricator.wikimedia.org/T59617) (owner: 10coren) [15:22:53] (03PS2) 10DCausse: Add initial rescore profiles for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) [15:23:18] PROBLEM - Restbase root url on restbase1002 is CRITICAL: Connection refused [15:23:24] (03CR) 10Andrew Bogott: [C: 031] ldap-yaml-enc: tolerate empty 'puppetClass' [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [15:23:45] mark: possibly, looking [15:24:59] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [15:25:01] mark: yes, it was me and it's fixed [15:25:08] k [15:25:48] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago [15:26:44] (03PS2) 10coren: maintain-replicas: match changed layout of mediawiki-config [software] - 10https://gerrit.wikimedia.org/r/249127 [15:27:47] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:27:49] (03CR) 10JanZerebecki: "That size is the byte size of the PHP serializing the PHP data model. See https://phabricator.wikimedia.org/T120834 . So it is a different" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 (owner: 10JanZerebecki) [15:28:23] (03CR) 10coren: [C: 032 V: 032] "Tested and works." [software] - 10https://gerrit.wikimedia.org/r/249127 (owner: 10coren) [15:29:18] RECOVERY - Restbase root url on restbase1002 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.024 second response time [15:29:59] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [15:30:15] (03PS3) 10coren: Replicas: include a restricted watchlist view [software] - 10https://gerrit.wikimedia.org/r/225218 (https://phabricator.wikimedia.org/T59617) [15:31:28] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:31:53] (03CR) 10Rush: [C: 04-1] "this doesn't actually work out and I don't think is right also" [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [15:32:53] what's going on with gerrit? [15:33:06] i'm in the middle of a deploy and the nodes can't fetch from gerrit [15:33:29] (03CR) 10coren: [V: 032] "Works (though not useful atm since the underlying table is not replicated)." [software] - 10https://gerrit.wikimedia.org/r/225218 (https://phabricator.wikimedia.org/T59617) (owner: 10coren) [15:33:36] fatal: remote error: Git repository not found [15:33:43] see [15:33:45] your repo is gone :d [15:33:56] maybe its really fetching from tin / mira? [15:34:03] no no [15:35:13] wth? https://gerrit.wikimedia.org/r/mediawiki/services/restbase/deploy/modules/restbase [15:35:18] waat? [15:35:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:35:37] told you [15:35:42] it is gone! ;-:} [15:36:54] mobrovac: I have no idea what that URL is really [15:37:00] me neither [15:37:13] from where does it come? [15:37:26] ohh [15:37:31] (03CR) 10Andrew Bogott: "It's harmless at worst, isn't it? Or do we want this to error out if puppetClass is undefined?" [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [15:37:42] maybe that is the Phabricator URL redirect for diffusion [15:37:55] eg: https://phabricator.wikimedia.org/r/revision/mediawiki/services/restbase/deploy ---> https://phabricator.wikimedia.org/rGRBD [15:38:00] (that one doesn't work though) [15:39:48] (03PS1) 10Rush: labs: puppetmaster self should still apply 'role::labs::instance' [puppet] - 10https://gerrit.wikimedia.org/r/257612 (https://phabricator.wikimedia.org/T120817) [15:41:24] (03CR) 10Rush: "puppet vomits on no definition on empty array atm, and also I think this is a consequence of the migration from ldap to manifest for basic" [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [15:45:41] 7Puppet, 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1862336 (10chasemp) p:5Triage>3High [15:45:49] (03CR) 10Paladox: "Oh but it doesen't work. How can this be fixed because ; is causing only the project to show not revision or branch or file to show." [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [15:45:58] (03PS6) 10Paladox: Fix redirections in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/257193 [15:48:26] (03CR) 10Paladox: "@20after4 ok but the links aren't showing in gerrit. They are showing like https://phabricator.wikimedia.org/r/revision/operations/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [15:48:36] (03Abandoned) 10Hashar: ldap-yaml-enc: tolerate empty 'puppetClass' [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [15:48:55] 7Puppet, 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1862351 (10hashar) a:3chasemp [15:50:06] (03CR) 10Hashar: "Nit: use dict.get() ? :}" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257612 (https://phabricator.wikimedia.org/T120817) (owner: 10Rush) [15:50:33] (03CR) 10Paladox: "The problem is also the same for phabricator because of ; it isen't creating the full redirect because it was then https://phabricator.wik" [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [15:52:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Seems generally correct, there are a couple classes that seem misplaced to me." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [15:52:35] Hi anomie ostriches thcipriani marktraceur Krenair ! Dunno who's doing the SWAT this morning... I have a CentralNotice patch that I'm about to add... :) [15:52:38] RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.009 second response time [15:52:39] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [15:53:18] <_joe_> ottomata: so, sorry for being so laaate [15:53:27] <_joe_> but the patch seems generally ok [15:54:10] k thanks [15:56:04] Hi anomie ostriches thcipriani marktraceur Krenair K just added the patch. I linked the core submodule update (which I didn't +2 yet, though) [15:56:41] AndyRussG: yup. sounds right. [15:56:50] I thought we didn't need submodule updates anymore? [15:57:02] marktraceur: CentralNotice is speeeeeeecial 8p [15:57:06] (03CR) 10Addshore: [C: 031] "LGTM as is, and then we can work from this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 (owner: 10JanZerebecki) [15:57:08] Sounds right [15:57:21] Heh we should actually fix it one day...... [15:57:43] It's just got some minor JS tweaks, no i18n or anything [15:59:35] (03CR) 10Rush: labs: puppetmaster self should still apply 'role::labs::instance' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257612 (https://phabricator.wikimedia.org/T120817) (owner: 10Rush) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151208T1600). Please do the needful. [16:00:04] kart_ jzerebecki AndyRussG: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:20] Yessir jouncebot [16:00:29] here jouncebot [16:00:53] I can SWAT. jzerebecki ping for SWAT. [16:01:04] * jzerebecki pats jouncebot [16:01:09] pong [16:01:49] !log restbase end deploy of 28bf071 [16:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:28] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: puppet fail [16:02:30] Hi. thcipriani: can I add a quick patch to fix a namespace typo to the current SWAT? [16:02:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [16:02:49] Dereckson: sure. [16:03:12] (03Merged) 10jenkins-bot: CX: Use ContentTranslationRESTBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [16:03:14] Okay, added. [16:04:50] godog: k, finally done the deploy, all good, you can repool rb1008 at your leisure :) [16:06:35] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: CX: Use ContentTranslationRESTBase [[gerrit:255102]] (duration: 00m 30s) [16:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:40] ^ kart_ check please [16:06:55] 7Puppet, 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1862387 (10Andrew) I think the ENC is broken. It should be merging custom hiera settings with the default r... [16:07:25] mobrovac: ooh sweet [16:07:48] !log repool restbase1008 [16:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:03] thcipriani: checking. [16:09:59] twentyafterfour thcipriani : am here if you need any help with those patches that need to ride the train :) [16:10:24] thcipriani: publishing fine. So, good. [16:10:30] kart_: thanks. [16:10:55] jdlrobson: ok [16:11:02] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 (owner: 10JanZerebecki) [16:11:46] (03Merged) 10jenkins-bot: Wikidata: set maxSerializedEntitySize to 2500 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 (owner: 10JanZerebecki) [16:11:52] jdlrobson: twentyafterfour we can merge the change to tools-release pre-branch cut, then I'll merge in the wmf-config changes pre-scaping to the test wiki, sound right? [16:12:49] Sounds great. So specifically https://gerrit.wikimedia.org/r/257432 and https://gerrit.wikimedia.org/r/257434 ? [16:12:54] (03CR) 10Rush: "going to wait and talk to yuvi a bit about this as it may be ENC is just plain broken in other ways" [puppet] - 10https://gerrit.wikimedia.org/r/257612 (https://phabricator.wikimedia.org/T120817) (owner: 10Rush) [16:13:10] (i think the patch to enable it can wait - i want to get the all clear from my PM) [16:13:43] !log thcipriani@tin Synchronized wmf-config/Wikibase.php: SWAT: Wikidata: set maxSerializedEntitySize to 2500 [[gerrit:257318]] (duration: 00m 27s) [16:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:51] ^ jzerebecki check please [16:14:59] jdlrobson: yup those patches specifically. Looks like twentyafterfour is ahead of me on the tools-release one. [16:15:11] thcipriani: looks good. thx [16:15:16] thcipriani: right [16:15:18] jzerebecki: thank you. [16:16:07] AndyRussG: I'm going to come back to your patch, Dereckson 's patch should be a fairly quick one to get out and check. [16:16:25] thcipriani: sure! thx :) [16:16:43] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257487 (owner: 10Dereckson) [16:17:26] thcipriani: :) [16:17:43] (03Merged) 10jenkins-bot: Fix typo in namespaces configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257487 (owner: 10Dereckson) [16:18:32] (03PS4) 10Andrew Bogott: Switch everything to the new openldap ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/256346 (https://phabricator.wikimedia.org/T101299) [16:18:34] (03PS1) 10Andrew Bogott: Move labs instances to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/257622 [16:19:14] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to comands - https://phabricator.wikimedia.org/T120831#1862409 (10mark) [16:19:21] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Fix typo in namespaces configuration [[gerrit:257487]] (duration: 00m 28s) [16:19:25] ^ Dereckson check please [16:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:59] Comma disappeared, works. [16:20:02] Thanks for the deploy. [16:20:44] Dereckson: thanks for checking! [16:20:55] AndyRussG: kk, you're up. [16:21:06] thcipriani: cool, all set :) [16:21:34] (Did I mention I didn't +2 the core patch yet?) [16:21:46] Ah K I see u did... [16:21:47] AndyRussG: yup, just did it. [16:22:00] kewl [16:23:08] (03CR) 10Paladox: "@20after4 or @dzahn could we give this patch a try since the phabricator redirect has nothing to do with how it works in gerrit since gerr" [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [16:23:22] moritzm: Out of an excess of caution, I split out the part of my ldap patch that updates labs instances and broke the test plan into two phases [16:28:27] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:30:18] (03PS1) 10Paladox: Fix viewing raw in phabricator on a pc [puppet] - 10https://gerrit.wikimedia.org/r/257629 [16:31:43] !log thcipriani@tin Synchronized php-1.27.0-wmf.7/extensions/CentralNotice: SWAT: Update CentralNotice [[gerrit:257615]] (duration: 00m 28s) [16:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:53] ^ AndyRussG check please [16:33:13] (03PS1) 10BBlack: varnish: cache /api/rest_v1/ in backends [puppet] - 10https://gerrit.wikimedia.org/r/257630 (https://phabricator.wikimedia.org/T96847) [16:33:15] (03PS1) 10BBlack: varnish: security_audit backend explicitly tier-one-only [puppet] - 10https://gerrit.wikimedia.org/r/257631 (https://phabricator.wikimedia.org/T96847) [16:33:17] (03PS1) 10BBlack: varnish: return (pass) for CAL URLs [puppet] - 10https://gerrit.wikimedia.org/r/257632 (https://phabricator.wikimedia.org/T96847) [16:33:19] (03PS1) 10BBlack: cache_upload: remove unused "rendering" backend [puppet] - 10https://gerrit.wikimedia.org/r/257633 (https://phabricator.wikimedia.org/T96847) [16:33:21] (03PS1) 10BBlack: add backend_random to maps and upload clusters in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/257634 (https://phabricator.wikimedia.org/T96847) [16:33:23] (03PS1) 10BBlack: add backend_random to maps and upload clusters config [puppet] - 10https://gerrit.wikimedia.org/r/257635 (https://phabricator.wikimedia.org/T96847) [16:33:25] (03PS1) 10BBlack: varnish: always use backend_random for pass/hfp [puppet] - 10https://gerrit.wikimedia.org/r/257636 (https://phabricator.wikimedia.org/T96847) [16:34:36] thcipriani: K (still waiting for RL cache rollover) [16:34:44] kk [16:34:46] (03CR) 10Muehlenhoff: [C: 031] Switch everything to the new openldap ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/256346 (https://phabricator.wikimedia.org/T101299) (owner: 10Andrew Bogott) [16:36:19] PROBLEM - puppet last run on pc1001 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:47] thcipriani: looks good so far! [16:39:57] AndyRussG: that's good. [16:41:11] (03PS2) 10Andrew Bogott: Move labs instances to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/257622 [16:43:06] (03CR) 10Muehlenhoff: [C: 031] Move labs instances to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/257622 (owner: 10Andrew Bogott) [16:49:12] (03PS1) 10Papaul: Add production DNS entries for auth2001 Bug:T120263 [dns] - 10https://gerrit.wikimedia.org/r/257637 (https://phabricator.wikimedia.org/T120263) [16:50:34] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1862495 (10Papaul) auth2001 10.193.1.22 port ge-5/0/6 [16:53:20] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1862507 (10Papaul) [17:00:04] andrewbogott moritzm: Dear anthropoid, the time has come. Please deploy Labs ldap maintenance window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151208T1700). [17:00:28] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [17:00:54] !log changing opendj writability-mode to ‘internal-only’ on nembus and neptunium [17:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:01:18] yay having 3 waves [17:02:13] moritzm: ok, read-only change seems to have worked [17:02:28] RECOVERY - puppet last run on pc1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:02:45] k, I'll stop opendj on nembus and run the export [17:02:56] andrewbogott: let me know when it's test time? [17:03:24] chasemp: at least 20 mins from now, the import on seaborgium takes 11 minutes alone [17:03:25] chasemp, Coren, hashar, ostriches: We have to do a dump and import which will take 20 minutes or so. I will ping when we’re ready to test the new setup. [17:03:35] kk [17:03:35] {{ack}}, I'm about [17:03:42] andrewbogott: kk [17:04:17] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1862564 (10RobH) [17:04:28] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1849646 (10RobH) [17:05:05] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1849646 (10RobH) @Papaul: Switch port is updated (proper description, enabled, vlan set to private1-a-codfw). You should be good to continue with install_server updates and install the OS. [17:07:14] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1862572 (10RobH) Also when you update netboot.cfg for the partitioning recipie, I'd suggest using raid1-lvm-ext4-srv.cfg. # Automatic software RAID 1 with LVM partitioning # # * two disks, sda & sdb # * layout: #... [17:08:32] thcipriani: k, I'm done testing. All good :) [17:08:48] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1862579 (10Papaul) @Robh I was just about to ask that question. Thanks for the update [17:09:28] (03CR) 10RobH: [C: 032] Add production DNS entries for auth2001 Bug:T120263 [dns] - 10https://gerrit.wikimedia.org/r/257637 (https://phabricator.wikimedia.org/T120263) (owner: 10Papaul) [17:10:06] papaul: i just merged your dns change for auth2001 production dns live so you should be set for install_server updates [17:10:19] robh: thanks [17:10:46] welcome, thanks for handling it. Also there is an order from Dell for 8 misc systems inbound (shipped from Dell today) [17:11:03] so I'll update that task and assign to you for receiivng, I have NOT put in an inbound shipment ticket with cyrusone for it though [17:11:09] now that you are back i'll let you handle that stuff again [17:11:16] (i only did the one shipment while you were out) [17:11:28] well, two, whatever, you know about those arleady =] [17:11:28] robh: i think will get that by next monday [17:11:45] yep. we really need to use two of them though so I'll have your racking tasks in before they arrive there [17:12:03] robh: no problem [17:12:36] rephrase: I have tasks pending to use two of them, and those tasks ideally are resolved before the code freeze goes back into effect on dec 21st. [17:12:47] so im quite happy these shipped so fast =] [17:13:38] !log stopping slapd on seaborgium/serpens to refresh LDAP databases with latest export from nembus [17:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:14:27] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1862587 (10BBlack) This was merged in the SWAT that happened around 00:00-01:00 UTC Dec 8 (so about 17 hours before this post)... [17:14:53] godog: so no one gave us any puppetswat patches [17:15:00] that means everyone is thrilled with patch review =] [17:15:09] (im taking it this way because its optimistic) [17:15:58] So in my last 7 puppet swat windows, only 3 have had patches =P [17:16:51] imports are running, will take 10-12 mins on seaborgium (serpens is already completed) (SAS vs. SATA disks for the ganeti VMs) [17:17:02] * twentyafterfour doesn't even know the process to get something in puppet swat. [17:17:14] twentyafterfour: add it to the deployments page [17:19:04] yep, thats pretty much it [17:19:16] i noticed the past deployment page updates didnt link to the puppet swat page but now they do as well [17:19:24] so at least folks can click through and see info on it [17:19:41] robh: heheh indeed! [17:19:49] but that page is outdated, im updating now [17:20:02] doesnt quite state 'just add to deployments' [17:21:24] (03PS1) 10Papaul: Add auth2001 partitioning entries Bug:T120263 [puppet] - 10https://gerrit.wikimedia.org/r/257643 (https://phabricator.wikimedia.org/T120263) [17:21:50] (03PS1) 10Ottomata: Update eventlogging topic parameter to use new format [puppet] - 10https://gerrit.wikimedia.org/r/257644 [17:22:08] (03PS3) 10Ori.livneh: import debian directory [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 [17:26:34] (03CR) 10Ottomata: [C: 032] Update eventlogging topic parameter to use new format [puppet] - 10https://gerrit.wikimedia.org/r/257644 (owner: 10Ottomata) [17:26:48] 6operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#1862601 (10akosiaris) 3NEW [17:27:15] andrewbogott: had to grab kid back home [17:27:36] hashar: nothing much has happened, still importing [17:28:48] import on seaborgium is completed, will restart the slapds and double-check that ldap replication works fine [17:29:28] (03PS5) 10Andrew Bogott: Switch everything to the new openldap ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/256346 (https://phabricator.wikimedia.org/T101299) [17:29:43] (03PS2) 10Papaul: Add auth2001 partitioning entries Bug:T120263 [puppet] - 10https://gerrit.wikimedia.org/r/257643 (https://phabricator.wikimedia.org/T120263) [17:30:00] maybe 5 minutes, then we can merge/test [17:33:48] (03PS1) 10Cmjohnson: Removing dns entries for decom'd host mw1041 [dns] - 10https://gerrit.wikimedia.org/r/257646 [17:34:59] !log deleted centralauth.localuser row for User:ThanduxoloKennethTwani5092/enwiki because the account doesn't exist (T120655) [17:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:39] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/server-side-events-log consumer/mysql-m4-master-03 consumer/mysql-m4-master-02 consumer/mysql-m4-master-01 consumer/mysql-m4-master-00 consumer/client-side-events-log consumer/all-events-log processor/server-side-0 processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side [17:38:45] that's me [17:38:46] will ack [17:39:27] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1862634 (10Papaul) [17:39:48] ACKNOWLEDGEMENT - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/server-side-events-log consumer/mysql-m4-master-03 consumer/mysql-m4-master-02 consumer/mysql-m4-master-01 consumer/mysql-m4-master-00 consumer/client-side-events-log consumer/all-events-log processor/server-side-0 processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/cli [17:39:49] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom'd host mw1041 [dns] - 10https://gerrit.wikimedia.org/r/257646 (owner: 10Cmjohnson) [17:39:59] ori: nuria, off the top of your head, how to do start individual eventloggign processes via this fancy upstart template stuff [17:39:59] ? [17:40:54] (03CR) 10Filippo Giunchedi: [C: 04-1] "some comments, the most important are the latter two" (0312 comments) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256007 (owner: 10Isart) [17:41:07] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [17:41:12] andrewbogott: there's a problem with the replication between serpens/seaborgium, don't merge yet [17:41:25] moritzm: ok! [17:41:48] ottomata: eventloggingctl start [17:42:27] no, a single instance [17:42:27] ori [17:42:37] i want to start one processor [17:42:38] start eventlogging/$role NAME="$name" CONFIG="$config" [17:42:49] $config is path to file? [17:43:32] start eventlogging/processor NAME=client-side-00 CONFIG=/etc/eventlogging.d/processors/client-side-00 [17:43:36] that should do it i think [17:43:38] been a while [17:44:27] that worked thanks [17:46:42] (03CR) 10Mschon: [C: 031] "looks ok for me" [dns] - 10https://gerrit.wikimedia.org/r/248504 (https://phabricator.wikimedia.org/T599) (owner: 10Dzahn) [17:49:07] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [17:53:03] andrewbogott, Coren: the import wasn't completed, since it tripped over a new service group without a member [17:53:13] %#$ [17:53:35] What group? I can add myself to it to skip over that issue at least. [17:53:38] tools.oubli-signature-bot [17:53:48] Coren: opendj is read-only currently [17:53:50] moritzm: can you remove it from the dump? [17:53:59] yeah, I'll do that [17:53:59] Or do we need to r/w, edit, r/o, start again? [17:54:19] or I can simply add coren's user to the LDIF? [17:54:21] andrewbogott: That'd cause odd issues; but he could add a member "by hand" maybe? [17:54:31] moritzm: That works. [17:54:32] moritzm: yes you can [17:55:01] Coren: what's your uid in LDAP? [17:55:13] how come a new empty service group showed up ? [17:55:13] 2138 [17:55:25] akosiaris: Somone removed themselves from it, I expect. [17:55:43] heh, the timing is impeccable [17:55:49] btw that poses an interesting problem [17:56:04] the last member of a service group will not longer be able to remove themselves from a service group [17:56:27] akosiaris: I'm not sure that's undesirable - but it'll need some UI to explain what happens. [17:56:38] Coren: my point exactly [17:57:32] that works, doing the same on seaborgium [17:57:47] akosiaris: but the last member can /delete/ a service group [17:57:59] which is what I would prefer anyway [17:58:07] is this a schema enforcement difference? [17:58:19] chasemp: yes. and a weird one.. in core schemas [17:58:24] I’m still inclined towards a policy of “Your tool has two maintainers or Ops punishes you relentlessly” [17:58:27] andrewbogott: ok then [17:58:28] chasemp: Yes, opendh allowed groups with no members. [17:58:55] akosiaris: but the UI still sucks — right now I think removing the last member will just fail without explanation [17:59:19] hence [17:59:19] https://phabricator.wikimedia.org/T120022 [17:59:19] andrewbogott: that's where I was going at right from the start [17:59:36] then again... the UI is not exactly forthcoming with info about what just went wrong anyway [17:59:43] yeah :( [18:00:04] godog robh: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches)Dependent on LDAP migration window completion (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151208T1800). [18:00:07] godog, robh - are you skipping puppet depl today? [18:00:29] Maybe I should change the default wikitech error message to “You did something subtly and unpredictably wrong” with a link to irc :( [18:00:30] yurik: there are no patches and the ldap migration needed to be done [18:00:47] fixed import on seaborgium running, but will need another 10-12 mins to complete [18:01:10] and it seems thats still ongoing so nope [18:01:13] robh, i wanted to git deploy (trebuchet) the maps service - should be about 15 min tops. [18:01:20] yurik: so short answer seems nope [18:01:44] it was pulled off earlier due to ldap and i added it back on in optimistic fevor [18:01:45] i guess i will wait for it to finish, even though it shouldn't really interfere in any way [18:02:03] yurik: You gotta warn people before you use words like "trebuchet" [18:02:06] yea i was convinced its bad to cross-deploy in case both break ;] [18:02:07] 6operations, 10ops-eqiad, 5Patch-For-Review: Decommission cisco servers, Analytics1003, 1004 and 1010 - https://phabricator.wikimedia.org/T118572#1862735 (10Cmjohnson) [18:02:09] 6operations, 10ops-eqiad, 5Patch-For-Review: Wipe and remove from rack Analytics1003, 1004 and 1010 - https://phabricator.wikimedia.org/T118999#1862733 (10Cmjohnson) 5Open>3Resolved Done...removed entry from switch cfg/vlan [18:02:10] It causes some people to involuntarily twitch :p [18:02:16] ostriches, i hate it as much as any other dev in here ;) [18:02:18] yurik: so indeed, today seems a wash but we can do thursday window instead [18:02:21] (03CR) 10Nuria: Update eventlogging topic parameter to use new format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257644 (owner: 10Ottomata) [18:02:33] yurik: scap3 ftw! :D [18:02:34] just add it on there so we can start reviewing in advance if you have it ready to go =] [18:02:39] robh, i usually deploy it myself, its not part of the puppet depl [18:02:45] just wanted to take over your window ) [18:02:50] oh [18:02:59] well, i have nothing so if ldap finishes during the puppetswat window [18:03:07] gotcha, thx [18:03:11] and no one objects, you can take it (this is not me approving you mege, just giving up my window ;) [18:03:13] (03PS5) 10Jdlrobson: Enable Cards and RelatedArticles so it rides the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257434 (https://phabricator.wikimedia.org/T116676) [18:03:38] (03PS4) 10Jdlrobson: Enable RelatedArticles on all wikipedias in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257435 (https://phabricator.wikimedia.org/T116676) [18:06:46] import on seaborgium is at 53% [18:07:12] moritzm, could you ping me when you are done? [18:07:32] yurik: will do [18:07:36] thx [18:07:51] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:10:20] (03PS1) 10Cmjohnson: Adding labs dns entries for promethium [dns] - 10https://gerrit.wikimedia.org/r/257654 [18:14:30] (03CR) 10Cmjohnson: [C: 032] Adding labs dns entries for promethium [dns] - 10https://gerrit.wikimedia.org/r/257654 (owner: 10Cmjohnson) [18:14:38] PROBLEM - ganeti-noded running on ganeti1002 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 0 (root), command name ganeti-noded [18:14:54] import on seaborgium completed, will make a quick replication test and report back [18:15:15] kk [18:17:50] thanks moritzm, hope you ate an early dinner [18:18:47] andrewbogott: ok, replication confirmed working both ways, we're good to merge [18:18:48] RECOVERY - ganeti-noded running on ganeti1002 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [18:19:08] moritzm: and those ganeti warnings aren’t something we should care about? [18:19:18] andrewbogott: no [18:19:18] (03CR) 10Andrew Bogott: [C: 032] Switch everything to the new openldap ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/256346 (https://phabricator.wikimedia.org/T101299) (owner: 10Andrew Bogott) [18:19:36] it was me trying to help moritz go faster by killing d-i-test from running [18:19:51] turns out it was a misguided attempt since it was running in another node [18:20:02] but it shouldn't be running anyway so no harm done [18:20:14] thanks. wasn't a notable i/o difference, though [18:20:43] yeah, I noticed right before saying "hey, here's what I did for ya!!!" [18:21:21] ok… Coren, chasemp, hashar, ostriches, time to start testing. [18:21:28] On it! [18:21:32] anyway, I did kill and set ADMIN_down on that vm which was probably spinning for a long time reinstalling endlessly itself [18:21:38] akosiaris: http://cdn.meme.am/instances/500x/54942287.jpg [18:21:42] Ping me if you need me to refresh puppet on a system where you don’t have root [18:21:42] https://etherpad.wikimedia.org/p/opendj-migration [18:21:46] and mark your tests as done when they’re done [18:22:04] ostriches: hahaha, nice... :-) [18:23:00] icinga/servermon works.. so all apache LDAP auths should work [18:23:04] in theory at least [18:23:09] and let's log manual puppet runs to avoid overlaps [18:23:17] e.g. plenty of services are on netmon1001 [18:23:27] PROBLEM - ganeti-noded running on ganeti1004 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 0 (root), command name ganeti-noded [18:23:42] !log running puppet on labcontrol1001 [18:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:49] andrewbogott: server = ldaps://ldap-eqiad.wikimedia.org ldaps://ldap-codfw.wikimedia.org [18:23:53] Is the correct one, right? [18:24:15] no, should be ldap-labs.eqiad.wikimedia.org [18:24:22] Hmmmmm, didn't think so [18:24:31] shinken uses ldap login? [18:24:31] * ostriches runs puppet a 2nd time, with feeling [18:24:52] ostriches: ldap-labs.eqiad.wikimedia.org and ldap-labs.codfw.wikimedia.org [18:24:53] those are the new ones [18:24:53] maybe my puppet patch missed those [18:25:07] !log running puppet on labcontrol1002 [18:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:27] RECOVERY - ganeti-noded running on ganeti1004 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [18:25:42] Hmm, I thought we covered it for gerrit. [18:25:42] andrewbogott: puppet run on terbium also has URI ldap://ldap-eqiad.wikimedia.org:389 ldap://ldap-codfw.wikimedia.org:389 [18:26:12] ok, it must be that lots of other things pull in that second role [18:26:12] I’ll just merge the labs role change now [18:26:24] (03PS3) 10Andrew Bogott: Move labs instances to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/257622 [18:27:16] 1 minute... [18:27:38] (03CR) 10Andrew Bogott: [C: 032] Move labs instances to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/257622 (owner: 10Andrew Bogott) [18:27:51] wikitech account creation failed -- running puppet there but unsure if that would catch the change at thsi moment [18:28:46] ok, try now [18:28:54] chasemp, running puppet there? [18:29:01] How would that fix wikitech account creation? [18:29:12] you need to update the config to point to the new ldap server don't you? [18:29:14] I didn't know if the ldap setting came from puppet for whever wikitech looks [18:29:24] Krenair: you’re right, that was my mistake [18:29:25] tldr I have no idea how wikitech is configured [18:29:26] no, it comes from mediawiki-config [18:29:28] Krenair: do you mind making a patch for that? [18:29:40] * Krenair facepalms [18:29:48] ok [18:29:57] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1862849 (10ori) [18:29:59] 7Blocked-on-Operations, 6operations, 6Reading-Admin, 10Traffic, and 3 others: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1862846 (10ori) 5Open>3Resolved a:3ori Fixed. [18:30:55] Krenair: I'll +2/sync-file it the second you have it up [18:31:27] 6operations, 10ops-eqiad, 6Labs: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1862850 (10Cmjohnson) moved to b3 connected asw-b-eqiad ge-3/0/19 updated vlan to labs-b updated dns https://gerrit.wikimedia.org/r/#/c/257654/1 currently it's set to install lvm.cfg... [18:32:59] my netework connection has decided to adopt a 5-seond latency right now, just to keep me on my toes [18:33:00] you just need to change $wgLDAPServerNames in wikitech.php [18:33:14] I'm still waiting for git to pull changes from gerrit [18:34:23] andrewbogott: terbium got the new settings [18:34:38] Krenair: Imma just fix on tin, we'll follow up in gerrit after. [18:34:42] ok [18:34:43] great [18:34:47] Gerrit's hella slow with the restart, all the caches are cold [18:35:17] oh, right [18:35:20] meanwhile neon takes a forver to run puppet so I'm waiting to see an update there atm [18:35:21] I can't pull because gerrit is down [18:35:37] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [18:36:28] so, shinken logs in with ldap? [18:36:29] !log demon@tin Synchronized wmf-config/wikitech.php: Point to new ldap servers (duration: 00m 27s) [18:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:37] Krenair: s/down/slow as hell to be unusable/ [18:36:51] "Authentication unavailable at this time." [18:36:54] Fannnntastic [18:37:24] Yeah, it took several minutes for a puppet run to get through on terbium - though it eventually worked. [18:37:39] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:38:14] wikitech should be working again since that !log [18:39:02] chasemp: ^ [18:39:20] ok diverted on icinga now I think it may have hardcoded ldap [18:39:21] thanks Krenair! [18:39:26] I didn't do anything [18:39:38] is everything being handled? [18:39:46] what is broken atm and/or how can I help? [18:39:56] I'm trying to figure out why gerrit's not happy and accepting logins. [18:40:02] I tried, but gerrit wasn't interested [18:40:29] You can SSH to gerrit and it identifies you that way [18:40:58] ldap-labs.codfw.wikimedia.org:636 [Root exception is java.net.ConnectException: Connection timed out] [18:41:38] paravoid: if you oculd look into icinga I can go back to looking into wikitech? [18:41:51] ostriches: port is wrong ... it's 389 not 636 [18:41:58] oh [18:42:11] ostriches: does gerrit support STARTTLS ? or just plain SSL ? [18:42:18] for LDAP connections I mean [18:42:21] That's...a good question. [18:42:43] paravoid: the complete checklist is here https://etherpad.wikimedia.org/p/opendj-migration [18:42:50] icinga is also complaining [18:42:53] most things are delegated, but you could help chase and/or ostriches troubleshoot the issues they’re seeing [18:43:04] qchris: yo, you about? [18:43:09] Internal Server Error [18:43:28] ostriches: Yes, but I guess I have to leave soonish. [18:43:30] What's up? [18:43:53] * qchris reads backlog [18:43:59] Swapped to new ldap servers, connections timing out [18:44:07] ostriches: does gerrit support STARTTLS ? or just plain SSL ? [18:44:07] Oh :-/ [18:44:21] icinga doesn't like ldaps:// vs ldap I think [18:44:29] No clue. Let me check the docs. [18:44:34] there is no ldaps:// anymore aiui [18:44:49] andrewbogott: User tools work fine from both labs and prod. [18:44:52] although I'd suggest we enable ldaps on the LDAP servers for now [18:44:55] yeah I removed to test and am resetting paravoid [18:45:05] until we track down the ldaps users and convert them to starttls [18:45:08] Coren: great, thanks! [18:45:11] I'll just do that [18:45:27] andrewbogott: Anything I can do while you guys are working on the gerrit thang? [18:45:29] chasemp: want to delegate your wikitech tests to coren? [18:45:31] is gerrit down or slow for folks? [18:45:31] akosiaris/moritzm: any objections? [18:45:38] fine with me [18:45:42] niedzielski, yes, it's having known issues [18:45:44] paravoid: no [18:45:47] andrewbogott: sure [18:45:52] Got 'em [18:45:58] Krenair: ah thanks :) [18:45:58] let's enable ldaps for now indeed [18:45:59] server = <% @ldap_hosts.each do |ldap_host| %>ldaps://<%= ldap_host %> <% end %> [18:46:07] That's probably it, we have ldaps:// in the erb [18:46:13] yes it is [18:46:24] Krenair: Was that patch pushed? [18:46:35] Coren, the mediawiki-config one for wikitech? [18:46:37] yes, but unless you instruct it to also do starttls it might be doing unecrypted LDAP [18:46:40] Yes [18:46:48] Coren, gerrit is having issues at the moment [18:46:51] come back to me gerrit! [18:46:51] so, let's enable ldaps for now [18:46:59] Oh, right, the whole thing is circular. [18:47:01] Duh [18:47:02] 6operations, 10ops-eqiad: Remove all out of warranty unused cp10xx's from A2 - https://phabricator.wikimedia.org/T120856#1862917 (10Cmjohnson) 3NEW a:3Cmjohnson [18:47:36] Coren: The patch was sync'd from tin, we'll clean up the discrepancy in git once we un-kill gerrit. [18:47:52] ostriches: Ah, ty [18:48:00] ostriches: After skimming the docs, I think Gerrit does not support STARTTLS. One has to use ldaps, although it's deprecated. [18:49:03] ok [18:49:05] !log enable ldaps:/// on serpens [18:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:12] !log enable ldaps:/// on seaborgium [18:49:16] akosiaris: did you remember to fix ferm too? [18:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:18] moritzm: did you catch the request for ldaps? [18:49:22] paravoid: argh [18:49:23] or, following along at least? [18:49:37] sure [18:49:46] paravoid: Ok, logins to gerrit work again. [18:49:52] searborgium has a ferm rules for 636 now [18:49:52] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1862946 (10Cmjohnson) [18:49:52] That was the "fix" [18:49:54] 6operations, 10ops-eqiad: Remove all out of warranty unused cp10xx's from A2 - https://phabricator.wikimedia.org/T120856#1862947 (10Cmjohnson) [18:50:01] ldaps:// works now for icinga [18:50:32] akosiaris: I'll add a ferm rules on serpens, or are you on it? [18:50:38] done [18:50:45] need to puppetize it though [18:51:04] gerrit is back \o/ [18:51:23] we can do that once gerrit is happy again, also found two further ldap indices which are needed [18:51:25] the bot is not [18:51:26] so https://gerrit.wikimedia.org/r/257657 [18:52:30] Gerrit should be dandy now. [18:52:37] A tad slow as caches warm, but up. [18:52:50] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1862968 (10RobH) @mark just approved the decom of the old squids in eqiad systems. @cmjohnson will be linking in a task for that shortly. When... [18:52:53] paravoid: merged^ [18:53:00] alright [18:53:12] !log running puppet on serpens/seaborgium [18:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:39] I'd do one and then the other, but ok :) [18:53:46] Krenair, Coren: wikitech.php https://gerrit.wikimedia.org/r/#/c/257658/, went ahead and merged since it's already live. [18:53:46] ostriches: what about the bot? [18:53:46] also https://gerrit.wikimedia.org/r/#/c/257659/, will wait until current puppet runs are over [18:53:55] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1862969 (10RobH) [18:54:04] paravoid: The bot I dunno, it doesn't survive gerrit reboots well. Needs someone in toollabs to kick it [18:54:07] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1862970 (10Cmjohnson) tasked to remove is added to Blocked By: [18:54:09] (it's not actually ldap-dependent) [18:54:36] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1862985 (10RobH) All of the spares data is now on the google sheet. [18:55:15] legoktm: Can you kick grrrit-wm? [18:55:18] moritzm: merged [18:55:28] doesn't YuviPanda have to do that now? [18:55:36] ostriches: I think YuviPanda might have to since its in kubernetes now [18:55:52] * ostriches pings YuviPanda a third time, for good measure [18:56:04] let me do it [18:56:33] these puppet runs wiped the ferm rules [18:56:48] since they're not part of https://gerrit.wikimedia.org/r/257657 [18:56:57] (the ones for ldaps) [18:57:03] they are [18:57:08] I still see them .. [18:57:10] &R_SERVICE(tcp, (389 636), $ALL_NETWORKS); [18:57:15] ah, sorry [18:57:23] I was still looking at the previously copied filename [18:57:29] :-) [18:58:41] brb [18:58:43] not sure about this [18:58:44] * Shinken <--- either not ldap or doesn't work for me? [18:59:52] chasemp: I am on it [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151208T1900). [19:00:06] since I am anyway messing with the thing these days [19:00:22] although that one is YuviPanda's child [19:00:23] k thanks [19:00:27] Assuming gerrit is now fine, I'm starting the new branch cut for 1.27.0-wmf.8 . jdlrobson Cards will be branched (just doublechecked the config). [19:00:30] ostriches: Not sure how badly you need to turn of ldaps in favor of STARTTLS, but I once had a Gerrit instance that used LDAP + STARTTLS at the reverse proxy, and gerrit used the reverse proxy's authentification information. [19:00:32] what? [19:00:36] did I do now [19:00:38] thcipriani: thanks for the update :) [19:00:54] hashar: are you still about? deployment puppet seems wonky in general with "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not parse for environment production: Syntax error at '<<'; expected '}' at /etc/puppet/manifests/role/eventlogging.pp:53 on node deployment-bastion.deployment-prep.eqiad.wmflabs" [19:00:55] ostriches: That's not smooth for groups, but it works for authentication with ldap + STARTTLS. [19:01:07] chasemp: shinken.wmflabs.org? it is 'guest/guest' password :P [19:01:23] chasemp: that smells like a bad merge [19:01:40] yeah, no ldap [19:01:52] ok confirmed them I'll remove from the page [19:02:05] tx [19:03:01] (03PS1) 10Cmjohnson: Adding dhcp entry for promethium bug: task T95185 [puppet] - 10https://gerrit.wikimedia.org/r/257663 [19:03:21] (03PS1) 10Muehlenhoff: Move DNS aliases to the openldap instances (once all tests are fine) [dns] - 10https://gerrit.wikimedia.org/r/257664 [19:03:39] chasemp: more or less family rush (showers, food thrown over the length of the table and other diapers related activities :D [19:03:41] qchris_away: Eh, I think we're fine for now, gerrit's not the only thing that needs ldaps:// at the moment. Does upgrading help? [19:04:03] 7Puppet, 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1863031 (10chasemp) >>! In T120817#1862387, @Andrew wrote: > I think the ENC is broken. It should be mergin... [19:04:10] hashar: Jenkins seems to accept logins just fine, I haven't tested wmf group membership tho. [19:04:18] (03PS2) 10Cmjohnson: Adding dhcp entry for promethium bug: task T95185 [puppet] - 10https://gerrit.wikimedia.org/r/257663 [19:04:22] ostriches: I am not too worried :-} [19:04:38] moritzm: ok, I would say that things are going well enough that we go forward rather than back. hashar, chasemp, Coren, ostriches, agreed? [19:04:40] ostriches: Jenkins has a ldap cache (much like Gerrit I believe). [19:04:53] seems nodepool is still able to create instances over the openstack api [19:04:55] andrewbogott: Wikitech seems to be mostly working to date [19:04:55] Gerrit flushes caches when it restarts :P [19:05:01] andrewbogott: So agree [19:05:09] Gerrit's ok, ya [19:05:13] andrewbogott: the only unturned stone atm is deployment-prep has unrelated puppet breakage so I can't test this atm [19:05:13] ostriches: yeah same for Jenkins. But I am not going to kill Jenkins :D [19:05:17] PROBLEM - Labs LDAP on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:23] Coren: would you do a service group create/test/delete on wikitech? My access is so slow right now that I can’t get there [19:05:24] what ? [19:05:28] otherwise things look well....except that^ :) [19:05:44] hm [19:05:46] moritzm: > [19:05:48] andrewbogott: I got slowness issues as well atm. Looking if everything is right with Silver [19:05:49] ^ ? [19:06:00] maybe the new server [19:06:07] is epically slow [19:06:15] slapd is running on seaborgium? [19:06:15] chasemp: what is the breakage for beta ? [19:06:19] no, it's not overloaded or someting [19:06:21] yes it is [19:06:32] akosiaris@seaborgium:~$ sudo lsof -i -n -P |grep 389 |wc -l [19:06:32] 990 [19:06:34] maybe a limit ? [19:07:20] wikitech has stopped responding to me on any openstackmanager special page. Worked fine up to a couple minutes ago [19:07:40] hmm, maybe clients not properly closing their connections? [19:08:14] moritzm: this smells of max no of files [19:08:19] Every other page works fine though. [19:08:28] it's constantly on 990 connections [19:09:11] proc/20330/limits says it's soft limit 1024, hard limit 3096 [19:09:16] proc/20330/limits says it's soft limit 1024, hard limit 4096 [19:09:21] yup [19:11:16] (03CR) 10Cmjohnson: [C: 032] Adding dhcp entry for promethium bug: task T95185 [puppet] - 10https://gerrit.wikimedia.org/r/257663 (owner: 10Cmjohnson) [19:11:57] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:12:09] Whoops, my fault [19:12:09] what was the max # of conn on the previous server ? [19:12:20] maybe something was not in puppet [19:12:36] previous server was probably a different distro as well [19:12:55] so, fixing the ulimits moritzm? [19:13:16] yeah, let's just add ulimit -n 4096 to the default file? [19:14:15] moritzm: works for me [19:14:22] yes [19:14:35] if the old server still have ldap running, one might want to look at its limit [19:14:46] doing so on seaborgium [19:14:52] hashar: entirely different config [19:15:10] ok ok :) [19:15:50] I've done it manually on seaborgium and restarted slapd, will puppetise this next [19:16:23] alright [19:16:33] 300 connections and rising [19:16:50] RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.006 seconds response time [19:16:53] And wikitech is well again - definitely the issue. [19:16:57] I suppose when we aren’t all testing at once and caches are warm the load won’t be as heavy [19:17:09] lets see if it borkes again at 990 [19:17:12] (03Restored) 10Yuvipanda: ldap-yaml-enc: tolerate empty 'puppetClass' [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [19:17:13] borks* [19:17:25] 600 connections and rising [19:17:26] if we have too many misbehaving clients, we might consider idletimeout at a later point [19:17:27] Coren: Did you say you had terbium working? [19:17:36] I think most of it it's just labs [19:17:38] I feel like in one of those movies where they just read a meter [19:17:45] ostriches: the userland ldap tools were working fine, yes. [19:18:02] akosiaris: As long as it doesn't get over 9000. :-) [19:18:05] demon@terbium ~$ ldaplist -l passwd [19:18:05] The search returned an error. [19:18:13] Coren: not for me :( [19:18:17] (03CR) 10Hashar: "Well please follow up on https://gerrit.wikimedia.org/r/#/c/257612/ which also add the puppet class 'role::labs::instance' by default? :-}" [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [19:18:38] marc@terbium:~$ ldaplist -l passwd marc [19:18:38] [19:18:38] dn: uid=marc,ou=people,dc=wikimedia,dc=org [19:18:38] displayName: Marc A. Pelletier [19:18:39] uid: marc [19:18:40] [...] [19:18:45] Hmmm. [19:18:45] a lot of VMs are still connecting to neptunium btw [19:19:01] Hah. Enumeration issue [19:19:14] andrewbogott, moritzm: ^^ search with no index, maybe? [19:19:17] akosiaris: that’s hard to avoid — planning to fix that with a dns alias [19:19:25] ok [19:19:25] yes, that didn’t work before either [19:19:37] Coren: ah, also the limit probably ... 500 is the max IIRC [19:19:38] ldaplist -l passwd [19:19:41] we can raise it though [19:19:50] Ah, that'd do it :) [19:19:55] andrewbogott: that worked - we raised the limit >2year ago IIRC [19:20:19] with no args — I tried that a couple of hours ago and it failed then as well [19:20:22] BUT as a rule I raised the limits in opendj to permit it, so I would support raising it now [19:20:29] Coren: yeah, I think we may have just recently hit the new limit [19:20:29] I just used it...a day or two ago [19:20:40] (So +1 to raising it) [19:20:52] andrewbogott: I tested that before the switchover and it worked. [19:20:58] PROBLEM - Labs LDAP on serpens is CRITICAL: Could not bind to the LDAP server [19:21:03] so, that limit we did raise it in openldap and then I reverted it and did it only for the replication user since it was killing that one [19:21:14] Coren: ok, maybe I’m mistaken [19:21:29] nice to see the fallback to serpens worked fine [19:21:33] (03PS1) 10Muehlenhoff: Bump fd/connection limit for slapd [puppet] - 10https://gerrit.wikimedia.org/r/257672 [19:21:41] anyway, the new limit would need to be 7000 [19:21:42] Or, y’know, over 9000 [19:22:03] and you might want a monitoring probe for it if at all possible [19:22:14] making the same local change for serpens [19:22:51] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Bump fd/connection limit for slapd [puppet] - 10https://gerrit.wikimedia.org/r/257672 (owner: 10Muehlenhoff) [19:23:07] andrewbogott: wikitech passes all tests except account creation, but that seems like a different issue (viz. etherpad) [19:24:17] Coren: ‘Fatal exception of type PasswordError’? [19:24:23] * Coren nods. [19:24:48] slapd is running hot, 70-90% CPU [19:24:48] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.114 seconds response time [19:24:50] That message comes from mw proper, if I understand well. [19:25:03] seems like that has to be an ldap thing though [19:25:33] andrewbogott: From what I see, it appears to be a crypto thing in core. [19:26:00] except on wikitech the passwords aren’t stored in the mw db [19:26:01] but in ldap [19:26:08] -rw-r----- 1 root adm 575M Dec 8 19:26 syslog [19:26:09] heh [19:26:15] we're logging every single search to /var/log/syslog [19:27:14] root@seaborgium:/var/log# grep -c SRCH syslog [19:27:14] 770112 [19:27:16] fun :) [19:27:32] loglevel sync stats [19:27:38] I 'll remove the stats thing [19:27:39] let's drop the stats from debuglevel [19:27:42] andrewbogott: Aha, Good point, and I see the issue: [19:27:54] PHP Warning: ldap_start_tls(): Unable to start TLS: Can't contact LDAP server in /srv/mediawiki/php-1.27.0-wmf.7/extensions/LdapAuthentication/LdapAuthentication.php on line 619 [19:27:56] TLS [19:28:25] (03CR) 10Hashar: "Can you please rebase this change against tip of production branch? Some patchset is cherry picked on deployment-bastion and comes in con" [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [19:28:32] Unlike OpenStackManager (apparently), LdapAuthentication wants to do TLS [19:28:54] we do support TLS [19:29:49] paravoid: Wasn't that the gerrit issue? [19:29:55] no [19:29:59] (and that issue is also fixed) [19:30:05] start TLS != SSL [19:30:15] confusingly enough [19:30:41] it's doing an unencrypted connection and issuing the STARTTLS command to upgrade to an encrypted one [19:31:33] so how is OSM verifying the server's certificate? [19:31:44] akosiaris: Right, vs OpenStackManager that, afaik, uses ldaps: [19:31:55] we support ldaps now too [19:32:05] paravoid: Yes, and OSM works. [19:32:27] er ok [19:32:30] so how is LdapAuthentication verifying the server's certificate? [19:33:53] (03PS1) 10Alexandros Kosiaris: openldap: Parameterize loglevel [puppet] - 10https://gerrit.wikimedia.org/r/257675 [19:34:20] moritzm: ^ [19:34:24] on it [19:34:44] Coren: are you troubleshooting this? [19:35:04] (03CR) 10Rush: [C: 031] "not even relevant to deployment prep now :)" [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [19:35:05] paravoid: Hm, it uses ldap_start_tls() [19:35:10] paravoid: I'm looking at it now. [19:35:36] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/257675 (owner: 10Alexandros Kosiaris) [19:35:46] (03CR) 10Alexandros Kosiaris: [C: 032] openldap: Parameterize loglevel [puppet] - 10https://gerrit.wikimedia.org/r/257675 (owner: 10Alexandros Kosiaris) [19:35:47] from the php module [19:36:35] root@seaborgium:~# sudo lsof -i -n -P |grep slapd |wc -l [19:36:36] (03PS2) 10Yuvipanda: ldap-yaml-enc: tolerate empty 'puppetClass' [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [19:36:36] 1018 [19:36:41] I think we are ok on that front [19:36:45] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap-yaml-enc: tolerate empty 'puppetClass' [puppet] - 10https://gerrit.wikimedia.org/r/257606 (https://phabricator.wikimedia.org/T120817) (owner: 10Hashar) [19:37:03] yep, seem so [19:37:36] Error: Failed to apply catalog: Parameter loglevel failed on Class[Openldap]: Invalid value "sync". Valid values are debug, info, notice, warning, err, alert, emerg, crit, verbose. at /etc/puppet/modules/role/manifests/openldap/labs.pp:29 [19:37:38] Coren, paravoid, note that login works on wikitech, just not new account creation [19:37:44] hm ? [19:37:51] so that’s weird. Why would it only use tls on creation? [19:38:00] did I use a reserved word or something ? [19:38:18] how sure are you that TLS is the issue? [19:38:23] (03PS1) 10Chad: Gerrit: move static assets to *.cache.* filenames [puppet] - 10https://gerrit.wikimedia.org/r/257676 [19:38:24] 6operations, 10ops-eqiad, 6Labs: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1863190 (10Cmjohnson) a:5Cmjohnson>3Andrew OS is installed...no puppet certs. Assigning to @andrew [19:38:40] omg we have wikitech account creation broken in the middle of google code-in beginning ? :o [19:38:55] * Nemo_bis starts screaming and running in circles ;) [19:39:16] paravoid: the only thing in error.log on failed attempts is 'PHP Warning: ldap_start_tls(): Unable to start TLS: Can't contact LDAP server in /srv/mediawiki/php-1.27.0-wmf.7/extensions/LdapAuthentication/LdapAuthentication.php on line 619' [19:39:31] failed account creation attempts? [19:39:35] Yep. [19:39:36] akosiaris: alternatively we can set it to the numeric values` [19:39:37] akosiaris: alternatively we can set it to the numeric values= [19:39:39] akosiaris: alternatively we can set it to the numeric values? [19:39:47] not sure, why it complains [19:39:59] moritzm: it's not openldap complaining [19:40:00] it's puppet [19:40:08] loglevel is a metaparameter [19:40:08] https://docs.puppetlabs.com/references/latest/metaparameter.html#loglevel [19:40:13] sigh [19:40:15] ori: thanks! [19:40:17] paravoid: Ah. But not recently. [19:40:23] Coren: last log line is from 19:15 [19:40:26] I was googling for that [19:40:26] so no, that's not it [19:40:27] paravoid: That was on my original attempts, not anymore. [19:40:28] ok fixing [19:40:52] So apparently two issues. [19:41:01] * Coren tries again, from a blank slate. [19:41:04] apparently one issue now [19:41:13] the other one was the ldaps one probably [19:41:17] (which is fixed) [19:41:18] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: puppet fail [19:41:39] moritzm: what kind of password restrictions are in place on the new ldap servers? [19:41:39] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: puppet fail [19:41:50] Is it possible we’re just running up against those, and wikitech doesn’t know to enforce? [19:42:31] andrewbogott: I'm putting in a good password - that was my first guess too. [19:43:50] ignore the seaborging/pollux puppet fails, that's me. fixing it [19:44:00] (03PS1) 10Alexandros Kosiaris: openldap: rename loglevel parameter [puppet] - 10https://gerrit.wikimedia.org/r/257678 [19:44:11] andrewbogott: you mean the ACLs or pw enforcements done in slapd? it just stores the pw hashes [19:44:40] (03PS2) 10Alexandros Kosiaris: openldap: rename loglevel parameter [puppet] - 10https://gerrit.wikimedia.org/r/257678 [19:44:46] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] openldap: rename loglevel parameter [puppet] - 10https://gerrit.wikimedia.org/r/257678 (owner: 10Alexandros Kosiaris) [19:44:47] moritzm: the latter [19:44:53] try debugging instead of guessing, I'd say :) [19:45:03] paravoid: that’s terrible advice! [19:45:12] :P [19:45:30] Account creation straight in ldap works fine. [19:45:52] there are no such restrictions, userPassword is just another base64-encoded attribute from the perspective of slapd [19:46:07] looking at LdapAuthentication.php [19:46:16] do you guys have printDebug() going anywhere? [19:46:31] there are some overlays which can enforce restrictions, but we don't use these [19:46:41] (yet) [19:46:45] :) [19:47:08] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:47:10] actually we got a password policy [19:47:20] ####################################################################### [19:47:20] ## Password policy (default is to store passwords in plaintext) [19:47:20] # This policy will store passwords an unsalted SHA1 hashes. [19:47:20] # This is a requirement for Google Apps Directory Sync [19:47:21] overlay ppolicy [19:47:22] ppolicy_hash_cleartext [19:47:22] password-hash {SHA} [19:47:27] I just noticed that myself... [19:47:36] but that's about it [19:47:38] nothing else [19:47:47] that may explain it if MW tries to use SSHA [19:47:47] 6operations, 10DBA: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#1863300 (10Cmjohnson) [19:47:48] for example no http://www.openldap.org/doc/admin24/overlays.html [19:47:49] 6operations, 10ops-eqiad: es1019 and its management interface are unresponsive - https://phabricator.wikimedia.org/T120689#1863297 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson powered off, removed power cables and powered on ---accessible now [19:48:13] the ldap log channel is not enabled, afaict [19:48:13] that password policy is horrible for prod anyway [19:48:40] well, applications should not be sending the password cleartext [19:49:01] it just does a SHA1 when a password is sent cleartext [19:49:04] ori: yes, I’m trying to dig up how to do that now. I had hashed-out lines in the config to easily turn it on but I think those were ‘cleaned up’ [19:49:21] (03PS1) 10Ori.livneh: Enable 'ldap' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257679 [19:49:24] we can set it differently for those ldap servers though for OIT's that's a given [19:49:34] (03CR) 10Ori.livneh: [C: 032] Enable 'ldap' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257679 (owner: 10Ori.livneh) [19:49:45] well, fine, be faster than me [19:50:06] (03Merged) 10jenkins-bot: Enable 'ldap' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257679 (owner: 10Ori.livneh) [19:50:20] so, wikitech is the only issue ? [19:50:28] or we got something I am missing ? [19:50:30] * andrewbogott scans the list one more time [19:50:37] er, wikitech password setting that is [19:50:40] Coren: are you still troubleshooting? [19:50:41] !log swapping failed disk db1019 [19:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:08] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:51:20] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: Enable ldap log channel (duration: 00m 28s) [19:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:25] akosiaris: yep, looks to me like wikitech is it. There’s a corresponding gerrit test that we can’t run until we have a new wikitech account. [19:51:29] akosiaris: the password policy is the other issue, anyone on top of this? [19:51:30] paravoid: I'm testing password change on existing account, was going back to it after I either confirmed or eleminated simple password handling, [19:52:07] paravoid: not sure what the issue is [19:52:17] what are you referring to ? [19:52:23] paravoid: And that also fails, though [19:52:28] can you do something to trigger the error? [19:52:31] that we have a password policy that forces unsalted passwords? [19:52:32] * Coren turns on debugging on wikitech [19:52:33] !log fixed ldap logging on seaborgium/serpens [19:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:47] 6operations, 10ops-eqiad: db1019 failing disk (degraded RAID) - https://phabricator.wikimedia.org/T120511#1863325 (10Cmjohnson) rebuilding root@db1019:~# megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state... [19:53:06] paravoid: ONLY if they are not already salted by the application setting them [19:53:13] aka we should fix the application [19:53:23] that password policy is for the corp ldap anyway [19:53:27] yes [19:53:29] does not belong to production [19:53:33] so let's get rid of it? [19:53:36] we can remove it or set it to something more sensible [19:53:43] ori: I thought you might like https://gerrit.wikimedia.org/r/#/c/257676/ :) [19:54:25] let's remove for now? we can add a more prod-aligned version later on [19:54:29] yeah [19:54:52] paravoid: so without it, in case an application sets a password, it will be stored in cleartext [19:55:08] 6operations, 10ops-eqiad: cp1037-1040 reclaim as spares - https://phabricator.wikimedia.org/T83553#1863329 (10Cmjohnson) Nothing left to do here...going to decommission them with the other old varnish [19:55:25] assuming the application does not already hash the password that is [19:55:46] which I find pretty probable it doesn't since all applications where dependant on opendj up to now doing it for them [19:56:46] ostriches: oo, nice [19:58:06] akosiaris: can we at least remove it as a test to see if that fixes wikitech? [19:58:51] wait [19:58:55] can you trigger the error one more time? [19:59:12] ori: sure, one second [19:59:16] paravoid: moritzm how about setting it to password-hash {SSHA} ? [19:59:43] that way we avoid unsalted passwords and cleartext passwords by mistake into the LDAP servers [19:59:55] akosiaris: works for me [20:00:03] ori: done [20:00:57] andrewbogott: I obviously have no problem removing that but somehow I doubt it's the source of the problem, it anyway works just fine for OIT ... [20:01:32] Aha. [20:01:39] (03PS41) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:01:43] It creates the account, but still gives the error. [20:02:01] lol [20:02:06] [in ldap] [20:02:06] Coren: yeah, I noticed a different error on my second attempt [20:02:20] andrewbogott: I got debugging turned on, and got logs now. [20:02:29] https://phabricator.wikimedia.org/P2384 is the debug log [20:02:32] so, "your account was created but please go away and never use it!" [20:02:45] * Coren turns it back off, and inspects the logs. [20:02:47] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [20:03:14] ori: ok, the andrewtestaccount101 was a ‘this account already exists’ [20:03:23] can you grab the log snippet for the ‘andrewtestaccount102’ attempt? [20:03:37] i dont see it in ldap.log [20:03:44] maybe it was before i enabled the log channel? [20:04:19] andrewbogott: I already turned it off - that log grows FAST. Maybe you're in the period though. [20:04:32] andrewbogott: where are you turning it on and off? [20:04:41] err, that was @Coren [20:04:57] 2015-12-08 19:59:37 silver labswiki ldap INFO: 2.1.0 Entering Connect [20:04:57] 2015-12-08 19:59:37 silver labswiki ldap INFO: 2.1.0 Using TLS or not using encryption. [20:05:01] heh [20:05:18] if it wasn't saying 2 lines below that it is using TLS i 'd be worried [20:05:19] ori: with the MW_DEBUG_LOCAL environment variable [20:05:21] yeah, not a very smart message [20:06:00] Coren: all that you're doing with that is preventing it from getting routed to fluorine [20:06:14] which is where it now goes, with the patch above merged [20:06:28] which is why the log file there is incomplete [20:06:30] ori: Ah, sorry - we were working in parallel at cross-purposes. :-) [20:06:35] yeah [20:07:14] ori: Well, I have the logfile locally for my attempt which I'm looking at now. [20:07:28] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:08:07] andrewbogott: for Andrewtestaccount it was " Failed to modify the user's password" [20:08:14] ori: that’s... [20:08:17] sorry, this is confusing [20:08:27] two things are happening, one is trying to create an account that already exists. [20:08:31] which /should/ fail [20:08:34] moritzm: bdb_equality_candidates: (aAAARecord) not indexed [20:08:35] and the other is failing to create a new account [20:08:41] another index to add to the mix ? [20:08:51] ori: Yeah, got the same thing for mine. [20:09:00] becuase we didn’t know that the ‘failures’ were actually creating busted accounts, there were a lot of ‘trying to create an account that already exists’ issues [20:09:05] which are red herrings [20:09:16] …probably [20:09:35] * andrewbogott retracts everything he just said [20:09:41] ? [20:09:44] I noticed this and a few others (roleOccupant and puppetVar), but they are fairly rare [20:09:54] So: in my logs: [20:10:08] moritzm: ok, so let's add all of those... a few more slapindexes wont hurt [20:10:10] but I'll prepare a patch, won't hurt to index them as well [20:10:21] aAAARecord would probably be looked up on PTR DNS requests [20:10:28] yup.. and it will hurt to not index them so let's do it [20:10:59] yep, I'll make another check for unindexed attributes and prepare a patch [20:11:48] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: puppet fail [20:14:00] (03PS1) 10Muehlenhoff: Extend LDAP indices for labs with puppetVar, roleOccupant and aAAARecord [puppet] - 10https://gerrit.wikimedia.org/r/257686 [20:15:30] ldap INFO: 2.1.0 Adding user [20:15:30] [...] [20:15:30] 2015-12-08 20:01:22 silver labswiki ldap INFO: 2.1.0 Failed to modify the user's password [20:15:30] 2015-12-08 20:01:22 silver labswiki connect INFO: Connected to database 0 at silver [20:15:30] 2015-12-08 20:01:22 silver labswiki memcached DEBUG: delete(global:account:3563d3129ee9d70a89d9dd6e520e97b9:lock) [20:15:31] 2015-12-08 20:01:22 silver labswiki wfDebug DEBUG: User::getBlockedStatus: checking... [20:15:32] 2015-12-08 20:01:22 silver labswiki Bug56269 WARNING: Exception thrown with an uncommited database transaction: [5f3deffc] /w/index.php?title=Special:UserLogin&action=submitlogin&type=signup PasswordError from line 2390 of /srv/mediawiki/php-1.27.0-wmf.7/includes/User.php: There was either an authentication database error or you are not allowed to update your external account. {"exception_id":"5f3deffc"} [20:15:33] [Exception PasswordError] (/srv/mediawiki/php-1.27.0-wmf.7/includes/User.php:2390) There was either an authentication database error or you are not allowed to update your external account. [20:16:20] So, from what I get, the accounts gets partially created, fails on 'modify the user's password', then breaks when trying to login automagically. [20:16:49] Coren: that’s with a fresh username? Not just a trying-to-create-existing-account? [20:16:59] andrewbogott: Yep, that's a fresh username. [20:17:48] so… I still want to blame ldap for rejecting the inital password setting [20:17:51] there is no rollback for removing the account if setting the password fails [20:17:54] So, afaict, it creates the account in LDAP and then fails to create the matching mediawiki user [20:17:57] Dec 8 20:16:25 seaborgium slapd[27255]: slap_global_control: unrecognized control: 1.3.6.1.4.1.4203.666.5.16 [20:18:01] correct [20:18:05] https://tools.ietf.org/html/draft-masarati-ldap-deref-00 [20:18:08] weird ... [20:18:18] what's trying to use LDAP dereferences ? [20:18:24] Coren: can you verify via an ldap call that the ldap user exists with the proper password? [20:19:11] progress [20:19:17] i disabled warning suppression around the ldap_modify call [20:19:24] seeing this in apache error log now: [20:19:34] andrewbogott: It exists, lemme see if I can auth against it. [20:19:36] [Tue Dec 08 20:14:19.114180 2015] [:error] [pid 9228] [client XXX:56645] PHP Warning: ldap_modify(): Modify: Insufficient access in /srv/mediawiki/php-1.27.0-wmf.7/extensions/LdapAuthentication/LdapAuthentication.php on line 206, referer: https://wikitech.wikimedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=signup&returnto=Main+Page [20:19:51] ah, that's interesting [20:19:53] akosiaris: maybe the ACL for parsing the Directory Managers membership? [20:20:03] so, insufficient access for setting the password [20:20:03] aha! The account has enough privilege to create the account, but not set password? [20:20:06] yes [20:20:11] lemme see [20:20:43] yes [20:20:45] (I have a couple of live-hacked edits on /srv/mediawiki/php-1.27.0-wmf.7/extensions/LdapAuthentication/LdapAuthentication.php on silver, btw) [20:20:45] that's it [20:20:51] ok submitting a patch [20:21:41] ima remove my tests from ldap in the meantime. [20:21:50] since they aren't in mediawiki [20:22:46] Coren: mine too while you’re in there? [20:23:02] andrewbogott: name(s)? [20:23:04] ‘andrewtestaccountxxx’ [20:23:09] and ‘andrewtestaccount’ [20:23:14] and Oritestaccount [20:23:39] I got andrewtestaccount but what are the 'xxx'? [20:23:48] oritestaccount dead [20:23:49] what's the task # for this (if there is one)? I ask because I want to file a bug for LdapAuthentication for swallowing useful debug data from ldap_* calls [20:24:19] bah internet. [20:24:22] if i have something in hiera that looks like this: [20:24:27] eqiad_primary: [20:24:27] kafka1001.eqiad.wmnet: [20:24:27] id: 1001 [20:24:27] kafka1002.eqiad.wmnet: [20:24:27] id: 1002 [20:24:52] and, in other places in puppet, i need a variable in a list that looks like this [20:25:00] ori: for the create account issue specifically? None (yet) - we've been working of an etherpad checklist atm [20:25:12] ottomata1: fwiw, we sometimes use the term "primary" to refer to just eqiad, e.g. $::mw_primary [20:25:27] $kafka_brokers = [kafka1001.eqiad.wmnet, kafka1002.eqiad.wmnet] [20:25:28] etc. [20:25:36] what is the best way to make that available? [20:25:50] i can't do it programmatically with hiera (right?) [20:25:56] previously I had a ::config class [20:26:12] and then that could be accessed via the fq var name, e.g. [20:26:23] $role::analytics::kafka::config::brokers_array [20:26:27] keys(hiera_hash(eqiad_primary)) ? [20:26:31] ottomata that's a hash [20:26:41] yes [20:26:55] used to do [20:26:56] $brokers_array = keys($brokers) [20:27:08] ottomata1: no i mean it still is a hash [20:27:12] what you pasted above [20:27:19] $kafka_brokers = [kafka1001.eqiad.wmnet, kafka1002.eqiad.wmnet] [20:27:19] ? [20:27:27] (10:24:27 μμ) ottomata1: eqiad_primary: [20:27:28] (10:24:27 μμ) ottomata1: kafka1001.eqiad.wmnet: [20:27:28] (10:24:27 μμ) ottomata1: id: 1001 [20:27:28] (10:24:27 μμ) ottomata1: kafka1002.eqiad.wmnet: [20:27:28] (10:24:27 μμ) ottomata1: id: 1002 [20:27:29] no, the "something in hiera that looks like this" [20:27:30] that's a hash [20:27:53] yes [20:27:57] i know that [20:27:59] and a simple $myhash = hiera('something::shometing') will give a hash to you [20:28:03] (03PS2) 10Muehlenhoff: Move DNS aliases to the openldap instances (once all tests are fine) [dns] - 10https://gerrit.wikimedia.org/r/257664 [20:28:05] jaja [20:28:09] ok [20:28:10] Coren: andrewtestaccount101 and 102 &c [20:28:10] (there is no &c) [20:28:15] I am talking to much, tell me [20:28:18] too* [20:28:23] Coren: sorry, I’ll get them if I’m too late [20:28:29] but, it'd be nice if i didn't have to compute the values of the list of brokers every place I needed it [20:28:36] which is most places except for the kafka server [20:28:42] andrewbogott: Nope. They be dead now. [20:28:46] thanks [20:29:00] And… has anyone verified that account creation works now? [20:29:00] * andrewbogott hasn't [20:29:06] no it doesn't [20:29:09] it's blocked on me [20:29:28] No, just waiting for alex to get a minute to make the acl patch. [20:29:47] right now i can make a nice little useful variable [20:29:52] because I can do it in puppet [20:29:52] yeah I am looking at it ... I have to change the base ACLs ... [20:30:33] ok I think I got it, I just hope I don't break the OIT ones... [20:31:26] !log ori@tin Synchronized php-1.27.0-wmf.7/extensions/MobileFrontend: I8a1fb69724b: Vary HTML output by NetSpeed designation (duration: 00m 28s) [20:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:46] 7Puppet, 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1863574 (10hashar) I have rebased puppet on deployment-puppetmaster and ran puppet agent on it to update the... [20:33:37] (03PS1) 10Alexandros Kosiaris: openldap: Update base acls [puppet] - 10https://gerrit.wikimedia.org/r/257690 [20:33:39] (03PS1) 10Alexandros Kosiaris: openldap: Allow to specify cleartext hashing scheme [puppet] - 10https://gerrit.wikimedia.org/r/257691 [20:34:32] !log restbase: canary deploy of 49fdc615b to restbase1001 [20:34:33] 7Puppet, 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1863581 (10chasemp) 5Open>3Resolved [20:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:34:47] (03CR) 10coren: [C: 031] "Pretty sure all this does is (correctly) allow password changes to that group." [puppet] - 10https://gerrit.wikimedia.org/r/257690 (owner: 10Alexandros Kosiaris) [20:35:18] Coren: I am reluctant still ... so I think I got a better approach overall [20:35:28] akosiaris: Oh? [20:35:32] mostly cause the OIT mirror don't have that entry [20:35:38] !log Jenkins: changing LDAP config from ldaps://ldap-eqiad.wikimedia.org:636 to ldaps://ldap-labs.eqiad.wikimedia.org:636 [20:35:41] and they share the same module [20:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:00] akosiaris: Wouldn't that just mean the OIT mirror can't be used to change passwords? [20:36:16] Coren: or it might just crash [20:36:30] akosiaris: Heh. Much faith? :-) What's your alternative approach? [20:36:33] when something tries to read a password and finds an ACL with an entry that does not exist [20:36:41] (03CR) 10Muehlenhoff: [C: 031] openldap: Allow to specify cleartext hashing scheme [puppet] - 10https://gerrit.wikimedia.org/r/257691 (owner: 10Alexandros Kosiaris) [20:37:10] oh, JFDI [20:37:29] !log restbase: start deploy of 49fdc615b to all nodes [20:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:37:39] (03CR) 10Muehlenhoff: [C: 031] "Good catch" [puppet] - 10https://gerrit.wikimedia.org/r/257690 (owner: 10Alexandros Kosiaris) [20:39:39] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:41:03] so moritzm, I think it's safer to just change the ACL evaluation order [20:41:22] and put the extras before the base ones [20:41:24] (03PS2) 10BBlack: varnish: always use backend_random for pass/hfp [puppet] - 10https://gerrit.wikimedia.org/r/257636 (https://phabricator.wikimedia.org/T96847) [20:41:26] (03PS2) 10BBlack: add backend_random to maps and upload clusters config [puppet] - 10https://gerrit.wikimedia.org/r/257635 (https://phabricator.wikimedia.org/T96847) [20:41:28] (03PS2) 10BBlack: add backend_random to maps and upload clusters in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/257634 (https://phabricator.wikimedia.org/T96847) [20:41:30] (03PS2) 10BBlack: cache_upload: remove unused "rendering" backend [puppet] - 10https://gerrit.wikimedia.org/r/257633 (https://phabricator.wikimedia.org/T96847) [20:41:32] (03PS2) 10BBlack: varnish: return (pass) for CAL URLs [puppet] - 10https://gerrit.wikimedia.org/r/257632 (https://phabricator.wikimedia.org/T96847) [20:41:33] already submitting a patch [20:41:34] (03PS2) 10BBlack: varnish: cache /api/rest_v1/ in backends [puppet] - 10https://gerrit.wikimedia.org/r/257630 (https://phabricator.wikimedia.org/T96847) [20:41:36] (03PS2) 10BBlack: varnish: security_audit backend explicitly tier-one-only [puppet] - 10https://gerrit.wikimedia.org/r/257631 (https://phabricator.wikimedia.org/T96847) [20:41:38] (03PS2) 10Alexandros Kosiaris: openldap: Allow to specify cleartext hashing scheme [puppet] - 10https://gerrit.wikimedia.org/r/257691 [20:41:40] (03PS2) 10Alexandros Kosiaris: openldap: Prepend extra ACLs to base ACLs [puppet] - 10https://gerrit.wikimedia.org/r/257690 [20:41:49] moritzm: ^ [20:42:05] I'll have a look [20:42:21] pfff, no that wont work nicely either... [20:42:29] it will allow everyone to get the password [20:42:46] but at least now it's easier to fix the per domain acl [20:43:58] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. In the original which added support for extra ACLs I mentioned the possible need for pre- and post ACLs later on. Turns " [puppet] - 10https://gerrit.wikimedia.org/r/257690 (owner: 10Alexandros Kosiaris) [20:45:01] oh, indeed [20:45:08] (03CR) 10coren: [C: 031] "With the caveats that the extra acls may need to be reexamined given the changed order." [puppet] - 10https://gerrit.wikimedia.org/r/257690 (owner: 10Alexandros Kosiaris) [20:45:59] in the mean time shall we already go ahead and flip the DNS aliases for ldap-eqiad and ldap-codfw? [20:46:11] moritzm: yes please [20:46:26] I think it makes sense to do so now. [20:47:11] could someone review https://gerrit.wikimedia.org/r/#/c/257664/ ? [20:47:58] (03CR) 10Alexandros Kosiaris: [C: 031] Move DNS aliases to the openldap instances (once all tests are fine) [dns] - 10https://gerrit.wikimedia.org/r/257664 (owner: 10Muehlenhoff) [20:48:00] 6operations: Clean up some accidental restbase metrics - https://phabricator.wikimedia.org/T120870#1863673 (10GWicke) 3NEW a:3fgiunchedi [20:48:12] 6operations, 10ops-eqiad: Remove all out of warranty unused cp10xx's from A2 - https://phabricator.wikimedia.org/T120856#1863681 (10RobH) @cmjohnson: Please note we cannot really wipe a SSD, as they have to make use of trim for that fuction. (Writing zeros does nothing and actually degrades the SSD.) So, we... [20:48:17] (03CR) 10coren: [C: 031] "Yep" [dns] - 10https://gerrit.wikimedia.org/r/257664 (owner: 10Muehlenhoff) [20:48:38] RECOVERY - RAID on db1019 is OK: OK: optimal, 1 logical, 2 physical [20:49:13] (03CR) 10Andrew Bogott: [C: 031] Move DNS aliases to the openldap instances (once all tests are fine) [dns] - 10https://gerrit.wikimedia.org/r/257664 (owner: 10Muehlenhoff) [20:49:37] merging [20:49:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move DNS aliases to the openldap instances (once all tests are fine) [dns] - 10https://gerrit.wikimedia.org/r/257664 (owner: 10Muehlenhoff) [20:50:09] !log restbase: finished deploy of 49fdc615b to all nodes [20:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:51:53] !log moved dns aliases ldap-eqiad.wikimedia.org and ldap-codfw.wikimedia.org to seaborgium/serpens (these are for compat reasons, using ldap-labs.[eqiad|codfw].wikimedia.org is preferred [20:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:53:06] Coren: If I've got a CN or mail attribute but I dunno the uid, how can one look it up in ldaplist? [20:53:12] * ostriches feels like he's missing something stupid [20:53:50] With ldaplist, I don't think you can. ldapsearch might be better for that. [20:54:02] At least, not by mail. [20:54:29] but you can ldaplist -l passwd though. [20:54:45] Ah gotcha, didn't know cn would work [20:54:49] ostriches: https://tools.wmflabs.org/contact :-) [20:55:15] won't work for email either, but will work for cn (with wildcard matching) [20:56:00] "400 - Bad Request" [20:56:02] :( [20:56:04] With the CN I have [20:56:24] (03CR) 10Ottomata: "Done, thanks Hashar" [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [20:56:58] ostriches: :( yeah, it probably has some bugs :D [20:58:06] (03PS42) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:58:39] 6operations, 10ops-eqiad: db1019 failing disk (degraded RAID) - https://phabricator.wikimedia.org/T120511#1863730 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson fixed [20:58:47] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [20:59:03] The user clearly exists in wikitech's database. So why can I not find them in ldap? [20:59:12] * ostriches finds something to kick [20:59:36] (03PS3) 10Alexandros Kosiaris: openldap: Allow to specify cleartext hashing scheme [puppet] - 10https://gerrit.wikimedia.org/r/257691 [20:59:37] ostriches: try searching for the wikitech username in ldap instead? [20:59:38] (03PS3) 10Alexandros Kosiaris: openldap: Prepend extra ACLs to base ACLs [puppet] - 10https://gerrit.wikimedia.org/r/257690 [20:59:50] I tried that, that should be the CN [21:00:05] ostriches: what/who are you looking for? [21:00:23] no, cn is the shell name [21:00:28] eh [21:00:34] uid is the shell name, you're right [21:00:38] I'm confusing cn and dn [21:00:41] cn = canonical name [21:01:06] or rather, I thought the cn was always the leftmost part of the dn, which is not true [21:01:13] dn = distinguished name (unique in the entire tree). [21:01:21] andrewbogott: Neil Quinn. I'm trying to add him to wmf ldap group. [21:02:21] ostriches: Neil P. Quinn-WMF — neilpquinn-wmf [21:02:25] cn/uid [21:02:52] ldaplist couldn't find it with that CN. [21:02:55] That's what I had, hmm [21:03:16] valhallasw: tyvm [21:03:22] hm, this reminds me, moritzm can you raise the unindexed search limit to 10,000? [21:03:35] (03PS1) 10GWicke: Share the RESTBase config template between production and labs [puppet] - 10https://gerrit.wikimedia.org/r/257696 [21:03:46] heh, that's my workflow :p [21:03:51] ostriches: and the numeric uid is 12049, if that's what you needed [21:03:52] I just ldaplist -l passwd and grep :p [21:04:30] moritzm: ok I think I got a good one this time https://gerrit.wikimedia.org/r/#/c/257690/ [21:04:45] valhallasw: Nah, modify-ldap-group wants the text uid. [21:04:58] akosiaris: yep, already looking into it [21:06:24] yeah, stopping ACL processing will work [21:06:35] (with "break" I mean) [21:07:18] moritzm: ok, merging then [21:07:25] and let's see everything break!!! [21:07:31] * andrewbogott has high hopes [21:07:39] (03CR) 10Alexandros Kosiaris: [C: 032] openldap: Prepend extra ACLs to base ACLs [puppet] - 10https://gerrit.wikimedia.org/r/257690 (owner: 10Alexandros Kosiaris) [21:07:41] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/257690 (owner: 10Alexandros Kosiaris) [21:08:33] andrewbogott: the sizelimit is the same, not matter whether indexed or not, what do we need 10000 for? [21:08:57] moritzm: maybe I’m asking for the wrong thing [21:09:12] I want users to get a list of all user accounts [21:09:35] via ldaplist -l passwd [21:09:46] !log checking out wmf/1.27.0-wmf.8 on tin for train deploy [21:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:51] With opendj, at least, that failed with an error about exceeding the record limit for unindexed searches [21:09:55] and I raised it, and it worked [21:09:57] ok, so the default limit is 2048 [21:10:18] slapd does that as a protection against resource overconsumption by malicious users [21:10:28] we can just raise the limit for the ldaplist user [21:10:36] is it a dedicated one ? [21:10:43] or that proxyagent thing ? [21:10:46] but we can raise it for selected users or groups [21:10:50] btw... it seems it works!!! [21:11:00] proxyagent [21:11:03] (03CR) 10Mobrovac: [C: 04-1] "If we want to share the config between labs and prod, then we need to be able to parametrise the list of domains to be included in the con" [puppet] - 10https://gerrit.wikimedia.org/r/257696 (owner: 10GWicke) [21:11:08] andrewbogott: I'll make a patch [21:11:11] thanks [21:11:23] isn't proxyagent used pretty much everywhere ? [21:12:05] akosiaris: yep [21:12:05] ok done I think. I managed to not get my password [21:12:15] and still I think the ACLs will work [21:12:16] (03CR) 10GWicke: "Yeah, good point. Any ideas for how to best do that?" [puppet] - 10https://gerrit.wikimedia.org/r/257696 (owner: 10GWicke) [21:12:28] andrewbogott: Coren wanna try wikitech account creation ? [21:12:36] I just did, will try again [21:12:36] akosiaris: Trying now [21:13:08] same failure as before [21:13:24] Idem [21:13:28] hmm [21:13:29] oh [21:14:00] I checked via tcpdump on nembus and there's still a few queries against it (e.g. from piramido.editor-engagament.eqiad.wmflabs), some might be due to cached hostnames from the old alias, but some might actually hardcode them [21:14:16] Coren: once more please [21:15:00] I can haz suxess. [21:15:01] (03PS3) 10BBlack: varnish: always use backend_random for pass/hfp [puppet] - 10https://gerrit.wikimedia.org/r/257636 (https://phabricator.wikimedia.org/T96847) [21:15:03] (03PS3) 10BBlack: add backend_random to maps and upload clusters config [puppet] - 10https://gerrit.wikimedia.org/r/257635 (https://phabricator.wikimedia.org/T96847) [21:15:05] (03PS3) 10BBlack: add backend_random to maps and upload clusters in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/257634 (https://phabricator.wikimedia.org/T96847) [21:15:07] (03PS3) 10BBlack: cache_upload: remove unused "rendering" backend [puppet] - 10https://gerrit.wikimedia.org/r/257633 (https://phabricator.wikimedia.org/T96847) [21:15:08] Coren: great! [21:15:09] (03PS3) 10BBlack: varnish: return (pass) for CAL URLs [puppet] - 10https://gerrit.wikimedia.org/r/257632 (https://phabricator.wikimedia.org/T96847) [21:15:11] (03PS1) 10BBlack: varnish: use same VCL files for text+mobile [puppet] - 10https://gerrit.wikimedia.org/r/257699 (https://phabricator.wikimedia.org/T109286) [21:15:15] ok I think that's solved [21:15:32] What was that last thing missing? [21:15:39] slapd restart [21:15:45] I thought puppet did it but nope [21:15:51] Ah! [21:15:55] * akosiaris facepalms [21:15:59] I'll doublecheck the list tomorrow, but maybe someone can send an announcement to the labs list to update possible hardcoded usages of nembus/neptuniun [21:16:06] moritzm: there is probably some replication traffic between nembus and neptunium [21:16:15] akosiaris: thanks! [21:16:18] yeah, that as well (but I filtered that one out) [21:16:19] yw [21:16:22] and monitoring from neon [21:16:29] I am happy we got it working [21:16:34] damn ACLs... [21:16:41] Coren, can you verify that your new user works on gerrit as well? [21:16:55] andrewbogott: Hm, kk - checking that [21:17:02] can't get on piramido.eqiad.wmflabs to check yet even as root hm [21:17:07] for most labs instances with a hardcoded name it'll work mostly fine (unless a new users logs in which isn't in the stale data) [21:17:36] (03CR) 10Mobrovac: "One obvious answer would be to have the domain list stored as an array in hiera, which can then be overwritten for deployment-prep. Howeve" [puppet] - 10https://gerrit.wikimedia.org/r/257696 (owner: 10GWicke) [21:18:05] other examle: agent3.security-tools.eqiad.wmflabs [21:18:24] andrewbogott: confirmed. [21:18:30] so I think we should leave opendj running until tomorrow or so [21:18:55] moritzm: sending an ‘all clear’ email, I’ll include that [21:19:05] andrewbogott: thanks [21:19:29] Coren: mind cleaning up my failed account creation tests again? Same naming pattern as before [21:19:36] Coren: You made a new user on wikitech during the testing, right? Did you try logging into gerrit with it yet? [21:19:45] ostriches: I have. Works. [21:19:50] andrewbogott: 101 up? [21:19:50] ok thx, will check off [21:19:58] Coren: yep [21:20:53] 103 104 deleted. Any others? [21:21:12] thcipriani: it seems an old wikidata branch was picked up, though nothing fatal for group0 [21:21:26] :* 'manage addresses' there isn't checked off [21:21:30] do we need to test it? [21:21:59] Coren, andrewbogott: ack on leaving the opendj instances running until tomorrow? [21:22:15] thcipriani: can you wait with starting scap for a submodule update? [21:22:22] moritzm: I can think of no harm in it, and it'll help stragglers with cached DNS, etc [21:22:34] I think terminating them now causes more trouble than waiting a bit for caches expiring or people fixing them [21:22:35] moritzm: at least until tomorrow [21:22:37] there’s no rush [21:22:51] Right, I agree fully. [21:22:53] I concur [21:22:56] probably more [21:22:59] jzerebecki: absolutely. In config.json it looks like wikidata is set to true for the branch name. [21:23:01] like a week or so [21:23:01] yeah, it we do it tomorrow we have a good mix of "data available" and "data not overly stale" [21:23:39] thcipriani: usually worked the last few times, perhaps I was too late in creating the new branch [21:23:45] I'll have a look at incoming queries throughout tomorrow, we can decide on these metrics [21:23:59] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1863827 (10Nuria) Can we get an update on this? cc @joe We expect no support when it comes to uptime of piw... [21:24:00] whether we want to keep it running longer [21:24:20] jzerebecki: kk, not a big deal, poke me with the submodule update when you have it. [21:24:39] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:26:15] I'm off for dinner, will check back later [21:26:23] ok, email sent. moritzm, chasemp, Coren [21:26:23] oops [21:26:31] moritzm: thanks for setting this up! Have a good dinner [21:27:00] chasemp, Coren, hashar, ostriches, akosiaris, thanks for working on this [21:27:09] (03CR) 10GWicke: "Actually, encoding the list in hiera doesn't sound so bad to me. It's a config variable that differs between environments, so it seems tha" [puppet] - 10https://gerrit.wikimedia.org/r/257696 (owner: 10GWicke) [21:27:21] * andrewbogott is about to go afk but will be textable [21:28:53] andrewbogott: thanks as well [21:29:02] and everyone else obviously [21:29:08] thcipriani: https://gerrit.wikimedia.org/r/#/c/257740/ [21:29:12] and I am signing off .. c ya tomorrow guys [21:30:04] (03PS1) 10Alexandros Kosiaris: openldap: Notify slapd on acls.conf and indeces.conf changes [puppet] - 10https://gerrit.wikimedia.org/r/257741 [21:30:05] jzerebecki: thanks [21:30:08] ciao akosiaris [21:30:30] andrewbogott: no thanks to me, i just helped pinpoint the auth issue :P [21:30:55] thanks ori! [21:31:02] heh, np :P [21:31:58] I'm trying to access Google Webmaster Tools but the password I have is wrong. [21:32:01] Can someone help me with this? [21:32:23] andrewbogott: all kudos to you guys. I have merely monitored stuff and ended up having nothing much to do. Kudos! [21:34:11] chasemp: Can you help me, perhaps? :-) [21:34:35] (03CR) 10Mobrovac: "Good point re storage groups. If we parametrise that as well, and have only one for beta, then we're safe on that front." [puppet] - 10https://gerrit.wikimedia.org/r/257696 (owner: 10GWicke) [21:34:48] Deskana: what are we up to? [21:35:17] chasemp: I'm trying to access Google Webmaster Tools but the password I have for noc@ is wrong. [21:35:28] chasemp: I ssh'd into bast1001.wikimedia.org to check the password but it seems to be out of date. [21:35:38] (03PS1) 10GWicke: Reinstate separate labs config [puppet] - 10https://gerrit.wikimedia.org/r/257743 [21:36:58] (03CR) 10GWicke: "As a pragmatic fix for restbase in labs, I have submitted https://gerrit.wikimedia.org/r/#/c/257743/ as an alternative. It simply is a cop" [puppet] - 10https://gerrit.wikimedia.org/r/257696 (owner: 10GWicke) [21:37:42] Deskana: I'm looking at what I have here just a sec [21:38:25] chasemp: Thanks. :-) [21:39:38] Deskana: I wasn't aware bast1001 housed passwords in that way actually? [21:39:57] chasemp: Yeah, I have a file in there that I can read that has the password for noc in it. [21:40:29] so yes this is the "old" password I see [21:40:38] chasemp: Aha, how do I access the new password then? [21:41:09] I guess this is how it was shared to you from before? I'm looking at what's up here [21:41:21] I figured we would do a hangout and I could verify you are you but maybe this is more sane [21:41:37] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [21:41:57] Deskana: let's move to pm so we can talk logistics of secret sharing :) [21:42:05] chasemp: Sure thing [21:42:13] I must have a video of Deskana to broadcast to chasemp [21:42:16] mitm! [21:42:36] I make ppl sing random christmas songs with words replaced with "your honor" [21:42:44] ahah [21:44:19] (03CR) 10Mobrovac: "LGTM, minor comments in-lined." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/257743 (owner: 10GWicke) [21:45:04] (03CR) 10Mobrovac: ".. or that. That could even be preferred now that we have a rather-minimalistic config." [puppet] - 10https://gerrit.wikimedia.org/r/257696 (owner: 10GWicke) [21:57:05] thcipriani, are you done with the train? [21:57:31] yurik: nope, still going [21:57:46] I'm getting all the symlinks fixed up, will be syncing to testwiki shortly [21:58:10] thcipriani, will it break anything if i "git deploy" (via trebuchet) our maps service? [21:58:23] its a separate nodejs service [21:58:40] or i could wait until you are done [21:58:55] yurik: go for it, should be completely separate. [21:59:03] ok [22:05:07] (03PS1) 10Thcipriani: Add 1.27.0-wmf.8 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257756 [22:05:09] (03PS1) 10Thcipriani: group0 to php-1.27.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257757 [22:07:09] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [22:07:09] (03CR) 10Thcipriani: [C: 032] "Train deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257756 (owner: 10Thcipriani) [22:07:30] (03Merged) 10jenkins-bot: Add 1.27.0-wmf.8 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257756 (owner: 10Thcipriani) [22:10:43] (03Abandoned) 10Rush: labs: puppetmaster self should still apply 'role::labs::instance' [puppet] - 10https://gerrit.wikimedia.org/r/257612 (https://phabricator.wikimedia.org/T120817) (owner: 10Rush) [22:12:26] (03CR) 10GWicke: "This is a pragmatic alternative to https://gerrit.wikimedia.org/r/#/c/257696/." [puppet] - 10https://gerrit.wikimedia.org/r/257743 (owner: 10GWicke) [22:16:29] !log deployed latest kartotherian & tilerator [22:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:34] thcipriani, i'm done [22:17:02] yurik: cool, thanks for letting me know [22:20:00] (03CR) 10Thcipriani: [C: 032] "Train deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257434 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [22:20:57] (03Merged) 10jenkins-bot: Enable Cards and RelatedArticles so it rides the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257434 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [22:23:27] !log thcipriani@tin Started scap: testwiki to 1.27.0-wmf.8 and rebuild l10n cache [22:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:44] ^ train deploy finally started for testwiki (FYI) [22:24:28] !log thcipriani@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.d9wYd5Axk0" ' returned non-zero exit status 1 (duration: 01m 01s) [22:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:25:39] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 3 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [22:25:42] ah, blerg. It's complaining about the Cards extension in wmf.7 [22:31:10] thcipriani: yeah. did we not get that bit of info to you? The wmf.8 branch for Cards needs to be added to the wmf.7 MW branch just to make l10nupdate happy [22:31:16] It's lame [22:31:56] ah, ok, just asking twentyafterfour in -releng about that. KK. adding the submodule for .7 [22:32:51] Sam used to have a trick for this that made it a smaller config change but that code got killed a few months ago [22:33:17] the old trick was to make a special extensions-list file just for the new branch [22:33:57] ostriches: I managed to push that patch without introducing the same horrible mistake the second time. Thanks for the help! [22:37:01] bd808: quick sanity check: https://gerrit.wikimedia.org/r/#/c/257768/ [22:37:31] lgtm [22:37:32] thcipriani: lgtm [22:37:40] nice, thanks. [22:38:00] we should solve that in a cleaner way [22:38:23] can't l10nupdate be made to skip it instead of aborting? [22:38:49] probably [22:39:19] I guess the fear there would be a bad new branch cut that missed an extension [22:39:46] it's probably the right thing that it blew up there, I think. [22:40:10] It used to be possible to make an extensions-list.1.27.0-wmf.8 list and only put the new extension in it [22:40:13] well it's shitty that the extension-list isn't in the branch with the extensions [22:40:44] what purpose does extension-list serve anyway? is it just there for l10nupdate? [22:41:01] or does it control which extensions get activated in production? [22:42:03] twentyafterfour: it looks like it is only for mergeMessageFileList.php [22:42:21] at least in core and multiwiki [22:43:52] so it could be moved into the branch instead of mediawiki-config? [22:44:07] yeah it looks like that would be possible [22:44:25] there is one in extensions/Wikidata that gets added in already [22:44:43] grep for wgExtensionEntryPointListFiles in mediawiki-config [22:45:45] !log thcipriani@tin Started scap: testwiki to 1.27.0-wmf.8 and rebuild l10n cache take II [22:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:46:04] ^ bd808 twentyafterfour thanks for the help! [22:46:36] bd808: yeah looks like it's just mergeMessageFileList, hmm ok [22:47:01] that script can also take a --extensions-dir argument [22:47:10] which makes the whole file not needed [22:47:32] I wonder why we don't use that? [22:49:54] It looks like you can only pass one --extensions-dir setting which wouldn't work. But that could probably be fixed up [22:50:15] (we need branch/extensions and branch/skins) [22:52:20] bd808: yeah that sounds like a better solution [22:52:57] I think I would have fixed this a long time ago if I had known it's only used for just that one thing, I thought it probably had deep rooted dependencies all over the codebase :-/ [22:53:10] that's what I get for making assumptions [22:53:46] but! it's success #2 of thcipriani deploying today [22:53:52] if only you had read all of the code in the whole system [22:54:03] (#1 was simply de-siloing) [22:54:05] go thcipriani go! [22:54:06] lol [22:54:40] Lemme tell you, after deploying for 3 hours and change, I feel like a success. [22:54:57] de-siloing ftw. [22:54:57] I bet you look it, too. [22:55:10] * ostriches gives thcipriani a participation trophy [22:55:42] :D [22:56:50] The first time I was going to run the train ostriches pretty much said "you'll figure it out". [22:57:08] Luckily Reedy took pitty on my and helped me make a checklist [22:57:25] bd808: http://cdn.meme.am/instances/500x/54942287.jpg [22:57:59] too perfect [22:58:16] I just screamed, "I need an adult!" in -releng until someone paid attention :) [22:58:17] ostriches: https://www.youtube.com/watch?v=NuMC73jdygI [22:58:49] Exactly! [23:03:19] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:11:55] bd808: `sudo -u reedy dodeploy` [23:16:33] (03PS2) 10EBernhardson: Turn off language detection user test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254071 (https://phabricator.wikimedia.org/T118292) [23:20:02] bd808: I added --extensions-dir [23:20:33] bd808: I dont think it ever have been properly used anywhere. And we eventually all forgot about it :-) [23:22:07] anyway have a good evening all ! [23:26:11] 6operations, 10hardware-requests: spare swift disks order - https://phabricator.wikimedia.org/T119698#1864094 (10RobH) [23:27:15] (03PS3) 10Ori.livneh: Reinstate separate labs config [puppet] - 10https://gerrit.wikimedia.org/r/257743 (owner: 10GWicke) [23:27:20] (03CR) 10Ori.livneh: [C: 032 V: 032] Reinstate separate labs config [puppet] - 10https://gerrit.wikimedia.org/r/257743 (owner: 10GWicke) [23:27:34] (03PS1) 10BBlack: text VCL: remove hiera mobile/text conditionals [puppet] - 10https://gerrit.wikimedia.org/r/257774 (https://phabricator.wikimedia.org/T109286) [23:29:01] 6operations, 10hardware-requests: spare swift disks order - https://phabricator.wikimedia.org/T119698#1864111 (10RobH) [23:30:27] (03PS2) 10BBlack: text VCL: remove hiera mobile/text conditionals [puppet] - 10https://gerrit.wikimedia.org/r/257774 (https://phabricator.wikimedia.org/T109286) [23:30:58] is someone scapping right now? [23:31:18] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: puppet fail [23:32:25] ori: yes, train is still running [23:32:30] nod [23:36:41] (03PS3) 10BBlack: text VCL: remove hiera mobile/text conditionals [puppet] - 10https://gerrit.wikimedia.org/r/257774 (https://phabricator.wikimedia.org/T109286) [23:36:52] bblack: that looks great, btw -- nice work [23:37:23] 6operations, 10ops-codfw: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1864130 (10RobH) 3NEW a:3Papaul [23:39:26] ori: :) [23:41:58] !log thcipriani@tin Finished scap: testwiki to 1.27.0-wmf.8 and rebuild l10n cache take II (duration: 56m 13s) [23:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:20] thcipriani: can i sneak in a quick sync-dir? [23:42:28] ori: sure, go for it [23:42:31] thanks [23:43:20] !log ori@tin Synchronized php-1.27.0-wmf.8/extensions/MobileFrontend: I7b86a521: Ensure the parser cache varies on images disabled and 'light' images (duration: 00m 32s) [23:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:36] thcipriani: done, thanks [23:43:47] ori: yw [23:45:18] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:46:00] (03CR) 10Thcipriani: [C: 032] "Train deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257757 (owner: 10Thcipriani) [23:46:23] (03Merged) 10jenkins-bot: group0 to php-1.27.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257757 (owner: 10Thcipriani) [23:47:58] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: Train: group0 to 1.27.0-wmf.8 [23:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:36] and the deploy train is complete. [23:52:49] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Puppet has 1 failures [23:53:15] thcipriani: :) thanks [23:55:12] all in a days work :) [23:56:18] (03PS4) 10CSteipp: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) [23:58:29] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:59:58] (03CR) 10CSteipp: "Rebased and also used bawolff's check to make sure no one is using one of the most popular passwords. Planning to deploy this Thursday aft" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp)