[00:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T0000). Please do the needful. [00:00:05] TabbyCat, James_F, and RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:34] I'm here [00:01:32] I guess I should do the SWAT, too [00:03:51] so, /etc/init.d/phd is just a symlink to /srv/phab/phabricator/scripts/__init_script__.php which just does a "require_once" of /srv/phab/phabricator/scripts/init/init-script.php. which just does a "require_once" of /srv/phab/phabricator/scripts/init/lib.php which does an @include_once of /srv/deployment/phabricator/deploy/libphutil/scripts/__init_script__.php which does require_once [00:03:57] $root.'/src/__phutil_library_init__.php' .. [00:03:59] hah [00:05:02] (03CR) 10Catrope: [C: 032] Disable wgCiteResponsiveReferences by default for back-compat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341708 (https://phabricator.wikimedia.org/T33597) (owner: 10Krinkle) [00:05:37] (03CR) 10Dzahn: [C: 032] Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [00:06:26] (03CR) 10Dzahn: [C: 032] "just adding the file for now, it will not be used until the next change" [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [00:06:46] (03Merged) 10jenkins-bot: Disable wgCiteResponsiveReferences by default for back-compat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341708 (https://phabricator.wikimedia.org/T33597) (owner: 10Krinkle) [00:07:11] (03CR) 10Dzahn: "16:06 < mutante> so, /etc/init.d/phd is just a symlink to /srv/phab/phabricator/scripts/__init_script__.php which just does a "require_on" [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [00:07:25] (03CR) 10jenkins-bot: Disable wgCiteResponsiveReferences by default for back-compat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341708 (https://phabricator.wikimedia.org/T33597) (owner: 10Krinkle) [00:07:36] (03CR) 10Dzahn: "pre-requisite to converting this to base::service_unit, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [00:08:37] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Disable wgCiteResponsiveReferences by default for back-compat (T33597) (duration: 00m 41s) [00:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:43] T33597: Render references list in multiple columns based on the number of items - https://phabricator.wikimedia.org/T33597 [00:09:26] (03CR) 10Dzahn: "upstart file has been added (but isn't used yet)" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [00:09:27] !log mobrovac@tin Started deploy [trending-edits/deploy@88e2f74]: Deploy changes for T156666 T156680 T159486 T156411 [00:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:37] T159486: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486 [00:09:37] T156680: Allow API consumer to express a timeframe in hours - https://phabricator.wikimedia.org/T156680 [00:09:38] T156666: Updated and start watching timestamps are always the same - https://phabricator.wikimedia.org/T156666 [00:09:38] T156411: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411 [00:10:32] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [00:11:41] James_F: Your Cite config patch is deployed, please test [00:11:47] * RoanKattouw forgot to do the mwdebug1001 dance, oops [00:11:59] Oh he says it's a no-op, alright then [00:13:00] !log Clear 2FA for "User:Steven Walling"; identity confirmed via facebook [00:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:08] (03CR) 10Dzahn: [C: 04-1] "see inline comments." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [00:14:10] (03PS5) 10Catrope: Modify add/remove groups for I984157d5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341382 (owner: 10MarcoAurelio) [00:14:20] (03CR) 10Catrope: [C: 032] Modify add/remove groups for I984157d5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341382 (owner: 10MarcoAurelio) [00:15:15] (03CR) 10Dzahn: ""document the lack of init scripts", "In general, you can not control what phd start launches. If you want to launch additional daemons, u" [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [00:15:21] !log mobrovac@tin Started deploy [changeprop/deploy@99280e3]: Deploy for T159486 [00:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:27] T159486: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486 [00:15:29] (03Merged) 10jenkins-bot: Modify add/remove groups for I984157d5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341382 (owner: 10MarcoAurelio) [00:15:32] !log mobrovac@tin Started deploy [cxserver/deploy@7e22281]: Deploy for T159486 [00:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:25] !log mobrovac@tin Finished deploy [trending-edits/deploy@88e2f74]: Deploy changes for T156666 T156680 T159486 T156411 (duration: 06m 58s) [00:16:30] !log mobrovac@tin Finished deploy [changeprop/deploy@99280e3]: Deploy for T159486 (duration: 01m 09s) [00:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:32] T156680: Allow API consumer to express a timeframe in hours - https://phabricator.wikimedia.org/T156680 [00:16:32] T156666: Updated and start watching timestamps are always the same - https://phabricator.wikimedia.org/T156666 [00:16:33] T156411: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411 [00:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:04] !log mobrovac@tin Started deploy [electron-render/deploy@51cff8a]: Deploy for T159486 [00:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:23] (03CR) 10jenkins-bot: Modify add/remove groups for I984157d5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341382 (owner: 10MarcoAurelio) [00:17:38] !log mobrovac@tin Started deploy [graphoid/deploy@485ca11]: Deploy for T159486 [00:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:56] !log mobrovac@tin Finished deploy [cxserver/deploy@7e22281]: Deploy for T159486 (duration: 02m 24s) [00:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:05] !log mobrovac@tin Started deploy [mathoid/deploy@83f80ee]: Deploy for T159486 [00:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:40] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Modify add/remove groups for flood group on wikitech (duration: 00m 42s) [00:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:34] !log mobrovac@tin Finished deploy [electron-render/deploy@51cff8a]: Deploy for T159486 (duration: 03m 29s) [00:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:41] T159486: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486 [00:21:21] (03PS2) 10Dzahn: add 2030.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/341362 (https://phabricator.wikimedia.org/T158981) [00:21:52] RECOVERY - puppet last run on aqs1004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [00:22:23] !log mobrovac@tin Finished deploy [graphoid/deploy@485ca11]: Deploy for T159486 (duration: 04m 45s) [00:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:37] 06Operations, 10ArchCom-RfC, 06Performance-Team, 06Services, and 4 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3082527 (10Krinkle) a:05Krinkle>03None [00:22:40] (03PS3) 10Dzahn: add 2030.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/341362 (https://phabricator.wikimedia.org/T158981) [00:22:58] !log mobrovac@tin Finished deploy [mathoid/deploy@83f80ee]: Deploy for T159486 (duration: 04m 53s) [00:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:03] !log mobrovac@tin Started deploy [electron-render/deploy@5ec5614]: Deploy for T159486 [00:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:23] !log mobrovac@tin Started deploy [mobileapps/deploy@d6202e4]: Deploy for T159486 [00:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:32] (03CR) 10Dzahn: [C: 032] "this will redirect to https://meta.wikimedia.org/wiki/Strategy/Wikimedia_movement/2017 - Apache config has been added first" [dns] - 10https://gerrit.wikimedia.org/r/341362 (https://phabricator.wikimedia.org/T158981) (owner: 10Dzahn) [00:26:02] PROBLEM - pdfrender on scb2004 is CRITICAL: connect to address 10.192.16.36 and port 5252: Connection refused [00:26:44] !log catrope@tin Synchronized php-1.29.0-wmf.15/extensions/Echo/modules/ui/: Fix regression in Echo popup (duration: 00m 42s) [00:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:15] !log mobrovac@tin Finished deploy [mobileapps/deploy@d6202e4]: Deploy for T159486 (duration: 03m 52s) [00:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:20] T159486: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486 [00:27:50] !log mobrovac@tin Finished deploy [electron-render/deploy@5ec5614]: Deploy for T159486 (duration: 04m 46s) [00:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:36] !log mobrovac@tin Started deploy [electron-render/deploy@5ec5614]: (no justification provided) [00:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:00] RECOVERY - pdfrender on scb2004 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.075 second response time [00:30:56] (03CR) 10Dzahn: "we are far from using polygerrit... but if, and only if, upstream really does not just fix this and breaks URLs, which i hope they won't, " [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [00:31:04] (03PS23) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [00:31:35] (03CR) 10Paladox: "> we are far from using polygerrit... but if, and only if, upstream" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [00:32:41] (03CR) 10Dzahn: [C: 04-1] "aha, so it doesn't break when removing the "Before" line? that would make it nice and clean, but please test" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [00:33:00] (03CR) 10Paladox: Phabricator: Migrate to base::service_unit for phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [00:33:10] PROBLEM - pdfrender on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 5252: Connection refused [00:33:19] (03CR) 10Paladox: "> aha, so it doesn't break when removing the "Before" line? that" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [00:33:45] !log mobrovac@tin Finished deploy [electron-render/deploy@5ec5614]: (no justification provided) (duration: 04m 08s) [00:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:04] (03CR) 10Dzahn: [C: 031] "ok, cool" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [00:34:07] !log mobrovac@tin Started deploy [electron-render/deploy@5ec5614]: (no justification provided) [00:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:07] !log mobrovac@tin Finished deploy [electron-render/deploy@5ec5614]: (no justification provided) (duration: 00m 59s) [00:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:50] RECOVERY - tileratorui on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.077 second response time [00:38:24] (03CR) 10Paladox: "Will need to manually remove /etc/init.d/phd as we use upstart on iridium now (this is server wise as puppet doesn't remove files)" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [00:39:28] (03CR) 10Paladox: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [00:54:00] 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3082673 (10Dzahn) @gpaumier see https://2030.wikimedia.org now :) [01:01:00] RECOVERY - tilerator on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.077 second response time [01:09:55] 06Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3082759 (10Dzahn) 05Open>03Resolved [01:10:03] 06Operations, 10Wikimedia-Apache-configuration: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3053534 (10Dzahn) [01:10:33] (03PS1) 10Krinkle: webperf: Use trebuchet install of eventlogging for ve.py [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) [01:10:35] (03PS1) 10Krinkle: webperf: Update navtiming.py to use eventlogging instead of zmq [puppet] - 10https://gerrit.wikimedia.org/r/341724 [01:10:54] (03CR) 10Krinkle: [C: 04-1] "Untested" [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) (owner: 10Krinkle) [01:11:02] (03CR) 10Krinkle: [C: 04-1] "Untested" [puppet] - 10https://gerrit.wikimedia.org/r/341724 (owner: 10Krinkle) [01:12:13] (03CR) 10jerkins-bot: [V: 04-1] webperf: Use trebuchet install of eventlogging for ve.py [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) (owner: 10Krinkle) [01:15:06] (03PS2) 10Krinkle: webperf: Use trebuchet install of eventlogging for ve.py [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) [01:15:08] (03PS2) 10Krinkle: webperf: Update navtiming.py to use eventlogging instead of zmq [puppet] - 10https://gerrit.wikimedia.org/r/341724 [01:18:14] (03CR) 10Chad: [C: 031] "Thoughts on this?" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [01:19:54] (03PS3) 10Chad: Read closed-labs as closed tag on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) (owner: 10Alex Monk) [01:20:10] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:20:28] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3082822 (10Smalyshev) [01:27:38] (03CR) 10Chad: [C: 032] Scap clean: abort if a branch is still in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 (owner: 10Chad) [01:29:59] (03Merged) 10jenkins-bot: Scap clean: abort if a branch is still in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 (owner: 10Chad) [01:32:40] (03CR) 10Chad: "*cough* We should move this to scap, T118772" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) (owner: 10Krinkle) [01:33:43] !log demon@tin Synchronized scap/plugins/clean.py: no-op (duration: 00m 41s) [01:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:21] (03CR) 10Krinkle: "RE: Scap. Yes, I'll leave that to Analytics. First step is to at least puppetize the install that is used, instead of the currently undocu" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) (owner: 10Krinkle) [01:35:37] (03PS3) 10Krinkle: webperf: Use trebuchet install of eventlogging for ve.py [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) [01:35:47] (03PS3) 10Krinkle: webperf: Update navtiming.py to use eventlogging instead of zmq [puppet] - 10https://gerrit.wikimedia.org/r/341724 [01:37:25] (03CR) 10jenkins-bot: Scap clean: abort if a branch is still in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 (owner: 10Chad) [01:37:44] (03CR) 10jerkins-bot: [V: 04-1] Read closed-labs as closed tag on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) (owner: 10Alex Monk) [01:46:38] (03PS1) 10Chad: Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729 [01:47:10] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [02:07:40] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:17:50] (03CR) 10Krinkle: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff) [02:18:51] (03CR) 10Krinkle: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff) [02:29:16] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 07m 53s) [02:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:40] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [02:46:04] !log disabling puppet on production authdns caches (testing dns lint related bits) [02:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:27] (03PS7) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) [02:47:59] (03CR) 10BBlack: [C: 032] authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [02:51:48] (03PS1) 10Krinkle: noc: Fix url to conftool (currently 404 Not Found) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341734 [02:55:10] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:57:41] (03CR) 10Dzahn: [C: 04-1] "needs additional CNAME to point to google to verify ownership so they can enable Google apps. but:" [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [03:01:49] (03PS2) 10BBlack: authdns: add 10/8 to geo map [puppet] - 10https://gerrit.wikimedia.org/r/341616 [03:01:51] (03PS1) 10BBlack: authdns: move hiera() down into module [puppet] - 10https://gerrit.wikimedia.org/r/341735 [03:03:07] (03CR) 10jerkins-bot: [V: 04-1] authdns: move hiera() down into module [puppet] - 10https://gerrit.wikimedia.org/r/341735 (owner: 10BBlack) [03:03:33] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.15) (duration: 15m 08s) [03:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:00] (03PS2) 10BBlack: authdns: move hiera() down into module [puppet] - 10https://gerrit.wikimedia.org/r/341735 [03:05:02] (03PS3) 10BBlack: authdns: add 10/8 to geo map [puppet] - 10https://gerrit.wikimedia.org/r/341616 [03:06:34] (03CR) 10BBlack: [C: 032] authdns: move hiera() down into module [puppet] - 10https://gerrit.wikimedia.org/r/341735 (owner: 10BBlack) [03:07:02] (03CR) 10Dzahn: [C: 04-1] "well.. we can either use CNAME or a TXT record (or upload a file). i see we have "CNAME google.com." for wikimedia.org and we also hav" [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [03:09:22] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Mar 8 03:09:21 UTC 2017 (duration 5m 49s) [03:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:01] (03PS4) 10Dzahn: change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) [03:13:44] (03CR) 10Dzahn: [C: 031] "i think this should work now, can has review though?, see comment above" [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [03:14:25] (03PS1) 10BBlack: Revert "authdns: move hiera() down into module" [puppet] - 10https://gerrit.wikimedia.org/r/341738 [03:14:44] (03CR) 10BBlack: [V: 032 C: 032] Revert "authdns: move hiera() down into module" [puppet] - 10https://gerrit.wikimedia.org/r/341738 (owner: 10BBlack) [03:15:03] (03PS1) 10BBlack: Revert "authdns lint support for full puppetized config" [puppet] - 10https://gerrit.wikimedia.org/r/341739 [03:15:07] (03PS2) 10BBlack: Revert "authdns lint support for full puppetized config" [puppet] - 10https://gerrit.wikimedia.org/r/341739 [03:15:13] (03CR) 10BBlack: [V: 032 C: 032] Revert "authdns lint support for full puppetized config" [puppet] - 10https://gerrit.wikimedia.org/r/341739 (owner: 10BBlack) [03:16:21] 06Operations, 10Domains, 10Traffic, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3083052 (10Dzahn) @Beetlebeard going with TXT record: https://gerrit.wikimedia.org/r/#/c/341359/4/templates/wikimedia.ee also comments at the bottom of h... [03:17:11] !log authdns back to normal (puppet enabled, do normal things!) [03:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:54] (03PS1) 10Krinkle: noc: Remove old IE-fixes.css for IE6/IE7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341740 [03:24:10] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [03:25:01] ger [03:28:30] (03CR) 10Krinkle: [C: 031] Gerrit: Remove reviewer counts cron, nobody is using it [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [03:29:59] (03CR) 10Krinkle: [C: 031] "Only one files/ directory left!" [puppet] - 10https://gerrit.wikimedia.org/r/341729 (owner: 10Chad) [03:34:46] (03PS1) 10Dzahn: phabricator: monitor PHD service only on active server [puppet] - 10https://gerrit.wikimedia.org/r/341747 [03:35:35] (03CR) 10Krinkle: [C: 032] noc: Fix url to conftool (currently 404 Not Found) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341734 (owner: 10Krinkle) [03:35:55] (03PS2) 10Dzahn: phabricator: monitor PHD service only on active server [puppet] - 10https://gerrit.wikimedia.org/r/341747 [03:36:11] (03PS5) 10Krinkle: Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo) [03:36:18] (03CR) 10Krinkle: [C: 032] Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo) [03:36:29] (03CR) 10Krinkle: [C: 032] noc: Remove old IE-fixes.css for IE6/IE7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341740 (owner: 10Krinkle) [03:37:21] RECOVERY - Check size of conntrack table on baham is OK: OK: nf_conntrack is 0 % full [03:37:27] (03Merged) 10jenkins-bot: noc: Fix url to conftool (currently 404 Not Found) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341734 (owner: 10Krinkle) [03:37:36] (03CR) 10jenkins-bot: noc: Fix url to conftool (currently 404 Not Found) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341734 (owner: 10Krinkle) [03:37:40] RECOVERY - Check systemd state on baham is OK: OK - running: The system is fully operational [03:38:20] RECOVERY - Check whether ferm is active by checking the default input chain on baham is OK: OK ferm input default policy is set [03:39:02] (03Merged) 10jenkins-bot: Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo) [03:39:32] (03Merged) 10jenkins-bot: noc: Remove old IE-fixes.css for IE6/IE7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341740 (owner: 10Krinkle) [03:39:39] (03CR) 10jenkins-bot: Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo) [03:40:29] !log krinkle@tin Synchronized docroot/noc/: Fix conftool link (I2f34be0a5), Remove IE6 css (Iae8a356e2), add db-codfw.php (I9f02dee3c) (duration: 00m 42s) [03:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:37] (03CR) 10Dzahn: "example: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=PHD+ get rid of ACKed but red stuff" [puppet] - 10https://gerrit.wikimedia.org/r/341747 (owner: 10Dzahn) [03:41:40] (03CR) 10jenkins-bot: noc: Remove old IE-fixes.css for IE6/IE7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341740 (owner: 10Krinkle) [03:43:02] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) (owner: 10Alex Monk) [03:45:14] (03CR) 10jerkins-bot: [V: 04-1] Read closed-labs as closed tag on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) (owner: 10Alex Monk) [03:46:08] (03PS2) 10Krinkle: Remove empty conditionals for wikis from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338129 (owner: 10Reedy) [03:46:51] (03CR) 10Krinkle: [C: 031] "Verified that there are no other conditions these would coincide with (sometimes an empty conditional can be used to avoid further matches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338129 (owner: 10Reedy) [03:47:00] (03PS2) 10Krinkle: Add a few newlines to standardise spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338130 (owner: 10Reedy) [03:47:02] (03CR) 10Krinkle: [C: 032] Add a few newlines to standardise spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338130 (owner: 10Reedy) [03:47:09] (03CR) 10Krinkle: [C: 032] Add a few newlines to standardise spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338130 (owner: 10Reedy) [03:51:37] (03CR) 10Dzahn: "used by korma? but not anymore since korma is currently 502 Bad Gateway ? http://korma.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [03:52:57] (03CR) 10Krinkle: [C: 04-1] Read closed-labs as closed tag on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332940 (https://phabricator.wikimedia.org/T115584) (owner: 10Alex Monk) [03:53:21] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [03:53:37] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/tag/analytics-tech-community-metrics/" [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [03:54:32] (03CR) 10jerkins-bot: [V: 04-1] rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [03:57:43] (03CR) 10Dzahn: [C: 031] "well.. i guess korma has been closed per https://phabricator.wikimedia.org/T156253#2986278 which is a little sad because outsourcing but t" [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [04:06:21] (03CR) 10Dzahn: "+0.5" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [04:09:10] (03CR) 10Dzahn: "given the ticket history, would like to hear hashar on this" [puppet] - 10https://gerrit.wikimedia.org/r/340496 (https://phabricator.wikimedia.org/T157785) (owner: 10Paladox) [04:10:40] (03CR) 10Dzahn: "@Muehlenhoff thoughts on this?" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [04:15:48] (03PS2) 10Dzahn: Enable base::firewall in role::test::system by default [puppet] - 10https://gerrit.wikimedia.org/r/341550 (owner: 10Muehlenhoff) [04:17:25] (03CR) 10Dzahn: [C: 032] "yep, these are all test::sytems and they all have base::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/341550 (owner: 10Muehlenhoff) [04:20:44] (03CR) 10Dzahn: [C: 031] Remove Aaron from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/340101 (owner: 10Muehlenhoff) [04:23:15] (03CR) 10Dzahn: [C: 031] Rename ferm service in role::labs::db::replica [puppet] - 10https://gerrit.wikimedia.org/r/328683 (owner: 10Muehlenhoff) [04:26:20] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:27:29] (03CR) 10Dzahn: [C: 031] role::analytics_cluster::hadoop::standby: Enable base::firewall in the role [puppet] - 10https://gerrit.wikimedia.org/r/341292 (owner: 10Muehlenhoff) [04:28:18] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/5684/" [puppet] - 10https://gerrit.wikimedia.org/r/341197 (https://phabricator.wikimedia.org/T159610) (owner: 10Florianschmidtwelzow) [04:31:51] (03CR) 10Dzahn: "there is a hardcoded "eqiad" in the wgetrc. can we set it so that eqiad is used on terbium but codfw is used on wasat, terbium's equivalen" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [04:37:40] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:54:20] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [05:06:40] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [05:53:50] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:40] PROBLEM - puppet last run on d-i-test is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:10:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 74% free (5636 MB out of 7627 MB) [06:15:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 73% free (5564 MB out of 7627 MB) [06:20:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 72% free (5452 MB out of 7627 MB) [06:20:50] RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:25:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 73% free (5540 MB out of 7627 MB) [06:30:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 74% free (5585 MB out of 7627 MB) [06:32:40] RECOVERY - puppet last run on d-i-test is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:35:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 74% free (5606 MB out of 7627 MB) [06:36:20] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:40:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 74% free (5625 MB out of 7627 MB) [06:44:40] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:44:46] (03PS2) 10Ema: cache_maps: do not set cookies [puppet] - 10https://gerrit.wikimedia.org/r/341575 [06:45:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5645 MB out of 7627 MB) [06:45:13] (03CR) 10Ema: [V: 032 C: 032] cache_maps: do not set cookies [puppet] - 10https://gerrit.wikimedia.org/r/341575 (owner: 10Ema) [06:50:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5650 MB out of 7627 MB) [06:53:23] 06Operations, 10Traffic: Upgrade text and upload cache clusters to varnish 4.1.5 - https://phabricator.wikimedia.org/T159424#3083179 (10ema) 05Open>03Resolved a:03ema Done, all cache hosts are now running 4.1.5. [06:55:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5656 MB out of 7627 MB) [06:55:15] 06Operations, 10ops-codfw, 10Traffic: baham (ns1) CPU-related issues - https://phabricator.wikimedia.org/T159870#3083182 (10ema) p:05Triage>03High [07:00:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5692 MB out of 7627 MB) [07:03:22] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3083190 (10Marostegui) 05Open>03Resolved All good now, thanks ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2... [07:04:20] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:04:42] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3083192 (10Marostegui) 05Open>03Resolved All good, thanks! ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2 (p... [07:05:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5694 MB out of 7627 MB) [07:06:01] (03CR) 10Papaul: [C: 032] phabricator: monitor PHD service only on active server [puppet] - 10https://gerrit.wikimedia.org/r/341747 (owner: 10Dzahn) [07:07:50] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:08:22] (03PS1) 10Marostegui: db-eqiad.php: Restore db1060 normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341757 (https://phabricator.wikimedia.org/T158193) [07:10:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5709 MB out of 7627 MB) [07:10:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1060 normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341757 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui) [07:11:50] RECOVERY - puppet last run on dbproxy1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:12:05] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1060 normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341757 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui) [07:12:14] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1060 normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341757 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui) [07:13:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1060 original weight - T158193 (duration: 00m 47s) [07:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:38] T158193: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193 [07:13:40] 06Operations, 10Traffic: Upgrade Pybal to 1.08 - https://phabricator.wikimedia.org/T110954#3083213 (10ema) 05Open>03Resolved a:03ema We're currently running 1.13.5, closing. [07:14:37] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3083216 (10Marostegui) 05Open>03Resolved Server's original weight has been restored. I will close this ticket, even though there... [07:15:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5712 MB out of 7627 MB) [07:19:00] !log Deploy alter table s6 revision table on db1061 - T159414 [07:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:06] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [07:20:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5715 MB out of 7627 MB) [07:25:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5719 MB out of 7627 MB) [07:27:11] !log Start pt-table-checksum on plwiki (s2) - T154485 [07:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:17] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [07:30:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5721 MB out of 7627 MB) [07:35:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5723 MB out of 7627 MB) [07:36:50] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:40:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5727 MB out of 7627 MB) [07:41:28] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3083229 (10Marostegui) I saw no entries on tendril so I assumed the backups that started to run yesterday finished, so I wanted to check if they were ok on bacula, and I saw this: ``... [07:45:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5727 MB out of 7627 MB) [07:50:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5727 MB out of 7627 MB) [07:55:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5727 MB out of 7627 MB) [08:00:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5727 MB out of 7627 MB) [08:05:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5728 MB out of 7627 MB) [08:10:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5730 MB out of 7627 MB) [08:15:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5733 MB out of 7627 MB) [08:16:50] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:20:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5735 MB out of 7627 MB) [08:21:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341759 (https://phabricator.wikimedia.org/T153743) [08:22:39] (03PS2) 10Marostegui: site.pp: Enable ROW binlog for db1070 [puppet] - 10https://gerrit.wikimedia.org/r/341007 (https://phabricator.wikimedia.org/T153743) [08:25:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5739 MB out of 7627 MB) [08:26:04] (03CR) 10Marostegui: [C: 032] site.pp: Enable ROW binlog for db1070 [puppet] - 10https://gerrit.wikimedia.org/r/341007 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:29:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341759 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:30:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5740 MB out of 7627 MB) [08:31:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341759 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:31:23] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341759 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [08:32:25] 06Operations, 10Traffic, 10netops: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525#3083266 (10ema) p:05Normal>03High [08:32:26] (03PS1) 10Giuseppe Lavagetto: Handle SIGTERM, SIGINT in the threads [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341760 [08:32:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1070 - T153743 (duration: 00m 41s) [08:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:51] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [08:35:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5751 MB out of 7627 MB) [08:36:15] !log Restart mysql on db1070 to change binlog to ROW - T153743 [08:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:55] !log upgrading apache on mw1161-mw1208 [08:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5757 MB out of 7627 MB) [08:40:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341761 [08:44:50] RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:45:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5757 MB out of 7627 MB) [08:48:01] (03PS1) 10ArielGlenn: give dumps monitor script explicit arg for working dir [puppet] - 10https://gerrit.wikimedia.org/r/341762 [08:50:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5757 MB out of 7627 MB) [08:54:03] (03CR) 10ArielGlenn: [C: 032] give dumps monitor script explicit arg for working dir [puppet] - 10https://gerrit.wikimedia.org/r/341762 (owner: 10ArielGlenn) [08:55:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5767 MB out of 7627 MB) [08:58:45] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341761 (owner: 10Marostegui) [08:59:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341761 (owner: 10Marostegui) [09:00:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341761 (owner: 10Marostegui) [09:00:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5790 MB out of 7627 MB) [09:00:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1070 - T153743 (duration: 00m 41s) [09:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:05] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:05:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5795 MB out of 7627 MB) [09:05:53] (03PS1) 10ArielGlenn: run monitor.py relative to its working dir [dumps] - 10https://gerrit.wikimedia.org/r/341763 [09:06:30] (03CR) 10ArielGlenn: [C: 032] run monitor.py relative to its working dir [dumps] - 10https://gerrit.wikimedia.org/r/341763 (owner: 10ArielGlenn) [09:07:21] !log ariel@tin Started deploy [dumps/dumps@e30fbd0]: run monitor.py relative to cwd, to pick up default config files [09:07:23] !log ariel@tin Finished deploy [dumps/dumps@e30fbd0]: run monitor.py relative to cwd, to pick up default config files (duration: 00m 02s) [09:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:41] and that should have been wd [09:08:04] oh well. time to get caffeine in, 1 hour passed since thyroid meds [09:09:44] (03CR) 10Muehlenhoff: [C: 04-1] role::analytics_cluster::hadoop::standby: Enable base::firewall in the role [puppet] - 10https://gerrit.wikimedia.org/r/341292 (owner: 10Muehlenhoff) [09:10:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5796 MB out of 7627 MB) [09:15:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5797 MB out of 7627 MB) [09:20:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5798 MB out of 7627 MB) [09:22:11] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3083379 (10elukey) [09:22:29] (03PS1) 10Elukey: Allow analytics1041 to be reimaged with Debian Jessie [puppet] - 10https://gerrit.wikimedia.org/r/341765 (https://phabricator.wikimedia.org/T159530) [09:25:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5801 MB out of 7627 MB) [09:25:54] (03CR) 10Jcrespo: "Thanks for deploying: https://noc.wikimedia.org/conf/highlight.php?file=db-codfw.php :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo) [09:28:23] (03PS1) 10Jcrespo: Followup commit to I9f02dee3cea543234 (style fix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341769 [09:30:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5802 MB out of 7627 MB) [09:30:30] (03PS3) 10Jcrespo: Rename ferm service in role::labs::db::replica [puppet] - 10https://gerrit.wikimedia.org/r/328683 (owner: 10Muehlenhoff) [09:30:51] (03CR) 10Hashar: "That related to T148478 (random slowdown)" [puppet] - 10https://gerrit.wikimedia.org/r/341701 (owner: 10Chad) [09:31:38] (03PS3) 10Gehel: osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631) [09:33:54] 06Operations, 10Electron-PDFs, 06Services: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10akosiaris) [09:34:11] ACKNOWLEDGEMENT - pdfrender on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 5252: Connection refused alexandros kosiaris T159922 [09:35:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5806 MB out of 7627 MB) [09:35:52] (03CR) 10Gehel: [C: 032] osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631) (owner: 10Gehel) [09:36:15] 06Operations, 13Patch-For-Review, 06Services (doing), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3083433 (10akosiaris) I see all the deploys have happened, @mobrovac should I re-resolve this ? [09:37:40] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3083451 (10hashar) https://gerrit.wikimedia.org/r/#/c/341701/ lowered the heap from 28GB to 20GB Graph of memory usage can be seen above T148478#3024661 For the [[ https:/... [09:38:30] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:39:18] (03PS1) 10Gehel: maps - fix logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/341770 (https://phabricator.wikimedia.org/T159631) [09:39:32] !log Stop replication on db2033 - T159707 [09:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:38] T159707: Import x1 on dbstore2001 - https://phabricator.wikimedia.org/T159707 [09:40:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5816 MB out of 7627 MB) [09:42:20] (03CR) 10Gehel: [C: 032] maps - fix logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/341770 (https://phabricator.wikimedia.org/T159631) (owner: 10Gehel) [09:44:30] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:45:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5830 MB out of 7627 MB) [09:46:00] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:50:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5838 MB out of 7627 MB) [09:51:30] !log re-enabled waterline import on maps[12]001 - T159631 [09:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:36] T159631: Tasmania is covered with water at z10+ - https://phabricator.wikimedia.org/T159631 [09:53:38] (03CR) 10Elukey: [C: 032] Allow analytics1041 to be reimaged with Debian Jessie [puppet] - 10https://gerrit.wikimedia.org/r/341765 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey) [09:53:43] (03PS2) 10Elukey: Allow analytics1041 to be reimaged with Debian Jessie [puppet] - 10https://gerrit.wikimedia.org/r/341765 (https://phabricator.wikimedia.org/T159530) [09:53:49] (03CR) 10Elukey: [V: 032 C: 032] Allow analytics1041 to be reimaged with Debian Jessie [puppet] - 10https://gerrit.wikimedia.org/r/341765 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey) [09:55:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5841 MB out of 7627 MB) [09:56:00] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:58:50] PROBLEM - Hadoop NodeManager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:00:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5842 MB out of 7627 MB) [10:00:53] 1041 is me, forgot to silence [10:01:08] I am prepping for reimage [10:05:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5842 MB out of 7627 MB) [10:10:02] gehel: moritzm: for deployment-prep using apt experimental packages [10:10:09] the trick is to use puppet.git /utils/hiera_lookup [10:10:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5843 MB out of 7627 MB) [10:10:12] eg: /var/lib/git/operations/puppet/utils/hiera_lookup -v --fqdn=deployment-salt02.deployment-prep.eqiad.wmflabs apt::use_experimental [10:10:24] (did that on deployment-puppetmaster01 which has the clone of puppet.git) [10:11:00] that shows all the http requests it does to the labs puppetmaster (for hiera settings in Horizon) and to wikitech for the various Hiera: subpages [10:11:43] eventually https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep has "apt::use_experimental": true . That page is project wide [10:12:04] done by Giuseppe on Feb 23rd https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=1572238&oldid=1368443 [10:12:08] hashar: Kool ! That helps a lot! [10:12:13] so beta cluster Jessie instances use experimental packages a [10:12:25] and iirc we have unattended upgrade enabled project wide [10:12:32] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3083512 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['ana... [10:12:42] !log reimage analytics1041 to Debian Jessie [10:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:46] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3083513 (10fgiunchedi) Note mwlog[12]001 need to be whitelisted to be reachable from analytics vlan, `rsync-http-https` term [10:13:20] hashar: do you have a similar trick to find where the apt class comes from? I can't understand where it is included... [10:13:25] moritzm: so essentially apt::use_experimental has been enabled project wide via wikitech page [[Hiera:deployment-prep]] [10:13:32] that is rather a mess [10:13:49] usually classes are applied in Horizon interface [10:13:54] which can be done per instance [10:14:09] and also project wide (eg have all instances of the project to use foo::bar class) [10:14:25] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3083515 (10Gehel) The JVM makes no promises about releasing the memory to the OS. So it is possible that the memory has been freed from the application, but it still retained... [10:14:25] additionally there is a "classes" hiera list that can be used to add more classes [10:14:41] and hiera settings can be set either via: a) puppet.git /hieradata/ b) one of the wikitech pages c) in horizon [10:14:59] each time with options for project wide or per instance [10:15:09] Ok, I'll check those again... [10:15:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5843 MB out of 7627 MB) [10:15:29] (and apparently unattended upgrade is not enabled or broken on beta cluster) [10:15:40] hashar: I added some comments on T148478, but feel free to ping me if you want to have a more in depth discussion on how GC works [10:15:41] T148478: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478 [10:15:50] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=367.80 Read Requests/Sec=229.60 Write Requests/Sec=0.20 KBytes Read/Sec=29388.80 KBytes_Written/Sec=2.00 [10:16:59] gehel: ah neat. So to rephrase your comment: the JVM did freed from the app internally but still held it from the OS? [10:17:24] hashar: yep, that's how it looks to me... [10:17:52] it is usually better to consider the JVM as an OS within the OS to which you allocate a fixed chunk of memory [10:17:55] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083531 (10Marostegui) Shall I start copying labsdb1007 to dbstore1001 or are you "breaking" at the moment it @jcrespo? [10:18:37] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083532 (10jcrespo) We can start this now. I was about to do it. [10:19:14] gehel: I once read a nice article about how the JVM memory work. Was super interesting but I had a feeling I missed a couple PhD to really understand how it works ;-} [10:19:24] my conclusion was: don't mess up with! [10:20:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5843 MB out of 7627 MB) [10:22:29] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083581 (10Marostegui) >>! In T157359#3083532, @jcrespo wrote: > We can start this now. I was about to do it. Ah, go ahead if you like then. [10:25:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5844 MB out of 7627 MB) [10:26:51] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.50 Read Requests/Sec=0.40 Write Requests/Sec=65.70 KBytes Read/Sec=5.20 KBytes_Written/Sec=404.80 [10:29:21] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083645 (10jcrespo) I am with labsdb1004 now, please shutdown postgres and copy it. [10:30:02] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083647 (10Marostegui) Oki doki! [10:30:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5844 MB out of 7627 MB) [10:31:28] !log Shutdown postgresql on labsdb1007 for maintenance - T157359 [10:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:34] T157359: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359 [10:34:56] !log restarting labsdb1004's mariadb T159572 [10:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:02] T159572: labsdb1004 MySQL crash - https://phabricator.wikimedia.org/T159572 [10:35:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5844 MB out of 7627 MB) [10:35:37] hashar: The GC is black magic, but the basic principles are not that hard to understand... [10:35:38] 06Operations, 06Analytics-Kanban, 10netops, 15User-Elukey: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3083678 (10elukey) [10:37:46] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083686 (10Marostegui) Transfer started and the file will be located at: `dbstore1001:/srv/tmp/labsdb1007.tar.gz` [10:38:18] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3083687 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1041.eqiad.wmnet'] ``` and were **ALL** suc... [10:40:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5844 MB out of 7627 MB) [10:40:45] (03PS1) 10Marostegui: linux-host-entries: No more precise for labsdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/341776 (https://phabricator.wikimedia.org/T157359) [10:42:42] 06Operations, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3083692 (10akosiaris) Looking at the RDB file size for that specific instance, it's currently 1.4G on rdb1007. It's clear that the configured limits of 500MB hard , 200MB per 60 secs soft... [10:44:03] (03CR) 10Jcrespo: [C: 032] "I need to deploy this, or this will cause problem every time we restart a mysql server with the socket on the right place. Please, Moritz," [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 (owner: 10Jcrespo) [10:45:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5845 MB out of 7627 MB) [10:46:29] (03PS1) 10Jcrespo: mariadb: Deploy I5d66ece339 (/var/run/mysqld unix permissions) [puppet] - 10https://gerrit.wikimedia.org/r/341777 (https://phabricator.wikimedia.org/T148507) [10:47:10] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Deploy I5d66ece339 (/var/run/mysqld unix permissions) [puppet] - 10https://gerrit.wikimedia.org/r/341777 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [10:50:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5845 MB out of 7627 MB) [10:50:50] (03CR) 10Jcrespo: [C: 031] "We can even do 1006 too here, or on a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/341776 (https://phabricator.wikimedia.org/T157359) (owner: 10Marostegui) [10:51:34] (03PS2) 10Marostegui: linux-host-entries: No more precise for labsdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/341776 (https://phabricator.wikimedia.org/T157359) [10:53:12] godog, There is no longer collection failures for dbs: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=4&fullscreen [10:53:35] but we are cheating by monitoring only 1 of the instances on multi-instance hosts [10:53:39] (03PS3) 10Marostegui: linux-host-entries: Remove precise: labsdb1006,7 [puppet] - 10https://gerrit.wikimedia.org/r/341776 (https://phabricator.wikimedia.org/T157359) [10:54:00] (03CR) 10Jcrespo: [C: 031] linux-host-entries: Remove precise: labsdb1006,7 [puppet] - 10https://gerrit.wikimedia.org/r/341776 (https://phabricator.wikimedia.org/T157359) (owner: 10Marostegui) [10:55:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5845 MB out of 7627 MB) [10:55:22] (03CR) 10Marostegui: [C: 032] linux-host-entries: Remove precise: labsdb1006,7 [puppet] - 10https://gerrit.wikimedia.org/r/341776 (https://phabricator.wikimedia.org/T157359) (owner: 10Marostegui) [10:55:29] (03PS4) 10Marostegui: linux-host-entries: Remove precise: labsdb1006,7 [puppet] - 10https://gerrit.wikimedia.org/r/341776 (https://phabricator.wikimedia.org/T157359) [10:57:27] jynus: nice! I see labsdb1004:9104 there tho? [10:57:53] godog, I restarted it briefly [10:57:59] not related to monitoring [10:58:17] there are two ways to proceed with this [10:58:33] either we make the mysql-exported callable 7 times [10:58:44] by making everthing needed a parameter [10:59:13] or we send an array of sockets to it, and let mysql-exported figure it out [10:59:53] this is important because it is no longer a single, special host [11:00:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5845 MB out of 7627 MB) [11:00:22] we may have multiple, very important production hosts with multiple instances soon [11:00:39] and monitoring them would be a blocker [11:00:43] jynus: how many instances per host you'd think? more or less than 10 ? [11:00:54] 8 top [11:01:05] 2-3 normally [11:02:26] yeah it seems we could have pairs of (unix socket, tcp port) or sth like that and use create_resources with the exporter [11:02:34] no need for a port [11:02:39] only a socket [11:02:53] the listening port for the exporter [11:02:55] see current template [11:02:59] ah, true [11:03:02] sorry [11:03:15] which means [11:03:24] we may need to change actual prometheus yaml? [11:03:28] to allow ports? [11:04:14] I saw your update, btw, on puppetdb [11:04:20] the yaml config you mean? yeah each instance would be on a different port, like we do e.g. for varnish [11:04:34] ah, so it is already on varnish? [11:04:57] then I can see it and "clone it", assuming it works similarly [11:05:06] yeah each cp machine runs two varnish, frontend and backend and we monitor them individually [11:05:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5845 MB out of 7627 MB) [11:05:43] the other thing that I will change, but it is not important, and I can do on my own [11:05:55] is to make prometheus mysql access passwordless [11:06:40] that'd be nice! based only on permissions on the unix socket? [11:06:44] no [11:07:16] UID == 'prometheus' + connecting from localhost = granted with prometheus grants [11:07:51] so there is not password to leak on multiple machines [11:08:06] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "TL;DR: good job, but document the code in transports/clustershell.py" (037 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (owner: 10Volans) [11:08:07] for prometheus and for nagios- which only connects from localhost [11:09:08] sweet, less passwords is always a good idea indeed [11:09:13] yep [11:09:18] but that is secondary [11:09:26] I probably don't need help with that [11:09:37] the important part is the multi-instance support [11:09:49] I will have a look at varnish [11:10:08] prometheus-varnish-exporter, I'd assume? [11:10:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5846 MB out of 7627 MB) [11:12:11] jynus: yeah, look at how we do multi-instance in puppet with the service file and so on [11:12:14] bbiab [11:13:11] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3028990 (10Nikerabbit) Just relaying that translatewiki.net has been affected with https://github.com/facebook/hhvm/issues/7567 until I disabled stat_cache. As far as I know WMF has stat_cache enabled. [11:15:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5846 MB out of 7627 MB) [11:16:05] “Request from (IPv6) via cp4017 frontend, Varnish XID 189009509 [11:16:06] Error: 503, Backend fetch failed at Wed, 08 Mar 2017 11:02:24 GMT” - seems like it was momentary, tho. [11:16:09] when you come back- I am 100% onboard with prometheus and its model- I am not 100% sure how flexible are the exporters as a go compiled blob [11:16:32] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083756 (10Marostegui) `labsdb1007:/etc` directory has been copied to `dbstore1001:/srv/tmp/labsdb1007_etc.tar.gz` [11:16:43] specially because AFAIK (I may be wrong here) it is not easy to add custom metrics outside of the official ones [11:17:57] Seems like the server is reeeeaaaallllyyyy slow now. [11:20:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5846 MB out of 7627 MB) [11:23:59] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:24:44] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3083762 (10jcrespo) Qualitatively, gerrit has been working ok to me lately, although it logged me out several times (I assume the service was restarted). [11:25:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5846 MB out of 7627 MB) [11:25:10] RECOVERY - Hadoop NodeManager on analytics1041 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:30:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5846 MB out of 7627 MB) [11:35:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5846 MB out of 7627 MB) [11:35:49] !log installing texlive-base security updates [11:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5846 MB out of 7627 MB) [11:42:06] (03CR) 10Volans: [C: 04-1] "Looks good but has a typo, see inline." (033 comments) [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341296 (owner: 10Giuseppe Lavagetto) [11:42:27] (03PS2) 10Urbanecm: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618) [11:44:22] (03PS3) 10Urbanecm: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618) [11:45:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5847 MB out of 7627 MB) [11:49:30] (03CR) 10Giuseppe Lavagetto: Add --strip (033 comments) [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341296 (owner: 10Giuseppe Lavagetto) [11:50:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5847 MB out of 7627 MB) [11:50:34] (03CR) 10Volans: [C: 04-1] "LGTM, minor missing variable in log line" (033 comments) [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341760 (owner: 10Giuseppe Lavagetto) [11:51:29] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive] [11:51:59] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:54:32] (03Draft1) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/341782 [11:55:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5847 MB out of 7627 MB) [11:56:42] (03PS2) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/341782 [11:57:09] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:19] PROBLEM - DPKG on wasat is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:19] PROBLEM - DPKG on mw2249 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:19] PROBLEM - DPKG on mw2247 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:19] PROBLEM - DPKG on mw2241 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:19] PROBLEM - DPKG on mw2221 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:20] PROBLEM - DPKG on mw2239 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:20] PROBLEM - DPKG on mw2228 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:21] PROBLEM - DPKG on mw2217 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:21] PROBLEM - DPKG on mw2220 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:22] PROBLEM - DPKG on mw2216 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:22] PROBLEM - DPKG on mw2243 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:23] PROBLEM - DPKG on mw2242 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:23] PROBLEM - DPKG on mw2225 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:57:24] PROBLEM - DPKG on mw2215 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:58:19] RECOVERY - DPKG on wasat is OK: All packages OK [11:58:19] RECOVERY - DPKG on mw2249 is OK: All packages OK [11:58:19] RECOVERY - DPKG on mw2247 is OK: All packages OK [11:58:19] RECOVERY - DPKG on mw2241 is OK: All packages OK [11:58:19] RECOVERY - DPKG on mw2239 is OK: All packages OK [11:58:20] RECOVERY - DPKG on mw2221 is OK: All packages OK [11:58:20] RECOVERY - DPKG on mw2217 is OK: All packages OK [11:58:21] RECOVERY - DPKG on mw2220 is OK: All packages OK [11:58:21] RECOVERY - DPKG on mw2216 is OK: All packages OK [11:58:22] RECOVERY - DPKG on mw2243 is OK: All packages OK [11:58:22] RECOVERY - DPKG on mw2242 is OK: All packages OK [11:58:23] RECOVERY - DPKG on mw2225 is OK: All packages OK [11:58:23] RECOVERY - DPKG on mw2215 is OK: All packages OK [11:58:24] RECOVERY - DPKG on mw2226 is OK: All packages OK [11:58:27] (03PS3) 10Gehel: WIP - logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/341782 [11:58:36] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive] [11:58:55] ^ that's the texlive update, these packages are fairly big [11:59:19] RECOVERY - DPKG on mw2228 is OK: All packages OK [11:59:19] RECOVERY - DPKG on mw2246 is OK: All packages OK [11:59:29] RECOVERY - DPKG on mw2236 is OK: All packages OK [11:59:59] PROBLEM - DPKG on mw1286 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:59:59] PROBLEM - DPKG on mw1305 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:59:59] PROBLEM - DPKG on mw1264 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:59:59] PROBLEM - DPKG on mw1281 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:59:59] PROBLEM - DPKG on mw1299 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:00] PROBLEM - DPKG on mw1279 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:00] PROBLEM - DPKG on mw1304 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:01] PROBLEM - DPKG on mw1290 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:01] PROBLEM - DPKG on mw1283 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:02] PROBLEM - DPKG on mw1272 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:02] PROBLEM - DPKG on mw1267 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:09] PROBLEM - DPKG on mw1298 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:09] PROBLEM - DPKG on mw1293 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5847 MB out of 7627 MB) [12:00:10] PROBLEM - DPKG on mw1282 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:10] PROBLEM - DPKG on mw1278 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:10] PROBLEM - DPKG on mw1261 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:10] PROBLEM - DPKG on mw1301 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:11] PROBLEM - DPKG on mw1284 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:19] PROBLEM - DPKG on mw1302 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:19] PROBLEM - DPKG on mw1263 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:19] PROBLEM - DPKG on mw1275 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:19] PROBLEM - DPKG on mw1295 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:19] PROBLEM - DPKG on mw1288 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:20] PROBLEM - DPKG on mw1262 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:20] PROBLEM - DPKG on mw1277 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:21] PROBLEM - DPKG on mw1285 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:21] PROBLEM - DPKG on mw1268 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:22] PROBLEM - DPKG on mw1266 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:29] PROBLEM - DPKG on mw1306 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:29] PROBLEM - DPKG on mw1274 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:29] PROBLEM - DPKG on mw1297 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:29] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:00:29] PROBLEM - DPKG on mw1273 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:39] PROBLEM - DPKG on mw1300 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:49] PROBLEM - DPKG on mw1269 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:49] PROBLEM - DPKG on mw1303 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:49] PROBLEM - DPKG on mw1280 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:49] PROBLEM - DPKG on mw1276 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:49] PROBLEM - DPKG on mw1270 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:50] PROBLEM - DPKG on mw1294 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:51] PROBLEM - DPKG on mw1265 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:51] PROBLEM - DPKG on mw1296 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:00:59] RECOVERY - DPKG on mw1286 is OK: All packages OK [12:00:59] RECOVERY - DPKG on mw1305 is OK: All packages OK [12:00:59] (03CR) 10Gehel: "@Giuseppe: I'd like your opinion on this before I refactor too much. Does this approach make sense to you?" [puppet] - 10https://gerrit.wikimedia.org/r/341782 (owner: 10Gehel) [12:00:59] RECOVERY - DPKG on mw1264 is OK: All packages OK [12:00:59] RECOVERY - DPKG on mw1281 is OK: All packages OK [12:00:59] RECOVERY - DPKG on mw1299 is OK: All packages OK [12:01:00] RECOVERY - DPKG on mw1279 is OK: All packages OK [12:01:00] RECOVERY - DPKG on mw1304 is OK: All packages OK [12:01:01] RECOVERY - DPKG on mw1283 is OK: All packages OK [12:01:01] RECOVERY - DPKG on mw1290 is OK: All packages OK [12:01:02] RECOVERY - DPKG on mw1272 is OK: All packages OK [12:01:02] RECOVERY - DPKG on mw1267 is OK: All packages OK [12:01:09] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive],Package[texlive-fonts-recommended] [12:01:09] RECOVERY - DPKG on mw1282 is OK: All packages OK [12:01:09] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive] [12:01:09] RECOVERY - DPKG on mw1278 is OK: All packages OK [12:01:09] RECOVERY - DPKG on mw1261 is OK: All packages OK [12:01:10] RECOVERY - DPKG on mw1301 is OK: All packages OK [12:01:10] RECOVERY - DPKG on mw1284 is OK: All packages OK [12:01:19] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive-fonts-recommended] [12:01:19] RECOVERY - DPKG on mw1302 is OK: All packages OK [12:01:19] RECOVERY - DPKG on mw1263 is OK: All packages OK [12:01:19] RECOVERY - DPKG on mw1275 is OK: All packages OK [12:01:19] RECOVERY - DPKG on mw1288 is OK: All packages OK [12:01:20] RECOVERY - DPKG on mw1262 is OK: All packages OK [12:01:20] RECOVERY - DPKG on mw1277 is OK: All packages OK [12:01:21] RECOVERY - DPKG on mw1285 is OK: All packages OK [12:01:21] RECOVERY - DPKG on mw1268 is OK: All packages OK [12:01:22] RECOVERY - DPKG on mw1266 is OK: All packages OK [12:01:29] RECOVERY - DPKG on mw1306 is OK: All packages OK [12:01:29] RECOVERY - DPKG on mw1274 is OK: All packages OK [12:01:29] RECOVERY - DPKG on mw1297 is OK: All packages OK [12:01:30] RECOVERY - DPKG on mw1273 is OK: All packages OK [12:01:39] RECOVERY - DPKG on mw1300 is OK: All packages OK [12:01:39] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive] [12:01:49] RECOVERY - DPKG on mw1269 is OK: All packages OK [12:01:49] RECOVERY - DPKG on mw1303 is OK: All packages OK [12:01:49] RECOVERY - DPKG on mw1280 is OK: All packages OK [12:01:49] RECOVERY - DPKG on mw1276 is OK: All packages OK [12:01:49] RECOVERY - DPKG on mw1270 is OK: All packages OK [12:01:50] RECOVERY - DPKG on mw1265 is OK: All packages OK [12:01:50] RECOVERY - DPKG on mw1294 is OK: All packages OK [12:01:51] RECOVERY - DPKG on mw1296 is OK: All packages OK [12:01:59] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:02:09] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[texlive-pictures] [12:02:09] RECOVERY - DPKG on mw1298 is OK: All packages OK [12:02:09] RECOVERY - DPKG on mw1293 is OK: All packages OK [12:02:09] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive-fonts-recommended] [12:02:09] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive-fonts-recommended] [12:02:10] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:02:19] RECOVERY - DPKG on mw1295 is OK: All packages OK [12:02:19] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive-fonts-recommended] [12:02:29] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive-fonts-recommended] [12:02:29] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:02:29] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:02:59] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[texlive-pictures],Package[texlive],Package[texlive-fonts-recommended] [12:02:59] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:03:29] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:03:30] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:04:29] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:04:39] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:04:59] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:05:09] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:05:09] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:05:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5847 MB out of 7627 MB) [12:06:09] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:07:29] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:08:08] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3002535 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['labsdb1007.eqiad.wmnet'] ```... [12:10:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5847 MB out of 7627 MB) [12:11:19] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:11:30] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:12:29] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:15:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5848 MB out of 7627 MB) [12:15:18] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083856 (10jcrespo) The installer ask for verification to delete all partitions- this should be changed on the recipe (or use the db one, where this... [12:19:21] 06Operations, 10Traffic, 10netops: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525#3083886 (10BBlack) Some further thoughts that haven't been captured here: 1. I think it makes sense at this juncture to pursue a combined authdns+recdns machine config. They're both low-load and ne... [12:20:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5848 MB out of 7627 MB) [12:25:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5848 MB out of 7627 MB) [12:25:10] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:25:11] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:29:10] (03PS1) 10Jcrespo: partman-labsdb-osm: Make partman auto-confirm partition/RAID/LVM [puppet] - 10https://gerrit.wikimedia.org/r/341786 (https://phabricator.wikimedia.org/T157359) [12:30:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5848 MB out of 7627 MB) [12:34:17] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083907 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labsdb1007.eqiad.wmnet'] ``` and were **ALL** successful. [12:35:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5848 MB out of 7627 MB) [12:35:23] !log add mwlog[12]001 to analytics-in4 term rsync-http-https - T123728 [12:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:30] T123728: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728 [12:40:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5849 MB out of 7627 MB) [12:40:33] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083935 (10jcrespo) It installed ok, but /dev/tank/data has only 500GB. How was that partitioned @akosiaris ? [12:45:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5849 MB out of 7627 MB) [12:48:27] 06Operations, 06Performance-Team, 10Thumbor: Investigate if we can graph the age of the Thumbor processes in Grafana - https://phabricator.wikimedia.org/T159352#3083943 (10Gilles) Looks like something like that in a cron might do the trick: http://dev.nuclearrooster.com/2011/05/11/sending-metrics-to-statsd-f... [12:50:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 77% free (5849 MB out of 7627 MB) [12:54:09] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:54:13] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083946 (10akosiaris) It's missing the rest of the disks in the md RAID5 (and RAID1 for /boot) for some reason. Maybe some difference between jessie... [12:55:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 78% free (5924 MB out of 7627 MB) [12:55:55] ACKNOWLEDGEMENT - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 5576 MB (15% inode=92%): /dev 32200 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 77166 MB (5% inode=99%): Jeff_Green noted investigating [12:55:56] ACKNOWLEDGEMENT - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 78% free (5924 MB out of 7627 MB) Jeff_Green noted investigating [12:56:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 23 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:01:49] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:01:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:02:36] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083951 (10akosiaris) Nope. not that. What's actually the culprit is rOPUP1b3633e where labsdb1007 gets matched incorrectly [13:03:15] (03PS2) 10Filippo Giunchedi: hieradata: make mwlog1001 primary log host [puppet] - 10https://gerrit.wikimedia.org/r/341570 (https://phabricator.wikimedia.org/T123728) [13:05:09] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: make mwlog1001 primary log host [puppet] - 10https://gerrit.wikimedia.org/r/341570 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [13:08:26] 06Operations, 07Puppet: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163#3083956 (10akosiaris) Yeah, let's disable it until we have a good reason for enabling it. It will probably need a restart of puppetdb which will be followed by a storm of puppet alerts btw [13:09:59] (03CR) 10Elukey: "Ready for a round of reviews if anybody has time." [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [13:11:07] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083958 (10jcrespo) Thanks, I can amend that and try again. [13:11:30] !log make mwlog1001 the primary logging host, deprecate fluorine [13:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:33] (03PS1) 10Alexandros Kosiaris: Add private LVS IPs in network::subnets data [puppet] - 10https://gerrit.wikimedia.org/r/341787 [13:16:49] (03PS1) 10Jcrespo: autoinstall: Fix bug on 1b3633e4aa3f [puppet] - 10https://gerrit.wikimedia.org/r/341788 (https://phabricator.wikimedia.org/T157359) [13:20:37] ^this tells me that we should probably lock and unlock in some way hosts for reinstall [13:21:29] 06Operations, 07Puppet: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163#3083990 (10Joe) @akosiaris you are correct, but I think that's inevitable. [13:22:09] (03CR) 10Marostegui: [C: 031] autoinstall: Fix bug on 1b3633e4aa3f [puppet] - 10https://gerrit.wikimedia.org/r/341788 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [13:23:30] (03PS1) 10Filippo Giunchedi: site: use spare::system on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/341789 (https://phabricator.wikimedia.org/T123728) [13:24:45] (03PS2) 10Marostegui: autoinstall: Fix bug on 1b3633e4aa3f [puppet] - 10https://gerrit.wikimedia.org/r/341788 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [13:25:05] 06Operations, 06Performance-Team, 10Thumbor: Investigate if we can graph the age of the Thumbor processes in Grafana - https://phabricator.wikimedia.org/T159352#3083992 (10Gilles) I believe this might work: ``` ps -eo comm,cmd:44,etimes= | grep "^thumbor" | awk -v HOSTNAME=$(hostname) '{print "thumbor."HOST... [13:25:31] (03CR) 10Alexandros Kosiaris: [C: 031] autoinstall: Fix bug on 1b3633e4aa3f [puppet] - 10https://gerrit.wikimedia.org/r/341788 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [13:25:55] (03PS2) 10Marostegui: partman-labsdb-osm: Make partman auto-confirm partition/RAID/LVM [puppet] - 10https://gerrit.wikimedia.org/r/341786 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [13:26:19] (03CR) 10Marostegui: [C: 032] autoinstall: Fix bug on 1b3633e4aa3f [puppet] - 10https://gerrit.wikimedia.org/r/341788 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [13:27:13] (03PS3) 10Marostegui: partman-labsdb-osm: Make partman auto-confirm partition/RAID/LVM [puppet] - 10https://gerrit.wikimedia.org/r/341786 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [13:28:39] (03CR) 10Marostegui: [C: 032] partman-labsdb-osm: Make partman auto-confirm partition/RAID/LVM [puppet] - 10https://gerrit.wikimedia.org/r/341786 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [13:30:29] 3 [13:30:49] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:31:45] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3083998 (10Marostegui) Merged both patches from @jcrespo and ran puppet on install1002, going to try to reimage the server again [13:33:09] 06Operations, 06Performance-Team, 10Thumbor: Graph the age of the Thumbor processes in Grafana - https://phabricator.wikimedia.org/T159352#3065071 (10Gilles) p:05Triage>03Normal [13:33:24] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3084004 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['labsdb1007.eqiad.wmnet']... [13:42:25] (03PS1) 10Gilles: Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) [13:42:36] !log Deploy alter table s6 revision table on db1023 - T159414 [13:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:41] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [13:43:37] (03CR) 10jerkins-bot: [V: 04-1] Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) (owner: 10Gilles) [13:46:37] (03PS2) 10Gilles: Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) [13:48:37] (03PS1) 10Alexandros Kosiaris: WIP: Document kubernetes pod IPs [puppet] - 10https://gerrit.wikimedia.org/r/341792 [13:51:20] jouncebot, next [13:51:20] In 0 hour(s) and 8 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T1400) [13:51:52] (03CR) 10Giuseppe Lavagetto: Handle SIGTERM, SIGINT in the threads (033 comments) [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341760 (owner: 10Giuseppe Lavagetto) [13:52:07] whoose doing the swat today? [13:52:13] (03PS2) 10Giuseppe Lavagetto: Add --strip [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341296 [13:52:15] (03PS2) 10Giuseppe Lavagetto: Handle SIGTERM, SIGINT in the threads [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341760 [13:53:10] (03CR) 10Volans: [C: 031] "LGTM" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341296 (owner: 10Giuseppe Lavagetto) [13:54:25] (03PS1) 10Alexandros Kosiaris: Assign the kubernetes pod IPs in DNS [dns] - 10https://gerrit.wikimedia.org/r/341794 [13:54:42] (03CR) 10Volans: [C: 031] "LGTM" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341760 (owner: 10Giuseppe Lavagetto) [13:54:43] 06Operations, 06Operations-Software-Development, 15User-Joe: conftool service removal bugs - https://phabricator.wikimedia.org/T152977#3084082 (10Joe) This is now solved with the latest version of conftool [13:54:55] 06Operations, 06Operations-Software-Development, 15User-Joe: conftool service removal bugs - https://phabricator.wikimedia.org/T152977#3084084 (10Joe) 05Open>03Resolved [13:55:52] (03PS4) 10Urbanecm: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618) [13:56:33] <_joe_> akosiaris: just 500 IPs? [13:56:39] <_joe_> is that enough? [13:56:58] <_joe_> I mean can we grow the subnet without getting mad at a later time? [13:57:36] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3084091 (10elukey) Summary of today. I reimaged analytics1041 with the analytics-flex.cfg partman recipe, that does not men... [13:58:18] (03CR) 10Aklapper: "I'm sure this was not used by korma.wmflabs.org. Looks like this was requested in https://phabricator.wikimedia.org/T54329 independent fro" [puppet] - 10https://gerrit.wikimedia.org/r/341593 (owner: 10Chad) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T1400). [14:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:09] Present [14:00:51] _joe_: I 'll split it to 2 /24 [14:00:52] but yes [14:01:06] we can grow it very easily [14:01:15] it's anyway empty up to 10.64.79.255 [14:01:45] o/ [14:02:07] hashar: should I do eu swat today? [14:03:19] Urbanecm: I am reviewing your patches [14:03:24] zeljkof, okay [14:03:35] if hashar (or somebody else) wants to do swat today, let me know [14:05:23] ok, no reply so... [14:05:29] I can SWAT today! [14:05:43] zeljkof, okay! [14:06:00] hello [14:06:18] Urbanecm: all those logos on https://gerrit.wikimedia.org/r/#/c/341599/ are you grabbing them from commons ? [14:06:19] o/ [14:07:12] hashar: want to do swat? or should I continue? [14:07:34] hashar, I received 2 zips from srdjan_m, updated IS file and added them to correct location. Not sure where did they grab the files. [14:08:38] at least their sizes are consistent [14:08:46] (03PS2) 10Zfilipin: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341504 (https://phabricator.wikimedia.org/T159803) (owner: 10Urbanecm) [14:09:19] hashar: srdjan_m has just joined, you can ask yourself [14:09:33] (03CR) 10Alexandros Kosiaris: "hm, IPv6 ? maybe we should think about this too. Not sure however how well the calico policies support it. It supposedly is supported." [dns] - 10https://gerrit.wikimedia.org/r/341794 (owner: 10Alexandros Kosiaris) [14:09:37] (03PS1) 10Marostegui: db-eqiad.php: Add a few comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341797 (https://phabricator.wikimedia.org/T153743) [14:09:43] hashar, I checked the files randomly. [14:10:34] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341504 (https://phabricator.wikimedia.org/T159803) (owner: 10Urbanecm) [14:10:53] <_joe_> heh, ipv6 for pods networking? [14:10:53] Urbanecm: +2d 341504 [14:11:03] <_joe_> interesting idea [14:11:09] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:11:12] (03CR) 10Hashar: [C: 031] "Checked that sizes are consistent (width x height). Looked at all of them and they are legit logos." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:11:18] (03CR) 10Giuseppe Lavagetto: [C: 032] Add --strip [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341296 (owner: 10Giuseppe Lavagetto) [14:11:26] (03CR) 10Giuseppe Lavagetto: [C: 032] Handle SIGTERM, SIGINT in the threads [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341760 (owner: 10Giuseppe Lavagetto) [14:11:31] zeljkof, I noticed the message from wikibugs_ :) [14:12:03] me too :) just wanted to be explicit [14:12:04] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341504 (https://phabricator.wikimedia.org/T159803) (owner: 10Urbanecm) [14:12:13] (03CR) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341504 (https://phabricator.wikimedia.org/T159803) (owner: 10Urbanecm) [14:12:27] Ok [14:13:21] (03PS4) 10Zfilipin: Update logo for bswiki (Bosnian Wikipedia) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy) [14:13:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:16:09] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 15 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:16:33] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:341504|Add new throttle rule (T159803)]] (duration: 00m 41s) [14:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:39] T159803: Requesting temporary lift of IP cap on 2017-03-10 - https://phabricator.wikimedia.org/T159803 [14:16:52] Urbanecm: 341504 is deployed [14:17:11] zeljkof, Thank you [14:18:10] Urbanecm: reviewing 339326 [14:18:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 17 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:19:11] no image diff in gerrit :/ looking at it locally [14:20:13] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy) [14:21:28] (03Merged) 10jenkins-bot: Update logo for bswiki (Bosnian Wikipedia) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy) [14:21:37] (03CR) 10jenkins-bot: Update logo for bswiki (Bosnian Wikipedia) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy) [14:22:08] zeljkof, ok. [14:23:09] Urbanecm: could you please resolve conflict in 341599? [14:23:20] looks like something is conflicting [14:23:33] zeljkof, working on it [14:23:33] (while I deploy 339326) [14:23:38] thanks! [14:24:30] yw [14:25:45] hashar: I went through the list at T150618 and manually checked if there was a "Wikipedia-logo-v2-xx.svg" available on Commons that matched the .png at /static/images/project-logos/xxwiki.png. When I ran into a match, I scaled the .svg to 204px for 1.5x, 270px for 2x, ran 'optipng -o7' on the .pngs, zipped them up and sent them to Urban. [14:25:46] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [14:25:52] Hope that answers you question. [14:25:57] srdjan_m: yeah they seem all fine :) [14:26:10] and thanks for optipng! [14:26:12] (03PS5) 10Urbanecm: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618) [14:26:38] np [14:26:51] zeljkof, PS5 is the rebase [14:26:56] thanks [14:26:58] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:339326|Update logo for bswiki (Bosnian Wikipedia) (T158815)]] (duration: 00m 41s) [14:26:59] yw [14:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:03] T158815: Update logo for bs.wikipedia - https://phabricator.wikimedia.org/T158815 [14:27:58] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:339326|Update logo for bswiki (Bosnian Wikipedia) (T158815)]] (duration: 00m 41s) [14:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:24] Urbanecm: 339326 is deployed, please check logos [14:30:01] zeljkof, working [14:36:10] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:37:20] hashar: a question about Image Cache Purges [14:37:26] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Image_Cache_Purges [14:37:38] (03Merged) 10jenkins-bot: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [14:37:53] the instructions are for just one image [14:38:11] how do I do it for multiple images? one by one? [14:38:16] I usually don't bother but yeah that can be done [14:38:24] script it ? [14:38:27] should I just ignore it? [14:38:34] yeah ignore it for now [14:38:37] ok [14:41:02] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:341599|Add HD logos for several projects (T150618)]] (duration: 00m 44s) [14:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:08] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [14:42:05] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:341599|Add HD logos for several projects (T150618)]] (duration: 00m 41s) [14:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:22] Urbanecm: 341599 deployed, please check all wikis ;) [14:42:40] no more commits to deploy [14:42:41] zeljkof, that would take long time :). But okay, going to do it. [14:42:51] !log EU SWAT finished [14:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:48] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3084272 (10Marostegui) Looks good now: ``` root@labsdb1007:~# pvs PV VG Fmt Attr PSize PFree /dev/md1 labsdb1007-vg lvm2 a... [14:44:03] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [14:45:07] zeljkof: some vim magic: %s%\(.*logos\/\)\(.\+\)\(wiki-.*\)%https://\2.wikipedia.org/\1\2\3 [14:45:46] zeljkof, logos works :) [14:46:08] Urbanecm: great! [14:46:13] hashar: ٩(×̯×)۶ what does that do? [14:46:18] zeljkof: I think they just get purged automatically [14:46:35] oh the vim regex was to generate the http urls to purge [14:48:08] and given they are almost all new files, no purges were needed [14:48:10] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3084282 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labsdb1007.eqiad.wmnet'] ``` and were **ALL** successful. [14:48:44] (03PS2) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [14:49:57] (03CR) 10jerkins-bot: [V: 04-1] maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [14:50:24] (03CR) 10Gehel: [C: 04-1] maps - cleartables osm replication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [14:51:28] Thanks Urbanecm and zeljkof [14:51:50] DatGuy: thanks for deploying with #releng ;) [14:51:53] (03PS3) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [14:52:03] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:54:45] "I am your pilot zeljko.f. Thanks for deploying with SWAT airlines." [14:57:03] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.003 second response time [14:57:40] (03CR) 10jenkins-bot: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [15:03:57] jouncebot: now [15:03:57] No deployments scheduled for the next 3 hour(s) and 56 minute(s) [15:04:38] (03PS3) 10Reedy: Remove empty conditionals for wikis from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338129 [15:04:44] (03CR) 10Reedy: [C: 032] Remove empty conditionals for wikis from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338129 (owner: 10Reedy) [15:05:04] (03PS3) 10Reedy: Add a few newlines to standardise spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338130 [15:05:09] (03CR) 10Reedy: [C: 032] Add a few newlines to standardise spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338130 (owner: 10Reedy) [15:05:15] DatGuy: :D [15:06:07] (03Merged) 10jenkins-bot: Remove empty conditionals for wikis from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338129 (owner: 10Reedy) [15:06:23] (03Merged) 10jenkins-bot: Add a few newlines to standardise spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338130 (owner: 10Reedy) [15:06:32] (03PS2) 10Reedy: Remove EducationProgram config back compat hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341567 [15:06:37] (03CR) 10Reedy: [C: 032] Remove EducationProgram config back compat hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341567 (owner: 10Reedy) [15:06:50] (03CR) 10jenkins-bot: Remove empty conditionals for wikis from flaggedrevs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338129 (owner: 10Reedy) [15:07:34] 06Operations, 10ops-codfw, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3084305 (10elukey) https://phabricator.wikimedia.org/T156023#3046855 as a reminder of the appservers status after the last rebalance. [15:08:34] !log rebooting mw225[123] as part of sanity check for T155180 [15:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:40] T155180: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180 [15:09:40] (03Merged) 10jenkins-bot: Remove EducationProgram config back compat hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341567 (owner: 10Reedy) [15:09:43] PROBLEM - Host mw2252 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:43] PROBLEM - Host mw2253 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:43] PROBLEM - Host mw2251 is DOWN: PING CRITICAL - Packet loss = 100% [15:10:13] RECOVERY - Host mw2253 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:10:13] RECOVERY - Host mw2251 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [15:10:23] RECOVERY - Host mw2252 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [15:10:34] (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: add --strip option [puppet] - 10https://gerrit.wikimedia.org/r/341805 [15:15:53] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [15:19:58] !log rebooting mw22(5[4-9]|60) as part of sanity check for T155180 [15:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:03] T155180: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180 [15:21:33] PROBLEM - Host mw2257 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:53] RECOVERY - Host mw2257 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [15:29:01] 06Operations, 10Pywikibot-core, 10Traffic, 07HTTPS, and 2 others: Prepare pywikibot for http -> https switch in entity uri - https://phabricator.wikimedia.org/T159956#3084363 (10Lokal_Profil) [15:29:32] !log uploaded linux 4.9.13 for jessie-wikimedia/experimental to apt.wikimedia.org [15:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:16] 06Operations, 10Electron-PDFs, 06Services: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10GWicke) The display assertion vaguely points towards xpra or Xorg. Smells like a race condition on service restart, possibly with the old Xor... [15:32:34] !log reedy@tin Synchronized wmf-config/flaggedrevs.php: Whitespace (duration: 00m 41s) [15:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:53] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3084373 (10Marostegui) So, for the record we saw something with the logical volume: ``` root@labsdb1007:~# lvs LV VG Attr... [15:33:39] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Remove EducationProgram config back compat (duration: 00m 41s) [15:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:18] 06Operations, 10Electron-PDFs, 06Services: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10mobrovac) This is (unfortunately) a common scenario on service start-up. The current work-around is stopping the service, wait a bit and star... [15:34:25] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3084381 (10Papaul) [15:35:38] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3033850 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi Installation complete. You good to take over. [15:36:47] (03PS1) 10Andrew Bogott: Upstart logrotate: Use copytruncate instead of delaycompress. [puppet] - 10https://gerrit.wikimedia.org/r/341808 (https://phabricator.wikimedia.org/T159141) [15:37:02] !log uploaded firmware-nonfree 20161130 for jessie-wikimedia/experimental to apt.wikimedia.org [15:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:24] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3084404 (10Marostegui) Good news: the backups were scheduled to run today at 2AM and the first ones started around 45 minutes ago! :-) So I believe we are in a good shape here, let'... [15:42:21] (03PS1) 10Muehlenhoff: Add new meta package for Linux 4.9 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/341810 (https://phabricator.wikimedia.org/T154934) [15:44:29] (03CR) 10Muehlenhoff: [C: 032] Add new meta package for Linux 4.9 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/341810 (https://phabricator.wikimedia.org/T154934) (owner: 10Muehlenhoff) [15:50:34] (03PS1) 10Urbanecm: [throttle] Add new throttle rule+remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341812 (https://phabricator.wikimedia.org/T159957) [15:52:01] (03PS2) 10Urbanecm: [throttle] Add new throttle rule+remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341812 (https://phabricator.wikimedia.org/T159957) [15:54:04] (03CR) 10Gehel: [C: 031] "LGTM, will be merged during the upgrade of the production clusters" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340977 (owner: 10DCausse) [15:55:03] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:13] (03PS4) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [16:02:42] (03CR) 10Gehel: [C: 04-1] maps - cleartables osm replication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [16:04:26] moritzm: Not to belabour the point, but when I hear "package X for Debian" I have flashbacks to trying to get something into the central repositories. Would it be acceptable to "simply" package 3d2png and distribute it locally somehow? [16:04:34] !log mobrovac@tin Started deploy [eventstreams/deploy@78e248c]: Deploy for T159486 [16:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:40] T159486: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486 [16:06:22] !log mobrovac@tin Finished deploy [eventstreams/deploy@78e248c]: Deploy for T159486 (duration: 01m 48s) [16:06:23] marktraceur: yeah, I was referring to preparing a package for apt.wikimedia.org [16:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:54] OK, that sounds less stressful and traumatic [16:07:14] moritzm: I'll set about brushing up on my Debian packaging. Here's hoping it's gotten easier, not harder [16:07:33] 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#3084487 (10mobrovac) [16:07:40] 06Operations, 13Patch-For-Review, 06Services (doing), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3084485 (10mobrovac) 05Open>03Resolved All of the services are now in sync. Resolving. [16:08:37] 06Operations, 06Services (done), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3084489 (10mobrovac) [16:08:50] (03PS1) 10Muehlenhoff: Add experimental repository to multatuli [puppet] - 10https://gerrit.wikimedia.org/r/341816 [16:09:45] marktraceur: it's much simpler these days, have a look at https://vincent.bernat.im/en/blog/2016-pragmatic-debian-packaging [16:10:02] Cheers! [16:10:13] moritzm: I'm hoping one of the scripts for node will work for me [16:13:27] (03CR) 10Muehlenhoff: [C: 032] Add experimental repository to multatuli [puppet] - 10https://gerrit.wikimedia.org/r/341816 (owner: 10Muehlenhoff) [16:13:45] !log Deploy alter table s6 revision table on dbstore1002 - T159414 [16:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:52] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [16:22:03] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:27:04] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3084544 (10elukey) Yes I'd like to have the same version everywhere, we can coordinate with Traffic to roll out 0.9.... [16:27:53] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 893.07 seconds [16:28:47] ^ that is me, I will silence it [16:39:53] (03PS1) 10Alexandros Kosiaris: Change bacula retention policies [puppet] - 10https://gerrit.wikimedia.org/r/341817 [16:44:37] 06Operations, 10ops-codfw: wtp2019 has faulty memory - https://phabricator.wikimedia.org/T146009#3084589 (10Papaul) @joe is the system stay having memory issue so I can start the troubleshooting process and open a ticket with Dell? [16:49:37] (03PS2) 10Marostegui: db-eqiad.php: Add a few comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341797 (https://phabricator.wikimedia.org/T153743) [16:54:33] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:39] (03PS2) 10Jcrespo: mariadb: Decouple parsercache role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341565 (https://phabricator.wikimedia.org/T150850) [17:17:13] 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3078358 (10RobH) I find it useful to paste in a checklist of the steps that have to be accomplished, just so folks can review and see what has happened: @... [17:17:13] PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [17:17:26] 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3084624 (10RobH) [17:18:13] RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:21:37] !log installing Ubuntu imagemagick security updates (jessie already fixed) [17:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:33] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:23:07] !log update RESTBase to 20e2c44c: staging [17:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:58] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#3084635 (10MoritzMuehlenhoff) The update ha... [17:25:56] !log update RESTBase to 20e2c44c: canary on restbase1007 [17:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:23] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3084656 (10Paladox) [17:28:19] (03CR) 10Mobrovac: [C: 031] Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [17:28:57] !log update RESTBase to 20e2c44c [17:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:17] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3084643 (10Paladox) For example this change https://gerrit.wikimedia.org/r/341630 it was merged at 12:05am my time but my email is showing as it being... [17:33:25] (03CR) 10Jcrespo: [C: 032] mariadb: Decouple parsercache role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341565 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [17:38:20] (03PS7) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) [17:38:25] (03PS8) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) [17:38:52] (03PS9) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) [17:46:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add a few comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341797 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [17:46:18] 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3084701 (10Ottomata) [17:47:30] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3084643 (10Reedy) Might want to post the full email headers of the offending email(s) [17:50:30] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3084721 (10Paladox) @Reedy hi, how can i do that please? [17:51:14] (03Merged) 10jenkins-bot: db-eqiad.php: Add a few comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341797 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [17:52:09] (03Abandoned) 10Jcrespo: dbstore: configuration changes to make InnoDB the main storage [puppet] - 10https://gerrit.wikimedia.org/r/328671 (https://phabricator.wikimedia.org/T130128) (owner: 10Jcrespo) [17:52:23] PROBLEM - DPKG on labstore1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:52:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: db1070 ROW based replication comments - T153743 (duration: 00m 41s) [17:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:32] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [17:52:47] (03Abandoned) 10Jcrespo: Revert "eventlogging_sync: By pass ssl check on localhost" [puppet] - 10https://gerrit.wikimedia.org/r/325279 (owner: 10Jcrespo) [17:54:23] RECOVERY - DPKG on labstore1001 is OK: All packages OK [17:55:34] (03PS6) 10Subramanya Sastry: Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 [17:58:12] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3084782 (10Paladox) @Reedy https://phabricator.wikimedia.org/P5028 [18:03:16] (03Draft2) 10Reedy: Actually run generatecaptcha cronjob on 1st day of every month [puppet] - 10https://gerrit.wikimedia.org/r/341823 (https://phabricator.wikimedia.org/T159581) [18:03:52] jouncebot: next [18:03:52] In 0 hour(s) and 56 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T1900) [18:04:09] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3084794 (10Paladox) https://gerrit-review.googlesource.com/ is working in sending the email to my email at the correct times. [18:04:59] (03CR) 10Jcrespo: [C: 031] Actually run generatecaptcha cronjob on 1st day of every month [puppet] - 10https://gerrit.wikimedia.org/r/341823 (https://phabricator.wikimedia.org/T159581) (owner: 10Reedy) [18:05:42] (03CR) 10jenkins-bot: Add a few newlines to standardise spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338130 (owner: 10Reedy) [18:07:00] (03CR) 10jenkins-bot: Remove EducationProgram config back compat hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341567 (owner: 10Reedy) [18:08:11] (03CR) 10jenkins-bot: db-eqiad.php: Add a few comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341797 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [18:09:01] (03CR) 10Dzahn: [C: 032] Actually run generatecaptcha cronjob on 1st day of every month [puppet] - 10https://gerrit.wikimedia.org/r/341823 (https://phabricator.wikimedia.org/T159581) (owner: 10Reedy) [18:09:26] (03PS3) 10Dzahn: mediawiki::maintenance: Actually run generatecaptcha cronjob on 1st day of every month [puppet] - 10https://gerrit.wikimedia.org/r/341823 (https://phabricator.wikimedia.org/T159581) (owner: 10Reedy) [18:10:13] PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [18:11:13] RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:12:00] (03PS1) 10Jcrespo: mariadb: Decouple mariadb::misc role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341825 (https://phabricator.wikimedia.org/T150850) [18:18:17] (03PS1) 10DCausse: Elastic 5.1.2 plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/341826 [18:18:54] (03CR) 10Dzahn: [C: 032] Save logs of generate CAPTCHA cron to /var/log/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/341197 (https://phabricator.wikimedia.org/T159610) (owner: 10Florianschmidtwelzow) [18:18:59] (03PS2) 10Dzahn: Save logs of generate CAPTCHA cron to /var/log/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/341197 (https://phabricator.wikimedia.org/T159610) (owner: 10Florianschmidtwelzow) [18:23:38] 06Operations, 07Puppet, 13Patch-For-Review: GenerateFancyCaptchas cronjob should output to logfile - https://phabricator.wikimedia.org/T159610#3084856 (10Reedy) 05Open>03Resolved [18:26:33] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:28:12] (03PS2) 10Jcrespo: mariadb: Decouple mariadb::misc role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341825 (https://phabricator.wikimedia.org/T150850) [18:28:51] (03Abandoned) 10DCausse: Upgrade to elastic 5.2.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340977 (owner: 10DCausse) [18:37:26] 06Operations, 10Electron-PDFs, 06Services: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3084898 (10GWicke) As an immediate work-around, maybe we could add a delay in the restart process? One way to do this might be to add a sleep in https:/... [18:39:19] (03PS3) 10Dzahn: phabricator: monitor PHD service only on active server [puppet] - 10https://gerrit.wikimedia.org/r/341747 [18:46:36] (03PS4) 10Volans: Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968) [18:47:10] (03CR) 10Volans: Add support for batch processing (036 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans) [18:47:14] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3084931 (10Paladox) gerrit test comment for emails (testing if it affects phabricator too) [18:47:44] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3084933 (10Paladox) Yep affects phabricator too. [18:48:16] ottomata hi, im wondering could you triage https://phabricator.wikimedia.org/T159960 please? (The email servers are slow at sending emails) [18:48:24] Noticed on phabricator and gerrit [18:48:37] please [18:48:50] mutante RainbowSprinkles twentyafterfour ^^ [18:49:41] 06Operations, 10Gerrit, 10Phabricator, 06Release-Engineering-Team, 07Upstream: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3084940 (10Paladox) [18:51:28] 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3084958 (10Ottomata) p:05Triage>03Normal [18:51:40] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3084643 (10Paladox) [18:54:33] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:55:01] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#3084977 (10Ottomata) @matmarex, perhaps, but what I'm reading is that folks think that MW should handle short URLs with query params consistently, which isn't operations. N... [18:56:02] 06Operations, 10Electron-PDFs, 06Services: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3084981 (10Ottomata) p:05Triage>03Normal [18:56:13] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3084982 (10Paladox) I found this in the log on phabricator test instantance in labs /var/log/exim4/mainlog... [18:57:04] 06Operations, 07Epic, 03Interactive-Sprint, 06Maps (Maps-data): Epic: backup vector tiles - https://phabricator.wikimedia.org/T159770#3084984 (10Ottomata) p:05Triage>03Normal [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T1900). Please do the needful. [19:00:23] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:20] (03PS2) 10Dzahn: typos: add 'criticial' [puppet] - 10https://gerrit.wikimedia.org/r/341434 [19:01:36] paladox: hi, yeah hm. not sure if this is an operations tag or not, but release engineering is probably right [19:01:49] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#3085016 (10matmarex) MediaWiki already does it consistently and correctly, but it can't if it doesn't have access to the unmangled URL. Something in our server setup mangles... [19:01:56] Ok [19:02:11] No change to deploy for SWAT. [19:02:18] (03CR) 10Dzahn: "this is blocked by https://gerrit.wikimedia.org/r/#/c/341427/ which did not get merged" [puppet] - 10https://gerrit.wikimedia.org/r/341434 (owner: 10Dzahn) [19:02:40] ottomata it is unlikly to be the way phab and gerrit is configured i think. It may affect other softwear wikimedia uses but i havent seen any of those yet [19:02:50] 06Operations, 06Labs, 07Tracking: Cleanup tools nfs share on labstore1004/5 - https://phabricator.wikimedia.org/T156982#3085020 (10madhuvishy) Doing this again today: Truncating log/err/out files over 100M. List of files: ``` 101G /srv/tools/shared/tools/project/whymbot/enwikt.err 93G /srv/tools/shared/tool... [19:02:51] like could just be big email queue? [19:02:51] (03CR) 10jerkins-bot: [V: 04-1] typos: add 'criticial' [puppet] - 10https://gerrit.wikimedia.org/r/341434 (owner: 10Dzahn) [19:02:53] as i have mostly been looking at phabricator and gerrit [19:02:55] ya maybe if it is both [19:03:03] oh [19:03:12] I was thinking that too [19:03:46] ottomata took 8 hours for an email to be sent to me [19:03:47] from gerrit [19:04:06] paladox: I think that's rubbish [19:04:16] I think 8 hours sounds like the time difference between you and PST [19:04:30] oh [19:04:53] but if that was true then woulden't it have been more 12am which i think is around 4 or 5 pm pst [19:05:52] Meh, if nothing to swat, I'll push a couple of cherry picks [19:06:57] I'll also have a 3-8 minutes to test change for config in a few minutes. [19:07:55] (03PS1) 10Tarrow: remove elasticsearch plugin_dir setting [puppet] - 10https://gerrit.wikimedia.org/r/341831 [19:08:35] (03PS1) 10Dereckson: Reenable Collection on srn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341832 (https://phabricator.wikimedia.org/T158467) [19:08:39] Reedy: this one ^ [19:09:09] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085033 (10Ottomata) p:05Triage>03Low Hard to tell if this is a phab/gerrit problem, or just a slow emai... [19:09:43] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#3085035 (10Ottomata) Should this be merged into the parent task then? [19:10:11] Reedy: can you ping me when you're done with the cherry-picks? I'll be afk next 20 minutes. [19:10:41] Dereckson: Mine are a couple of extension ones [19:11:30] (03CR) 10Tarrow: "I think after the changes to this earlier this week this was wrongly left behind. This seems necessary for the role to work on labs (since" [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [19:14:03] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:15:48] (03CR) 10Dzahn: "i guess we still use precise on grid..." [puppet] - 10https://gerrit.wikimedia.org/r/337207 (https://phabricator.wikimedia.org/T143349) (owner: 10Dzahn) [19:18:18] (03PS1) 10GWicke: Delay service shut-down to work around xpra race [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) [19:18:53] (03Abandoned) 10Dzahn: labs_vagrant: drop precise support [puppet] - 10https://gerrit.wikimedia.org/r/337205 (https://phabricator.wikimedia.org/T143349) (owner: 10Dzahn) [19:18:58] (03PS2) 10GWicke: PDFRender: Delay service shut-down to work around xpra race [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) [19:20:25] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#3085106 (10matmarex) Maybe? I guess it's a special case of that. Do whatever makes it easier for you folks to get it fixed :) [19:22:40] (03CR) 10Dzahn: [C: 031] "i support this but need to clean out my queue and it doesn't look like it's gonna be merged soon" [puppet] - 10https://gerrit.wikimedia.org/r/337842 (https://phabricator.wikimedia.org/T114063) (owner: 10Hashar) [19:27:06] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:09] (03CR) 10Dzahn: "the (ACKed) alert about PHD not running on phab2001 is now gone from Icinga while it still exists on iridium, as intended" [puppet] - 10https://gerrit.wikimedia.org/r/341747 (owner: 10Dzahn) [19:28:12] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085148 (10demon) >>! In T159960#3084982, @Paladox wrote: > I found this in the log on phabricator test inst... [19:28:26] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:29:00] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085150 (10Paladox) Oh that's an ipv6 address the ipv4 one works. [19:30:05] (03CR) 10Mobrovac: [C: 031] "Works in BC, PCC also looking good - https://puppet-compiler.wmflabs.org/5692/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/341833 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [19:30:36] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085158 (10demon) Either way, we shouldn't be sending labs e-mails via the prod mailserver. I'm not worried... [19:31:25] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085161 (10demon) Also: I'm not seeing any delay on getting Phabricator e-mails. For example: replies to thi... [19:32:13] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085163 (10demon) Same with gerrit, getting all my e-mails. I see no production problem on our end [19:32:45] (03CR) 10Addshore: [C: 04-1] remove elasticsearch plugin_dir setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341831 (owner: 10Tarrow) [19:33:21] (03PS1) 10Andrew Bogott: Raise query limit for nova-admin user. [puppet] - 10https://gerrit.wikimedia.org/r/341836 (https://phabricator.wikimedia.org/T149109) [19:34:03] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085192 (10demon) 05Open>03Invalid Last point: there's no back up of pending jobs on either gerrit or ph... [19:34:16] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085194 (10Paladox) Oh, what about merging changes in gerrit? [19:34:27] (03CR) 10Paladox: "test comment" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [19:35:42] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085199 (10Paladox) But how comes email's are showing up as late for me? gerri-review is working perfectly. [19:36:15] !log reedy@tin Synchronized php-1.29.0-wmf.14/extensions/ConfirmEdit: Maintenance script updates (duration: 00m 50s) [19:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:26] Dereckson: Done [19:36:33] thanks [19:36:43] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085206 (10demon) I'm pretty sure this is a local issue for you / your ISP, nobody else is seeing this issue. [19:37:00] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085207 (10Paladox) I did a comment on https://gerrit.wikimedia.org/r/#/c/340900/6 and am still waiting for... [19:37:01] (03PS7) 10Paladox: Gerrit: Add some apache rewrite rules for polygerrit [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) [19:37:31] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085209 (10Paladox) I use yahoo mail's web ui. [19:37:43] (03CR) 10BryanDavis: [C: 031] Raise query limit for nova-admin user. [puppet] - 10https://gerrit.wikimedia.org/r/341836 (https://phabricator.wikimedia.org/T149109) (owner: 10Andrew Bogott) [19:38:06] (03PS2) 10Andrew Bogott: Raise query limit for nova-admin user. [puppet] - 10https://gerrit.wikimedia.org/r/341836 (https://phabricator.wikimedia.org/T149109) [19:38:15] (03PS3) 10Andrew Bogott: Raise query limit for nova-admin user. [puppet] - 10https://gerrit.wikimedia.org/r/341836 (https://phabricator.wikimedia.org/T149109) [19:38:47] (03CR) 10Dereckson: [C: 032] Reenable Collection on srn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341832 (https://phabricator.wikimedia.org/T158467) (owner: 10Dereckson) [19:40:22] (03CR) 10Andrew Bogott: [C: 032] Raise query limit for nova-admin user. [puppet] - 10https://gerrit.wikimedia.org/r/341836 (https://phabricator.wikimedia.org/T149109) (owner: 10Andrew Bogott) [19:41:06] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:41:52] (03CR) 10Paladox: "test comment" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [19:42:16] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:42:28] (03Merged) 10jenkins-bot: Reenable Collection on srn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341832 (https://phabricator.wikimedia.org/T158467) (owner: 10Dereckson) [19:42:41] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [19:42:43] (03CR) 10jenkins-bot: Reenable Collection on srn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341832 (https://phabricator.wikimedia.org/T158467) (owner: 10Dereckson) [19:43:00] (03CR) 10Paladox: "check" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [19:43:19] !log Upgraded nslcd and libnss-ldapd in labstore100[1,2,4,5] [19:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:45] (03CR) 10Dereckson: "If the new URL in polygerrit would be the definitive ones, perhaps it would be useful to force client browsers and search engine to adopt " [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [19:45:06] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085275 (10Paladox) I have now tested with my outlook account and same problem email taking too long to show... [19:45:34] uh oh [19:47:03] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085279 (10demon) Cannot replicate: I just confirmed a new e-mail within about 5 seconds. [19:47:29] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#3085280 (10Ottomata) Ha, I'm not sure it'll help get it fixed, but it will help with ticket proliferation :) [19:47:43] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#3085285 (10Ottomata) [19:47:46] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration: Data passed to HHVM ($_SERVER variables) is a mixed bag of already-decoded and non-decoded nonsense - https://phabricator.wikimedia.org/T132629#3085283 (10Ottomata) [19:48:16] (03PS1) 10Dereckson: Fix portals submodule version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341842 [19:48:41] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late also affecting other service like phabricator - https://phabricator.wikimedia.org/T159960#3085288 (10Paladox) Yes that worked too, but after confirming the email and writing a comment here https://g... [19:49:23] (03CR) 10Dereckson: [C: 032] Fix portals submodule version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341842 (owner: 10Dereckson) [19:50:07] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:51:13] (03Merged) 10jenkins-bot: Fix portals submodule version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341842 (owner: 10Dereckson) [19:51:37] For Collection on srn.wikipedia, I've on mwdebug1002 a good PDF document, all works fine, indeed. [19:54:29] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Reenable Collection on srn.wikipedia (T158467) (duration: 00m 46s) [19:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:34] T158467: Re-enable Collection on Sranan Wikipedia (srnwiki) - https://phabricator.wikimedia.org/T158467 [19:55:06] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:55:21] 06Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#3085338 (10bbogaert) [19:58:15] 06Operations, 10Analytics, 10Analytics-Cluster: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3085345 (10Nuria) [19:58:18] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#3085344 (10Nuria) 05Open>03Resolved [19:58:38] 06Operations, 10Gerrit, 10Mail, 10Phabricator, and 2 others: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3085349 (10Paladox) [19:58:49] 06Operations, 10Gerrit, 10Mail, 06Release-Engineering-Team: Gerrit emails are showing up as being sent late - https://phabricator.wikimedia.org/T159960#3084643 (10Paladox) [19:59:31] jouncebot: next [19:59:32] In 0 hour(s) and 0 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T2000) [19:59:37] * Dereckson is done. [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T2000). Please do the needful. [20:00:41] 06Operations, 10Analytics, 10Analytics-Cluster: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3085380 (10Nuria) [20:01:42] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3085393 (10Nuria) [20:03:23] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Update Zookeeper heap usage configuration and set alarms - https://phabricator.wikimedia.org/T157968#3021877 (10Nuria) 05Open>03Resolved [20:08:07] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341844 [20:08:09] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341844 (owner: 1020after4) [20:09:40] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341844 (owner: 1020after4) [20:10:08] (03CR) 10jenkins-bot: Fix portals submodule version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341842 (owner: 10Dereckson) [20:10:10] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341844 (owner: 1020after4) [20:10:16] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:19:16] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:24:31] (03CR) 10Paladox: [C: 031] "recheck" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/333475 (owner: 10Paladox) [20:42:59] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3085505 (10jcrespo) > Hey, I am not saying it is going to work 100% sure- I am just suggesting to try it first, and then go the slow route, which is... [20:43:35] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3085507 (10ZMcCune) 05Resolved>03Open Per conversations on Wikimedia-l (https://lists.wikimedia.org/pipermail/wikimedia-l/2017-March/086629.html) and w... [20:44:16] PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:49:39] (03PS1) 10Krinkle: Remove mentions of fluorine in old comments and descriptions [puppet] - 10https://gerrit.wikimedia.org/r/341847 (https://phabricator.wikimedia.org/T123728) [20:50:15] (03CR) 10Krinkle: "I found one more mention of fluorine in actual code (not comments). logging/mediawiki/errors.pp resolves fluorine before mwlog1001. Seems " [puppet] - 10https://gerrit.wikimedia.org/r/341789 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [20:59:15] twentyafterfour: did you finish the train deploy? [20:59:57] legoktm: no I'm trying to fix https://phabricator.wikimedia.org/T159881 real quick first [20:59:59] :-/ [21:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T2100). [21:00:18] https://gerrit.wikimedia.org/r/#/c/341849/ is the fix I hope [21:01:08] are you looking for someone to +2 it or ? [21:01:34] well RainbowSprinkles is looking at it for me but I'd welcome any CR+ [21:02:46] you haz +2 [21:03:23] thanks! [21:04:15] lol [21:11:22] !log arlolra@tin Started deploy [parsoid/deploy@0c22f72]: Updating Parsoid to dec47257 [21:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:16] RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:19:41] !log arlolra@tin Finished deploy [parsoid/deploy@0c22f72]: Updating Parsoid to dec47257 (duration: 08m 19s) [21:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:41] twentyafterfour: can you ping me once you're done then? I have a mw config change to follow up the parsoid deploy [21:23:37] legoktm: Could you merge and pull https://gerrit.wikimedia.org/r/#/c/341769/1 as well? (no-op) [21:23:54] sure [21:24:04] (03CR) 10Krinkle: [C: 031] Followup commit to I9f02dee3cea543234 (style fix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341769 (owner: 10Jcrespo) [21:24:31] (03CR) 10Krinkle: Upcoming mediawiki-core hardware expansion (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580) (owner: 10Jcrespo) [21:27:19] !log Updated Parsoid to dec47257 (T59603) [21:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:25] T59603: Create a {{PAGELANGUAGE}} magic word - https://phabricator.wikimedia.org/T59603 [21:30:42] legoktm: syncing then I'm done [21:31:05] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.15/extensions/CodeReview/backend/CodeCommentLinker.php: deploy https://gerrit.wikimedia.org/r/#/c/341857/ (duration: 00m 46s) [21:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:23] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.15 [21:31:27] done [21:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:38] thanks [21:32:44] (03PS3) 10Legoktm: Enable Linter on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335052 [21:32:46] (03PS2) 10Legoktm: Followup commit to I9f02dee3cea543234 (style fix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341769 (owner: 10Jcrespo) [21:32:58] (03CR) 10Legoktm: [C: 032] Enable Linter on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335052 (owner: 10Legoktm) [21:33:04] (03CR) 10Legoktm: [C: 032] Followup commit to I9f02dee3cea543234 (style fix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341769 (owner: 10Jcrespo) [21:34:10] (03Merged) 10jenkins-bot: Enable Linter on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335052 (owner: 10Legoktm) [21:34:25] (03Merged) 10jenkins-bot: Followup commit to I9f02dee3cea543234 (style fix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341769 (owner: 10Jcrespo) [21:34:26] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3085630 (10MaxSem) Can we use this opportunity to reimport the data from scratch to get rid of possible accumulated OSM replication errors? [21:37:43] legoktm: thx [21:38:14] !log mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=testwiki linter [21:38:15] np [21:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:36] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Enable Linter on testwiki - T148609 (1/2) (duration: 00m 44s) [21:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:43] T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609 [21:40:21] (03CR) 10Krinkle: "Thanks, looks good. Next steps would presumably be minimising the "differences" sections at the end of the test data by using the browser_" [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [21:41:25] !log legoktm@tin Synchronized wmf-config/CommonSettings.php: Enable Linter on testwiki - T148609 (2/2) (duration: 00m 41s) [21:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:25] Notice: Undefined variable: wmgUseLinter in /srv/mediawiki/wmf-config/CommonSettings.php on line 2129 [21:43:41] !log arlolra@tin Started restart [parsoid/deploy@0c22f72]: (no justification provided) [21:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:24] twentyafterfour: uh, I definitely synced it in the right order [21:44:41] hmm only showed up 6 times so I guess it's an anomaly [21:44:53] doing the train now twentyafterfour ? [21:45:02] matanya: I'm done with the train [21:45:08] just monitoring logs [21:46:12] legoktm: Sometimes it doesn't work 100% [21:46:21] If it showed up a few times, and then went away, nothing to worry about [21:49:03] (03CR) 10jenkins-bot: Enable Linter on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335052 (owner: 10Legoktm) [21:49:07] (03CR) 10jenkins-bot: Followup commit to I9f02dee3cea543234 (style fix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341769 (owner: 10Jcrespo) [21:56:15] 06Operations, 06Labs: Remove linux kernel 3.16 from the jessie image on labs - https://phabricator.wikimedia.org/T159990#3085707 (10Paladox) [21:56:21] 06Operations, 06Labs: Remove linux kernel 3.16 from the jessie image on labs - https://phabricator.wikimedia.org/T159990#3085719 (10Paladox) p:05Triage>03High [21:58:06] 06Operations, 06Labs: Remove linux kernel 3.16 from the jessie image on labs - https://phabricator.wikimedia.org/T159990#3085707 (10MoritzMuehlenhoff) Why does in break the system if you upgrade the unused 3.16 kernel? [21:58:16] 06Operations, 06Labs: Remove linux kernel 3.16 from the jessie image on labs - https://phabricator.wikimedia.org/T159990#3085725 (10yuvipanda) p:05High>03Normal @MoritzMuehlenhoff any idea what's the right thing to do here? [21:59:55] 06Operations, 06Labs: Remove linux kernel 3.16 from the jessie image on labs - https://phabricator.wikimedia.org/T159990#3085727 (10Paladox) >>! In T159990#3085723, @MoritzMuehlenhoff wrote: > Why does in break the system if you upgrade the unused 3.16 kernel? It seemed to have broke gerrit-test3. Since after... [22:10:45] !log resuming running refreshLinks.php on small wikis [22:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:23] (03CR) 10Dzahn: [C: 032] Remove mentions of fluorine in old comments and descriptions [puppet] - 10https://gerrit.wikimedia.org/r/341847 (https://phabricator.wikimedia.org/T123728) (owner: 10Krinkle) [22:12:20] (03CR) 10Dzahn: "thanks for the clean-up" [puppet] - 10https://gerrit.wikimedia.org/r/341847 (https://phabricator.wikimedia.org/T123728) (owner: 10Krinkle) [22:20:29] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3028990 (10Reedy) Looks like we're getting a few fatals on beta cluster too... [22:22:01] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3085786 (10Dzahn) patch set 2 of the changes above has been deployed just now [22:23:46] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3085787 (10Reedy) Looks pretty much the same what @Nikerabbit is seeing ``` root@deployment-mediawiki04:/var/log/hhvm# cat /var/log/hhvm/stacktrace.16002.log Host: deployment-mediawiki04 ProcessID: 16002 ThreadID: 1... [22:30:03] (03PS1) 10Reedy: hhvm.server.stat_cache = false [puppet] - 10https://gerrit.wikimedia.org/r/341916 (https://phabricator.wikimedia.org/T158176) [22:30:58] 06Operations, 07HHVM, 13Patch-For-Review: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3085800 (10Reedy) ^ FWIW, not necessarily suggesting we merge that for production, but going to cherry pick onto the deployment puppetmaster for the time being to stop beta breaking [22:35:08] (03PS1) 1020after4: Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) [22:36:30] (03CR) 10jerkins-bot: [V: 04-1] Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [22:39:32] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3085825 (10Legoktm) 05Open>03Resolved a:03Legoktm >>! In T159618#3080998, @Betacommand wrote: > The edit rate may have been the issue, but we should still utilize the to... [22:39:38] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3085828 (10Dzahn) [22:41:12] (03Draft1) 10Paladox: Gerrit: Add a new $gerrit_auth_email variable [labs/private] - 10https://gerrit.wikimedia.org/r/341918 [22:41:14] (03PS2) 10Paladox: Gerrit: Add a new $gerrit_auth_email variable [labs/private] - 10https://gerrit.wikimedia.org/r/341918 [22:42:10] (03CR) 10Chad: [C: 04-1] "That's what gerrit_email_key already does...." [labs/private] - 10https://gerrit.wikimedia.org/r/341918 (owner: 10Paladox) [22:42:39] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3085845 (10Dzahn) labsdb10071 has been reinstalled with jessie: count: 4 [22:43:06] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3085846 (10Dzahn) [22:43:36] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:43:49] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3085848 (10ZMcCune) Seems like changed photos are not displaying (https://annual.wikimedia.org/2016/fact-8.html). @Varnent - any ideas on cause here? [22:47:16] (03PS2) 1020after4: Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) [22:48:25] (03CR) 1020after4: [C: 031] "it should be safe to deploy this ahead of the corresponding code deployment." [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [22:48:42] (03CR) 10jerkins-bot: [V: 04-1] Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [22:48:46] (03CR) 10Paladox: "Please :)" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [22:51:44] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3085868 (10jrbs) Looks like the image "malasari-birdwatching.jpg" wasn't deployed or is mistyped. [22:52:39] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3085869 (10Pchelolo) [22:54:37] 06Operations, 10Parsoid: Separate subdomain for parsoid visual diff test service on ruthenium - https://phabricator.wikimedia.org/T159995#3085873 (10ssastry) [22:55:30] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3085885 (10jrbs) >>! In T151798#3085868, @jrbs wrote: > Looks like the image "malasari-birdwatching.jpg" wasn't deployed or is mistyped. Oh, it looks like... [22:56:22] (03PS3) 1020after4: Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) [22:56:32] 06Operations: decom fluorine - https://phabricator.wikimedia.org/T159996#3085886 (10Dzahn) [22:56:54] 06Operations: decom fluorine - https://phabricator.wikimedia.org/T159996#3085886 (10Dzahn) [22:58:13] 06Operations, 10Parsoid, 10Traffic: Separate subdomain for parsoid visual diff test service on ruthenium - https://phabricator.wikimedia.org/T159995#3085900 (10Dzahn) [22:58:54] (03PS3) 10Paladox: Gerrit: Have a bogus valus for $gerrit_phab_cert [labs/private] - 10https://gerrit.wikimedia.org/r/341918 [22:59:03] 06Operations, 10Parsoid, 10Traffic: Separate subdomain for parsoid visual diff test service on ruthenium - https://phabricator.wikimedia.org/T159995#3085873 (10Dzahn) This will need a DNS change to add the new new name and then a misc-web varnish change to add a new director/backend. Adding 'traffic' and 'd... [22:59:31] 06Operations, 10DNS, 10Parsoid, 10Traffic: Separate subdomain for parsoid visual diff test service on ruthenium - https://phabricator.wikimedia.org/T159995#3085915 (10Dzahn) [23:00:20] (03CR) 10Paladox: "Can this be done through the main.pp file?" [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:00:25] (03CR) 10Chad: "Spelling nit inline, otherwise lgtm" (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/341918 (owner: 10Paladox) [23:00:32] (03CR) 10Paladox: "As we need to configure this." [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:00:50] (03PS4) 10Paladox: Gerrit: Have a bogus valus for $gerrit_phab_cert [labs/private] - 10https://gerrit.wikimedia.org/r/341918 [23:01:04] (03CR) 10Paladox: Gerrit: Have a bogus valus for $gerrit_phab_cert (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/341918 (owner: 10Paladox) [23:02:22] 06Operations, 10DNS, 10Parsoid, 10Traffic: Separate subdomain for parsoid visual diff test service on ruthenium - https://phabricator.wikimedia.org/T159995#3085933 (10Dzahn) existing varnish config, `modules/role/manifests/cache/misc.pp` ``` 'parsoid-tests.wikimedia.org' => { 'director' => 'rutheni... [23:05:34] (03CR) 10Chad: [C: 032] Gerrit: Have a bogus valus for $gerrit_phab_cert [labs/private] - 10https://gerrit.wikimedia.org/r/341918 (owner: 10Paladox) [23:05:45] (03CR) 10Chad: [V: 032 C: 032] Gerrit: Have a bogus valus for $gerrit_phab_cert [labs/private] - 10https://gerrit.wikimedia.org/r/341918 (owner: 10Paladox) [23:05:59] 06Operations, 10DNS, 10Parsoid, 10Traffic: Separate subdomain for parsoid visual diff test service on ruthenium - https://phabricator.wikimedia.org/T159995#3085943 (10Dzahn) eh.. wait .. or does it just need a second domain name for the same port 8001 (which it is now using while both 8003 and 8010 listen... [23:06:27] (03CR) 10Paladox: "Need to press the submit thingy." [labs/private] - 10https://gerrit.wikimedia.org/r/341918 (owner: 10Paladox) [23:06:37] (03PS4) 1020after4: Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) [23:07:09] (03CR) 10Paladox: "Thanks :)" [labs/private] - 10https://gerrit.wikimedia.org/r/341918 (owner: 10Paladox) [23:08:23] (03CR) 1020after4: "https://puppet-compiler.wmflabs.org/5695/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:09:11] 06Operations, 10hardware-requests: decom fluorine - https://phabricator.wikimedia.org/T159996#3085948 (10RobH) [23:11:36] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [23:12:52] 06Operations, 10hardware-requests: decom fluorine - https://phabricator.wikimedia.org/T159996#3085965 (10RobH) [23:13:01] 06Operations, 10DNS, 10Parsoid, 10Traffic: Separate subdomain for parsoid visual diff test service on ruthenium - https://phabricator.wikimedia.org/T159995#3085966 (10Dzahn) yes, the latter it is. should just be a second name, and both port 8001, nginx proxies from there [23:13:38] (03PS3) 10Paladox: Add mariadb-java-client [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) [23:16:10] (03PS1) 10Dzahn: add parsoid-vd-tests.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/341920 (https://phabricator.wikimedia.org/T159995) [23:17:37] (03PS5) 1020after4: Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) [23:17:39] (03CR) 10Paladox: Add config for elasticsearch cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:17:41] (03PS2) 10Dzahn: add parsoid-vd-tests.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/341920 (https://phabricator.wikimedia.org/T159995) [23:19:23] (03CR) 10Paladox: Add config for elasticsearch cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:19:30] (03CR) 1020after4: [C: 031] "https://puppet-compiler.wmflabs.org/5696/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:21:06] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:22:01] (03PS6) 10Paladox: Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:22:35] (03CR) 10Paladox: "Ive added a phabricator_protocole hiera config so i can make it http for now in labs. As i will need to test https." [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:23:12] (03CR) 10Paladox: [C: 031] "The rest of my feedback can be done in a future patch." [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:23:53] (03PS24) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [23:24:06] (03PS1) 10Dzahn: parsoid: rename parsoid-tests to parsoid-rt-tests [dns] - 10https://gerrit.wikimedia.org/r/341923 (https://phabricator.wikimedia.org/T159995) [23:25:06] (03PS2) 10Dzahn: rename parsoid-tests to parsoid-rt-tests [dns] - 10https://gerrit.wikimedia.org/r/341923 (https://phabricator.wikimedia.org/T159995) [23:31:15] (03PS1) 10Dzahn: varnish/misc: add parsoid-vd-tests -> ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341925 (https://phabricator.wikimedia.org/T159995) [23:31:17] (03PS1) 10Dzahn: varnish/misc: rename parsoid-tests to parsoid-rt-tests [puppet] - 10https://gerrit.wikimedia.org/r/341926 (https://phabricator.wikimedia.org/T159995) [23:32:08] 06Operations, 10DNS, 10Parsoid, 10Traffic, 13Patch-For-Review: Separate subdomain for parsoid visual diff test service on ruthenium - https://phabricator.wikimedia.org/T159995#3086018 (10ssastry) Followup request on IRC (for which dzahn has uploaded a patch): rename parsoid-tests to parsoid-rt-tests (for... [23:34:30] (03PS2) 10Dzahn: varnish/misc: rename parsoid-tests to parsoid-rt-tests [puppet] - 10https://gerrit.wikimedia.org/r/341926 (https://phabricator.wikimedia.org/T159995) [23:34:32] (03CR) 10Paladox: [C: 031] "Tested on the phabricator instance and this works :)" [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:37:49] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3086043 (10Dzahn) @jrbs The file was added as "2016/img/fact-cards/malasari-birdwatching-card.jpg" in https://gerrit.wikimedia.org/r/#/c/341756/ so s... [23:40:34] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073330 (10Betacommand) [23:41:44] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3086089 (10Dzahn) I see, they are different files/size with the same name, so needed in both locations. deployed [23:42:56] 06Operations, 03Interactive-Sprint, 06Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2384405 (10Pnorman) Based on {T159631} we should switch to hourly diffs even if we don't change how often we update. [23:45:31] (03CR) 10Dzahn: "we should move all the hiera lookups but of course that should not block this change. we should move them all at once separatelyu" [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:47:54] (03CR) 10Dzahn: "let's add a bit more explanation to the commit message besides that it's needed" [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:48:34] Looks like in tonights update to phabricator, phabricator gain badges [23:48:43] for users [23:49:05] See https://phab-01.wmflabs.org/people/badges/3/ [23:49:06] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [23:50:08] (03PS7) 10Dzahn: phabricator: Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:50:59] (03CR) 10Dzahn: [C: 032] phabricator: Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:51:05] (03PS8) 1020after4: Add config for elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) [23:51:44] (03CR) 1020after4: "ah you beat me to it ;)" [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:52:00] (03CR) 10Dzahn: [C: 032] "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:55:06] (03CR) 10Dzahn: "ran puppet on both iridium and phab2001, has been applied" [puppet] - 10https://gerrit.wikimedia.org/r/341917 (https://phabricator.wikimedia.org/T157156) (owner: 1020after4) [23:55:25] thanks mutante! [23:55:38] you're welcome [23:56:18] jouncebot: next [23:56:20] In 103 hour(s) and 3 minute(s): Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T0700) [23:56:25] heh [23:57:05] (03CR) 10Paladox: "This can now be merged as we merged the upstart script :)" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:57:11] (03PS25) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [23:57:35] thanks jouncebot, but what holiday is that [23:57:50] Holi (Hindu festival) would surprise me as US holiday [23:58:08] 14th is Pi Day [23:59:11] * Platonides mumbles about the inconsistency of mm/dd/yy dates [23:59:45] i don't see any US holiday in March.. ehm.. copy/paste issue?