[00:00:04] Deploy window No deploys! (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191127T0000) [00:01:32] (03PS2) 10Mholloway: MachineVision: Show UploadWizard CTA on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553153 (https://phabricator.wikimedia.org/T234960) [00:02:43] (03CR) 10Mholloway: [C: 03+2] MachineVision: Show UploadWizard CTA on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553153 (https://phabricator.wikimedia.org/T234960) (owner: 10Mholloway) [00:03:33] (03Merged) 10jenkins-bot: MachineVision: Show UploadWizard CTA on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553153 (https://phabricator.wikimedia.org/T234960) (owner: 10Mholloway) [00:04:21] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimed [00:04:21] ces/Monitoring/mobileapps [00:05:34] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Show UploadWizard CTA on testcommonswiki (T234960) (duration: 01m 00s) [00:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:40] T234960: Add call to action on final step of Upload Wizard - https://phabricator.wikimedia.org/T234960 [00:06:03] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:19:34] 10Operations, 10Discovery-Search (Current work): Add Maryum to Puppet - https://phabricator.wikimedia.org/T239300 (10Mstyles) [00:20:43] James_F: Do we have structure/unit test in wmf-config to ensure all wikis are in exactly one of s#, one of (small,medium,large) and one of (wikipedia,wiktionary,…, chapters, special) etc.? [00:21:01] Krinkle: Yes, yes, no. [00:21:10] I think we have a subset of it, but if not already complete, would be good to do as well, especially now that it is much harder to verify by hand in the separate YAML files. [00:21:47] Krinkle: Also, TMH videojs slim-down just landed in master. Yay. [00:23:16] https://github.com/wikimedia/operations-mediawiki-config/blob/53647cbfc3afd6b2213a2f56395ccb1f0a3fb1b6/tests/dblistTest.php#L80 [00:23:17] nice [00:24:47] (03PS1) 10Mstyles: add mstyles as group member [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239300) [00:24:49] (03CR) 10jerkins-bot: [V: 04-1] add mstyles as group member [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239300) (owner: 10Mstyles) [00:25:02] Yup. [00:27:38] Krinkle: I'm unconvinced by the value in the massive work to create InitialiseSettings.json, having talked it through with Roan. [00:28:14] But I now don't think the static files should be merged into git, just the diff displayed in CI and as part of `scap sync config` perhaps. [00:28:48] (03PS1) 10Krinkle: tests: Remove obsolete logic for "-computed" dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553220 (https://phabricator.wikimedia.org/T239162) [00:30:06] James_F: Sounds good to me. The file is not just for diffing though, we also need to think about the wgConf and dynamic stuff etc, or spend some more time in making that go away so that it is simple enough to fit in your model. I'd like that actually. [00:34:08] and maybe come up with a sightly different file structure so that we can read an arbitrary wiki's overrides without reading N files per wiki. Because we need to be able to read many wgCanoncalServer values for N wikis when parsing wikitext and handling cookie logic in core, as well as for JobQueue processing sometimes and various Wikibase stuff. Maybe that means for some settings a dedicated YAML file that centralises it for quick [00:34:08] consumption, or some clever APCU caching on top of it run-time lazied (basically re-creating IS.php but at run-time and not on disk, which might make it okay). [00:34:15] Might also need a hook in core. [00:34:23] So that we can do it without as much indirection. [00:34:54] Meh. Mayebe. [00:35:13] James_F: btw, can you see if group1/2 lists can be ported to the new system? [00:35:18] That'd resolve those FIXME. [00:35:32] those are the last "computed" dblists we use in prod. [00:35:39] Computational YAML files? That's rather against the spirit of staticness… [00:35:47] They should have followed the group1-computed.dblist->group1.dblist model, but I forgot to follow up on that. [00:35:59] I was thinking that if the first character was ! then it removed it, but… [00:36:07] it slipped in a year or so ago and never got around to fixing that [00:36:11] Yeah [00:36:43] I'll add a unit test for wiki family [00:36:47] and for deploy group [00:36:55] Cool. [00:38:00] (03CR) 10Jforrester: "Wrong task?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553220 (https://phabricator.wikimedia.org/T239162) (owner: 10Krinkle) [00:40:31] (03PS2) 10Krinkle: tests: Remove obsolete logic for "-computed" dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553220 (https://phabricator.wikimedia.org/T223602) [00:40:36] (03CR) 10Krinkle: [C: 03+2] "Oops :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553220 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [00:41:03] James_F: https://gist.github.com/Krinkle/9364aa55fe1057028f5c109a11d50745#file-rarrrrrr-L2 [00:41:05] ofcourse... [00:41:21] (03Merged) 10jenkins-bot: tests: Remove obsolete logic for "-computed" dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553220 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [00:41:35] What joy. [00:41:41] That'll be a prime candidate for diffing config before/after [00:41:48] * James_F nods. [00:41:49] to see what (if anything) it is relying on. [00:42:03] Have an exemption for special+something? [00:42:07] Actually, 'special' is kind of special. I wonder if we ever configure both 'wikipedia' and 'special' in IS.php [00:42:17] It definitely shouldn't be in both though. [00:42:51] wmgULSPosition is set for both 'special' and 'wikimedia', but to the same value. [00:43:14] Other than that, no clashes that I see. [00:43:20] special/wikiPedia [00:43:27] ... which we do make use of. [00:43:46] which means it's non-deterministic which one they get [00:43:52] gewikimedia is both special and wikimedia. [00:44:03] ah, right that one [00:44:17] The other eight are all special and wikipedia. [00:44:30] wgLanguageCode also sets both wikimania+special, but gewikimedia already sets that directly so no difference [00:44:52] Well, that has `'default' => '$lang'` [00:44:57] So everything has a value. [00:45:09] And there's no value for 'wikipedia'. [00:45:12] (03PS1) 10Catrope: GrowthExperiments: Align help panel new account enabling with homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553222 (https://phabricator.wikimedia.org/T232396) [00:45:19] yeah but local << (one) matching tag << default [00:45:28] oh fun, only two things use 'special' [00:45:31] I'm impressed [00:45:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:45:43] Yeah, we can probably scrap special except I imagine it's used in puppet somewhere horrible. [00:45:48] Because all the bad dblists are. [00:45:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:45:58] sure, I'm not thinking of removing the list [00:46:04] I want to kill off an awful lot of our dblists. [00:46:06] we definitely need 'special' to capture all the non-family wikis. [00:46:17] but these wikis don't need to be in both. [00:46:22] and looks like nothing settings-wise relies on that. [00:46:31] Like the ones to decide which stage of hovercards we're deploying to in late 2016. [00:48:04] yeah [00:51:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:52:57] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=esams https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:55:33] (03PS1) 10Krinkle: tests: Assert each wiki is in one 'deploy group' and one 'family' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553224 [00:55:51] (03PS2) 10Krinkle: tests: Assert each wiki is in one 'deploy group' and one 'family' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553224 [00:56:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:58:07] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:59:59] (03PS3) 10Jforrester: tests: Assert each wiki is in one 'deploy group' and one 'family' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553224 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:00:14] (03CR) 10Jforrester: [C: 03+1] tests: Assert each wiki is in one 'deploy group' and one 'family' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553224 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:04:33] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) >>! In T184066#5694288, @Papaul wrote: > qfx5100-spare1, psu 0 {#20156} to ps2-oe15-esams:17 > qfx5100-spare2, psu 0 {#20157} to ps2-oe15-esams:16 > qfx5100-spare1, psu 1 {#20159} to ps1-oe1... [01:05:06] (03PS2) 10BBlack: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/551252 [01:07:52] (03PS1) 10Catrope: GrowthExperiments: Being "initiation test" for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553225 (https://phabricator.wikimedia.org/T238888) [01:08:07] (03PS1) 10Krinkle: Make arbcom-*.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) [01:09:00] (03PS2) 10Krinkle: Make arbcom-*.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) [01:10:24] (03PS3) 10Jforrester: Make arbcom-*.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:10:30] (03CR) 10Jforrester: [C: 03+1] Make arbcom-*.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:11:01] (03PS1) 10Krinkle: Make ge.wikimedia.org site no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553228 (https://phabricator.wikimedia.org/T239301) [01:15:49] (03PS4) 10Krinkle: Make special *.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) [01:15:51] (03PS2) 10Krinkle: Make ge.wikimedia.org site no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553228 (https://phabricator.wikimedia.org/T239301) [01:17:26] (03PS5) 10Jforrester: Make special *.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:17:31] (03CR) 10Jforrester: [C: 03+1] Make special *.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:17:37] (03PS3) 10Jforrester: Make ge.wikimedia.org site no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553228 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:17:58] (03CR) 10Jforrester: [C: 03+1] Make ge.wikimedia.org site no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553228 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:18:42] Found one exception for the test wikis [01:19:12] (03PS6) 10Krinkle: Make special *.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) [01:19:14] (03PS4) 10Krinkle: Make ge.wikimedia.org site no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553228 (https://phabricator.wikimedia.org/T239301) [01:19:16] (03PS1) 10Krinkle: Set wgLanguageCode for test wikis explicitly to 'en' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553230 [01:19:40] (03CR) 10Krinkle: [C: 03+2] tests: Assert each wiki is in one 'deploy group' and one 'family' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553224 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:20:15] (03PS2) 10Krinkle: Set wgLanguageCode for test wikis explicitly to 'en' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553230 [01:20:25] James_F: OK to deploy ^ one? [01:20:37] (03CR) 10jerkins-bot: [V: 04-1] Make ge.wikimedia.org site no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553228 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:20:41] (03Merged) 10jenkins-bot: tests: Assert each wiki is in one 'deploy group' and one 'family' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553224 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:20:47] (03PS7) 10Krinkle: Make special *.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) [01:20:54] (03PS5) 10Krinkle: Make ge.wikimedia.org site no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553228 (https://phabricator.wikimedia.org/T239301) [01:21:09] Krinkle: Yes, go for it. [01:21:14] (03CR) 10Krinkle: [C: 03+2] Set wgLanguageCode for test wikis explicitly to 'en' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553230 (owner: 10Krinkle) [01:22:07] (03Merged) 10jenkins-bot: Set wgLanguageCode for test wikis explicitly to 'en' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553230 (owner: 10Krinkle) [01:23:56] * Krinkle testing [01:27:49] (03CR) 10Krinkle: [C: 03+2] Make special *.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:27:51] (03CR) 10Krinkle: [C: 03+2] Make ge.wikimedia.org site no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553228 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:28:37] (03Merged) 10jenkins-bot: Make special *.wikipedia.org sites no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553226 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:28:41] (03Merged) 10jenkins-bot: Make ge.wikimedia.org site no longer "special" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553228 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [01:28:46] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 01m 03s) [01:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:46] (03PS3) 10Krinkle: Remove unused $hostName variable in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546641 [01:30:48] (03CR) 10Krinkle: [C: 03+2] Remove unused $hostName variable in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546641 (owner: 10Krinkle) [01:31:25] (03Merged) 10jenkins-bot: Remove unused $hostName variable in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546641 (owner: 10Krinkle) [01:33:43] (03CR) 10Krinkle: "me re-doing it from scratch due to merge conflicts, and then a bunch of debug testing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [01:34:31] (03PS1) 10Krinkle: Revert "Fix 'the the' typo in vendor/perftools/xhgui-collector/external/header.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553231 [01:34:35] (03CR) 10Krinkle: [C: 03+2] Revert "Fix 'the the' typo in vendor/perftools/xhgui-collector/external/header.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553231 (owner: 10Krinkle) [01:35:32] (03Merged) 10jenkins-bot: Revert "Fix 'the the' typo in vendor/perftools/xhgui-collector/external/header.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553231 (owner: 10Krinkle) [01:38:54] (03PS9) 10Krinkle: Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [01:41:23] * Krinkle testing this patch on deploy1001/mwdebug1001 [01:44:46] (03CR) 10Krinkle: [C: 04-1] "PHP Fatal error: require_once(): Failed opening required '/srv/mediawiki/wmf-config/../vendor/autoload.php' (include_path='/srv/mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [01:45:48] (03PS10) 10Krinkle: Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [01:47:03] (03CR) 10Krinkle: "Uncaught Error: Call to undefined function Xhgui_Saver_Mongo() in /srv/mediawiki/wmf-config/profiler.php:204" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [01:48:44] (03PS11) 10Krinkle: Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 [01:49:28] (03CR) 10Krinkle: [C: 03+2] "Yay. https://performance.wikimedia.org/xhgui/run/view?id=5dddd60b3f3dfac0333bdb46" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [01:51:27] (03CR) 10Krinkle: [C: 03+2] Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [01:52:18] (03Merged) 10jenkins-bot: Convert frankenstein vendor/ into thin local lib/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [01:55:07] !log krinkle@deploy1001 Synchronized lib/: 4108ff4e2 (1/3) (duration: 01m 01s) [01:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:23] !log krinkle@deploy1001 Synchronized wmf-config/: 4108ff4e2 (2/3) (duration: 00m 59s) [01:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:13] !log krinkle@deploy1001 Synchronized vendor: 4108ff4e2 (3/3) (duration: 01m 00s) [01:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:26] (03CR) 10Vgutierrez: ATS: enable reload for global Lua script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [02:45:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:45:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:08:23] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:09:59] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:34:03] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:42:41] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 104.5 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:56:49] (03PS4) 10Vgutierrez: acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [03:58:10] (03CR) 10Vgutierrez: acme-chief: parallelize gdnsd-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [04:10:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:33:01] (03PS5) 10Vgutierrez: acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [04:33:37] (03CR) 10jerkins-bot: [V: 04-1] acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [04:34:30] (03PS6) 10Vgutierrez: acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [04:38:44] (03PS1) 10BBlack: dnsbox: set up rec/auth machine-local dependency [puppet] - 10https://gerrit.wikimedia.org/r/553236 (https://phabricator.wikimedia.org/T98006) [04:41:37] I'm out of willingness to risk staying up later fixing my own mistakes, so that patch will wait for early tomorrow :) [05:17:01] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7083 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [05:18:47] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.03333 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [05:28:09] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.48 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:37:45] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [05:38:33] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 80.68 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:39:29] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [05:40:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311 for schema change', diff saved to https://phabricator.wikimedia.org/P9758 and previous config saved to /var/cache/conftool/dbconfig/20191127-054056-marostegui.json [05:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2125 T239042', diff saved to https://phabricator.wikimedia.org/P9759 and previous config saved to /var/cache/conftool/dbconfig/20191127-054809-marostegui.json [05:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:15] T239042: db2125 crashed - https://phabricator.wikimedia.org/T239042 [05:48:27] 10Operations, 10ops-codfw, 10DBA: db2125 crashed - https://phabricator.wikimedia.org/T239042 (10Marostegui) 05Open→03Resolved All the wikis have had their main tables checked and there is non apparent data drifts, so I am going to repool this host and consider this fixed for now. If it happens again, we... [05:48:30] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [05:54:27] (03PS1) 10Marostegui: db2135: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/553240 (https://phabricator.wikimedia.org/T238183) [05:56:35] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2135 to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553241 (https://phabricator.wikimedia.org/T238183) [05:56:38] (03CR) 10Marostegui: [C: 03+2] db2135: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/553240 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [05:58:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Add db2135 to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553241 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [05:59:33] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2135 to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553241 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [06:01:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add db2135 to the config T238183 (duration: 01m 11s) [06:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:12] T238183: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 [06:02:12] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add db2135 to the config T238183 (duration: 00m 59s) [06:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:40] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10Legoktm) @tstarling is the gpg key that you used to sign that release available anywhere? https://www.mediawiki.org/keys/keys.txt stil... [06:05:33] !log Promote db2135 to codfw m5 master T238183 [06:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:09] (03CR) 10Marostegui: [C: 03+1] "We'd need to check replication on a codfw slave to see if we have to restart it across all the hosts." [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [06:16:08] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoy-tls: proxy /stats from the admin interface. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/553045 (owner: 10Giuseppe Lavagetto) [06:22:20] 10Operations, 10ops-codfw, 10DBA: (codfw):rack/setup/install db213[2-5] - https://phabricator.wikimedia.org/T237702 (10Marostegui) [06:23:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Update blubberoid to workaround in telemetry collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/553155 (owner: 10Giuseppe Lavagetto) [06:25:30] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [06:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:49:10] (03PS1) 10Marostegui: mariadb: Set db2062 to spare [puppet] - 10https://gerrit.wikimedia.org/r/553243 (https://phabricator.wikimedia.org/T238726) [06:50:19] !log Stop MySQL on db2062 - T238726 [06:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:24] T238726: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 [06:50:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Set db2062 to spare [puppet] - 10https://gerrit.wikimedia.org/r/553243 (https://phabricator.wikimedia.org/T238726) (owner: 10Marostegui) [06:52:02] !log Remove db2062 from tendril and zarcillo - T238726 [06:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:04] (03PS1) 10Elukey: Apply profile::analytics::client::limits to stat100[6,7] and notebooks [puppet] - 10https://gerrit.wikimedia.org/r/553244 (https://phabricator.wikimedia.org/T212824) [07:03:09] (03CR) 10Elukey: [C: 03+2] Apply profile::analytics::client::limits to stat100[6,7] and notebooks [puppet] - 10https://gerrit.wikimedia.org/r/553244 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [07:03:38] !log Compress tables on db1102:3314 [07:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:53] 10Operations, 10Traffic: cp3063 crashed - https://phabricator.wikimedia.org/T239310 (10Vgutierrez) [07:04:30] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3063.esams.wmnet [07:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:59] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [07:09:29] RECOVERY - Host cp3063 is UP: PING WARNING - Packet loss = 28%, RTA = 83.37 ms [07:12:53] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) >>! In T212824#5695358, @JAllemandou wrote: > @elukey: We should apply the same treatment for stat1007 :) Done! Only stat1... [07:15:29] 10Operations, 10Traffic: cp3063 crashed - https://phabricator.wikimedia.org/T239310 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez As "expected" nothing on SEL or logs, this is yet another occurrence of T238305 [07:15:32] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [07:15:47] !log repooling cp3063 - T239310 [07:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:53] T239310: cp3063 crashed - https://phabricator.wikimedia.org/T239310 [07:25:48] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.02 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:29:30] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.52 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:32:51] (03PS1) 10Giuseppe Lavagetto: prometheus::k8s: drop envoy metrics about the admin interface [puppet] - 10https://gerrit.wikimedia.org/r/553246 [07:34:38] (03PS1) 10Giuseppe Lavagetto: blubberoid: add TLS termination to the production clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/553247 [07:37:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: add TLS termination to the production clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/553247 (owner: 10Giuseppe Lavagetto) [07:41:46] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [07:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:25] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10MoritzMuehlenhoff) JFTR, Diamond has been removed from Debian as part of the Python 2 removal: https://packages.qa.debian.org/d/diamond/... [07:49:10] !log roll restart of eventstreams on scb2* - T239220 [07:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:15] T239220: page-links-change EventStream doesn't appear to be outputting events - https://phabricator.wikimedia.org/T239220 [07:51:46] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [07:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:59] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [07:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:31] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [07:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:10] (03CR) 10Muehlenhoff: [C: 03+1] "Go for it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [08:01:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [08:03:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [08:04:24] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:06:13] (03CR) 10Muehlenhoff: systemd::slice::all_users: add Debian Buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [08:08:01] 10Operations, 10User-fgiunchedi: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10fgiunchedi) [08:08:03] (03CR) 10Elukey: systemd::slice::all_users: add Debian Buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [08:09:20] !log swift eqiad-prod: more weight to ms-be105[7-9] - T237438 [08:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:25] T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 [08:09:40] (03CR) 10Muehlenhoff: systemd::slice::all_users: add Debian Buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [08:11:14] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 73.36 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:15:01] (03CR) 10Muehlenhoff: Switch Ganeti servers in esams/ulsfo to Buster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553046 (https://phabricator.wikimedia.org/T236216) (owner: 10Muehlenhoff) [08:17:10] (03PS5) 10Muehlenhoff: Switch Ganeti servers in esams/ulsfo to Buster [puppet] - 10https://gerrit.wikimedia.org/r/553046 (https://phabricator.wikimedia.org/T236216) [08:18:04] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:21:06] I'm going to silence the codfw traffic drop for a week or so, tracking is at T239039 [08:21:07] T239039: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey - https://phabricator.wikimedia.org/T239039 [08:21:32] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 77.19 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:24:06] !log silence codfw varnish traffic drop until dec 9th - T239039 [08:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:34] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535188 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [08:29:00] (03PS6) 10Muehlenhoff: Switch Ganeti servers in esams/ulsfo to Buster [puppet] - 10https://gerrit.wikimedia.org/r/553046 (https://phabricator.wikimedia.org/T236216) [08:33:15] godog: thanks! [08:38:13] (03CR) 10Muehlenhoff: [C: 03+2] Switch Ganeti servers in esams/ulsfo to Buster [puppet] - 10https://gerrit.wikimedia.org/r/553046 (https://phabricator.wikimedia.org/T236216) (owner: 10Muehlenhoff) [08:41:21] (03CR) 10Arturo Borrero Gonzalez: systemd::slice::all_users: add Debian Buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [08:42:08] (03CR) 10Muehlenhoff: "Ack, Arturo's suggestion sounds good to me" [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [08:43:36] (03CR) 10Elukey: systemd::slice::all_users: add Debian Buster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [08:44:06] (03PS2) 10Filippo Giunchedi: swift: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535188 (https://phabricator.wikimedia.org/T205870) [08:44:32] (03PS3) 10DannyS712: Remove `wgImportSources` settings for closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552361 (https://phabricator.wikimedia.org/T231178) [08:47:54] TIL gerrit will let you merge an effectively empty commit [08:48:48] (03PS5) 10DannyS712: Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842) [08:52:40] (03PS2) 10Kosta Harlan: GrowthExperiments: Begin "initiation test" for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553225 (https://phabricator.wikimedia.org/T238888) (owner: 10Catrope) [08:55:27] (03PS6) 10Elukey: systemd::slice::all_users: add Debian Buster support [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) [09:01:16] !log Stop replication on 1124:3318 to reimport wikidatawiki.page table on labsdb1010 - T238399 [09:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:21] T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 [09:01:32] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:02:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/19655/" [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [09:02:19] !log reimage mw1317.eqiad.wmnet - T239054 [09:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:24] T239054: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 [09:07:14] (03CR) 10Elukey: [C: 03+2] systemd::slice::all_users: add Debian Buster support [puppet] - 10https://gerrit.wikimedia.org/r/553142 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [09:08:07] godog: ready to merge? [09:08:30] elukey: sure, go for it [09:08:39] ah wait is it the empty commit? [09:08:44] it is [09:09:03] ack [09:10:06] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:11:55] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1317.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911270911_jiji_16360_mw1317_... [09:12:41] (03PS1) 10Elukey: role::statistics::explorer::gpu: add systemd user limits [puppet] - 10https://gerrit.wikimedia.org/r/553288 (https://phabricator.wikimedia.org/T212824) [09:14:41] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1322.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911270914_jiji_16885_mw1322_... [09:14:43] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer::gpu: add systemd user limits [puppet] - 10https://gerrit.wikimedia.org/r/553288 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [09:15:14] (03PS7) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) [09:15:50] (03CR) 10Ema: ATS: enable reload for global Lua script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [09:16:31] (03CR) 10Vgutierrez: [C: 03+1] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [09:17:56] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) [09:19:22] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10elukey) I am inclined to close this task, and re-open if more alarms will appear. We'll likely have to tune limits but I hope that t... [09:25:00] !log cp3050: re-enable request coalescing after performance experiment T238494 [09:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:05] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [09:25:33] (03PS2) 10Muehlenhoff: Add IDP device data to backups [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) [09:29:55] !log installing php-imagick security updates [09:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:57] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:06] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:10] (03CR) 10Volans: acme-chief: parallelize gdnsd-sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [09:33:44] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:19] (03PS7) 10Vgutierrez: acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [09:34:48] (03CR) 10Vgutierrez: "thanks for the review volans" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [09:35:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:31] (03PS1) 10Giuseppe Lavagetto: lvs::configuration: add blubberoid TLS service [puppet] - 10https://gerrit.wikimedia.org/r/553294 [09:41:03] !log installing symfony security updates [09:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:00] (03CR) 10Volans: [C: 03+1] "I did not test it but LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [09:50:50] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) >>! In T236277#5695092, @Marostegui wrote: > @jbond maybe it is a good idea to disable puppet on all databases before merging the change and then trying... [09:51:31] (03PS3) 10Jbond: puppet_ca: update puppet ca with a new certificate valid for 10 years [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) [09:51:55] (03CR) 10jerkins-bot: [V: 04-1] puppet_ca: update puppet ca with a new certificate valid for 10 years [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [09:52:59] heads up, I am deliberately causing memory pressure on stat1007 to see how the new systemd limits behave, any alarm is due to me [09:53:05] (if any, I hope not) [09:55:36] nothing exploded \o/ [09:57:13] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [09:58:46] (03PS1) 10Muehlenhoff: Update IDP rsync to also account for future additions to the devices directory [puppet] - 10https://gerrit.wikimedia.org/r/553296 [10:00:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) (owner: 10Muehlenhoff) [10:01:52] (03CR) 10Jbond: Update IDP rsync to also account for future additions to the devices directory (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553296 (owner: 10Muehlenhoff) [10:09:53] (03PS2) 10Muehlenhoff: Update IDP rsync to also account for future additions to the devices directory [puppet] - 10https://gerrit.wikimedia.org/r/553296 [10:09:59] (03CR) 10Muehlenhoff: Update IDP rsync to also account for future additions to the devices directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553296 (owner: 10Muehlenhoff) [10:12:25] (03CR) 10Volans: [C: 03+1] "CI output looks good. Some possible caveats:" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [10:14:56] (03CR) 10Muehlenhoff: [C: 03+2] Add IDP device data to backups [puppet] - 10https://gerrit.wikimedia.org/r/553058 (https://phabricator.wikimedia.org/T233936) (owner: 10Muehlenhoff) [10:16:07] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::k8s: drop envoy metrics about the admin interface [puppet] - 10https://gerrit.wikimedia.org/r/553246 (owner: 10Giuseppe Lavagetto) [10:21:22] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1317.eqiad.wmnet'] ` and were **ALL** successful. [10:21:48] (03PS1) 10ArielGlenn: skip comment lines in dblists [dumps] - 10https://gerrit.wikimedia.org/r/553302 [10:24:19] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1322.eqiad.wmnet'] ` and were **ALL** successful. [10:27:20] No backups: 1 (idp2001), Fresh: 94 jobs [10:28:17] (03PS8) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 [10:30:49] (03CR) 10ArielGlenn: [C: 03+2] skip comment lines in dblists [dumps] - 10https://gerrit.wikimedia.org/r/553302 (owner: 10ArielGlenn) [10:31:11] $ check_bacula.py idp2001.wikimedia.org-Monthly-1st-Thu-production-idp [10:31:13] No jobs found for idp2001.wikimedia.org-Monthly-1st-Thu-production-idp [10:31:55] !log ariel@deploy1001 Started deploy [dumps/dumps@e0b0e76]: skip comment lines in dblist files [10:31:59] !log ariel@deploy1001 Finished deploy [dumps/dumps@e0b0e76]: skip comment lines in dblist files (duration: 00m 03s) [10:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:29] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:33:23] $ ./check_bacula.py idp2001.wikimedia.org-Monthly-1st-Thu-production-idp [10:33:25] 2019-11-27 10:32:35: level: F, status: T, bytes: 2128, files: 2, errors: 0, duration(s): 0:00:00 [10:33:29] ^CC moritzm [10:33:52] thx! idp1001 might also soon show up in monitoring, BTW. puppet just ran there [10:34:09] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:35:11] (03PS8) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) [10:35:42] 10Operations, 10DC-Ops: HP SSD Failure Firmware Fix - https://phabricator.wikimedia.org/T239211 (10jcrespo) Please note those are for SAS disks, I beleive we have more SATA ones, which are affected by https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00048133ja_jp [10:36:37] Hi, https://lists.wikimedia.org/mailman/private/2019-November/thread.html says _no such list_ to me. However, at least one email has been sent to that archive. Any guesses what might be wrong? [10:37:16] (03CR) 10jerkins-bot: [V: 04-1] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [10:39:00] Urbanecm: https://lists.wikimedia.org/mailman/private/LIST_NAME_HERE/2019-November/thread.html [10:39:35] (03PS9) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) [10:40:12] rxy: ehh, https://lists.wikimedia.org/mailman/private/google-code-in-mentors links to wrong address, while https://lists.wikimedia.org/mailman/private/google-code-in-mentors/ works... [10:40:41] (03CR) 10Muehlenhoff: [C: 03+2] Enable ldap-corp1001/2001 as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 (owner: 10Muehlenhoff) [10:40:50] (03PS10) 10Ema: ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) [10:41:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nova: add nova config for the placement service [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [10:43:25] !log reimage mw1347,mw1337,mw1327 - T239054 [10:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:30] T239054: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 [10:43:32] (03PS1) 10Filippo Giunchedi: install_server: use GPT for mw-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/553306 (https://phabricator.wikimedia.org/T156955) [10:45:47] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1347.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271045_jiji_35912_mw1347_... [10:46:18] 10Operations, 10Patch-For-Review, 10User-jbond: Add U2F/FIDO as second factor for CAS - https://phabricator.wikimedia.org/T233937 (10jbond) @Volans just asked if there is a way to register multiple u2f devices to the same account. Of the top of my head im not sure how to achive that but placing a not here a... [10:47:48] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1337.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271047_jiji_36247_mw1337_... [10:49:24] 10Operations: Deprecate msdos partition scheme in favor of GPT - https://phabricator.wikimedia.org/T239321 (10fgiunchedi) [10:49:49] (03PS2) 10Filippo Giunchedi: install_server: use GPT for mw-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/553306 (https://phabricator.wikimedia.org/T239321) [10:49:52] (03PS1) 10Alexandros Kosiaris: package_builder: Remove tests/ directory [puppet] - 10https://gerrit.wikimedia.org/r/553307 [10:49:54] (03PS1) 10Alexandros Kosiaris: package_builder: Add the bullseye distribution [puppet] - 10https://gerrit.wikimedia.org/r/553308 [10:49:56] (03PS1) 10Alexandros Kosiaris: package_builder: Add ordering rules for new hooks [puppet] - 10https://gerrit.wikimedia.org/r/553309 [10:50:09] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1327.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271049_jiji_36621_mw1327_... [10:50:44] effie _joe_ got https://gerrit.wikimedia.org/r/c/operations/puppet/+/553306 out to move to GPT, please let me know what you think! [10:51:40] (03PS1) 10Jbond: puppetboard: remove trailing slash from proxied url [puppet] - 10https://gerrit.wikimedia.org/r/553310 [10:51:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] install_server: use GPT for mw-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/553306 (https://phabricator.wikimedia.org/T239321) (owner: 10Filippo Giunchedi) [10:52:13] godog: tx :) [10:52:24] (03CR) 10Volans: [C: 03+1] "Let's try it!" [puppet] - 10https://gerrit.wikimedia.org/r/553310 (owner: 10Jbond) [10:52:59] effie: yw, if that looks good I'm happy to help testing it [10:53:03] (03CR) 10Jbond: [C: 03+2] puppetboard: remove trailing slash from proxied url [puppet] - 10https://gerrit.wikimedia.org/r/553310 (owner: 10Jbond) [10:54:06] godog: if that means to adopt a mediawiki servers, I am all in :p [10:54:32] effie: heheh more like foster care for a day, but yeah [10:54:49] oh sure, we like that as well [10:55:06] (03CR) 10Effie Mouzeli: [C: 03+1] install_server: use GPT for mw-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/553306 (https://phabricator.wikimedia.org/T239321) (owner: 10Filippo Giunchedi) [10:55:34] (03PS2) 10ArielGlenn: properly handle mtime lookup for dumps log exception checker [puppet] - 10https://gerrit.wikimedia.org/r/552707 [10:56:38] (03PS3) 10Filippo Giunchedi: install_server: use GPT for mw-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/553306 (https://phabricator.wikimedia.org/T239321) [10:56:41] kk, will merge and I'm happy to test a reimage [10:58:42] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use GPT for mw-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/553306 (https://phabricator.wikimedia.org/T239321) (owner: 10Filippo Giunchedi) [11:00:39] effie: which host should I take for reimage ? [11:01:24] (03CR) 10ArielGlenn: [C: 03+2] properly handle mtime lookup for dumps log exception checker [puppet] - 10https://gerrit.wikimedia.org/r/552707 (owner: 10ArielGlenn) [11:02:28] godog: we have a great selection of hosts to choose from [11:02:44] godog: you can do mw2290.codfw.wmnet [11:03:10] 10Operations, 10User-jbond: Investigate how automated tasks can authenticate against CAS - https://phabricator.wikimedia.org/T239323 (10jbond) [11:03:48] (03CR) 10Ema: [C: 03+2] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552955 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [11:04:52] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:04:54] effie: kk, will do! what's the command line you are using atm ? [11:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:46] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10mark) >>! In T184066#5695891, @RobH wrote: >>>! In T184066#5694288, @Papaul wrote: >> qfx5100-spare1, psu 0 {#20156} to ps2-oe15-esams:17 >> qfx5100-spare2, psu 0 {#20157} to ps2-oe15-esams:16 >>... [11:05:54] godog: wmf-auto-reimage-host -a -p T239054 --no-downtime [11:05:54] T239054: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 [11:06:30] effie: ok! and the downtime separatedly ? or for a single host I can ditch --no-downtime ? [11:06:35] effie: seems to miss quote some options there [11:06:46] also why no-downtime? [11:06:48] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:55] *quite [11:06:57] !log cp1075: depool to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552955/ and test tslua reloads T233274 [11:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:07:02] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [11:07:03] I remember we had some issues with downtimes [11:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:11] ok switching channels [11:07:14] -sre [11:09:13] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:12] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:21] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:21] PROBLEM - traffic_server backend process restarted on cp1075 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1075&var-layer=backend [11:14:47] ACKNOWLEDGEMENT - traffic_server backend process restarted on cp1075 is CRITICAL: 2 ge 2 Ema https://phabricator.wikimedia.org/P9762 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1075&var-layer=backend [11:17:45] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` mw2290.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271117_filippo_42756_m... [11:18:05] PROBLEM - Check systemd state on ldap-corp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:31] looking at ldap-corp1001 [11:19:15] RECOVERY - Check systemd state on ldap-corp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:03] !log reimage mw2289.codfw.wmnet [11:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:31] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw2289.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271120_jiji_43318_mw2289_... [11:22:32] jouncebot: now [11:22:32] For the next 12 hour(s) and 37 minute(s): No deploys! (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191127T0000) [11:22:39] lovely [11:26:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553296 (owner: 10Muehlenhoff) [11:26:46] (03PS1) 10Filippo Giunchedi: Revert "install_server: use GPT for mw-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/553317 [11:27:25] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "install_server: use GPT for mw-raid1" [puppet] - 10https://gerrit.wikimedia.org/r/553317 (owner: 10Filippo Giunchedi) [11:27:58] !log jiji@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1337.eqiad.wmnet,service=apache2 [11:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:07] !log jiji@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1337.eqiad.wmnet,service=nginx [11:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:18] !log jiji@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1347.eqiad.wmnet,service=apache2 [11:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:25] !log jiji@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1327.eqiad.wmnet,service=apache2 [11:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:31] !log jiji@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1347.eqiad.wmnet,service=nginx [11:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:39] !log jiji@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1327.eqiad.wmnet,service=nginx [11:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553308 (owner: 10Alexandros Kosiaris) [11:31:22] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2290.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2290.codfw.wmnet'] ` [11:31:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:33:24] (03PS1) 10Jbond: Revert "puppetboard: remove trailing slash from proxied url" [puppet] - 10https://gerrit.wikimedia.org/r/553318 [11:34:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] package_builder: Remove tests/ directory [puppet] - 10https://gerrit.wikimedia.org/r/553307 (owner: 10Alexandros Kosiaris) [11:34:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] package_builder: Add the bullseye distribution [puppet] - 10https://gerrit.wikimedia.org/r/553308 (owner: 10Alexandros Kosiaris) [11:35:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] package_builder: Add ordering rules for new hooks [puppet] - 10https://gerrit.wikimedia.org/r/553309 (owner: 10Alexandros Kosiaris) [11:35:10] (03CR) 10Jbond: [C: 03+2] Revert "puppetboard: remove trailing slash from proxied url" [puppet] - 10https://gerrit.wikimedia.org/r/553318 (owner: 10Jbond) [11:36:09] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:38:10] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) I have tested on snapshot1008, which mounts only the buster nfs share, that the dump_lock.py script with multiple instances works as it should; this is the locking mechanism fo... [11:41:11] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:43:36] 10Operations, 10User-jbond: Investigate how automated tasks can authenticate against CAS - https://phabricator.wikimedia.org/T239323 (10jbond) p:05Triage→03Normal [11:44:02] 10Operations: Deprecate msdos partition scheme in favor of GPT - https://phabricator.wikimedia.org/T239321 (10jbond) p:05Triage→03Normal [11:44:11] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Mailing List - https://phabricator.wikimedia.org/T239240 (10jbond) p:05Triage→03Normal [11:46:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:46:30] (03PS1) 10Muehlenhoff: Switch ldap-corp.codfw.wikimedia.org to ldap-corp2001 [dns] - 10https://gerrit.wikimedia.org/r/553323 (https://phabricator.wikimedia.org/T224557) [11:47:21] (03PS1) 10ArielGlenn: dumpsdata1001 will install with buster now [puppet] - 10https://gerrit.wikimedia.org/r/553324 (https://phabricator.wikimedia.org/T224563) [11:47:47] !log deployed security patch for T237667 [11:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:00] (03CR) 10ArielGlenn: [C: 03+2] dumpsdata1001 will install with buster now [puppet] - 10https://gerrit.wikimedia.org/r/553324 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [11:50:46] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1337.eqiad.wmnet'] ` and were **ALL** successful. [11:51:41] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1347.eqiad.wmnet'] ` and were **ALL** successful. [11:52:14] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10jbond) 05Open→03Resolved Hi Fuzzy, i have now added you for both domains, please note that theses domains where not configured in search console until just... [11:56:18] 10Operations, 10GLOW, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T238868 (10jbond) 05Open→03Resolved a:03jbond I have added the sat site and given you permissions, please note it could take a day or two for that site to populate with... [11:56:51] (03PS3) 10Muehlenhoff: Update IDP rsync to also account for future additions to the devices directory [puppet] - 10https://gerrit.wikimedia.org/r/553296 [11:57:15] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Add system user analytics-privatedata to the anaytics-privatedata-users group - https://phabricator.wikimedia.org/T238306 (10jbond) 05Open→03Resolved I think this is complete but please reopen if there are still outstanding tasks [11:58:58] (03CR) 10Muehlenhoff: [C: 03+2] Update IDP rsync to also account for future additions to the devices directory [puppet] - 10https://gerrit.wikimedia.org/r/553296 (owner: 10Muehlenhoff) [12:00:32] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1327.eqiad.wmnet'] ` and were **ALL** successful. [12:02:08] (03CR) 10BBlack: [C: 03+2] dnsbox: set up rec/auth machine-local dependency [puppet] - 10https://gerrit.wikimedia.org/r/553236 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [12:03:42] (03CR) 10BBlack: [C: 03+2] acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [12:04:19] (03PS1) 10Muehlenhoff: Switch idp CNAME to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/553330 [12:05:02] (03PS1) 10Muehlenhoff: Flip idp1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/553331 [12:05:13] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2289.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2289.codfw.wmnet'] ` [12:05:51] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw2289.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271205_jiji_55597_mw2289_... [12:05:54] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2289.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2289.codfw.wmnet'] ` [12:06:17] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw2289.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271206_jiji_55698_mw2289_... [12:07:10] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` mw2290.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271207_filippo_55841_m... [12:07:16] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2290.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2290.codfw.wmnet'] ` [12:07:29] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` mw2290.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271207_filippo_55893_m... [12:07:32] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2290.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2290.codfw.wmnet'] ` [12:07:43] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` mw2290.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271207_filippo_55940_m... [12:08:49] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) Thanks a lot, John! [12:18:16] (03PS1) 10BBlack: dnsbox: add dns4001 to the set of authservers [puppet] - 10https://gerrit.wikimedia.org/r/553332 (https://phabricator.wikimedia.org/T98006) [12:18:35] !log reimaged dumpsdata1001 to buster and forgot to use the dang script but it is all ok anyhow :-P [12:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:13] (03CR) 10BBlack: [C: 03+2] dnsbox: add dns4001 to the set of authservers [puppet] - 10https://gerrit.wikimedia.org/r/553332 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [12:20:19] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) Aaaaand dumpsdata1001 is reimaged. All the data is still there, available to snapshot hosts. [12:20:28] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) [12:22:44] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [12:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:17] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:08] (03PS1) 10BBlack: Revert "test commit" [dns] - 10https://gerrit.wikimedia.org/r/553334 [12:24:52] (03CR) 10BBlack: [C: 03+2] Revert "test commit" [dns] - 10https://gerrit.wikimedia.org/r/553334 (owner: 10BBlack) [12:24:52] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:19] !log jiji@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1337.eqiad.wmnet,service=apache2 [12:26:20] !log jiji@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1347.eqiad.wmnet,service=apache2 [12:26:21] !log jiji@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1327.eqiad.wmnet,service=apache2 [12:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:29] !log jiji@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1337.eqiad.wmnet,service=nginx [12:26:30] !log jiji@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1347.eqiad.wmnet,service=nginx [12:26:31] !log jiji@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1327.eqiad.wmnet,service=nginx [12:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:46] (03CR) 10Jbond: "looks good a few minor nits, see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [12:34:19] (03PS1) 10Filippo Giunchedi: prometheus: alert on exporter's 'up' metrics [puppet] - 10https://gerrit.wikimedia.org/r/553335 (https://phabricator.wikimedia.org/T187708) [12:34:47] (03PS1) 10BBlack: authdns: use system user flag for user/group [puppet] - 10https://gerrit.wikimedia.org/r/553336 [12:36:03] (03CR) 10jerkins-bot: [V: 04-1] prometheus: alert on exporter's 'up' metrics [puppet] - 10https://gerrit.wikimedia.org/r/553335 (https://phabricator.wikimedia.org/T187708) (owner: 10Filippo Giunchedi) [12:37:14] (03CR) 10Jbond: [C: 04-1] "Hi Maryum," (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239300) (owner: 10Mstyles) [12:38:24] (03CR) 10BBlack: [C: 03+2] authdns: use system user flag for user/group [puppet] - 10https://gerrit.wikimedia.org/r/553336 (owner: 10BBlack) [12:39:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/553330 (owner: 10Muehlenhoff) [12:40:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553331 (owner: 10Muehlenhoff) [12:44:46] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) 05Open→03Resolved Closing, any followup issues can get their own tasks. [12:44:48] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10ArielGlenn) [12:47:44] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2289.codfw.wmnet'] ` and were **ALL** successful. [12:47:52] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2290.codfw.wmnet'] ` and were **ALL** successful. [12:48:23] (03CR) 10Muehlenhoff: authdns: add ferm rules for 5353 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553135 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [12:50:25] !log jiji@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=codfw,service=apache2,cluster=api_appserver,name=mw2289.codfw.wmnet [12:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:56] !log jiji@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=codfw,service=nginx,cluster=api_appserver,name=mw2289.codfw.wmnet [12:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:05] !log jiji@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,service=nginx,cluster=api_appserver,name=mw2289.codfw.wmnet [12:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:11] !log jiji@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,service=nginx,cluster=api_appserver,name=mw2289.codfw.wmnet [12:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:19] (03PS2) 10Giuseppe Lavagetto: lvs::configuration: add blubberoid TLS service [puppet] - 10https://gerrit.wikimedia.org/r/553294 [12:51:25] (03PS3) 10Jbond: CI - Python3: Fix minor flake8 issues in python3 files [puppet] - 10https://gerrit.wikimedia.org/r/551534 [12:51:27] (03PS20) 10Jbond: CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 [12:51:43] <_joe_> vgutierrez: would you care taking a look? ^^ [12:52:04] <_joe_> at my patch [12:52:15] <_joe_> I decided puppet won over me. [12:53:03] tomorrow? :) [12:53:22] <_joe_> oh sorry [12:53:28] <_joe_> I forget it's super late already [12:53:36] <_joe_> don't worry, I'll find another pair of eyes [12:53:39] <_joe_> apologies [12:54:04] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [12:54:05] no problem [12:54:22] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10ArielGlenn) [12:54:45] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10ArielGlenn) [12:55:21] markup should really ignore a trailing blank after an x and make it into a checked box anyways ( [x] and [x ]) [12:55:30] [12:56:22] (03PS21) 10Jbond: CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 [12:58:38] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [13:01:53] PROBLEM - Disk space on an-tool1007 is CRITICAL: DISK CRITICAL - free space: / 604 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-tool1007&var-datasource=eqiad+prometheus/ops [13:03:37] (03PS4) 10Jbond: CI - Python3: Fix minor flake8 issues in python3 files [puppet] - 10https://gerrit.wikimedia.org/r/551534 [13:04:03] (03PS22) 10Jbond: CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 [13:06:27] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [13:07:40] (03PS5) 10Jbond: CI - Python3: Fix minor flake8 issues in python3 files [puppet] - 10https://gerrit.wikimedia.org/r/551534 [13:08:01] (03PS23) 10Jbond: CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 [13:13:00] (03CR) 10Jbond: [C: 03+2] CI - Python3: Fix minor flake8 issues in python3 files [puppet] - 10https://gerrit.wikimedia.org/r/551534 (owner: 10Jbond) [13:13:11] (03CR) 10Jbond: [C: 03+2] CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [13:13:48] !log reimage mw1348, mw1338, mw1328 [13:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:19] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1348.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271315_jiji_70711_mw1348_... [13:16:39] (03CR) 10Jbond: "> Patch Set 19: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [13:17:07] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1338.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271316_jiji_70851_mw1338_... [13:17:19] (03PS2) 10Muehlenhoff: Switch idp CNAME to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/553330 [13:17:31] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1328.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271317_jiji_70890_mw1328_... [13:19:45] 10Operations, 10Puppet, 10User-jbond: Python3 style guid - https://phabricator.wikimedia.org/T239334 (10jbond) [13:21:13] 10Operations, 10Puppet, 10User-jbond: Python3 style guid - https://phabricator.wikimedia.org/T239334 (10jbond) [13:21:24] 10Operations, 10Puppet, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) [13:21:59] !log reimage mw2288, mw2287, mw2286 [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P9765 and previous config saved to /var/cache/conftool/dbconfig/20191127-132220-marostegui.json [13:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch idp CNAME to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/553330 (owner: 10Muehlenhoff) [13:22:51] 10Operations, 10Puppet, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10ArielGlenn) [13:22:59] (03PS2) 10Muehlenhoff: Flip idp1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/553331 [13:23:32] (03CR) 10jerkins-bot: [V: 04-1] Flip idp1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/553331 (owner: 10Muehlenhoff) [13:23:59] 10Operations, 10Puppet, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) [13:24:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1089 for schema change', diff saved to https://phabricator.wikimedia.org/P9766 and previous config saved to /var/cache/conftool/dbconfig/20191127-132359-marostegui.json [13:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:39] (03CR) 10Phamhi: [C: 03+2] labmon: add compatibility in buster (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:24:54] (03PS5) 10Phamhi: labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) [13:24:55] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2288.codfw.wmnet', 'mw2287.codfw.wmnet', 'mw2286.codfw.wmnet'] ` The log can be found in `/var/log/... [13:25:26] (03PS1) 10Ladsgroup: mediawiki: Disable mostrevisions for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/553339 (https://phabricator.wikimedia.org/T239072) [13:25:28] (03CR) 10jerkins-bot: [V: 04-1] labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:25:59] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Disable mostrevisions for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/553339 (https://phabricator.wikimedia.org/T239072) (owner: 10Ladsgroup) [13:26:06] 10Operations, 10serviceops: Reimage mwdebug1002 and mw1317 - https://phabricator.wikimedia.org/T236806 (10jijiki) 05Open→03Resolved [13:26:46] RECOVERY - traffic_server backend process restarted on cp1075 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1075&var-layer=backend [13:28:16] !log cp1075: ats-{tls,backend} restarted to apply tslua reload changes T233274 [13:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:21] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [13:29:48] (03PS1) 10Jbond: taskgen: fix variable rename [puppet] - 10https://gerrit.wikimedia.org/r/553340 [13:30:33] is puppet broken on master? https://gerrit.wikimedia.org/r/c/operations/puppet/+/553339 [13:30:51] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/553339 (https://phabricator.wikimedia.org/T239072) (owner: 10Ladsgroup) [13:31:40] (03PS2) 10Jbond: taskgen: fix variable rename [puppet] - 10https://gerrit.wikimedia.org/r/553340 [13:32:23] Amir1: looks like jbond42's patch above might fix it [13:33:04] okay, let me know once it's done, thanks! [13:35:32] (03CR) 10Jbond: [C: 03+2] taskgen: fix variable rename [puppet] - 10https://gerrit.wikimedia.org/r/553340 (owner: 10Jbond) [13:36:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553340 (owner: 10Jbond) [13:36:49] (03PS3) 10Jbond: Flip idp1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/553331 (owner: 10Muehlenhoff) [13:38:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:35] (03PS2) 10Ema: mediawiki: Disable mostrevisions for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/553339 (https://phabricator.wikimedia.org/T239072) (owner: 10Ladsgroup) [13:39:08] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:30] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:36] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:03] ema: thanks! [13:41:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:41:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:41:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:08] Amir1: thank jbond42! :) [13:42:25] !log cp1075: repool with tslua reloads enabled T233274 [13:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:30] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [13:42:49] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:42:51] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:42:51] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:55] effie: that's teh problem to run them in parallel ^^^ [13:44:21] will it continue or should I stop the exec? [13:44:23] that's why I was suggesting to use 2/3 tmux with sequential instead and starting them few minutes apart [13:44:30] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:41] it will continue but the fail one will not be downtimed in icinga [13:44:49] sorry Amir1 i broke it and fixed it rebasing your changes should solve the issue [13:44:49] so either you do it manually or expect some spam [13:44:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:59] oh I can live with that, I will downtime them [13:45:14] jbond42: it's all good, we all break things from time to time [13:45:19] thank you for your work! [13:45:26] :) thanks [13:46:02] 10Operations, 10Puppet, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) p:05Triage→03Normal [13:46:24] volans: thank you [13:47:33] (03PS1) 10Ladsgroup: Set all of testwikidatawiki to read from the new term store for items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553342 (https://phabricator.wikimedia.org/T225057) [13:49:39] Can I quickly deploy this? ^ [13:50:21] It's for test wiki only and I'm around until Friday [13:52:33] (03PS6) 10Phamhi: labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) [13:53:26] I quickly deploy this then :D [13:53:40] (03CR) 10Ladsgroup: [C: 03+2] "testwikidata only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553342 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [13:54:34] (03Merged) 10jenkins-bot: Set all of testwikidatawiki to read from the new term store for items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553342 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [13:54:48] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata, 10Structured-Data-Backlog (Current Work): Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) Do we have a meeting scheduled to talk about capacity needs? [13:57:57] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set all of testwikidatawiki to read from the new term store for items (T225057) (duration: 00m 56s) [13:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:02] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [13:58:48] (03PS1) 10Ema: ATS: use ts.debug instead of error in global Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/553344 (https://phabricator.wikimedia.org/T233274) [14:00:54] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [14:00:54] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:11] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:01:25] !log temporarily stop cas on idp1001 for some failover tests [14:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:29] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:07:03] 10Operations, 10Traffic, 10serviceops: Investigate the remaining usage of X-Real-IP - https://phabricator.wikimedia.org/T239340 (10akosiaris) [14:07:09] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2288.codfw.wmnet', 'mw2287.codfw.wmnet', 'mw2286.codfw.wmnet'] ` and were **ALL** successful. [14:07:13] 10Operations, 10Traffic, 10serviceops: Investigate the remaining usage of X-Real-IP - https://phabricator.wikimedia.org/T239340 (10akosiaris) p:05Triage→03Low [14:08:17] (03CR) 10Ema: [C: 03+2] ATS: use ts.debug instead of error in global Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/553344 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [14:08:45] PROBLEM - Check systemd state on idp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:52] (03PS2) 10Alexandros Kosiaris: otrs: Switch from X-Real-IP to X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/552514 (https://phabricator.wikimedia.org/T239340) [14:09:55] (03PS2) 10Alexandros Kosiaris: Switch from X-Real-IP to X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) [14:10:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] "@vgutierrez thanks for the pointer." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris) [14:11:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Setting -1 for now as it probably is worth it to split this into useful corresponding commits" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris) [14:11:07] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for project GLOW - https://phabricator.wikimedia.org/T238607 (10jbond) 05Open→03Resolved a:03jbond @Moushira the mailing listr has now been created you should be able to view the [[ https://lists.wikimedia.org/mailman/listinfo/glow | info page... [14:11:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] "FWIW the first use case the makes sense already is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552514/2" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris) [14:12:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] "@Ariel, @Effie. Task filed as https://phabricator.wikimedia.org/T239340" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (https://phabricator.wikimedia.org/T239340) (owner: 10Alexandros Kosiaris) [14:14:13] !log reimage mw2285, mw2286, mw2283 [14:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:05] !log reimage mw2285, mw2284, mw2283 [14:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:26] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2285.codfw.wmnet', 'mw2284.codfw.wmnet', 'mw2283.codfw.wmnet'] ` The log can be found in `/var/log/... [14:20:53] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1338.eqiad.wmnet'] ` and were **ALL** successful. [14:23:14] (03CR) 10Marostegui: [C: 03+2] mediawiki: Disable mostrevisions for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/553339 (https://phabricator.wikimedia.org/T239072) (owner: 10Ladsgroup) [14:28:11] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1328.eqiad.wmnet'] ` and were **ALL** successful. [14:28:39] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1348.eqiad.wmnet'] ` and were **ALL** successful. [14:29:17] (03PS1) 10Ema: ATS: ensure Lua conf Icinga checks produce output [puppet] - 10https://gerrit.wikimedia.org/r/553345 (https://phabricator.wikimedia.org/T233274) [14:31:57] (03CR) 10Ema: [C: 03+2] ATS: ensure Lua conf Icinga checks produce output [puppet] - 10https://gerrit.wikimedia.org/r/553345 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [14:32:38] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Mailing List - https://phabricator.wikimedia.org/T239240 (10jbond) @Elitre, @Trizek-WMF or @Johan aroe one of you able to validate this request please, thanks John [14:32:59] (03PS1) 10Muehlenhoff: Absent sync systemd timer for the primary IDP [puppet] - 10https://gerrit.wikimedia.org/r/553346 [14:33:27] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553346 (owner: 10Muehlenhoff) [14:35:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:38] (03PS1) 10Jbond: profile::idp: set loglevel to WARN [puppet] - 10https://gerrit.wikimedia.org/r/553348 [14:43:28] !log reimage mw1346, mw1336, mw1326 [14:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:30] (03PS2) 10Muehlenhoff: Absent sync systemd timer for the primary IDP [puppet] - 10https://gerrit.wikimedia.org/r/553346 [14:44:31] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1346.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271443_jiji_96389_mw1346_... [14:46:55] (03CR) 10Muehlenhoff: [C: 03+2] Absent sync systemd timer for the primary IDP [puppet] - 10https://gerrit.wikimedia.org/r/553346 (owner: 10Muehlenhoff) [14:49:09] I filed T239344 about mobileapps flapping on scb2005. [14:49:10] T239344: Mobileapps flapping on scb2005 since 2019-11-26 0:00 UTC - https://phabricator.wikimedia.org/T239344 [14:49:28] (03PS22) 10Andrew Bogott: nova: add nova config for the placement service, enable on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [14:49:49] (03PS1) 10BBlack: authdns: restrict 5353 to production [puppet] - 10https://gerrit.wikimedia.org/r/553349 (https://phabricator.wikimedia.org/T98006) [14:50:41] !log Create nova_cell0 database on m5 master - T239170 [14:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:46] T239170: Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 [14:51:10] (03PS1) 10Marostegui: production-m5.sql: Add access to nova_cell0 for nova user [puppet] - 10https://gerrit.wikimedia.org/r/553350 (https://phabricator.wikimedia.org/T239170) [14:53:29] (03PS4) 10Ottomata: Add LVS for eventgate-logging-external using TLS port [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) [14:54:12] (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/553350 (https://phabricator.wikimedia.org/T239170) (owner: 10Marostegui) [14:54:20] (03PS2) 10Ottomata: Add discovery for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/550923 (https://phabricator.wikimedia.org/T236386) [14:54:23] (03PS2) 10Ottomata: Add discovery entries for eventgate-logging-external [dns] - 10https://gerrit.wikimedia.org/r/550915 (https://phabricator.wikimedia.org/T236386) [14:55:17] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:55:47] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add access to nova_cell0 for nova user [puppet] - 10https://gerrit.wikimedia.org/r/553350 (https://phabricator.wikimedia.org/T239170) (owner: 10Marostegui) [14:56:20] !log Add new grants for nova_cell0 database on m5 - T239170 [14:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:25] T239170: Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 [14:56:51] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:59:52] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This needs to be a bit more refined, I'll work on it as soon as I can, but basically we should not hardcode a LVS server here." [puppet] - 10https://gerrit.wikimedia.org/r/553097 (https://phabricator.wikimedia.org/T238751) (owner: 10Alaa Sarhan) [14:59:54] (03CR) 10Aklapper: [C: 03+1] Phabricator: Rename Priority field value "Normal" to "Medium" [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [15:00:09] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir1001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikimania.com has 86393 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [15:00:21] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir2001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikimania.com has 86380 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [15:00:29] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir2002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikimania.com has 86372 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [15:00:37] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir1002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikimania.com has 86364 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [15:00:55] (03CR) 10Jbond: [C: 03+2] puppetdb6: remove old puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [15:01:08] (03PS7) 10Jbond: puppetdb6: remove old puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/547219 (https://phabricator.wikimedia.org/T235655) [15:01:10] (03CR) 10Aklapper: [C: 03+1] "@Mukunda: Wondering if this is all to be done, or if you can remember / imagine anything else needed? (Plus who could give a final +2...)" [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [15:02:12] (03CR) 1020after4: "@aklapper: I think that should do it!" [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [15:02:38] !log remove trapperkeeper-webserver-jetty9-clojure debs from apt.wikimedia.org/buster-wikimedia (these were needed to unbreak TLS on Puppetdb in Buster, but an update landed in Buster 10.2, which replaces our custom hotfix) [15:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:00] !log cp-ats: rolling ats-{tls,backend} restart to enable lua reload T233274 [15:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:05] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [15:05:06] (03CR) 10BBlack: [C: 03+2] authdns: restrict 5353 to production [puppet] - 10https://gerrit.wikimedia.org/r/553349 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:05:43] (03PS1) 10Marostegui: production-m5.sql.erb: Change nova_cell0 to nova_cell0_eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/553351 (https://phabricator.wikimedia.org/T239170) [15:06:00] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:06:02] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:14] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1336.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271505_jiji_103277_mw1336... [15:07:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add LVS for eventgate-logging-external using TLS port [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:07:51] (03CR) 10Andrew Bogott: "thanks, sorry for the last-minute rename" [puppet] - 10https://gerrit.wikimedia.org/r/553351 (https://phabricator.wikimedia.org/T239170) (owner: 10Marostegui) [15:07:55] (03PS3) 10Giuseppe Lavagetto: lvs::configuration: add blubberoid TLS service [puppet] - 10https://gerrit.wikimedia.org/r/553294 [15:07:59] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` mw1326.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911271507_jiji_103657_mw1326... [15:08:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs::configuration: add blubberoid TLS service [puppet] - 10https://gerrit.wikimedia.org/r/553294 (owner: 10Giuseppe Lavagetto) [15:08:44] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Change nova_cell0 to nova_cell0_eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/553351 (https://phabricator.wikimedia.org/T239170) (owner: 10Marostegui) [15:09:02] _joe_: can I merge your changes? [15:09:12] <_joe_> marostegui: can I merge yours please [15:09:22] <_joe_> mine need visual verification of the results [15:09:24] _joe_: yep, anytime [15:09:28] _joe_: mine are a noop [15:11:20] !log downgrading trapperkeeper-webserver-jetty9-clojure packages on puppetdb hosts to the version shipped in Buster 10.2 [15:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:44] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:14:46] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:50] (03PS1) 10Ottomata: Fix type 'evengate' -> 'eventgate' in conftool-data eqiad [puppet] - 10https://gerrit.wikimedia.org/r/553352 (https://phabricator.wikimedia.org/T236386) [15:19:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix type 'evengate' -> 'eventgate' in conftool-data eqiad [puppet] - 10https://gerrit.wikimedia.org/r/553352 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:21:11] !log oblivian@cumin1001 conftool action : set/weight=10:pooled=yes; selector: service=eventgate-logging-external [15:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:54] 10Operations, 10Gerrit-Privilege-Requests, 10Release-Engineering-Team, 10Wikidata, 10Wikidata-Query-Service: Add dcausse to wikidata-query-deploy - https://phabricator.wikimedia.org/T239341 (10dcausse) [15:24:26] !log installing freetype bugfix updates from Buster 10.2 point release [15:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:42] <_joe_> !log restarting lvs2006 for addition of eventgate-logging-external,blubberoid-https [15:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:51] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:27:33] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:27:41] !log Add grants for dump (10.64.0.95,10.64.16.31) for nova_cell0_eqiad database on db1117:3325 and db2078:3325 - T239170 [15:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:47] T239170: Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 [15:28:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553348 (owner: 10Jbond) [15:28:32] (03CR) 10Jbond: [C: 03+2] profile::idp: set loglevel to WARN [puppet] - 10https://gerrit.wikimedia.org/r/553348 (owner: 10Jbond) [15:29:10] !log Add grants for dump (10.192.0.114,10.192.16.96) for nova_cell0_eqiad database on db1117:3325 and db2078:3325 - T239170 [15:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:22] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:29:25] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:06] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:30:08] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:05] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-logging-external_43192: Servers kubernetes2001.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:36:48] <_joe_> known ^^ [15:37:07] !log Logging retroactively for the record: drop user 'nova'@'%' from m5 - T239170 [15:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:13] T239170: Create a new nova database on m5 named 'nova_cell0' - https://phabricator.wikimedia.org/T239170 [15:38:21] 10Operations, 10ops-codfw, 10ops-eqdfw: scs-[a1-c8]-codfw redundancy power test - https://phabricator.wikimedia.org/T239345 (10Papaul) [15:39:07] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [15:40:05] (03PS1) 10Ottomata: Use https:// for eventgate-logging-external ProxyFetch LVS check [puppet] - 10https://gerrit.wikimedia.org/r/553355 (https://phabricator.wikimedia.org/T236386) [15:41:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Use https:// for eventgate-logging-external ProxyFetch LVS check [puppet] - 10https://gerrit.wikimedia.org/r/553355 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:42:20] !log migrate db entries of archive Media to backup1001 T238048 [15:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:25] T238048: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 [15:43:21] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) p:05Triage→03High ` root@db1135.eqiad.wmnet[bacula9]> UPDATE Media SET StorageId = 11 WHERE StorageId = 4; Query OK, 2 rows affected (0.00 sec) R... [15:44:03] <_joe_> !log restarting pybal again on lvs2006 [15:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:11] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:45:19] PROBLEM - Nginx local proxy to apache on mw2288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1888 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:45:23] PROBLEM - PHP7 rendering on mw2288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1888 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:45:55] PROBLEM - Apache HTTP on mw2288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1888 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:47:14] <_joe_> !log restarting pybal on lvs2003 [15:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:58] !log testing redundancy power on scs-a1-codfw [15:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:26] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) Hi @jbond, It seems I don't have permission for the "REQUEST INDEXING" tool. Thanks [15:52:26] !log disabling puppet on dbprov1001 to test bacula restore T238048 [15:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:32] T238048: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 [15:53:29] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10jbond) hi Fuzzy i have increased your permissions are you able to do this now? [15:53:37] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:55:17] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:55:23] <_joe_> @log restarting pybal on lvs1016 [15:55:28] <_joe_> !log restarting pybal on lvs1016 [15:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:08] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1346.eqiad.wmnet'] ` and were **ALL** successful. [15:56:25] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.50:43192, 10.2.2.31:4666]) https://wikitech.wikimedia.org/wiki/PyBal [15:56:26] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:56:29] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:48] (03PS1) 10CDanis: SpecialContributions: max concurrency 3 (instead of 10) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553360 (https://phabricator.wikimedia.org/T234450) [15:57:43] <_joe_> !log restarting pybal on lvs1015 [15:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:34] 10Operations, 10ops-codfw, 10ops-eqdfw: scs-[a1-c8]-codfw redundancy power test - https://phabricator.wikimedia.org/T239345 (10Papaul) Test on scs-a1-codfw Test1: turn off power switch on psu1 and turn on power switch on psu2 device is still on Test2: turn off power switch on psu2 and turn on power switch on... [15:58:46] 10Operations, 10ops-codfw, 10ops-eqdfw: scs-[a1-c1]-codfw redundancy power test - https://phabricator.wikimedia.org/T239345 (10Papaul) [15:59:11] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) Yep, now it works :) Thanks again! [15:59:52] (03CR) 10Marostegui: [C: 03+1] SpecialContributions: max concurrency 3 (instead of 10) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553360 (https://phabricator.wikimedia.org/T234450) (owner: 10CDanis) [16:00:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:06:56] !log cp3050: set proxy.config.http.server_session_sharing.match to "ip" T238494 [16:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:01] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [16:08:29] ema: lol, nice find! [16:08:36] +1 "ip" beats "both" for us [16:08:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] SpecialContributions: max concurrency 3 (instead of 10) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553360 (https://phabricator.wikimedia.org/T234450) (owner: 10CDanis) [16:10:03] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): labmon1001 graphite instance archiver keeps archiving the same instances - https://phabricator.wikimedia.org/T120377 (10Bstorm) 05Open→03Invalid We can't even tell what this relates to anymore. [16:10:50] bblack: silly but effective! cp3050 is now establishing ~20 new conns per second vs ~70 on cp3052 [16:12:06] (03PS1) 10Cwhite: logstash,hiera: add logstash performance tunables and tune batch size [puppet] - 10https://gerrit.wikimedia.org/r/553361 (https://phabricator.wikimedia.org/T215904) [16:12:23] who knew that being the only major site on the internet that uses 3,876 distinct public hostnames for the same basic technical service would make life difficult all the damn time in surprising new ways? [16:13:52] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1336.eqiad.wmnet'] ` and were **ALL** successful. [16:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1089 after schema change', diff saved to https://phabricator.wikimedia.org/P9767 and previous config saved to /var/cache/conftool/dbconfig/20191127-161450-marostegui.json [16:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1080 for schema change', diff saved to https://phabricator.wikimedia.org/P9768 and previous config saved to /var/cache/conftool/dbconfig/20191127-161525-marostegui.json [16:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:40] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:17:42] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:18:11] 10Operations, 10Toolforge, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Tool Labs / Tool Forge - https://phabricator.wikimedia.org/T210991 (10Bstorm) Since diamond is being dropped from Debian, let's review this soon. [16:18:15] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2285.codfw.wmnet', 'mw2284.codfw.wmnet', 'mw2283.codfw.wmnet'] ` and were **ALL** successful. [16:20:59] 10Operations, 10Wikimedia-General-or-Unknown, 10Availability: Consider using Cassandra/restbase in place of external store - https://phabricator.wikimedia.org/T100705 (10ArielGlenn) I don't really want to revive this ticket but I do want to know if it's seriously on the roadmap or indefinitely deferred/rejec... [16:21:39] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1326.eqiad.wmnet'] ` and were **ALL** successful. [16:22:42] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10Andrew) 05Open→03Resolved a:03Andrew now the yaml backend is the default. [16:23:09] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): labnet/ labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279 (10Andrew) 05Open→03Invalid this hasn't been an issue lately. [16:23:46] 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608 (10Andrew) a:03Andrew [16:29:23] jouncebot: next [16:29:23] In 7 hour(s) and 30 minute(s): WMF Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191128T0000) [16:29:51] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [16:29:56] (03CR) 10CDanis: [C: 03+2] SpecialContributions: max concurrency 3 (instead of 10) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553360 (https://phabricator.wikimedia.org/T234450) (owner: 10CDanis) [16:31:09] 10Puppet, 10Toolforge, 10Goal, 10cloud-services-team (Kanban): Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10Andrew) a:03Bstorm this can be closed, can't it? [16:31:38] 10Operations, 10ops-codfw, 10ops-eqdfw: scs-[a1-c1]-codfw redundancy power test - https://phabricator.wikimedia.org/T239345 (10Papaul) p:05Triage→03Normal [16:31:47] nice window announcement, jouncebot [16:31:49] (03PS1) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) [16:31:51] (03PS1) 10Filippo Giunchedi: install_server: standard partman example [puppet] - 10https://gerrit.wikimedia.org/r/553364 [16:32:06] :) [16:32:22] !log cdanis@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: dd4c76d3d SpecialContributions: max concurrency 3 (instead of 10) T234450 (duration: 01m 17s) [16:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:01] 10Puppet, 10Toolforge, 10Goal, 10cloud-services-team (Kanban): Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10Bstorm) 05Open→03Resolved I'm closing it. Deleting a node needs work, but that exists as another ticket. [16:33:37] (03PS1) 10Jbond: tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 [16:33:39] (03PS1) 10Jbond: tlsproxy::localssl: allow users to specify an upstream ip address [puppet] - 10https://gerrit.wikimedia.org/r/553366 [16:33:41] (03PS1) 10Jbond: apereo_cas: ensure we log the correct client ip address and not nginx's [puppet] - 10https://gerrit.wikimedia.org/r/553367 [16:34:49] heh [16:35:10] 10Operations, 10ops-codfw: codfw: rack/setup/install mc203[7,8,9].codfw.wmnet - https://phabricator.wikimedia.org/T239249 (10Papaul) [16:35:20] I still have this gut reaction when I see a tlsproxy::localssl patch to go run and check that whatever someone's doing for some other use of it doesn't break our traffic nodes' use of it [16:35:29] but we don't use it anymore, which is pretty awesome! :) [16:35:33] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 (owner: 10Jbond) [16:35:48] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: allow users to specify an upstream ip address [puppet] - 10https://gerrit.wikimedia.org/r/553366 (owner: 10Jbond) [16:40:41] (03PS2) 10Jbond: tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 [16:40:57] lol :) [16:42:23] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10elukey) Thanks a lot! [16:45:46] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Investigate use of Puppet "environments" for per-project Puppet manifests - https://phabricator.wikimedia.org/T170370 (10Andrew) [16:46:16] (03PS3) 10Giuseppe Lavagetto: Add discovery for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/550923 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [16:46:30] 10Operations, 10Patch-For-Review: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) [16:47:52] (03PS3) 10Jbond: tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 [16:47:54] (03PS2) 10Jbond: tlsproxy::localssl: allow users to specify an upstream ip address [puppet] - 10https://gerrit.wikimedia.org/r/553366 [16:48:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add discovery for eventgate-logging-external [puppet] - 10https://gerrit.wikimedia.org/r/550923 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [16:49:50] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 (owner: 10Jbond) [16:49:55] (03PS2) 10Jbond: apereo_cas: ensure we log the correct client ip address and not nginx's [puppet] - 10https://gerrit.wikimedia.org/r/553367 [16:50:00] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: allow users to specify an upstream ip address [puppet] - 10https://gerrit.wikimedia.org/r/553366 (owner: 10Jbond) [16:50:46] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-logging-external [16:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:07] 10Operations, 10Data-Services, 10cloud-services-team (Kanban): Switch labstore servers to default SSH configuration - https://phabricator.wikimedia.org/T177914 (10Bstorm) The commments are still present in puppet, but I wouldn't be surprised if the workaround described here doesn't even work anymore. [16:51:44] 10Operations, 10DC-Ops: HP SSD Failure Firmware Fix - https://phabricator.wikimedia.org/T239211 (10wiki_willy) a:03wiki_willy [16:52:21] (03PS5) 10Elukey: role::dumps::distribution::server: add kerberos [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) [16:52:29] (03PS4) 10Jbond: tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 [16:52:47] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:53:45] (03PS23) 10Andrew Bogott: nova: add nova config for the placement service, enable on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) [16:54:18] (03CR) 10Elukey: [C: 03+2] role::dumps::distribution::server: add kerberos [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [16:54:24] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: add parameter type checking [puppet] - 10https://gerrit.wikimedia.org/r/553365 (owner: 10Jbond) [16:54:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add discovery entries for eventgate-logging-external [dns] - 10https://gerrit.wikimedia.org/r/550915 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [16:54:52] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10jbond) 05Resolved→03Open no problem [16:55:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:55:23] (03PS2) 10Elukey: role::dumps::distribution::server: add analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/550816 (https://phabricator.wikimedia.org/T234229) [16:57:22] !log disabling puppet on clouvirt* and cloudcontrol* while merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552894/ [16:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:33] (03CR) 10Andrew Bogott: [C: 03+2] nova: add nova config for the placement service, enable on eqiad1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552894 (https://phabricator.wikimedia.org/T239161) (owner: 10Andrew Bogott) [16:58:04] (03CR) 10Elukey: [C: 03+2] role::dumps::distribution::server: add analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/550816 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [16:59:09] 10Operations, 10serviceops, 10Kubernetes, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10Joe) p:05Triage→03Normal [16:59:39] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) [17:00:47] (03PS1) 10Giuseppe Lavagetto: trafficserver: use https discovery url for blubberoid [puppet] - 10https://gerrit.wikimedia.org/r/553369 (https://phabricator.wikimedia.org/T236017) [17:02:32] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10Joe) Once the patch I created is merged, we will be able to remove the HTTP endpoint as soon as we're varnish-be-free. [17:05:57] 10Operations, 10DC-Ops, 10procurement: HP SSD Failure Firmware Fix - https://phabricator.wikimedia.org/T239211 (10wiki_willy) Email sent over to Dasher/HP to see if they can cross-check if any of our previously purchased drives are impacted. Thanks, Willy [17:06:31] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops: HP SSD Failure Firmware Fix - https://phabricator.wikimedia.org/T239211 (10wiki_willy) [17:07:07] 10Operations, 10Data-Services, 10cloud-services-team (Kanban): Undo special tools-home and tools-project share definitions for NFS - https://phabricator.wikimedia.org/T161834 (10Bstorm) At this point, I have expanded rather than reduced this tendency (unfortunately) to accommodate maps because they are on di... [17:08:12] PROBLEM - Check systemd state on labstore1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:29] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10RobH) So, the channel in question is pretty much a legacy channel, left over before the standardization of #wikimedia-subteam/project standard of naming. I'm not sure who should be allowed in, as the access list curr... [17:14:06] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10jbond) @Daimona sorry for the lag on this ticket it is still on my radar im just trying to work out the correct procedure to get you authorised as the #mediawiki_security channel is a bit strange due to legacy issues.... [17:14:20] (03CR) 10Dzahn: [C: 03+2] Phabricator: Rename Priority field value "Normal" to "Medium" [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [17:14:49] (03PS1) 10ArielGlenn: comment out sagres.c3sl.ufpr.br from dumps mirrors list [puppet] - 10https://gerrit.wikimedia.org/r/553371 [17:16:41] (03CR) 10Bstorm: [C: 03+1] comment out sagres.c3sl.ufpr.br from dumps mirrors list [puppet] - 10https://gerrit.wikimedia.org/r/553371 (owner: 10ArielGlenn) [17:18:39] (03CR) 10Dzahn: "oh yea, i forgot about it being jessie and that issue.." [puppet] - 10https://gerrit.wikimedia.org/r/552947 (owner: 10Dzahn) [17:19:27] RECOVERY - Check systemd state on labstore1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:33] (03CR) 10Elukey: [C: 03+1] comment out sagres.c3sl.ufpr.br from dumps mirrors list [puppet] - 10https://gerrit.wikimedia.org/r/553371 (owner: 10ArielGlenn) [17:21:18] RECOVERY - Nginx local proxy to apache on mw2288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:21:34] RECOVERY - Apache HTTP on mw2288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:22:14] RECOVERY - PHP7 rendering on mw2288 is OK: HTTP OK: HTTP/1.1 200 OK - 75376 bytes in 0.559 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:22:18] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Aklapper) @jbond: Did you intentionally reopen this task? [17:22:42] (03CR) 10ArielGlenn: [C: 03+2] comment out sagres.c3sl.ufpr.br from dumps mirrors list [puppet] - 10https://gerrit.wikimedia.org/r/553371 (owner: 10ArielGlenn) [17:22:50] (03CR) 10Dzahn: [C: 03+1] "just needs to go after https://gerrit.wikimedia.org/r/c/operations/puppet/+/553183" [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [17:23:13] 10Operations, 10Puppet, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10CDanis) [17:23:59] (03PS2) 10Dzahn: Always set AIRFLOW_HOME when running airflow [puppet] - 10https://gerrit.wikimedia.org/r/553183 (owner: 10EBernhardson) [17:25:27] 10Operations, 10Puppet, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10jbond) we should potentially also refresh discussions on [[ https://phabricator.wikimedia.org/T211750 | black ]]. potentially with a CI hook automaticity reformat the code via pre-co... [17:28:20] (03CR) 10Dzahn: "@Aklappper It worked. I raised the priority of the linked ticket to.... NORMAL." [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [17:29:15] (03CR) 10Dzahn: "arr..disregard that comment. may need restart" [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [17:30:18] (03PS3) 10Krinkle: Gerrit: Make 'eclipse' and 'elegant' themes colorblind-friendly [puppet] - 10https://gerrit.wikimedia.org/r/536687 (https://phabricator.wikimedia.org/T232893) [17:30:22] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10Daimona) Thanks! >>! In T239093#5697868, @RobH wrote: > * We need to know the user's IRC cloak, if they have one. I do have a cloak, `wikipedia/Daimona-Eaytoy`, as shown in the info you posted. > ** @diamona turns... [17:30:48] mutante: Not sure if anyone cares or whether it needs review - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536687/ but would be nice to land :) [17:31:07] (if not, should I ask RelEng to +1?) [17:32:21] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [17:32:29] Krinkle: i added myself to reviewers in Gerrit UI. that will remind me to do it in a bit, just in the middle of other ones [17:32:37] cool [17:33:43] 10Operations, 10Puppet, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10BBlack) I will float the opinion that while I may have many opinions on code style, bikeshedding between reasonable options for a shared standard is a waste. If there's a standard up... [17:33:52] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops: HP SSD Failure Firmware Fix - https://phabricator.wikimedia.org/T239211 (10wiki_willy) 05Open→03Resolved Confirmed by Dasher (email below), that we're not impacted by the critical firmware update bulletins. Resolving task. Hi Willy, Thank you for bri... [17:36:20] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6375 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:37:06] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Mailing List - https://phabricator.wikimedia.org/T239240 (10Johan) I'm not really sure what you're looking for from us, @jbond but if you just want a sanity check: makes sense to me and the requesters are known and trusted members... [17:39:22] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.03333 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:46:16] (03CR) 10Dzahn: "As pointed out by Mukunda: The change did not take effect because it was already by edits in the UI." [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [17:47:07] (03CR) 10Dzahn: ".already overridden by changes in the UI." [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper) [17:47:58] (03CR) 10Dzahn: [C: 03+2] Always set AIRFLOW_HOME when running airflow [puppet] - 10https://gerrit.wikimedia.org/r/553183 (owner: 10EBernhardson) [17:50:51] (03PS1) 10Ssingh: Edit Project Config [software/censorship-monitoring] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/553381 [17:51:14] ha [17:52:11] (03Abandoned) 10Ssingh: Edit Project Config [software/censorship-monitoring] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/553381 (owner: 10Ssingh) [17:53:28] (03CR) 10Dzahn: [C: 03+1] airflow: Run webserver and scheduler processes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [17:53:54] (03PS16) 10Dzahn: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [17:54:27] (03CR) 10jerkins-bot: [V: 04-1] airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [17:56:13] (03PS17) 10Dzahn: airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [18:03:27] (03CR) 10CRusnov: backends: add Netbox backend (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [18:03:39] 10Operations, 10Puppet, 10User-ArielGlenn, 10User-jbond: Python3 style guide - https://phabricator.wikimedia.org/T239334 (10Volans) I think the best way is to have it easily integrated in some form in the local workflow in our dev envs, so that when you tests locally they passes and when you commit you com... [18:17:21] (03PS1) 10Dzahn: airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 [18:17:32] (03PS1) 10Ssingh: First commit of censorship monitoring project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553385 [18:19:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1003/19666/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/553361 (https://phabricator.wikimedia.org/T215904) (owner: 10Cwhite) [18:19:32] (03CR) 10jerkins-bot: [V: 04-1] airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (owner: 10Dzahn) [18:19:48] (03CR) 10Dzahn: airflow: Run webserver and scheduler processes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [18:24:01] (03CR) 10Dzahn: [C: 03+2] airflow: Run webserver and scheduler processes [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [18:38:47] 10Operations, 10ops-codfw, 10ops-eqdfw: scs-[a1-c1]-codfw redundancy power test - https://phabricator.wikimedia.org/T239345 (10Papaul) Looks like redundancy power is working on both scs's in codfw. What thing to do in esams is to try to change the outlet in which the scs is plugged in on ps2-oe16-esams and p... [18:39:42] (03CR) 10Dzahn: "The systemd units exist now, though the services fail to start for now:" [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [18:41:20] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10thcipriani) [18:42:30] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10thcipriani) Lowered the memory request as that seems to be out-of-line with the usage of most Ganeti VMs, hopefully 16G would work for a Ganeti VM? [18:43:30] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10thcipriani) >>! In T239151#5693254, @hashar wrote: > About "access to gerrit sql", would it be sufficient to do a database dump from production and load that in a MySQL server local to the t... [18:44:38] (03CR) 10Dzahn: "airflow-webserver: PermissionError: [Errno 13] Permission denied: '/etc/airflow/unittests.cfg'" [puppet] - 10https://gerrit.wikimedia.org/r/544990 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [18:48:58] (03PS2) 10Dzahn: airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) [18:49:20] (03CR) 10jerkins-bot: [V: 04-1] airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [18:49:22] mutante: i just dug through the airflow code to double check, it looks like we will never read the unittests.cfg file, but some empty file must exist there or it will attempt to write a default version [18:49:51] ebernhardson: the file does not exist but i think it wants to create it and the dir is [18:49:54] root:airflow 440 [18:50:04] mutante: right, we should just have puppet put an empty file there [18:50:07] i guess we want to ensure airflow:airflow and 755 ? [18:50:26] or 644 since puppet adds the +1 on dirs anyways [18:50:30] mutante: it doesn't really need to write to that path, that is just their "easy first run" that will magic config files into places if you don't have them already [18:52:04] hmm. how does it know it's the first run [18:53:44] mutante: it doesn't, it tries to write any time the file doesn't exist [18:54:06] mutante: basically it assumes its not the first run, because on the first run it writes the files out [18:54:37] basically this: https://github.com/apache/airflow/blob/master/airflow/configuration.py#L512 [18:55:05] ebernhardson: i tried "touch unittests.cfg" and starting it again but that doesnt do it. the file cant be empty i guess [18:55:43] mutante: hmm, it shouldn't ever actually read that file. Looking at logs [18:55:57] ValueError: Unable to configure handler 'processor' [18:56:07] maybe that's an unrelated thing [18:56:26] yea, sounds like i might have a config issue somewhere. looking [18:56:36] let me try what it puts in that file if it can write to the dir? [18:56:40] ok [18:56:48] mutante: sure [18:58:53] !log an-airflow1001: cd /etc/ ; chown airflow airflow; systemctl start airflow-webserver to let airflow write unittests.cfg [18:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:15] ebernhardson: unittests.cfg now has content. config error is different and unchanged [19:00:24] mutante: hmm, for some reason /var/log/airflow is a symlink to `dir`. Because i wrote the definitions for logdir and piddir wrong ... [19:00:33] !log an-airflow1001: cd /etc/ ; chown airflow airflow; systemctl start airflow-webserver to let airflow write unittests.cfg (it tries to write this on first start and did not have permissions to do so) T236180 [19:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:39] T236180: Deploy search platform airflow service - https://phabricator.wikimedia.org/T236180 [19:00:48] ebernhardson: ah [19:00:51] mutante: it should have `ensure => 'directory'` but currently has `ensure => 'dir'` [19:01:07] ah:) ok, easy to fix [19:02:56] (03PS1) 10Dzahn: airflow: fix ensure => directory [puppet] - 10https://gerrit.wikimedia.org/r/553392 [19:03:19] (03PS2) 10Dzahn: airflow: fix ensure => directory [puppet] - 10https://gerrit.wikimedia.org/r/553392 (https://phabricator.wikimedia.org/T236180) [19:03:42] (03CR) 10Dzahn: [C: 03+2] airflow: fix ensure => directory [puppet] - 10https://gerrit.wikimedia.org/r/553392 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [19:06:04] Notice: /Stage[main]/Profile::Analytics::Search::Airflow/File[/var/log/airflow]/ensure: ensure changed 'link' to 'directory' [19:06:15] Notice: /Stage[main]/Profile::Analytics::Search::Airflow/Systemd::Service[airflow-webserver]/Service[airflow-webserver]/ensure: ensure changed 'stopped' to 'running' [19:06:52] ebernhardson: one issue fixed! but new issue [19:07:05] ebernhardson: No module named 'MySQLdb' [19:07:23] but progress.other issue is gone [19:07:24] mutante: ahh, of course. Shipping a fix [19:08:05] !log ebernhardson@deploy1001 Started deploy [search/airflow@57f4caa]: Install mysqlclient to airflow instance [19:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:45] !log ebernhardson@deploy1001 Finished deploy [search/airflow@57f4caa]: Install mysqlclient to airflow instance (duration: 00m 40s) [19:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:07] err, hmm. Seems this one has platform dependencies: libmysqlclient.so.20: [19:09:14] ImportError: libmysqlclient.so.20: cannot open shared object file [19:09:19] looking around [19:15:03] had to get off a train, be back soon [19:15:55] ack [19:16:20] [an-airflow1001:~] $ apt-cache search libmysqlclient.so [19:16:20] default-libmysqlclient-dev - MySQL database development files (metapackage) [19:17:08] yea i found that, wasn't sure if it also provides .20, seems worth testing at least. [19:18:01] the meta package would pull all these: [19:18:02] libgmp-dev libgmpxx4ldbl libgnutls-openssl27 libgnutls28-dev libgnutlsxx28 libidn2-dev libmariadb-dev libmariadb-dev-compat libmariadb3 [19:18:05] libp11-kit-dev libtasn1-6-dev nettle-dev [19:18:31] but could be just "libmariadb-dev-*" [19:18:33] stat1007 has that installed but doesn't get the .20 symlink. Alternatively i think python3-mysqldb would do it [19:21:28] ebernhardson: i tried installing python3-mysqldb but that did not change it, so i removed it again (with --purge) [19:21:36] verified the default-libmysqlclient-dev isn't enough on stat1007 [19:21:57] mutante: lets try that one more time without mysql installed to the virtualenv as well [19:22:01] sec i'll revert the last patch [19:22:11] it probably didn't look at the system version [19:23:31] !log ebernhardson@deploy1001 Started deploy [search/airflow@f3bad9d]: revert adding mysqlclient python package [19:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:13] !log ebernhardson@deploy1001 Finished deploy [search/airflow@f3bad9d]: revert adding mysqlclient python package (duration: 00m 42s) [19:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:25] mutante: verified undeployed, can try the deb again [19:27:55] !log an-airflow1001 - apt-get install python3-mysqldb - start airflow-webserver [19:27:56] mutante: hmm, the virtualenv isn't seeing the system package, lemme see why... [19:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:06] 10Operations, 10Traffic: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) Please note that ganeti4002 and ganeti4003 are showing as 'staged' in netbox but not in puppetdb, and throwing report errors on https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ "missin... [19:28:07] ebernhardson: ack [19:29:17] (03PS1) 10Dzahn: airflow: require_package python3-mysqldb [puppet] - 10https://gerrit.wikimedia.org/r/553397 (https://phabricator.wikimedia.org/T236180) [19:30:24] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [19:30:41] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) changed the IP from 10.64.40.100 to 10.64.40.118 [19:30:57] (03PS1) 10Jgreen: change frdb1003's IP from 10.64.40.100 to 10.64.40.118 [dns] - 10https://gerrit.wikimedia.org/r/553399 (https://phabricator.wikimedia.org/T239139) [19:31:25] !log ebernhardson@deploy1001 Started deploy [search/airflow@45b7790]: Allow airflow virtualenv to import system site packages to facilitate libmysqlclient [19:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:00] (03CR) 10Jgreen: [C: 03+2] change frdb1003's IP from 10.64.40.100 to 10.64.40.118 [dns] - 10https://gerrit.wikimedia.org/r/553399 (https://phabricator.wikimedia.org/T239139) (owner: 10Jgreen) [19:32:10] !log ebernhardson@deploy1001 Finished deploy [search/airflow@45b7790]: Allow airflow virtualenv to import system site packages to facilitate libmysqlclient (duration: 00m 45s) [19:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1080 after schema change', diff saved to https://phabricator.wikimedia.org/P9772 and previous config saved to /var/cache/conftool/dbconfig/20191127-193227-marostegui.json [19:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:45] closer, (_mysql_exceptions.OperationalError) (1698, "Access denied for user 'search_airflow'@'2620:0:861:106:10:64:36:119'") [19:33:17] ebernhardson: alright! i'll make the package install official [19:33:37] ebernhardson: afraid that step needs DBA [19:33:38] ahh, it has the wrong hostname. That ip is itself, it should have an-coord1001 iirc. [19:33:42] or not :) [19:34:00] mutante: user and db already created, or at least so i was told :) [19:34:03] (03CR) 10Dzahn: [C: 03+2] airflow: require_package python3-mysqldb [puppet] - 10https://gerrit.wikimedia.org/r/553397 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [19:34:28] alright [19:35:26] ebernhardson: not dbproxy* ? [19:35:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 for schema change', diff saved to https://phabricator.wikimedia.org/P9773 and previous config saved to /var/cache/conftool/dbconfig/20191127-193528-marostegui.json [19:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:58] mutante: hmm, maybe? I'm not famliar with the proxies, not sure if they are available on the analytics mysql db [19:36:28] i dont know about analytics, i just know non-analytics stuff uses them nowadays instead of directly connecting to a db* [19:37:03] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: (ESCALATED Need By: 11/26/19) rack/setup/install frdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T239139 (10Jgreen) [19:37:08] marostegui: are there dbproxies for analytics db ^? [19:37:48] mutante: for m5? no [19:37:52] mutante: also i can't actually look in the config file, which host does sql_alchemy_conn in /etc/airflow/airflow.cfg say? It should be an-coord1001.eqiad.wmnet if i'm following puppet correctly [19:39:09] oh, actually i'm probably misreading the mysql connect error, its not saying it was connecting to that ip, mariadb uses name+ip as a username [19:39:26] ok this will probably require following up on the ticket to figure out why the mysql credentials aren't right [19:39:31] mutante: there are no proxies for either m5 or m4 [19:39:52] marostegui: ok, thank you! [19:39:54] ebernhardson: 372 broker_url = sqla+mysql://airflow:airflow@localhost:3306/airflow [19:39:57] ? [19:40:23] result_backend = db+mysql://airflow:airflow@localhost:3306/airflow [19:40:26] mutante: nope, thats unused. sql_alchemy_conn = [19:40:42] mutante: both unused, i guess i could trim the defaults from the config file for pieces we don't use [19:40:50] ebernhardson: yes, that is @an-coord1001.eqiad.wmnet/search_airflow [19:41:05] user name and db name both search_airflow [19:41:27] needs GRANT on an-coord1001 for connection from that IP (v6?)? [19:41:34] mutante: likely yes [19:44:55] there is modules/profile/manifests/mariabdb/grants/ but not sure if an-coord is in that. i see only cloudinfra, core and production [19:47:11] mutante: if it was it would have been around nov 4, 2am. Might be manual, not seeing a patch at same time. Thats when elu.key posted the user created: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/544989/ [19:47:28] yea, i guess this might be manual [19:47:47] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:52:22] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received daniel_zahn https://phabricator.wikimedia.org/T239344 https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:54:03] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:08:46] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2215.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911272008_dzahn_166465_mw22... [20:13:40] (03PS1) 10Ssingh: Replace the string "CAIDA" with "IODA" to maintain consistency [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553407 [20:20:46] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2224.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911272020_dzahn_168724_mw22... [20:21:14] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10Bstorm) [20:21:59] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [20:24:37] (03PS3) 10Dzahn: airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) [20:25:11] (03CR) 10jerkins-bot: [V: 04-1] airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [20:29:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:03] (03PS4) 10Dzahn: airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) [20:32:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:25] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) 05Open→03Resolved [20:40:23] (03CR) 10Zoranzoki21: "BTW, this is empty patchset?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553231 (owner: 10Krinkle) [20:40:38] (03PS1) 10Dzahn: airflow: remove config settings for Celery Executor and Flower [puppet] - 10https://gerrit.wikimedia.org/r/553413 [20:40:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:16] (03CR) 10Dzahn: "can't compile yet because host is new and facts need syncing" [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [20:51:42] (03PS2) 10Dzahn: otrs: add envoy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/552947 [20:57:19] PROBLEM - Check systemd state on an-tool1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:51] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:00:52] ACKNOWLEDGEMENT - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T239365 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:00:55] 10Operations, 10ops-eqiad: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10ops-monitoring-bot) [21:09:26] 10Operations, 10DBA, 10MediaWiki-General: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10daniel) [21:10:40] (03CR) 10CDanis: [C: 03+1] prometheus: alert on exporter's 'up' metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553335 (https://phabricator.wikimedia.org/T187708) (owner: 10Filippo Giunchedi) [21:10:53] !log restarting acme-chief service on acmechief1001 (daemon appears to be stuck on a lock and nonfunctional for days...) [21:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:05] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2215.codfw.wmnet'] ` and were **ALL** successful. [21:19:21] (03PS3) 10Dzahn: otrs: add envoy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/552947 [21:21:15] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10Andrew) [21:23:10] (03PS1) 10Bstorm: toolforge-k8s: simplify calico upgrades and distribute calicoctl [puppet] - 10https://gerrit.wikimedia.org/r/553418 [21:26:40] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2224.codfw.wmnet'] ` and were **ALL** successful. [21:27:05] (03PS2) 10Bstorm: toolforge-k8s: simplify calico upgrades and distribute calicoctl [puppet] - 10https://gerrit.wikimedia.org/r/553418 [21:27:39] (03CR) 10Dzahn: [C: 03+2] Gerrit: Make 'eclipse' and 'elegant' themes colorblind-friendly [puppet] - 10https://gerrit.wikimedia.org/r/536687 (https://phabricator.wikimedia.org/T232893) (owner: 10Krinkle) [21:30:04] Re-deploying temp 500 limit fix for T234450 to wmf.5 (already on wmf.8) [21:30:58] !log mw2215 rebooting [21:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:57] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 581285 seconds left:Certificate *.wikimania.com valid until 2020-01-26 14:00:33 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Ncredir [21:31:57] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir1002 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 581285 seconds left:Certificate *.wikimania.com valid until 2020-01-26 14:00:33 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Ncredir [21:31:57] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir2002 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 581285 seconds left:Certificate *.wikimania.com valid until 2020-01-26 14:00:33 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Ncredir [21:31:57] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir2001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 581285 seconds left:Certificate *.wikimania.com valid until 2020-01-26 14:00:33 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Ncredir [21:32:57] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2225.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911272132_dzahn_183058_mw22... [21:34:26] (03PS3) 10Bstorm: toolforge-k8s: simplify calico upgrades and distribute calicoctl [puppet] - 10https://gerrit.wikimedia.org/r/553418 [21:34:50] mutante: just go this on a scap sync-file: "mw2215.codfw.wmnet returned [255]: ssh: connect to host mw2215.codfw.wmnet port 22: Connection timed out". Do I need to re-run? [21:35:46] sbassett: i am running "scap pull" on it right now. that should do it [21:35:58] mutante: ok, tx [21:35:59] !log mw2215 scap pull [21:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:15] it's because we are reinstalling appservers [21:36:35] 21:35:32 Copying from deployment.codfw.wmnet to mw2215.codfw.wmnet [21:36:40] 21:36:24 Finished rsync common (duration: 00m 51s) [21:37:10] (03PS4) 10Bstorm: toolforge-k8s: simplify calico upgrades and distribute calicoctl [puppet] - 10https://gerrit.wikimedia.org/r/553418 [21:38:11] (03PS5) 10Bstorm: toolforge-k8s: simplify calico upgrades and distribute calicoctl [puppet] - 10https://gerrit.wikimedia.org/r/553418 [21:40:04] (03CR) 10Bstorm: "There, that does what I want: no changes to the calico file until hiera changes: https://puppet-compiler.wmflabs.org/compiler1001/19672/to" [puppet] - 10https://gerrit.wikimedia.org/r/553418 (owner: 10Bstorm) [21:45:08] (03CR) 10Bstorm: "The fix for https://github.com/projectcalico/calico/issues/2322 is backported to 3.8.1, which would be the first phase of calico upgrades." [puppet] - 10https://gerrit.wikimedia.org/r/553418 (owner: 10Bstorm) [21:47:15] (03CR) 10Bstorm: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/553418 (owner: 10Bstorm) [21:51:42] PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (201440s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:51:42] PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (201440s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:54:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:20] 10Operations, 10ops-eqiad: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10wiki_willy) a:03Jclark-ctr @Jclark-ctr - looks like the system is under warranty until October 2021, so we'll be able to RMA it. Thanks, Willy [21:57:43] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2217.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911272155_dzahn_187651_mw22... [21:59:57] (03CR) 10Cwhite: [C: 03+2] logstash,hiera: add logstash performance tunables and tune batch size [puppet] - 10https://gerrit.wikimedia.org/r/553361 (https://phabricator.wikimedia.org/T215904) (owner: 10Cwhite) [22:02:22] PROBLEM - Wikitech and wt-static content in sync on cloudweb2001-dev is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (202663s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [22:06:21] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [22:10:13] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2226.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911272209_dzahn_191211_mw22... [22:14:16] PROBLEM - mediawiki-installation DSH group on cloudweb2001-dev is CRITICAL: Host cloudweb2001-dev is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:15:52] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10awight) [22:18:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:36] (03PS1) 10Dzahn: SSL: add certificate for OTRS/ticket.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/553421 [22:26:09] (03CR) 10jerkins-bot: [V: 04-1] SSL: add certificate for OTRS/ticket.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/553421 (owner: 10Dzahn) [22:26:24] (03PS2) 10Dzahn: SSL: add certificate for OTRS/ticket.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/553421 [22:27:17] (03CR) 10Dzahn: "Added certificate in private repo." [puppet] - 10https://gerrit.wikimedia.org/r/553421 (owner: 10Dzahn) [22:29:15] (03PS1) 10Dzahn: add fake SSL key for ticket.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/553422 [22:29:34] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake SSL key for ticket.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/553422 (owner: 10Dzahn) [22:31:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:48] (03CR) 10Dzahn: "certificate for this added in private repo." [puppet] - 10https://gerrit.wikimedia.org/r/552947 (owner: 10Dzahn) [22:33:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:01] (03PS1) 10Dzahn: varnish/ATS: rename director for OTRS from mendelevium to otrs [puppet] - 10https://gerrit.wikimedia.org/r/553423 [22:37:42] (03PS4) 10Dzahn: otrs: add envoy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/552947 (https://phabricator.wikimedia.org/T210411) [22:37:51] (03PS1) 10Dzahn: ATS: switch OTRS to use TLS and discovery record [puppet] - 10https://gerrit.wikimedia.org/r/553424 (https://phabricator.wikimedia.org/T210411) [22:39:18] (03CR) 10Dzahn: "to use the config from Change-Id: Ib7659491090762 hence port 1443 because of jessie and envoy and privileged ports" [puppet] - 10https://gerrit.wikimedia.org/r/553424 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [22:43:22] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2225.codfw.wmnet'] ` and were **ALL** successful. [22:47:25] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2243.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911272246_dzahn_198389_mw22... [23:03:38] (03PS12) 10Dzahn: rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:04:02] (03CR) 10jerkins-bot: [V: 04-1] rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:04:35] (03CR) 10Dzahn: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [23:05:57] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2244.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911272305_dzahn_202800_mw22... [23:07:31] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2217.codfw.wmnet'] ` and were **ALL** successful. [23:09:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:02] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [23:15:54] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2218.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911272314_dzahn_205154_mw22... [23:20:58] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2226.codfw.wmnet'] ` and were **ALL** successful. [23:21:50] (03CR) 10Urbanecm: [C: 03+1] Beta labs: Remove unused GrowthExperiments config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552501 (owner: 10Kosta Harlan) [23:27:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:46] PROBLEM - PHP opcache health on mw2215 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:46:14] RECOVERY - PHP opcache health on mw2215 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:53:24] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2243.codfw.wmnet'] ` and were **ALL** successful.