[00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200312T0000). [00:00:24] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/WikimediaEditorTasks: Revert 'Fix revert counting for non-language-specific counters' (T247479) (duration: 01m 07s) [00:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:29] T247479: All edits in Wikidata are tagged as "suggested edits" - https://phabricator.wikimedia.org/T247479 [00:00:37] I just got through mutante [00:01:01] great, thank you [00:01:03] AmandaNP: great [00:01:05] bradv: better? [00:01:16] afterhours: good? [00:01:28] yay! [00:01:33] si :) [00:01:37] 10/10 service. Would recommend. [00:01:53] where can I leave a review on Yelp [00:02:05] for this excellent service [00:02:21] mholloway: lemme know when you're done [00:02:34] <3 [00:06:36] (03PS2) 10C. Scott Ananian: add hiera keys for parsoid-php on deployment-parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [00:06:53] (03CR) 10C. Scott Ananian: "Everything seems to be working in beta now, so it should be safe to merge back from Horizon." [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [00:09:09] ebernhardson: thanks, one more quick deploy change then i should be good for now [00:10:02] kk [00:12:02] (03PS1) 10Mholloway: Revert "Enabling depicts count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579094 [00:12:25] (03CR) 10Mholloway: [V: 03+2 C: 03+2] Revert "Enabling depicts count" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579094 (owner: 10Mholloway) [00:13:56] 10Operations, 10Commons, 10Thumbor: Getting constant 429 for a thumbnail - https://phabricator.wikimedia.org/T247473 (10AntiCompositeNumber) Running the PDF through GhostScript locally generates an error on the second page: ` $ gs -sDEVICE=jpeg -dJPEG=90 -r150 -DBATCH -dNOPAUSE -dSAFER -sOutputFile=Mimořádné... [00:14:57] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable depicts counter due to code revert (T244974) (duration: 01m 07s) [00:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:02] T244974: Connect image tag contributions to Suggested edits profile stats - https://phabricator.wikimedia.org/T244974 [00:16:16] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable depicts counter due to code revert (T244974), take 2 (duration: 01m 07s) [00:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:36] (03PS1) 10Bstorm: dumps-distribution: move all traffic to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/579095 (https://phabricator.wikimedia.org/T224583) [00:16:53] ebernhardson: ok, all yours [00:17:21] mholloway: thanks! [00:17:46] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) a:03Bstorm [00:18:46] 10Operations, 10Commons, 10Thumbor: Getting constant 429 for a thumbnail - https://phabricator.wikimedia.org/T247473 (10AntiCompositeNumber) ` $ gs -sDEVICE=jpeg -dJPEG=90 -sOutputFile=%stdout -dFirstPage=2 -dLastPage=2 -r150 -dBATCH -dNOPAUSE -dSAFER -q -f Mimořádné_opatření_-_zákaz_vývozu_desinfekce_rukou.... [00:20:08] (03CR) 10Bstorm: "It seems that when I change the other values, it just breaks various jobs. So I'm just aiming to change the active Cloud VPS NFS server. " [puppet] - 10https://gerrit.wikimedia.org/r/579095 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [00:20:50] (03PS2) 10C. Scott Ananian: Update linter whitelist w/ parsoid11's IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579018 [00:21:54] (03CR) 10Dzahn: [C: 03+2] switch webproxy CNAMEs to new install servers [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:22:00] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) @elukey Are there any special considerations for Kerberos/HDFS/Hadoop stuff if I aim to upgrade this... [00:23:22] !log switching webproxy.eqiad.wmnet / webproxy.codfw.wmnet to install[12]003 (squids on buster) [00:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:29] 10Operations, 10Commons, 10Thumbor: Thumbnailing page 2 of c:File:Mimořádné opatření - zákaz vývozu desinfekce rukou.pdf generates a non-fatal Ghostscript error that is piped to imagemagick - https://phabricator.wikimedia.org/T247473 (10AntiCompositeNumber) [00:23:55] (03PS3) 10C. Scott Ananian: Update linter whitelist w/ parsoid11's IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579018 (https://phabricator.wikimedia.org/T246833) [00:24:22] mholloway: BTW, please don't ever V+2 patches; it breaks CI for everyone, and bypasses all the security and CI checks. [00:24:34] James_F: to fix the last db expression in prod (group1) I think buildDBLists.php might be a good place to handle that. Especilly with the composer checkclean step in place (we wouldn't even need a separate DBlist structure test to verify that it is up to date). [00:24:59] I'm not sure how to store it though. What do you think about e.g. renaming group1.dblist to group1-expression.dblist or some such? [00:25:03] open to ideas :) [00:25:08] Krinkle: Hmm. [00:25:27] Krinkle: We could just move the expression into the YAML files? [00:25:35] Right now we don't have negation, but we could add that. [00:25:59] aye, yeah, that could work. [00:26:07] I see that we do have yaml files to represent the tag itself [00:26:10] Yes. [00:26:44] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/578653 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [00:26:45] Or we could set group0 and group2 explcitly and have group1 set in base.yaml. [00:29:27] (03PS4) 10Jforrester: Update linter whitelist w/ parsoid11's IP address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579018 (https://phabricator.wikimedia.org/T246833) (owner: 10C. Scott Ananian) [00:29:44] Hm.. right. I wodner if we could set them all explicitly actually, with inheritance that is. [00:29:45] (03CR) 10Jforrester: [C: 03+1] "Good to deploy whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579018 (https://phabricator.wikimedia.org/T246833) (owner: 10C. Scott Ananian) [00:30:10] (03PS3) 10Jforrester: parsoidphp is dead, long live parsoid (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579021 (owner: 10C. Scott Ananian) [00:30:25] Although I guess that doesn't work right now, given there isn't a way to shadow a wikiTag [00:30:37] like if wikipedia.yaml is in group2, I can't exclude hewiki [00:30:50] (03Abandoned) 10Jforrester: ProductionServices: Stop defining the 'parsoid' JS service, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579020 (https://phabricator.wikimedia.org/T229015) (owner: 10Jforrester) [00:31:01] which I think is a sane design for inheritence, that's good. Just makes the expression part harder [00:32:49] James_F: OK, sorry about that. [00:42:56] (03PS1) 10Krinkle: test: Fixup swapped test case names in dblistTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579101 [00:42:58] (03PS1) 10Krinkle: tests: Remove excemption of group2 in "web-used expression lists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579102 [00:43:00] (03PS1) 10Krinkle: multiversion: Simplify path inclusions in buildDBLists.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 [00:43:02] (03PS1) 10Krinkle: multiversion: Structure buildDBLists.php to prevent labs info in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579104 [00:43:04] (03PS1) 10Krinkle: multiversion: Let buildDBLists.php expand expression dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) [00:43:59] (03CR) 10jerkins-bot: [V: 04-1] test: Fixup swapped test case names in dblistTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579101 (owner: 10Krinkle) [00:44:09] (03PS2) 10Krinkle: test: Fixup swapped test case names in dblistTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579101 [00:44:11] (03PS2) 10Krinkle: tests: Remove excemption of group2 in "web-used expression lists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579102 [00:44:13] (03PS2) 10Krinkle: multiversion: Simplify path inclusions in buildDBLists.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 [00:44:15] (03PS2) 10Krinkle: multiversion: Structure buildDBLists.php to prevent labs info in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579104 [00:44:17] (03PS2) 10Krinkle: multiversion: Let buildDBLists.php expand expression dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) [00:44:41] (03CR) 10Jforrester: [C: 03+1] "Ha, oops." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579101 (owner: 10Krinkle) [00:44:59] (03CR) 10Jforrester: [C: 03+1] tests: Remove excemption of group2 in "web-used expression lists" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579102 (owner: 10Krinkle) [00:45:14] (03PS3) 10Krinkle: tests: Fixup swapped test case names in dblistTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579101 [00:45:17] (03CR) 10Krinkle: [C: 03+2] tests: Fixup swapped test case names in dblistTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579101 (owner: 10Krinkle) [00:45:51] (03PS3) 10Krinkle: tests: Remove exemption of group2 in "web-used expression lists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579102 [00:45:56] (03PS4) 10Krinkle: tests: Remove exemption of group2 in "web-used expression lists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579102 [00:46:02] (03CR) 10Krinkle: [C: 03+2] "Oops. Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579102 (owner: 10Krinkle) [00:46:15] (03CR) 10Jforrester: [C: 03+1] multiversion: Simplify path inclusions in buildDBLists.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 (owner: 10Krinkle) [00:46:38] (03PS1) 10Dzahn: installserver: allow configuring squid as running/stopped in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/579106 [00:46:40] (03Merged) 10jenkins-bot: tests: Fixup swapped test case names in dblistTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579101 (owner: 10Krinkle) [00:47:10] (03Merged) 10jenkins-bot: tests: Remove exemption of group2 in "web-used expression lists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579102 (owner: 10Krinkle) [00:47:24] (03CR) 10Krinkle: [C: 03+1] Drop the 'pp_stage0' and 'pp_stage1' dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579083 (owner: 10Jforrester) [00:48:42] (03CR) 10Jforrester: "Maybe call the "expressions" foo.dbexpression?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:49:11] (03CR) 10Dzahn: "watched logs on all 4 servers and confirmed incoming requests on new servers.. while old ones stopped having traffic" [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:49:14] (03CR) 10Krinkle: [C: 04-1] "Ah perfect. yes, that avoids any clashes or misuse. Would make my patch simpler as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:49:55] (03CR) 10jerkins-bot: [V: 04-1] installserver: allow configuring squid as running/stopped in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/579106 (owner: 10Dzahn) [00:51:20] James_F: per https://phabricator.wikimedia.org/T169821#5962310 - looks like group1.dblist currently costs as much as 10 dblists to process [00:51:28] 0.5ms vs 0.05ms [00:51:36] probably more, as its part of the avg [00:52:17] Ouch. [00:52:45] !log ebernhardson@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/CirrusSearch/includes/Maintenance/Reindexer.php: (no justification provided) (duration: 01m 12s) [00:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:15] !log wmf.23 cirrussearch: wait around for counts to match before giving up [00:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:46] (03PS2) 10Dzahn: installserver: allow configuring squid as running/stopped in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) [00:56:22] !log ebernhardson@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/CirrusSearch/includes/Maintenance/Reindexer.php: wait around for counts to match up in reindexer before giving up (duration: 01m 08s) [00:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:39] (03CR) 10jerkins-bot: [V: 04-1] installserver: allow configuring squid as running/stopped in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:00:03] (03PS3) 10Dzahn: installserver: allow configuring squid as absent in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) [01:00:57] (03CR) 10Krinkle: multiversion: Simplify path inclusions in buildDBLists.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 (owner: 10Krinkle) [01:01:06] (03PS3) 10Krinkle: multiversion: Simplify path inclusions in buildDBLists.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 [01:01:08] (03PS3) 10Krinkle: multiversion: Structure buildDBLists.php to prevent labs info in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579104 [01:01:10] (03PS3) 10Krinkle: multiversion: Let buildDBLists.php expand expression dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) [01:01:29] (03CR) 10jerkins-bot: [V: 04-1] installserver: allow configuring squid as absent in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:01:49] (03CR) 10Krinkle: multiversion: Simplify path inclusions in buildDBLists.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 (owner: 10Krinkle) [01:03:03] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@4e2ea09]: resolve deadlock in bulk_daemon [01:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:40] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/21391/" [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:04:30] (03PS4) 10Dzahn: installserver: allow configuring squid as absent in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) [01:13:08] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@4e2ea09]: resolve deadlock in bulk_daemon (duration: 10m 05s) [01:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:51:32] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/WikimediaEditorTasks: Revert 'Fix revert counting for non-language-specific counters' (T247479) (duration: 01m 08s) [01:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:38] T247479: All edits on Wikidata and Commons tagged as "suggested edits" between approx. 23:40 and 00:00 UTC on 2020-03-11 - https://phabricator.wikimedia.org/T247479 [02:03:52] Krinkle: Found the PHP74 blocker mistake. It's in WikimediaBadges. What do you notice about https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaBadges/+/master/extension.json#26 ? :-) [02:04:39] James_F: 😐 [02:04:44] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaBadges/+/547288/ [02:04:50] It's intentionally "empty". [02:05:10] https://starecat.com/content/wp-content/uploads/thats-not-how-this-works-thats-not-how-any-of-this-works.jpg [02:05:40] ... right [02:05:49] Can we pass null instead? [02:05:57] Or do we have to do something horrid? [02:06:15] Afaik there is no need for it to be set at all [02:06:31] Yeah, if default isn't set it just isn't set? [02:06:47] skinStyles and styles are both optional [02:06:56] a module can even be just a deprecated alias to another dependency [02:07:03] they can evaluate to an empty array indeed [02:07:06] yeah should be fine [02:07:21] > self::tryForKey( $this->skinScripts, $context->getSkin(), 'default' ) [02:07:35] returns an array, which can indeed be empty. fallback=default does not have to be found [02:07:44] and I'm sure we omit it in various cases already [02:07:57] Krinkle: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaBadges/+/579118 [02:08:12] * Krinkle signs off for the day [02:08:13] LGTM [02:08:21] LGTM == C+2? ;-) [02:08:30] Thanks! [02:08:36] o/ [03:43:29] (03PS1) 10Andrew Bogott: Designate: add wmcs-populate-domains script [puppet] - 10https://gerrit.wikimedia.org/r/579121 (https://phabricator.wikimedia.org/T247374) [03:44:21] (03CR) 10jerkins-bot: [V: 04-1] Designate: add wmcs-populate-domains script [puppet] - 10https://gerrit.wikimedia.org/r/579121 (https://phabricator.wikimedia.org/T247374) (owner: 10Andrew Bogott) [03:46:42] (03PS2) 10Andrew Bogott: Designate: add wmcs-populate-domains script [puppet] - 10https://gerrit.wikimedia.org/r/579121 (https://phabricator.wikimedia.org/T247374) [04:09:44] 10Operations, 10Traffic, 10Security: HTTP MediaWiki API GET requests to Wikimedia wikis should not be redirected to HTTPS when they have a session cookie or Authorization header - https://phabricator.wikimedia.org/T247490 (10Tgr) [04:15:24] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-anaphora] - 10https://gerrit.wikimedia.org/r/578705 (https://phabricator.wikimedia.org/T234181) (owner: 10KartikMistry) [04:24:58] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 146.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [04:25:12] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 133.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [04:28:55] (03PS3) 10KartikMistry: cxserver: Add sectionmapping config for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/578935 (https://phabricator.wikimedia.org/T243430) [04:31:38] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 133.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [04:36:39] (03PS4) 10KartikMistry: cxserver: Add sectionmapping config for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/578935 (https://phabricator.wikimedia.org/T243430) [05:15:35] 10Operations, 10procurement: Request SRE assistance in specs for analysis server - https://phabricator.wikimedia.org/T247492 (10Dsharpe) [05:16:22] 10Operations, 10procurement: Request SRE assistance in specs for Security team's proposed analysis server - https://phabricator.wikimedia.org/T247492 (10Dsharpe) [05:18:22] 10Operations, 10procurement: Request SRE assistance: specs for Security team's proposed analysis server - https://phabricator.wikimedia.org/T247492 (10Dsharpe) [05:29:54] 10Operations: Request SRE assistance: specs for Security team's proposed analysis server - https://phabricator.wikimedia.org/T247492 (10Dsharpe) [05:54:42] 10Operations, 10WMNO-Sámi, 10Wikimedia-Mailing-lists: Create mailing list for WMNO Sámi project - https://phabricator.wikimedia.org/T182093 (10Vgutierrez) p:05Triage→03Medium a:03Vgutierrez [06:00:12] * kart_ is updating cxserver now.. [06:01:31] (03CR) 10KartikMistry: [C: 03+2] cxserver: Add sectionmapping config for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/578935 (https://phabricator.wikimedia.org/T243430) (owner: 10KartikMistry) [06:01:51] (03Merged) 10jenkins-bot: cxserver: Add sectionmapping config for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/578935 (https://phabricator.wikimedia.org/T243430) (owner: 10KartikMistry) [06:03:40] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [06:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:22] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [06:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:01] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [06:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:02] !log Updated cxserver to 2020-03-12-041806-production and added sectionmapping db config (T246316, T243430, T202276) [06:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:09] T202276: Package apertium-pol-szl (Polish-Silesian) - https://phabricator.wikimedia.org/T202276 [06:14:09] T246316: Filter out unsupported languages for cxserver - https://phabricator.wikimedia.org/T246316 [06:14:09] T243430: Basic service for mapping sections - https://phabricator.wikimedia.org/T243430 [06:16:04] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [06:18:48] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [06:19:24] OK. Seems problematic. [06:23:50] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [06:24:48] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [06:30:33] 10Operations, 10WMNO-Sámi, 10Wikimedia-Mailing-lists: Create mailing list for WMNO Sámi project - https://phabricator.wikimedia.org/T182093 (10Vgutierrez) 05Open→03Resolved @jhsoby-WMNO the list has been created and you should have received an email with the details. [06:43:36] Seems I forgot to update charts, akosiaris [06:48:57] (03PS1) 10KartikMistry: Added cxserver charts 0.0.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/579128 [06:59:20] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [07:01:32] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [07:15:57] (03PS1) 10Elukey: Replace install[12]002 with install[12]003 in firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/579131 [07:20:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] Added cxserver charts 0.0.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/579128 (owner: 10KartikMistry) [07:21:14] kart_: yeah, we have an action item to fix that as well so people don't have to do that manually [07:21:17] (03Merged) 10jenkins-bot: Added cxserver charts 0.0.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/579128 (owner: 10KartikMistry) [07:21:25] kart_: but you should be good to deploy it now [07:21:36] akosiaris: cool. Thanks! [07:22:51] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [07:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:09] (03CR) 10Elukey: [C: 04-1] "Scope is too broad, I'll split the changes." [homer/public] - 10https://gerrit.wikimedia.org/r/579131 (owner: 10Elukey) [07:23:17] (03Abandoned) 10Elukey: Replace install[12]002 with install[12]003 in firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/579131 (owner: 10Elukey) [07:24:38] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [07:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:45] (03CR) 10Jcrespo: "Hi, Guozr.im," [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [07:25:28] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:26:10] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [07:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:27] (03PS1) 10Elukey: Update analytics-in[4,6] filter rules with new webproxy IPs [homer/public] - 10https://gerrit.wikimedia.org/r/579140 [07:27:37] akosiaris: done! [07:28:02] !log Updated cxserver charts to 0.0.13 [07:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:08] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:28:35] (03CR) 10Giuseppe Lavagetto: "we already have such a script, IIRC?" [puppet] - 10https://gerrit.wikimedia.org/r/578988 (owner: 10Jbond) [07:28:35] kart_: yay [07:28:58] akosiaris: had to Google to find 'helm' :) [07:29:30] but, yeah. I should keep re-reading README.md [07:37:32] (03PS2) 10Elukey: Update analytics-in[4,6] filter rules with new webproxy IPs [homer/public] - 10https://gerrit.wikimedia.org/r/579140 [07:40:17] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10elukey) @Bstorm the only thing to keep in mind is that Buster ships with java 11 only, and hadoop for the mom... [07:44:26] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10MoritzMuehlenhoff) profile::java::analytics uses the Java 8 forward port on Buster, so that part should be fi... [07:49:15] (03CR) 10Muehlenhoff: "Why is that needed? The squid instances running are not doing any harm and the install1002/2002 servers will be decommed soon anyway?" [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [07:56:00] (03PS3) 10Elukey: Update analytics-in[4,6] filter rules with new webproxy IPs [homer/public] - 10https://gerrit.wikimedia.org/r/579140 [07:56:42] akosiaris: o/ available for a quick review? --^ [07:58:36] elukey: should you be deleting the old ones in the same patch? [07:59:09] don't you want to do it in 2 distinct patches spaced at least a few minutes (if not hours) apart? [07:59:53] <3 for not forgetting IPv6 btw [08:06:21] akosiaris: I already have alarms related to webproxy failures for 1002/2002 in the analytics vlan, this is why I am swapping :) [08:07:14] but I can add/remove in case, np [08:07:18] huh? [08:07:26] * akosiaris looking [08:09:02] while you're at it, there's a similar change upcoming for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569684/, does that also need changes on the routers to allow it? [08:09:11] oh I know why [08:09:19] webproxy.eqiad.wmnet is an alias for install1003.wikimedia.org. [08:09:26] ok, that could have been coordinated better [08:09:26] yeah [08:09:39] but in that case, you are correct ofc. Lemme +1 [08:09:53] (my "yeah" was not for the coordination but for the alias) [08:09:54] :) [08:09:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update analytics-in[4,6] filter rules with new webproxy IPs [homer/public] - 10https://gerrit.wikimedia.org/r/579140 (owner: 10Elukey) [08:10:04] moritzm: already took care of those as well in --^ [08:10:31] akosiaris: there is also some follow up to do, there are some rules in other filters to block traffic from the outside to install public ips [08:10:39] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: debian-glue-backports not enabling backports on buster - https://phabricator.wikimedia.org/T247316 (10hashar) @ema I would advise to NOT rely on backports. The suite gets removed before the release reaches EOL. That causes some packages to no... [08:10:41] ack, thx [08:11:21] * akosiaris looking at the dhcpd change as well [08:12:23] (03CR) 10Elukey: [C: 03+2] Update analytics-in[4,6] filter rules with new webproxy IPs [homer/public] - 10https://gerrit.wikimedia.org/r/579140 (owner: 10Elukey) [08:12:46] elukey: yes you are right, it will require changes for tftp as well [08:12:49] !log push new install/webproxy terms for analytics-in4/6 to cr1/cr2-eqiad [08:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:10] term tftp { [08:13:10] from { [08:13:11] destination-address { [08:13:11] /* install2002 */ [08:13:11] 208.80.153.53/32; [08:13:12] /* install1002 */ [08:13:13] 208.80.154.22/32; [08:13:14] } [08:13:15] and all [08:13:46] but that can indeed be done in 2 steps and verify before merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569684/ [08:17:42] ack, thx [08:18:47] (03PS1) 10Addshore: Write to the new terms store up to Q 87.5 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579213 (https://phabricator.wikimedia.org/T219123) [08:20:25] jouncebot: now [08:20:25] No deployments scheduled for the next 2 hour(s) and 39 minute(s) [08:20:51] If noone objects I'm going to add another 500k wikidata items to the "write" secion of our new term store (so near the end)! [08:21:13] (03CR) 10Addshore: [C: 03+2] Write to the new terms store up to Q 87.5 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579213 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [08:22:12] (03Merged) 10jenkins-bot: Write to the new terms store up to Q 87.5 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579213 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [08:24:09] Krinkle: I see undeployed config things? [08:24:24] but maybe only tests: .. [08:24:28] * addshore digs [08:24:58] I guess they are safe! [08:26:33] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q87.5 million, was 87 (T219123) (duration: 01m 12s) [08:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:38] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [08:27:45] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q87.5 million, was 87 (T219123) cache bust (duration: 01m 08s) [08:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Sigh, my bad, sorry. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/579016 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [08:36:02] !log start "rebuild" of Q87 -> 87.5 million for T219123 [08:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:07] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [08:36:52] (03CR) 10Ema: "recheck" [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [08:44:15] <_joe_> addshore: can I pick up deploying things? [08:44:25] yup! fire away! [08:45:26] (03PS6) 10Giuseppe Lavagetto: Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) [08:45:50] (03PS1) 10Muehlenhoff: Remove some stray Puppet references to cp1008, cp1071-cp1074 [puppet] - 10https://gerrit.wikimedia.org/r/579223 (https://phabricator.wikimedia.org/T229586) [08:46:56] <_joe_> akosiaris: I'm moving ores, then restbase to go through envoy [08:47:06] <_joe_> restbase will also add TLS IIRC [08:47:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:47:41] _joe_: https://phabricator.wikimedia.org/T238658#5961899 btw [08:48:03] I 'll try and dig into watch going on there [08:48:23] (03Merged) 10jenkins-bot: Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:48:35] <_joe_> akosiaris: possibly is some unconfigured timeout in envoy [08:48:49] <_joe_> I kinda remember 15s being some timeout [08:49:17] <_joe_> I thought I added support for a timeout [08:49:27] <_joe_> in the docker image and the chart [08:50:44] (03CR) 10Ema: [C: 03+1] Remove some stray Puppet references to cp1008, cp1071-cp1074 [puppet] - 10https://gerrit.wikimedia.org/r/579223 (https://phabricator.wikimedia.org/T229586) (owner: 10Muehlenhoff) [08:53:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove some stray Puppet references to cp1008, cp1071-cp1074 [puppet] - 10https://gerrit.wikimedia.org/r/579223 (https://phabricator.wikimedia.org/T229586) (owner: 10Muehlenhoff) [08:54:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10MoritzMuehlenhoff) [08:55:52] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: switch ores to use envoy (duration: 01m 08s) [08:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:49] (03PS1) 10Muehlenhoff: Remove prod DNS entries for cp1008, cp1071-cp1074 [dns] - 10https://gerrit.wikimedia.org/r/579226 (https://phabricator.wikimedia.org/T229586) [09:02:05] _joe_: also, let me know when I can switchover eventgate-main [09:04:24] (03CR) 10Ema: [C: 03+1] Remove prod DNS entries for cp1008, cp1071-cp1074 [dns] - 10https://gerrit.wikimedia.org/r/579226 (https://phabricator.wikimedia.org/T229586) (owner: 10Muehlenhoff) [09:06:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove prod DNS entries for cp1008, cp1071-cp1074 [dns] - 10https://gerrit.wikimedia.org/r/579226 (https://phabricator.wikimedia.org/T229586) (owner: 10Muehlenhoff) [09:09:52] 10Operations, 10Analytics, 10Traffic: Test atskafka deployment - https://phabricator.wikimedia.org/T247497 (10ema) [09:09:58] 10Operations, 10Analytics, 10Traffic: Test atskafka deployment - https://phabricator.wikimedia.org/T247497 (10ema) p:05Triage→03Medium [09:10:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10MoritzMuehlenhoff) [09:11:34] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (10) node(s) change every puppet run: logstash2028.codfw.wmnet, logstash2027.codfw.wmnet, cescout1001.eqiad.wmnet, logstash2026.codfw.wmnet, logstash2029.codfw.wmnet, cloudvirt2003-dev.codfw.wmnet, ganeti1022.eqiad.wmnet, ganeti1020.eqiad.wmnet, ganeti1021.eqiad.wmnet, ganeti1019.eqiad.wmnet https://wikitech [09:11:34] ki/Puppet%23check_puppet_run_changes [09:11:39] (03PS1) 10Elukey: admin: simplify and document some analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) [09:13:20] (03PS2) 10Elukey: admin: simplify and document some analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) [09:14:21] (03PS4) 10Ema: Use json.Marshal [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) [09:14:23] (03PS3) 10Ema: Handle rdkafka statistics [software/atskafka] - 10https://gerrit.wikimedia.org/r/578549 (https://phabricator.wikimedia.org/T237993) [09:17:51] _joe_: https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/route/route_components.proto#envoy-api-field-route-routeaction-timeout [09:17:54] got it I think. [09:18:07] it's exactly what otto describes. Need to figure now how to set it [09:19:02] I like how the refs are for the API but not the config [09:23:42] <_joe_> akosiaris: it's in our production config [09:23:48] <_joe_> even in the container I hope [09:24:23] <_joe_> - name: UPSTREAM_TIMEOUT [09:24:25] <_joe_> value: {{ .Values.tls.upstream_timeout }} [09:24:32] <_joe_> yes we do [09:24:47] <_joe_> maybe otto didn't backport my changes adding that [09:28:31] (03CR) 10Vgutierrez: [C: 03+1] Use json.Marshal [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [09:34:12] (03PS1) 10Hashar: package_builder: do not set webproxy on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) [09:36:14] (03CR) 10Hashar: "Another way would be to simply check whether an http_proxy has been disabled instead of differentiating based on @realm." [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [09:48:00] (03PS1) 10Alexandros Kosiaris: eventstreams/envoy: Disable upstream_timout [deployment-charts] - 10https://gerrit.wikimedia.org/r/579232 (https://phabricator.wikimedia.org/T238658) [09:48:06] (03PS1) 10Hashar: beta: pull out deployment-parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/579233 (https://phabricator.wikimedia.org/T246854) [09:49:19] (03PS1) 10Elukey: profile::kerberos: extend ticket lifetime from 10h to 48h [puppet] - 10https://gerrit.wikimedia.org/r/579234 [09:50:28] (03PS2) 10Alexandros Kosiaris: eventstreams/envoy: Disable upstream_timout [deployment-charts] - 10https://gerrit.wikimedia.org/r/579232 (https://phabricator.wikimedia.org/T238658) [09:50:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Great! Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/579121 (https://phabricator.wikimedia.org/T247374) (owner: 10Andrew Bogott) [09:53:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams/envoy: Disable upstream_timout [deployment-charts] - 10https://gerrit.wikimedia.org/r/579232 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [09:58:11] (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: split tls proxy config out of profile [puppet] - 10https://gerrit.wikimedia.org/r/576852 [09:58:17] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [09:58:18] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [09:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:03] (03CR) 10Muehlenhoff: [C: 03+1] "One nit (feel free to ignore!), looks good to me. There's no added risk here, in case of a stolen laptop we'd actively remove the user and" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579234 (owner: 10Elukey) [10:00:56] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::webserver: split tls proxy config out of profile [puppet] - 10https://gerrit.wikimedia.org/r/576852 (owner: 10Giuseppe Lavagetto) [10:01:25] (03CR) 10Hashar: "That has fixed https://integration.wikimedia.org/ci/job/beta-scap-eqiad/" [puppet] - 10https://gerrit.wikimedia.org/r/579233 (https://phabricator.wikimedia.org/T246854) (owner: 10Hashar) [10:01:49] Error: UPGRADE FAILED: v1.Deployment.Spec: v1.DeploymentSpec.Template: v1.PodTemplateSpec.Spec: v1.PodSpec.Containers: []v1.Container: v1.Container.Ports: []v1.ContainerPort: Name: ImagePullPolicy: Image: Env: []v1.EnvVar: v1.EnvVar.Value: ReadString: expects " or n, but found 0, error found in #10 byte of ...|,"value":0}],"image"|..., bigger context ...|alue":"9361"},{"name":"UPSTREAM_TIMEOUT","value":0}],"image":"docker [10:01:49] -registry.wikimedia.org/envoy-tl|... [10:01:52] sigh... interesting [10:05:28] (03PS1) 10Alexandros Kosiaris: _tls_helpers: Double quote UPSTREAM_TIMEOUT value [deployment-charts] - 10https://gerrit.wikimedia.org/r/579237 (https://phabricator.wikimedia.org/T238658) [10:07:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] _tls_helpers: Double quote UPSTREAM_TIMEOUT value [deployment-charts] - 10https://gerrit.wikimedia.org/r/579237 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [10:07:52] (03Merged) 10jenkins-bot: _tls_helpers: Double quote UPSTREAM_TIMEOUT value [deployment-charts] - 10https://gerrit.wikimedia.org/r/579237 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [10:10:23] (03CR) 10Hnowlan: [C: 03+2] changeprop: Nutcracker sidecar fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/579016 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [10:11:11] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/578988 (owner: 10Jbond) [10:11:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM the openstack part. Please collect a +1 from someone related to the archiva module." [puppet] - 10https://gerrit.wikimedia.org/r/578609 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [10:12:32] (03PS2) 10Hnowlan: changeprop: Nutcracker sidecar fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/579016 (https://phabricator.wikimedia.org/T213193) [10:13:19] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [10:13:19] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [10:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:34] take #2 [10:15:08] (03CR) 10Ema: [C: 03+2] Use json.Marshal [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:21:03] _joe_: dammit, it didn't work [10:21:24] (03CR) 10Elukey: profile::kerberos: extend ticket lifetime from 10h to 48h (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579234 (owner: 10Elukey) [10:21:27] <_joe_> akosiaris: try setting a timeout of 86400? [10:21:42] <_joe_> I'm not sure timeout 0 is interpreted as no timeout in envoy [10:22:03] 10Operations, 10Commons, 10Thumbor: Thumbnailing page 2 of c:File:Mimořádné opatření - zákaz vývozu desinfekce rukou.pdf generates a non-fatal Ghostscript error that is piped to imagemagick - https://phabricator.wikimedia.org/T247473 (10Urbanecm) @AntiCompositeNumber This is the traceback ` Traceback (most... [10:22:30] _joe_: https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/route/route_components.proto#envoy-api-field-route-routeaction-timeout [10:22:35] A value of 0 will disable the route’s timeout. [10:22:35] (03PS2) 10Elukey: profile::kerberos: extend ticket lifetime from 10h to 48h [puppet] - 10https://gerrit.wikimedia.org/r/579234 [10:23:04] <_joe_> akosiaris: uhm can we try though? [10:23:09] <_joe_> if that's not the issue [10:23:11] yeah, sure [10:23:17] <_joe_> I mean it's still 15 seconds? [10:23:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/578406 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [10:23:31] yes, it's still 16s [10:24:58] <_joe_> I'll quickly try codfw [10:25:08] I am working on codfw [10:25:21] cause eqiad is diverged by andrew's tests and I don't want to mess with that [10:26:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [10:28:09] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [10:28:09] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [10:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:31] (03CR) 10Elukey: [C: 03+1] "Code looks sane, modulo the fact that I am not an expert in go or the confluent lib :)" (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/578549 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:29:56] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:17] (03PS1) 10Volans: sre.dns.netbox: fix CWD for runuser execution [cookbooks] - 10https://gerrit.wikimedia.org/r/579239 (https://phabricator.wikimedia.org/T233183) [10:30:36] _joe_: yeah, neither did 86400 work [10:30:37] (03CR) 10Ema: [C: 03+2] Handle rdkafka statistics (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/578549 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:30:44] btw... how are those env vars used? [10:31:01] I don't see refs to UPSTREAM_TIMEOUT in envoy.yaml [10:31:08] <_joe_> akosiaris: uhm [10:31:09] (03PS2) 10Volans: sre.dns.netbox: fix CWD for runuser execution [cookbooks] - 10https://gerrit.wikimedia.org/r/579239 (https://phabricator.wikimedia.org/T233183) [10:31:14] <_joe_> maybe it wasn't added? [10:31:30] <_joe_> akosiaris: wait [10:31:38] <_joe_> what version of the envoy image are you using? [10:31:57] image_version: 1.11.2-1 [10:32:02] <_joe_> ok [10:32:04] <_joe_> lol [10:32:34] <_joe_> 1.12.2-1 is what should be used [10:32:39] <_joe_> but look out for things missing [10:32:46] <_joe_> let me do it [10:32:51] <_joe_> and sorry for leading you astray [10:33:06] oh wow... 4 revisions back [10:33:18] <_joe_> yeah man [10:33:26] <_joe_> and security releases too [10:33:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/579239 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:34:03] (03PS1) 10Hnowlan: changeprop: Release new version of changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/579240 (https://phabricator.wikimedia.org/T213193) [10:34:05] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: fix CWD for runuser execution [cookbooks] - 10https://gerrit.wikimedia.org/r/579239 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:36:01] (03Merged) 10jenkins-bot: sre.dns.netbox: fix CWD for runuser execution [cookbooks] - 10https://gerrit.wikimedia.org/r/579239 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:39:16] !log volans@cumin2001 START - Cookbook sre.dns.netbox [10:39:17] !log volans@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:24] (03PS1) 10Volans: sre.dns.netbox: add missing -- to runuser commands [cookbooks] - 10https://gerrit.wikimedia.org/r/579242 (https://phabricator.wikimedia.org/T233183) [10:42:50] (03PS1) 10Jbond: demo [puppet] - 10https://gerrit.wikimedia.org/r/579243 [10:43:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/579242 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:43:35] (03PS1) 10Giuseppe Lavagetto: eventstreams: use latest envoy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/579244 (https://phabricator.wikimedia.org/T238658) [10:43:58] <_joe_> akosiaris: ^^ the helpers were all up to date [10:44:01] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: add missing -- to runuser commands [cookbooks] - 10https://gerrit.wikimedia.org/r/579242 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:44:05] (03CR) 10jerkins-bot: [V: 04-1] demo [puppet] - 10https://gerrit.wikimedia.org/r/579243 (owner: 10Jbond) [10:44:07] <_joe_> so we were setting the env var but not using it [10:44:53] _joe_: versions are fully compatible? [10:45:23] <_joe_> akosiaris: well define compatible, but yes [10:45:28] that minor version number bump scared me a bit and was worried they might have differences [10:45:32] ok let me try that [10:45:45] (03Merged) 10jenkins-bot: sre.dns.netbox: add missing -- to runuser commands [cookbooks] - 10https://gerrit.wikimedia.org/r/579242 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:45:48] <_joe_> akosiaris: the new version is compatible with the chart [10:45:55] <_joe_> the old one, not really :P [10:46:12] why not 1.12.2-1 btw? [10:46:24] but rather 1.12.1-3? [10:46:39] <_joe_> uh I mispasted [10:46:41] <_joe_> sigh [10:46:44] <_joe_> lemme update the patch [10:46:47] :D [10:46:52] at least I noticed it [10:46:52] 10Operations, 10WMNO-Sámi, 10Wikimedia-Mailing-lists: Create mailing list for WMNO Sámi project - https://phabricator.wikimedia.org/T182093 (10jhsoby-WMNO) Thank you very much, @Vgutierrez! [10:48:08] (03PS2) 10Giuseppe Lavagetto: eventstreams: use latest envoy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/579244 (https://phabricator.wikimedia.org/T238658) [10:49:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] package_builder: do not set webproxy on WMCS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [10:50:51] (03CR) 10Hashar: "Cool, thanks for confirming. I will add a http_proxy parameter :]" [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [10:52:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: use latest envoy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/579244 (https://phabricator.wikimedia.org/T238658) (owner: 10Giuseppe Lavagetto) [10:52:23] (03PS2) 10Jbond: demo [puppet] - 10https://gerrit.wikimedia.org/r/579243 [10:52:26] (03Merged) 10jenkins-bot: eventstreams: use latest envoy image [deployment-charts] - 10https://gerrit.wikimedia.org/r/579244 (https://phabricator.wikimedia.org/T238658) (owner: 10Giuseppe Lavagetto) [10:54:06] (03CR) 10jerkins-bot: [V: 04-1] demo [puppet] - 10https://gerrit.wikimedia.org/r/579243 (owner: 10Jbond) [10:57:20] elukey: [10:57:24] ignore that [10:57:43] (03PS1) 10Ema: atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497) [10:58:07] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [10:58:07] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [10:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:15] (03CR) 10Elukey: [C: 03+2] profile::kerberos: extend ticket lifetime from 10h to 48h [puppet] - 10https://gerrit.wikimedia.org/r/579234 (owner: 10Elukey) [10:58:44] (03CR) 10jerkins-bot: [V: 04-1] atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [10:59:19] !log volans@cumin2001 START - Cookbook sre.dns.netbox [10:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:31] (03PS2) 10Ema: atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200312T1100). [11:00:04] CFisch_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] _joe_: eventstreams-canary-75577b4cc4-mdxkz 1/2 Error 3 73s [11:00:16] sigh [11:00:21] that upgrade failed [11:02:11] !log volans@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:34] error initializing configuration '/etc/envoy/envoy.yaml': Unable to parse JSON as proto (INVALID_ARGUMENT [11:02:37] sigh [11:02:54] Invalid data type for duration, value is 0 for type type.googleapis.com/google.protobuf.Duration [11:02:57] so... the docs lie? [11:03:00] <_joe_> oh I think I know what it is [11:03:03] <_joe_> yes [11:03:12] seriously? [11:03:20] <_joe_> didn't you try with staging first? [11:03:24] the docs say "put 0 there to disable it", but then nope? [11:03:36] <_joe_> try "0s" ? but try on staging please [11:03:43] _joe_: no but we are good, kubernetes refused to proceed [11:03:50] it tried to spawn the new pods ofc [11:04:00] <_joe_> no I know [11:04:04] but all are in CrashLoopBackOff [11:04:08] <_joe_> but "0s" will be a valid value [11:04:14] seriously? [11:04:19] <_joe_> but it might not have the effect we hope fore [11:04:32] it will a nice and useless rabbithole then [11:04:38] let me try that 0s on staging then [11:05:51] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [11:05:51] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [11:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:56] (03PS3) 10Jbond: analytics users: test using yaml pointers to simplify group structres [puppet] - 10https://gerrit.wikimedia.org/r/579243 [11:09:24] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [11:09:24] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [11:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:56] !log roll restart of krb-kdc on krb1001/krb2001 to pick up new ticket lifetime settings (10h -> 48h) [11:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:18] _joe_: yup, "0s" worked [11:10:19] sigh [11:11:15] let's release 1 more chart version [11:11:45] <_joe_> akosiaris: sorry, but also - told ya [11:11:57] <_joe_> the interaction between protobufs and yaml generates monsters [11:13:06] (03PS1) 10Alexandros Kosiaris: eventstreams: Pass 0 seconds as "0s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/579248 (https://phabricator.wikimedia.org/T238658) [11:13:32] (03PS4) 10Jbond: analytics users: test using yaml pointers to simplify group structres [puppet] - 10https://gerrit.wikimedia.org/r/579243 [11:18:05] (03PS5) 10Jbond: analytics users: test using yaml pointers to simplify group members [puppet] - 10https://gerrit.wikimedia.org/r/579243 [11:19:31] (03CR) 10Hnowlan: [C: 03+2] changeprop: Release new version of changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/579240 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [11:21:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Pass 0 seconds as "0s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/579248 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [11:21:30] (03Merged) 10jenkins-bot: eventstreams: Pass 0 seconds as "0s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/579248 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [11:23:25] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [11:23:25] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [11:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:32] Is anyone around for SWAT? I just realised that it moved UTC with the time change [11:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:42] (03PS6) 10Jbond: analytics users: test using yaml pointers to simplify group members [puppet] - 10https://gerrit.wikimedia.org/r/579243 [11:25:16] (03PS1) 10Volans: dns: add prefix to metadata [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/579249 (https://phabricator.wikimedia.org/T233183) [11:25:16] I agreed to baby sit the one patch there; anyone have any objection to me doing it now? [11:25:29] (03PS2) 10KartikMistry: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) [11:26:00] (03PS1) 10Arturo Borrero Gonzalez: Revert "toolforge: support canonical redirects in urlproxy" [puppet] - 10https://gerrit.wikimedia.org/r/579250 [11:26:38] (03PS1) 10Volans: sre.dns.netbox: properly read metadata line [cookbooks] - 10https://gerrit.wikimedia.org/r/579251 (https://phabricator.wikimedia.org/T233183) [11:28:14] (03CR) 10jerkins-bot: [V: 04-1] sre.dns.netbox: properly read metadata line [cookbooks] - 10https://gerrit.wikimedia.org/r/579251 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [11:28:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "toolforge: support canonical redirects in urlproxy" [puppet] - 10https://gerrit.wikimedia.org/r/579250 (owner: 10Arturo Borrero Gonzalez) [11:29:19] (03CR) 10jerkins-bot: [V: 04-1] apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) (owner: 10KartikMistry) [11:29:42] (03PS2) 10Volans: sre.dns.netbox: properly read metadata line [cookbooks] - 10https://gerrit.wikimedia.org/r/579251 (https://phabricator.wikimedia.org/T233183) [11:30:30] (03CR) 10Volans: [C: 03+2] "self-merging to unblock testing. I will adjust according to any post-merge review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/579249 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [11:31:12] I'm assuming that it's good to go ahead; I just +2'd the backport. Just waiting for Jenkins; do shout if the SWAT window has been taken up with some some other work though [11:32:06] _joe_: didn't work. :-(. Requests still timeout after 15 or 16s [11:32:22] <_joe_> akosiaris: but it did work in staging? [11:33:07] 0s did work in staging and codfw as a configuration parameter but it did not have the result we hoped for [11:33:13] <_joe_> oh ok [11:33:29] I can see in envoy.yaml [11:33:31] route: [11:33:31] cluster: local_service [11:33:31] timeout: 0s [11:33:35] <_joe_> ok [11:33:36] so it must be something else? [11:33:44] <_joe_> try 86400s [11:34:16] akosiaris: Sounds like you're all doing k8s stuff; don't want to tread on your toes. Is it still ok to do a SWAT? [11:34:17] (03CR) 10Volans: [C: 03+2] "self-merging to unblock testing. I will adjust according to any post-merge review." [cookbooks] - 10https://gerrit.wikimedia.org/r/579251 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [11:34:27] <_joe_> tarrow: go ahead [11:34:31] thanks :) [11:36:05] (03Merged) 10jenkins-bot: sre.dns.netbox: properly read metadata line [cookbooks] - 10https://gerrit.wikimedia.org/r/579251 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [11:37:11] (03PS7) 10Jbond: analytics users: test using yaml pointers to simplify group members [puppet] - 10https://gerrit.wikimedia.org/r/579243 [11:37:58] !log volans@cumin2001 START - Cookbook sre.dns.netbox [11:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:02] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [11:38:02] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [11:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:53] !log volans@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:28] _joe_: nope, didn't work either. I can see - match: {prefix: /} [11:41:29] route: [11:41:29] cluster: local_service [11:41:29] timeout: 86400s [11:41:36] and yet it will kill it in 16s [11:41:45] rolling back [11:41:48] <_joe_> ok this makes no sense [11:42:04] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [11:42:04] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [11:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:11] <_joe_> let me make a few tests locally [11:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:12] yes it doesn't [11:45:24] (03PS1) 10Volans: sre.dns.netbox: exit early if no changes [cookbooks] - 10https://gerrit.wikimedia.org/r/579254 (https://phabricator.wikimedia.org/T233183) [11:45:59] Anyone know what I should do about unstaged changes to an extension? Shouldn't security updates normally be comitted so I can rebase over them? [11:46:24] <_joe_> akosiaris: I'm going to run some tests locally [11:47:02] (03CR) 10jerkins-bot: [V: 04-1] sre.dns.netbox: exit early if no changes [cookbooks] - 10https://gerrit.wikimedia.org/r/579254 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [11:47:10] (03PS2) 10Volans: sre.dns.netbox: exit early if no changes [cookbooks] - 10https://gerrit.wikimedia.org/r/579254 (https://phabricator.wikimedia.org/T233183) [11:47:43] Lucas_WMDE: don't suppose I can come crying to you for advice? As a normal SWAT deployer for this window? [11:47:58] (03PS3) 10Volans: sre.dns.netbox: exit early if no changes [cookbooks] - 10https://gerrit.wikimedia.org/r/579254 (https://phabricator.wikimedia.org/T233183) [11:48:39] or maybe Amir1 ? [11:48:51] what's up? [11:49:29] I'm being overly cautious; I was going to do a SWAT deploy but there are unstaged changes in .23 [11:49:37] not in my extension [11:50:22] It's pretty small; I guess it's a security thing but in the past I've always found they've been commited [11:50:47] can you show me in private? [11:50:50] somehow [11:50:51] tarrow: I think those were already there when I SWATted yesterday [11:50:56] `git rebase` should work [11:51:06] (I’m in a meeting, can’t do the SWAT myself, sorry) [11:51:11] cool, no problem [11:52:25] Yeah, I'm just used to working on a clean repo and it not being makes me slightly more cautious [11:52:51] yeah [11:53:08] I’m also used to `git rebase` not working when `git status` reports uncommitted stuff, but apparently for submodules it works and does the right thing [11:53:36] cool, so just `git rebase` as normal works fine? [11:53:40] I think so, yeah [11:54:07] great; I'm doing that now then :) [11:54:11] ok, good luck :) [11:55:16] pulling to mwdebug [11:56:48] (03PS8) 10Jbond: analytics users: test using yaml pointers to simplify group members [puppet] - 10https://gerrit.wikimedia.org/r/579243 [11:58:45] looks good; syncing [11:59:27] (03CR) 10Volans: [C: 03+2] "self-merging to unblock testing. I will adjust according to any post-merge review." [cookbooks] - 10https://gerrit.wikimedia.org/r/579254 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [12:00:10] !log tarrow@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/TwoColConflict: SWAT: [[gerrit:579221|Detect whether an edit came from VisualEditor (T245722)]] (duration: 01m 10s) [12:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:16] T245722: Record "should-have" metrics for TwoColConflict - https://phabricator.wikimedia.org/T245722 [12:00:36] (03PS9) 10Jbond: analytics users: test using yaml pointers to simplify group members [puppet] - 10https://gerrit.wikimedia.org/r/579243 [12:00:46] !log EU SWAT done [12:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:06] (03Merged) 10jenkins-bot: sre.dns.netbox: exit early if no changes [cookbooks] - 10https://gerrit.wikimedia.org/r/579254 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [12:18:44] (03PS2) 10Hashar: package_builder: do not set webproxy by default [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) [12:18:51] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [12:19:11] (03Abandoned) 10Jbond: analytics users: test using yaml pointers to simplify group members [puppet] - 10https://gerrit.wikimedia.org/r/579243 (owner: 10Jbond) [12:19:21] (03CR) 10Jbond: [C: 03+2] decom elnath.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/579032 (owner: 10Dzahn) [12:19:32] (03PS2) 10Jbond: decom elnath.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/579032 (owner: 10Dzahn) [12:22:30] (03CR) 10Jbond: [C: 03+2] puppetmaster: stop allowing elnath [puppet] - 10https://gerrit.wikimedia.org/r/579034 (https://phabricator.wikimedia.org/T188544) (owner: 10Dzahn) [12:29:15] !log volans@cumin2001 START - Cookbook sre.dns.netbox [12:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:16] !log volans@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:17] (03PS1) 10Hnowlan: changeprop: Release new version of changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/579257 (https://phabricator.wikimedia.org/T213193) [12:34:53] (03Abandoned) 10Hnowlan: changeprop: Release new version of changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/579240 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [12:36:06] (03PS3) 10KartikMistry: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) [12:39:42] (03CR) 10jerkins-bot: [V: 04-1] apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) (owner: 10KartikMistry) [12:42:52] (03CR) 10Hashar: "I have added a http_proxy parameter everywhere which defaults to undef." [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [12:51:41] (03PS1) 10Jbond: vault: add certificate file for vault [puppet] - 10https://gerrit.wikimedia.org/r/579258 [12:53:29] (03PS1) 10Arturo Borrero Gonzalez: openstack: l3_agent: drop routing_source_ip hack [puppet] - 10https://gerrit.wikimedia.org/r/579259 (https://phabricator.wikimedia.org/T247505) [12:54:32] (03CR) 10Hnowlan: [C: 03+2] changeprop: Release new version of changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/579257 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [12:54:35] (03CR) 10jerkins-bot: [V: 04-1] openstack: l3_agent: drop routing_source_ip hack [puppet] - 10https://gerrit.wikimedia.org/r/579259 (https://phabricator.wikimedia.org/T247505) (owner: 10Arturo Borrero Gonzalez) [12:55:02] (03Merged) 10jenkins-bot: changeprop: Release new version of changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/579257 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [12:56:17] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [12:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] brennen and hashar: Time to snap out of that daydream and deploy Mediawiki train - American+European Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200312T1300). [13:17:31] (03PS1) 10Vgutierrez: Release 8.0.6-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/579262 (https://phabricator.wikimedia.org/T245616) [13:19:15] 10Operations, 10Puppet, 10User-jbond: configure and Test vaults capabilities as an ondemand CA - https://phabricator.wikimedia.org/T247509 (10jbond) p:05Triage→03Medium [13:21:11] (03CR) 10Ottomata: "I'm not sure what the current status of the mysql research password file, but it is possible that users that will now be only in analytics" [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [13:21:13] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:17] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) a:05Jgreen→03None [13:22:45] (03PS3) 10Andrew Bogott: Designate: add wmcs-populate-domains script [puppet] - 10https://gerrit.wikimedia.org/r/579121 (https://phabricator.wikimedia.org/T247374) [13:24:32] (03PS4) 10Andrew Bogott: Designate: add wmcs-populate-domains script [puppet] - 10https://gerrit.wikimedia.org/r/579121 (https://phabricator.wikimedia.org/T247374) [13:25:24] 10Operations, 10Traffic, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) As of this morning: 2003/2004 presently identical. Skew still increased on 1003/1004: {P10691} [13:26:07] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [13:26:07] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [13:26:11] (03CR) 10Andrew Bogott: [C: 04-2] "This patchset is for illustrative purposes only; the actual changes are already merged (albeit in a different order)" [puppet] - 10https://gerrit.wikimedia.org/r/578522 (owner: 10Andrew Bogott) [13:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:19] (03CR) 10Andrew Bogott: [C: 04-2] "This patchset is for illustrative purposes only; the actual changes are already merged (albeit in a different order)" [puppet] - 10https://gerrit.wikimedia.org/r/578523 (owner: 10Andrew Bogott) [13:26:26] (03CR) 10Andrew Bogott: [C: 04-2] "This patchset is for illustrative purposes only; the actual changes are already merged (albeit in a different order)" [puppet] - 10https://gerrit.wikimedia.org/r/578524 (owner: 10Andrew Bogott) [13:26:38] (03PS5) 10Andrew Bogott: Designate: add wmcs-populate-domains script [puppet] - 10https://gerrit.wikimedia.org/r/579121 (https://phabricator.wikimedia.org/T247374) [13:31:24] (03CR) 10Andrew Bogott: [C: 03+2] Designate: add wmcs-populate-domains script [puppet] - 10https://gerrit.wikimedia.org/r/579121 (https://phabricator.wikimedia.org/T247374) (owner: 10Andrew Bogott) [13:32:18] (03CR) 10C. Scott Ananian: "yay for fixes!" [puppet] - 10https://gerrit.wikimedia.org/r/579233 (https://phabricator.wikimedia.org/T246854) (owner: 10Hashar) [13:34:16] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop: Nutcracker sidecar fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/579016 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [13:34:21] (03PS1) 10Andrew Bogott: wmcs-populate-domains: add to Pike [puppet] - 10https://gerrit.wikimedia.org/r/579267 [13:35:22] !log mvolz@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' . [13:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:43] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-populate-domains: add to Pike [puppet] - 10https://gerrit.wikimedia.org/r/579267 (owner: 10Andrew Bogott) [13:36:45] (03PS3) 10Hnowlan: changeprop: Nutcracker sidecar fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/579016 (https://phabricator.wikimedia.org/T213193) [13:42:43] (03PS1) 10Jbond: puppet agent: add parameter to change certificate_revocation behaviour [puppet] - 10https://gerrit.wikimedia.org/r/579268 (https://phabricator.wikimedia.org/T247509) [13:42:45] (03PS1) 10Jbond: puppetmaster: add parameter to control ssl_verify_depth [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) [13:43:12] (03PS2) 10Jbond: puppetmaster: add parameter to control ssl_verify_depth [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) [13:44:05] (03PS3) 10Jbond: puppetmaster: add parameter to control ssl_verify_depth [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) [13:44:16] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [13:45:31] (03CR) 10jerkins-bot: [V: 04-1] puppet agent: add parameter to change certificate_revocation behaviour [puppet] - 10https://gerrit.wikimedia.org/r/579268 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [13:45:44] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add parameter to control ssl_verify_depth [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [13:45:48] PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:46:00] euh? [13:46:23] that's fundraising [13:46:32] Jeff_Green: ^ is that you? [13:47:00] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop: Nutcracker sidecar fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/579016 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [13:47:16] (03Merged) 10jenkins-bot: changeprop: Nutcracker sidecar fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/579016 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [13:48:37] mark: https://librenms.wikimedia.org/eventlog two ports on fasw-c-codfw (which have description pay-lvs) went down at the same time [13:49:43] (03PS1) 10Hnowlan: changeprop: Release version with fixes for nutcracker behaviour. [deployment-charts] - 10https://gerrit.wikimedia.org/r/579270 (https://phabricator.wikimedia.org/T213193) [13:51:42] mark yes, I think so--I'm rebuilding pay-lvs2001 & 2002 [13:52:01] ok - please schedule downtime :) [13:52:07] I did [13:52:19] hmm i suppose [13:52:29] I'm not sure what that's coming from, separate monitoring on pfw3-codfw? [13:52:35] this is an alert on BGP for the router which you maybe weren't expecting [13:52:36] yes [13:52:40] yup, that's the monitoring on the router side [13:53:22] ok let's see if I can figure out how to downtime that check... [13:53:49] (03PS4) 10Jbond: puppetmaster: add parameter to control ssl_verify_depth [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) [13:54:00] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [13:54:25] ok got it I think! [13:55:59] (03PS2) 10Jbond: puppet agent: add parameter to change certificate_revocation behaviour [puppet] - 10https://gerrit.wikimedia.org/r/579268 (https://phabricator.wikimedia.org/T247509) [13:57:28] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/579268 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [14:03:08] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) In the end, this will happen Thursday 19th March 09:00 AM UTC [14:03:19] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) [14:03:57] (03CR) 10Hnowlan: [C: 03+2] changeprop: Release version with fixes for nutcracker behaviour. [deployment-charts] - 10https://gerrit.wikimedia.org/r/579270 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [14:04:15] (03Merged) 10jenkins-bot: changeprop: Release version with fixes for nutcracker behaviour. [deployment-charts] - 10https://gerrit.wikimedia.org/r/579270 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [14:05:21] (03CR) 10Elukey: "> I'm not sure what the current status of the mysql research password" [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:05:32] (03PS1) 10Volans: sre.dns.netbox: fix metadata detection in output [cookbooks] - 10https://gerrit.wikimedia.org/r/579271 (https://phabricator.wikimedia.org/T233183) [14:07:10] (03CR) 10jerkins-bot: [V: 04-1] sre.dns.netbox: fix metadata detection in output [cookbooks] - 10https://gerrit.wikimedia.org/r/579271 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [14:07:56] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:03] (03CR) 10Jbond: "cloud pcc: https://puppet-compiler.wmflabs.org/compiler1001/21402/" [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [14:12:34] (03CR) 10Muehlenhoff: "JFTR; in modules/openldap/files/cross-validate-accounts.py we have a check for the "ops" membership and subset permissions, a similar one " [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:12:39] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 3 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Mvolz) p:05Medium→03High [14:13:27] (03PS2) 10Volans: sre.dns.netbox: fix metadata detection in output [cookbooks] - 10https://gerrit.wikimedia.org/r/579271 (https://phabricator.wikimedia.org/T233183) [14:13:43] 10Operations, 10Citoid: Use log level warn in citoid whenever zotero is unresponsive - https://phabricator.wikimedia.org/T243504 (10Mvolz) p:05Triage→03High [14:18:07] (03CR) 10Ottomata: "> We could assume it as acceptable risk, and just point people to the right file in case a question is raised?" [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:18:40] (03CR) 10Elukey: "> > We could assume it as acceptable risk, and just point people to" [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:20:26] (03CR) 10Volans: [C: 03+2] "self-merging to unblock testing. I will adjust according to any post-merge review." [cookbooks] - 10https://gerrit.wikimedia.org/r/579271 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [14:21:15] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Test atskafka deployment - https://phabricator.wikimedia.org/T247497 (10elukey) I had a chat with Ema, reporting what we discussed in here. The test will run on a couple of cp hosts, so I proposed to create 2 new topics called `atskafka_test_webrequ... [14:22:01] (03Merged) 10jenkins-bot: sre.dns.netbox: fix metadata detection in output [cookbooks] - 10https://gerrit.wikimedia.org/r/579271 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [14:23:08] 10Operations, 10Cloud-VPS (Project-requests): Request creation of SRE VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) p:05Triage→03Medium [14:23:29] !log volans@cumin2001 START - Cookbook sre.dns.netbox [14:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:24] !log volans@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:26:27] 10Operations, 10Cloud-VPS (Project-requests): Request creation of SRE VPS project - https://phabricator.wikimedia.org/T247517 (10CDanis) [14:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:31] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10Mvolz) >>! In T243444#5947944, @akosiaris wrote: > I guess we aren't gonna find the source of this specific event. @mvolz do you feel we are at least better prepared logging wise f... [14:27:42] 10Operations, 10Citoid: Use log level warn in citoid whenever zotero is unresponsive - https://phabricator.wikimedia.org/T243504 (10Mvolz) 05Open→03Resolved [14:27:44] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10Mvolz) [14:28:16] (03PS3) 10Hashar: package_builder: do not set webproxy by default [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) [14:28:31] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [14:31:12] (03PS8) 10Ssingh: First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) [14:31:22] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 152.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [14:36:05] 10Operations, 10Analytics, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Gehel) And we have a [[ https://docs.google.com/document/d/1H-iaH5Tktye5rIcLic38FkVGqRBZzXwFhYwsL3nXH78/edit# | first... [14:37:15] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for ganeti2019 to ganeti2024 [dns] - 10https://gerrit.wikimedia.org/r/579053 (owner: 10Papaul) [14:37:25] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for ganeti2019 to ganeti2024 [dns] - 10https://gerrit.wikimedia.org/r/579053 [14:37:28] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: Add mgmt and production DNS for ganeti2019 to ganeti2024 [dns] - 10https://gerrit.wikimedia.org/r/579053 (owner: 10Papaul) [14:37:35] (03CR) 10Muehlenhoff: "debian/ looks fine, two comments inline" (032 comments) [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [14:40:01] (03CR) 10Ema: [C: 03+1] Release 8.0.6-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/579262 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [14:40:46] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ganeti2019.codfw.wmnet ` The log can be found in... [14:42:34] (03PS9) 10Ssingh: First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) [14:44:04] PROBLEM - Kerberos Kpropd daemon on krb2001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/kpropd https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [14:45:18] (03CR) 10Hashar: "I have no idea why the puppet compiler is unable to fulfill loookup(http_proxy) for boiron.eqiad.wmnet :(" [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [14:46:43] (03CR) 10Ssingh: "Thanks for the help and quick review; I have updated the files." [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [14:46:53] 10Operations, 10Traffic, 10observability: some Prometheus not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10Aklapper) [14:47:47] 10Operations, 10Traffic, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) [14:47:53] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.6-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/579262 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [14:48:08] (03PS1) 10Ema: Define a test pipeline with Blubber [software/atskafka] - 10https://gerrit.wikimedia.org/r/579283 (https://phabricator.wikimedia.org/T237993) [14:48:20] 10Operations, 10Traffic, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) @Aklapper: "After extensive research, it has been determined that the correct plural of 'Prometheus' is 'Prometheis'." — https://prometheus.io/docs/introductio... [14:49:02] RECOVERY - Kerberos Kpropd daemon on krb2001 is OK: PROCS OK: 1 process with args /usr/sbin/kpropd https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [14:51:08] !log restart kpropd daemon on krb2001 [14:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:26] not sure why it stopped, logs don't say [14:53:39] !log uploading envoyproxy_1.13.1-1 (upgrade from 1.12.2) [14:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:26] (03PS1) 10Cmjohnson: update mac address for htmldumper1001 [puppet] - 10https://gerrit.wikimedia.org/r/579286 (https://phabricator.wikimedia.org/T245567) [15:03:48] (03CR) 10Cmjohnson: [C: 03+2] update mac address for htmldumper1001 [puppet] - 10https://gerrit.wikimedia.org/r/579286 (https://phabricator.wikimedia.org/T245567) (owner: 10Cmjohnson) [15:07:34] (03PS3) 10Jbond: utils: add puppet csr generation script [puppet] - 10https://gerrit.wikimedia.org/r/578988 [15:09:25] (03PS1) 10Muehlenhoff: Remove obsolete kafka[12]00[1-3] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/579287 [15:09:59] (03PS1) 10Cmjohnson: updating dns/asset tag names to reflect correct servers htmldumper/fran1001 [dns] - 10https://gerrit.wikimedia.org/r/579288 (https://phabricator.wikimedia.org/T245567) [15:10:24] (03PS2) 10Cmjohnson: updating dns/asset tag names to reflect correct servers htmldumper/fran1001 [dns] - 10https://gerrit.wikimedia.org/r/579288 (https://phabricator.wikimedia.org/T245567) [15:10:52] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10akosiaris) 05Open→03Invalid OK, I 've pinged @Pchelolo at https://github.com/wikimedia/service-template-node/issues/128, maybe he can help. But otherwise, I am inclined to clos... [15:10:54] (03CR) 10Cmjohnson: [C: 03+2] updating dns/asset tag names to reflect correct servers htmldumper/fran1001 [dns] - 10https://gerrit.wikimedia.org/r/579288 (https://phabricator.wikimedia.org/T245567) (owner: 10Cmjohnson) [15:12:23] (03CR) 10Muehlenhoff: [C: 03+1] "+1 on the debian/ part. I skimmed over the other bits a few months ago, feel free to merge from my PoV" [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [15:13:33] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` htmldumper1001.eqiad.wmnet ` The log can be foun... [15:14:42] (03PS1) 10Andrew Bogott: OpenStack/Queens/Stretch, avoid upgrading systemd [puppet] - 10https://gerrit.wikimedia.org/r/579289 (https://phabricator.wikimedia.org/T247013) [15:14:48] o/ I'm seeing a lot (over a million) of strange reports in Logstash that I haven't seen before. They all contain `proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/200`. Is this normal? https://logstash.wikimedia.org/goto/54574d319c2b223a7e11b9d822a618a9 [15:15:13] (03PS1) 10Hashar: python-build: add gcc etc + libssl-dev [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/579290 (https://phabricator.wikimedia.org/T215458) [15:18:14] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 65.08 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [15:18:33] (03PS1) 10Elukey: Assign role::statistics::explorer to stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/579291 [15:19:45] (03CR) 10Elukey: [C: 03+2] Assign role::statistics::explorer to stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/579291 (owner: 10Elukey) [15:20:35] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['htmldumper1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['htmldumper1001.eqiad.wmnet'] ` [15:21:00] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [15:21:23] niedzielski: if I understand correctly, that's the apache2 logs you're looking at -- we've started streaming them from a subset of appservers to Logstash, to improve troubleshooting during incidents (https://phabricator.wikimedia.org/T244472) [15:21:35] (03CR) 10CRusnov: [C: 03+1] "Looks good!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/579290 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [15:22:38] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [15:23:31] rlazarus: thank you! [15:23:56] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` htmldumper1001.eqiad.wmnet ` The log can be foun... [15:25:06] niedzielski: sure thing! https://wikitech.wikimedia.org/wiki/Apache_log_format has the field-by-field layout of those log entries if you want to know more [15:25:43] 👍 [15:28:01] (03CR) 10Jforrester: "This doesn't fix anything. This just stops scap pointing out that this is broken." [puppet] - 10https://gerrit.wikimedia.org/r/579233 (https://phabricator.wikimedia.org/T246854) (owner: 10Hashar) [15:28:57] (03PS1) 10CRusnov: import_mgmt_dns: Update frack mgmt prefix length for codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/579294 [15:29:35] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/579294 (owner: 10CRusnov) [15:29:41] (03PS1) 10Cmjohnson: Add mgmt/production dns for kafka-jumbo100[789] [dns] - 10https://gerrit.wikimedia.org/r/579295 (https://phabricator.wikimedia.org/T244506) [15:30:56] (03CR) 10jerkins-bot: [V: 04-1] Add mgmt/production dns for kafka-jumbo100[789] [dns] - 10https://gerrit.wikimedia.org/r/579295 (https://phabricator.wikimedia.org/T244506) (owner: 10Cmjohnson) [15:32:34] (03CR) 1020after4: [C: 03+1] python-build: add gcc etc + libssl-dev [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/579290 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [15:35:49] (03PS2) 10Cmjohnson: Add mgmt/production dns for kafka-jumbo100[789] [dns] - 10https://gerrit.wikimedia.org/r/579295 (https://phabricator.wikimedia.org/T244506) [15:37:03] (03PS1) 10Ema: Install dependencies from Debian [software/atskafka] - 10https://gerrit.wikimedia.org/r/579298 (https://phabricator.wikimedia.org/T237993) [15:37:17] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/579294 (owner: 10CRusnov) [15:37:52] (03CR) 10jerkins-bot: [V: 04-1] Install dependencies from Debian [software/atskafka] - 10https://gerrit.wikimedia.org/r/579298 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:37:59] (03CR) 10CRusnov: [C: 03+2] import_mgmt_dns: Update frack mgmt prefix length for codfw [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/579294 (owner: 10CRusnov) [15:38:06] (03CR) 10Cmjohnson: [C: 03+2] Add mgmt/production dns for kafka-jumbo100[789] [dns] - 10https://gerrit.wikimedia.org/r/579295 (https://phabricator.wikimedia.org/T244506) (owner: 10Cmjohnson) [15:39:52] (03PS1) 10Alexandros Kosiaris: ganeti: Consolidate partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/579299 [15:41:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Consolidate partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/579299 (owner: 10Alexandros Kosiaris) [15:42:57] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) In that case it might be worth saving some future work and moving along to buster on these. [15:47:50] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2019.codfw.wmnet'] ` Of which those **FAILED**: ` ['ganeti2019.codfw.wmnet'] ` [15:48:48] (03PS1) 10Elukey: Add stat1008 to the list of Analytics statistics servers [puppet] - 10https://gerrit.wikimedia.org/r/579303 [15:49:13] (03PS1) 10Papaul: Partman: Add new ganeti2019 to ganeti2024 [puppet] - 10https://gerrit.wikimedia.org/r/579304 (https://phabricator.wikimedia.org/T244783) [15:51:36] (03CR) 10Papaul: [C: 03+2] Partman: Add new ganeti2019 to ganeti2024 [puppet] - 10https://gerrit.wikimedia.org/r/579304 (https://phabricator.wikimedia.org/T244783) (owner: 10Papaul) [15:51:39] (03CR) 10Elukey: "Brooke, I added you for the last xmldumps mountpoint, I promise I will not add more! (the host was just racked :)" [puppet] - 10https://gerrit.wikimedia.org/r/579303 (owner: 10Elukey) [15:54:03] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10MoritzMuehlenhoff) @elukey: I don't even think we need additional changes? E.g. an-launcher1001 is on Buster... [15:54:08] (03CR) 10C. Scott Ananian: "> This doesn't fix anything. This just stops scap pointing out that" [puppet] - 10https://gerrit.wikimedia.org/r/579233 (https://phabricator.wikimedia.org/T246854) (owner: 10Hashar) [15:55:37] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ganeti2019.codfw.wmnet ` The log can be found in... [15:56:11] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10elukey) >>! In T224583#5964337, @MoritzMuehlenhoff wrote: > @elukey: I don't even think we need additional ch... [15:57:28] (03CR) 10Bstorm: [C: 03+1] "I'll be ready to exportfs" [puppet] - 10https://gerrit.wikimedia.org/r/579303 (owner: 10Elukey) [16:00:05] godog and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200312T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:45] (03CR) 10Elukey: [C: 03+2] Add stat1008 to the list of Analytics statistics servers [puppet] - 10https://gerrit.wikimedia.org/r/579303 (owner: 10Elukey) [16:03:25] (03CR) 10Jforrester: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/579233 (https://phabricator.wikimedia.org/T246854) (owner: 10Hashar) [16:03:44] (03PS2) 10Ema: Install dependencies from Debian [software/atskafka] - 10https://gerrit.wikimedia.org/r/579298 (https://phabricator.wikimedia.org/T237993) [16:07:02] (03CR) 10Muehlenhoff: [C: 03+1] "Much better :-)" [software/atskafka] - 10https://gerrit.wikimedia.org/r/579298 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:07:32] in frack, we are seeing a chunk of icinga alerts related to passive checks going awol. appears that it might be related to T196336. is it possible for someone to take a look and see if we can get a restart or such to fix it up? [16:07:32] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [16:07:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [16:07:49] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:39] (03PS1) 10Vgutierrez: admin: Add holger to mw-log-readers group [puppet] - 10https://gerrit.wikimedia.org/r/579309 (https://phabricator.wikimedia.org/T247470) [16:10:40] (03PS1) 10CDanis: include the username in logfiles [software/spicerack] - 10https://gerrit.wikimedia.org/r/579310 [16:12:07] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` htmldumper1001.eqiad.wmnet ` The log can be foun... [16:13:02] (03CR) 10Volans: [C: 03+1] "LGTM plus a proposal inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/579310 (owner: 10CDanis) [16:13:33] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:01] (03PS1) 10Ottomata: eventstreams - default client_ip_connection_limit: 2 per worker [deployment-charts] - 10https://gerrit.wikimedia.org/r/579312 (https://phabricator.wikimedia.org/T238658) [16:14:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/579309 (https://phabricator.wikimedia.org/T247470) (owner: 10Vgutierrez) [16:14:11] (03CR) 10CDanis: include the username in logfiles (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/579310 (owner: 10CDanis) [16:15:54] (03PS1) 10Ottomata: eventstreams - bump image version to 2020-03-12-155244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/579313 (https://phabricator.wikimedia.org/T238658) [16:15:55] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:23] (03CR) 10Ottomata: [C: 03+2] eventstreams - default client_ip_connection_limit: 2 per worker [deployment-charts] - 10https://gerrit.wikimedia.org/r/579312 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [16:16:38] (03CR) 10Ottomata: [C: 03+2] eventstreams - bump image version to 2020-03-12-155244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/579313 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [16:17:16] (03CR) 10Vgutierrez: [C: 03+2] admin: Add holger to mw-log-readers group [puppet] - 10https://gerrit.wikimedia.org/r/579309 (https://phabricator.wikimedia.org/T247470) (owner: 10Vgutierrez) [16:17:47] (03PS2) 10Bstorm: dumps-distribution: move all traffic to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/579095 (https://phabricator.wikimedia.org/T224583) [16:20:35] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [16:20:35] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [16:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwlog1001.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T247470 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Medium a:03Vgutierrez @holger.knust you should be able to access mwlog1001 now [16:21:41] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2019.codfw.wmnet'] ` and were **ALL** successful. [16:21:57] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ganeti2020.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-re... [16:22:03] (03CR) 10Volans: [C: 03+1] "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/579310 (owner: 10CDanis) [16:24:01] (03CR) 10Ema: [C: 03+2] Install dependencies from Debian [software/atskafka] - 10https://gerrit.wikimedia.org/r/579298 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:26:56] (03PS2) 10CDanis: include the username in logfiles [software/spicerack] - 10https://gerrit.wikimedia.org/r/579310 [16:27:03] (03CR) 10CDanis: include the username in logfiles (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/579310 (owner: 10CDanis) [16:27:44] (03CR) 10Volans: [C: 03+1] "LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/579310 (owner: 10CDanis) [16:28:33] !log restarting icinga, acting up on command file (frack awol and downtimes) [16:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:29] (03CR) 10Guozr.im: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:36:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [16:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:57] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:23] 10Operations, 10observability: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10Volans) p:05Triage→03High [16:39:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:33] 10Operations, 10observability: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10CDanis) The main icinga daemon process, which AIUI is basically a single-threaded event loop, is pretty constantly near-or-at 100% CPU utilization, btw. [16:40:35] 10Operations, 10fundraising-tech-ops, 10observability: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10Dwisehaupt) [16:41:50] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:00] (03PS1) 10Elukey: profile::statistics::gpu: add stat1008 to the list [puppet] - 10https://gerrit.wikimedia.org/r/579322 [16:44:47] (03CR) 10CDanis: [C: 03+2] include the username in logfiles [software/spicerack] - 10https://gerrit.wikimedia.org/r/579310 (owner: 10CDanis) [16:46:30] (03PS1) 10Alexandros Kosiaris: eventgate-analytics: Bump envoy to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/579323 (https://phabricator.wikimedia.org/T247484) [16:47:04] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2020.codfw.wmnet'] ` and were **ALL** successful. [16:47:12] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: add stat1008 to the list [puppet] - 10https://gerrit.wikimedia.org/r/579322 (owner: 10Elukey) [16:49:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate-analytics: Bump envoy to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/579323 (https://phabricator.wikimedia.org/T247484) (owner: 10Alexandros Kosiaris) [16:51:05] (03Merged) 10jenkins-bot: include the username in logfiles [software/spicerack] - 10https://gerrit.wikimedia.org/r/579310 (owner: 10CDanis) [16:51:47] (03PS1) 10Alexandros Kosiaris: eventstreams: Drop canaries for a while [deployment-charts] - 10https://gerrit.wikimedia.org/r/579324 (https://phabricator.wikimedia.org/T247484) [16:52:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Drop canaries for a while [deployment-charts] - 10https://gerrit.wikimedia.org/r/579324 (https://phabricator.wikimedia.org/T247484) (owner: 10Alexandros Kosiaris) [16:52:31] (03CR) 10Hashar: package_builder: do not set webproxy by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [16:52:41] (03Merged) 10jenkins-bot: eventstreams: Drop canaries for a while [deployment-charts] - 10https://gerrit.wikimedia.org/r/579324 (https://phabricator.wikimedia.org/T247484) (owner: 10Alexandros Kosiaris) [16:52:58] (03PS1) 10Ladsgroup: Set term store to WRITE_BOTH for all of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579326 (https://phabricator.wikimedia.org/T219123) [16:53:11] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Cmjohnson) [16:53:25] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Cmjohnson) all set, resolving [16:53:36] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Cmjohnson) 05Open→03Resolved [16:53:56] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [16:53:56] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [16:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:43] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Cmjohnson) [16:56:58] (03PS4) 10Hashar: package_builder: do not set webproxy by default [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) [17:00:05] halfak and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200312T1700). [17:01:28] RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:01:37] (03CR) 10Muehlenhoff: "Isn't there some proxy in Cloud VPS which could simply be used instead of cluttering conditionals all over the classes?" [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [17:01:47] (03PS1) 10Elukey: aptrepo: add miopengemm to the whitelist of the amd-rocm271 component [puppet] - 10https://gerrit.wikimedia.org/r/579328 [17:02:26] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [17:03:15] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [17:03:16] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [17:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:25] (03CR) 10Elukey: [C: 03+2] aptrepo: add miopengemm to the whitelist of the amd-rocm271 component [puppet] - 10https://gerrit.wikimedia.org/r/579328 (owner: 10Elukey) [17:05:28] (03PS1) 10Herron: icinga: switch check_ping packet count to 1 [puppet] - 10https://gerrit.wikimedia.org/r/579329 (https://phabricator.wikimedia.org/T247538) [17:05:57] 10Operations, 10fundraising-tech-ops, 10observability, 10Patch-For-Review: Icinga latency is skyrocketing and commands ignored - https://phabricator.wikimedia.org/T247538 (10herron) https://gerrit.wikimedia.org/r/579329 seems like low hanging fruit that could help reduce load [17:06:55] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:46] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) Strategy for analyzing backups changed a bit, I now export the PK used on ranges to verify it is the same as on the live db... [17:11:50] (03CR) 10Herron: [C: 03+2] elasticsearch: add 'disktype' param to configure node.attr.disktype [puppet] - 10https://gerrit.wikimedia.org/r/578653 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [17:11:53] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid1001 and druid1007 - https://phabricator.wikimedia.org/T245569 (10RobH) [17:12:15] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid1001 and druid1007 - https://phabricator.wikimedia.org/T245569 (10RobH) [17:12:35] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10RobH) [17:12:44] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1002/351/boron.eqiad.wmnet/ looks better on boiron.eqiad.wmnet \o/" [puppet] - 10https://gerrit.wikimedia.org/r/579231 (https://phabricator.wikimedia.org/T247496) (owner: 10Hashar) [17:12:46] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:55] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10herron) [17:16:36] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ganeti2021.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-re... [17:17:36] !log execute modprinc -maxlife 2d krbtgt/WIKIMEDIA via kadmin.local on krb1001 (will be propagated to 2001 automatically) [17:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:26] (03PS1) 10Volans: sre.dns.netbox: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/579336 (https://phabricator.wikimedia.org/T233183) [17:25:03] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) [17:26:26] I'm deploying this UBN right now https://phabricator.wikimedia.org/T247466 [17:26:30] brennen: ^ [17:26:46] (03PS1) 10Herron: ELK7: require disktype "hdd" for new indices [puppet] - 10https://gerrit.wikimedia.org/r/579338 (https://phabricator.wikimedia.org/T247376) [17:26:56] Thanks, Amir1. [17:27:49] Amir1: thank you, much appreciated. [17:28:09] (03CR) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on four wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:28:17] I hope it fixes the issue [17:28:21] not super though [17:31:02] (03CR) 10Hashar: [V: 03+1 C: 03+1] "I gave it a try locally and that is sufficient to build pycrypto (a dependency of paramiko which is the direct dependency)." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/579290 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [17:31:13] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [17:31:53] (03CR) 10Jforrester: "This won't work with an entry in wgBetaFeaturesWhitelist." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:32:28] (03PS1) 10Hnowlan: changeprop: Remove HTTP liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/579339 (https://phabricator.wikimedia.org/T213193) [17:32:38] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [17:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:17] (03PS2) 10Hnowlan: changeprop: Remove HTTP liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/579339 (https://phabricator.wikimedia.org/T213193) [17:33:52] (03PS1) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [17:34:02] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) [17:35:01] (03CR) 10Herron: [C: 04-2] "not to be merged until all existing/new indices have "index.routing.allocation.require.disktype" : "hdd"" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [17:35:04] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:58] (03CR) 10C. Scott Ananian: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/579233 (https://phabricator.wikimedia.org/T246854) (owner: 10Hashar) [17:36:05] (03CR) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on four wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:37:20] (03CR) 10CRusnov: [C: 03+1] "looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/579336 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:37:22] (03PS3) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) [17:37:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:38:47] (03CR) 10Jforrester: "It's normal for the beta feature to go live *as a beta feature* on the Beta Cluster first for final validation, BTW. This patch precludes " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:38:53] (03PS1) 10Elukey: profile::kerberos: set ticket_lifetime to krb.conf [puppet] - 10https://gerrit.wikimedia.org/r/579342 [17:39:10] (03PS1) 10Alexandros Kosiaris: envoy: use non TLS enabled eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579343 (https://phabricator.wikimedia.org/T247484) [17:39:19] (03CR) 10Jforrester: Enable DiscussionTools as a beta feature on four wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:39:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:40:06] (03PS4) 10Krinkle: multiversion: Simplify path inclusions in buildDBLists.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 [17:40:08] (03PS4) 10Krinkle: multiversion: Structure buildDBLists.php to prevent labs info in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579104 [17:40:10] (03PS4) 10Krinkle: multiversion: Let buildDBLists.php expand expression dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) [17:40:54] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2021.codfw.wmnet'] ` and were **ALL** successful. [17:43:22] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/579336 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:45:10] (03Merged) 10jenkins-bot: sre.dns.netbox: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/579336 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:45:35] (03CR) 10Hnowlan: [C: 03+2] changeprop: Remove HTTP liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/579339 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:45:43] (03PS1) 10Giuseppe Lavagetto: services_proxy: use http endpoint for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579345 (https://phabricator.wikimedia.org/T247484) [17:46:07] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [17:46:13] (03Merged) 10jenkins-bot: changeprop: Remove HTTP liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/579339 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:48:18] (03PS2) 10Alexandros Kosiaris: envoy: use non TLS enabled eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579343 (https://phabricator.wikimedia.org/T247484) [17:48:41] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:48] !log increase via 'kadmin.local modprinc -maxlife 2d $user' all max ticket lifetimes of Kerberos User principals on the krb1001's KDC (changes will be propagated to codfw automatically) [17:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:06] (03CR) 10Bartosz Dziewoński: "> It's normal for the beta feature to go live *as a beta feature* on the Beta Cluster first for final validation, BTW. This patch preclude" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:49:13] (03CR) 10Elukey: [C: 03+2] profile::kerberos: set ticket_lifetime to krb.conf [puppet] - 10https://gerrit.wikimedia.org/r/579342 (owner: 10Elukey) [17:50:19] (03PS1) 10Herron: logstash: curator: set ignore_empty_list true for replica job [puppet] - 10https://gerrit.wikimedia.org/r/579346 (https://phabricator.wikimedia.org/T247014) [17:52:04] (03PS4) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) [17:52:06] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools as a beta feature on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579347 (https://phabricator.wikimedia.org/T245794) [17:52:23] (03CR) 10Bartosz Dziewoński: "https://gerrit.wikimedia.org/r/579347" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:53:08] James_F: remind me what is the process for config changes that only affect the Beta Cluster. can you just merge https://gerrit.wikimedia.org/r/579347 ? [17:53:32] (03PS2) 10Giuseppe Lavagetto: services_proxy: use http endpoint for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579345 (https://phabricator.wikimedia.org/T247484) [17:53:44] MatmaRex: They're a deploy, just a quiet one. [17:53:54] I'll sling it out for you. [17:54:11] (03PS3) 10Alexandros Kosiaris: envoy: use non TLS enabled eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579343 (https://phabricator.wikimedia.org/T247484) [17:54:25] (03CR) 10Jforrester: [C: 03+2] Enable DiscussionTools as a beta feature on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579347 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:54:27] (03PS5) 10Krinkle: multiversion: Let buildDBLists.php expand expression dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) [17:54:30] James_F: bonus optimisation thanks to you ^ [17:54:42] James_F: The good news is that I found where's the problem [17:54:55] the bad news is that I don't know how to fix it properly(TM) [17:55:05] James_F: thanks [17:55:23] Amir1: Fun. [17:55:28] :D [17:55:33] (03Merged) 10jenkins-bot: Enable DiscussionTools as a beta feature on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579347 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [17:55:37] (hold on, is james having three conversations in this channel simultaneously) [17:55:39] (03PS1) 10Gergő Tisza: Switch kowiki to use ORES for suggested edits topics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579348 [17:55:44] MatmaRex: Now four. [17:55:44] MatmaRex: typically. [17:56:03] :D [17:56:04] And several more in other channels/media. [17:56:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] envoy: use non TLS enabled eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579343 (https://phabricator.wikimedia.org/T247484) (owner: 10Alexandros Kosiaris) [17:56:57] MatmaRex: Done; will be live on Beta Cluster Soon™ [17:57:17] (03Abandoned) 10Alexandros Kosiaris: services_proxy: use http endpoint for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579345 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [17:57:24] Krinkle: Does it have a measurable impact? [17:58:00] James_F: thanks. the actual deployment happens automatically, right? [17:58:19] MatmaRex: Yes. [17:58:23] James_F: not having to expand group1 definitely does have big impact. the fewer branches in readDblistFile, maybe a little. It's called about 20-30 times so probably not a whole lot, but it is one extra fewer if conditions :) [17:58:24] Amir1: i take it https://gerrit.wikimedia.org/r/579302 won't stem the logspam on that one? [17:59:09] James_F: I'm mainly happy to see the code paths not overlap between dblist and dbexpr, it's defense in depth. [17:59:16] Krinkle: Always good to simplify, yes. [17:59:37] anytime you read a dblist it literally cannot use expressions which is much easier to reason about [18:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200312T1800) [18:00:04] ebernhardson: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:02:13] (03PS1) 10Alexandros Kosiaris: Revert "envoy: use non TLS enabled eventgate-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/579351 [18:02:15] brennen: it still will, It needs to be fixed another way [18:02:18] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "envoy: use non TLS enabled eventgate-analytics" [puppet] - 10https://gerrit.wikimedia.org/r/579351 (owner: 10Alexandros Kosiaris) [18:02:40] (sorry for late response, I was washing my hands) [18:03:01] it takes really long time these days [18:03:13] I have something for SWAT [18:03:21] once you all are done [18:04:20] i can deploy my own i suppose [18:04:23] its a simple config change [18:05:24] (03CR) 10Dzahn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [18:06:48] (03CR) 10Herron: [C: 03+1] Remove obsolete kafka[12]00[1-3] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/579287 (owner: 10Muehlenhoff) [18:08:11] (well, its a simple config change but sadly it lives in WikimediaEvents...so it will be 30 minutes :) [18:08:16] (03PS1) 10Hnowlan: changeprop: use tcpsocket for readiness and liveness [deployment-charts] - 10https://gerrit.wikimedia.org/r/579352 (https://phabricator.wikimedia.org/T213193) [18:10:15] any advice on whether T247546 merits much concern would be appreciated. [18:10:15] T247546: DataUpdateAdapter::doUpdate: Cannot execute query - https://phabricator.wikimedia.org/T247546 [18:10:24] (03PS2) 10Dzahn: doc: add envoy for TLS termination on doc1001 [puppet] - 10https://gerrit.wikimedia.org/r/572378 (https://phabricator.wikimedia.org/T210411) [18:10:50] (03Restored) 10Giuseppe Lavagetto: services_proxy: use http endpoint for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579345 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [18:12:04] (03CR) 10Dzahn: [C: 03+2] doc: add envoy for TLS termination on doc1001 [puppet] - 10https://gerrit.wikimedia.org/r/572378 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [18:12:24] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.04296 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:12:39] ebernhardson: can we do the mediawiki-config changes while you are waiting to merge? [18:13:07] (03PS3) 10Giuseppe Lavagetto: services_proxy: use http endpoint for eventgate-analytics [puppet] - 10https://gerrit.wikimedia.org/r/579345 (https://phabricator.wikimedia.org/T247484) [18:13:17] change, anyway, not sure what Amir1's patch is [18:13:24] tgr: yea [18:13:54] brennen: that one is tracked. Let me find it [18:14:18] Amir1: cool, thanks. [18:14:57] (03CR) 10Gergő Tisza: [C: 03+2] Switch kowiki to use ORES for suggested edits topics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579348 (owner: 10Gergő Tisza) [18:15:04] (03CR) 10Holger Knust: [C: 03+2] changeprop: use tcpsocket for readiness and liveness [deployment-charts] - 10https://gerrit.wikimedia.org/r/579352 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [18:15:08] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) And this is an example of running on the above dump (although I had to remove certain tables that of course, are not append... [18:15:49] (03Merged) 10jenkins-bot: changeprop: use tcpsocket for readiness and liveness [deployment-charts] - 10https://gerrit.wikimedia.org/r/579352 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [18:17:47] (03PS2) 10Gergő Tisza: Switch kowiki to use ORES for suggested edits topics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579348 [18:18:02] (03CR) 10Gergő Tisza: [C: 03+2] Switch kowiki to use ORES for suggested edits topics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579348 (owner: 10Gergő Tisza) [18:18:20] brennen: https://phabricator.wikimedia.org/T244115 [18:19:04] and https://phabricator.wikimedia.org/T246898 [18:19:12] * Amir1 tgr: Mine is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/579326 [18:19:17] (03Merged) 10jenkins-bot: Switch kowiki to use ORES for suggested edits topics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579348 (owner: 10Gergő Tisza) [18:19:57] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [18:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:00] Amir1: ta [18:21:05] (03CR) 10Jforrester: [C: 03+1] "I'm approving the new Beta Feature. Good to roll out when the team want it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [18:21:45] Amir1: re: T229731, I believe Releng doesn't want us to do a submodule bump on ext sec patches. Or least that was my most recent understanding. [18:22:25] Amir1: want me to deploy that as well? [18:22:41] brennen: I wish I could fix it but it's mostly because we are at middle of migration [18:22:46] sbassett: We don't? [18:22:47] tgr: yes please, thanks [18:23:14] sbassett: that's weird, several other extensions have it except this one causing lots of confusion [18:23:19] (03PS2) 10Gergő Tisza: Set term store to WRITE_BOTH for all of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579326 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [18:23:20] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:579348|Switch kowiki to use ORES for suggested edits topics]] (duration: 01m 08s) [18:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:48] (03CR) 10Gergő Tisza: [C: 03+2] Set term store to WRITE_BOTH for all of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579326 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [18:23:52] James_F Amir1: let me search the Phabs to confirm or disprove my memories [18:24:08] Sure. :-) [18:24:43] (03Merged) 10jenkins-bot: Set term store to WRITE_BOTH for all of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579326 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [18:24:51] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ganeti2022.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-re... [18:26:45] Amir1: it's on mwdebug1002 [18:26:59] tgr: it's untestable [18:27:39] (03PS1) 10Dzahn: add certificate for envoy TLS termination on doc1001 [puppet] - 10https://gerrit.wikimedia.org/r/579360 (https://phabricator.wikimedia.org/T210411) [18:27:49] James_F Amri1: So this was the most recent conversation I remember, particularly the last comment from thcipriani: https://phabricator.wikimedia.org/T229285. There was debate on this throughout the task, I believe, hence my still-confused comment at the end of the task. [18:28:34] (03CR) 10Dzahn: [C: 03+2] add certificate for envoy TLS termination on doc1001 [puppet] - 10https://gerrit.wikimedia.org/r/579360 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [18:29:11] sbassett: Hmm, yeah. [18:29:17] 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10CDanis) I just want to point out something... [18:29:23] sbassett: Amir1 the current state of the repo is normal. I explain in detail on that task, but the tl;dr is that committing those submodule changes makes it harder to pull in backports. [18:29:43] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:579326|Set term store to WRITE_BOTH for all of Wikidata (T219123)]] (duration: 01m 07s) [18:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:49] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [18:30:07] (03PS1) 10Jcrespo: mariadb-backups: Update rowinfo format to include index name [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579365 (https://phabricator.wikimedia.org/T244884) [18:30:25] thcipriani: Ok, thanks for the clarification. [18:30:31] hmm, can this be documented? [18:30:39] * thcipriani replies to email as well [18:30:42] It would be great [18:30:58] Amir1: yes, I agree and can do that :) [18:31:28] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: re-sync InitialiseSettings.php (duration: 01m 08s) [18:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:41] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Update rowinfo format to include index name [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579365 (https://phabricator.wikimedia.org/T244884) (owner: 10Jcrespo) [18:31:55] Amir1: ebernhardson we are done [18:32:01] Thanks! [18:32:06] great, going next [18:32:34] Amir1: It's not explicitly listed as a step under the official doc, but maybe we should add a note about why we don't do this? https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Security_patches [18:33:28] (03PS2) 10Dzahn: add certificate for envoy TLS termination on doc1001 [puppet] - 10https://gerrit.wikimedia.org/r/579360 (https://phabricator.wikimedia.org/T210411) [18:33:33] sbassett: I think it should be documented in https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers so I, a SWAT deployer, don't get confused when deploying a backport [18:34:18] (03Abandoned) 10Gergő Tisza: Switch kowiki and viwiki to use ORES for suggested edits topics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576892 (https://phabricator.wikimedia.org/T246171) (owner: 10Kosta Harlan) [18:34:21] !log ebernhardson@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: cirrus: Start Glent m0 AB test (duration: 01m 07s) [18:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:19] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ganeti2023.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-re... [18:36:00] all done [18:37:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:37:26] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.001245 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:39:48] PROBLEM - Check whether ferm is active by checking the default input chain on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:39:52] PROBLEM - dhclient process on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:40:00] PROBLEM - DPKG on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:40:08] PROBLEM - MD RAID on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:40:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [18:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:22] PROBLEM - Check systemd state on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:22] PROBLEM - configured eth on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:41:46] PROBLEM - Disk space on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=htmldumper1001&var-datasource=eqiad+prometheus/ops [18:41:52] PROBLEM - Check size of conntrack table on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:43:14] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:32] (03CR) 10Jforrester: [C: 03+1] "Sure." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 (owner: 10Krinkle) [18:44:36] (03CR) 10Jforrester: [C: 03+1] multiversion: Structure buildDBLists.php to prevent labs info in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579104 (owner: 10Krinkle) [18:45:06] (03CR) 10Jforrester: [C: 03+1] multiversion: Let buildDBLists.php expand expression dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [18:45:22] PROBLEM - puppet last run on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:47:58] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2022.codfw.wmnet'] ` and were **ALL** successful. [18:48:11] Amir1: note added: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#git. Hopefully that works. [18:48:46] PROBLEM - Check the NTP synchronisation status of timesyncd on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [18:50:18] sbassett: thanks [18:51:07] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [18:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:22] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [18:52:34] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [18:53:04] PROBLEM - IPMI Sensor Status on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:53:34] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:19] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2023.codfw.wmnet'] ` and were **ALL** successful. [18:59:48] (03PS1) 10Dzahn: squid: remove obsolete hierarchy_stoplist config directive [puppet] - 10https://gerrit.wikimedia.org/r/579373 (https://phabricator.wikimedia.org/T224576) [19:00:04] brennen and hashar: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200312T1900). [19:02:04] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` ganeti2024.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-re... [19:02:21] (03PS2) 10Dzahn: squid: remove obsolete hierarchy_stoplist config directive [puppet] - 10https://gerrit.wikimedia.org/r/579373 (https://phabricator.wikimedia.org/T224576) [19:03:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/579106 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [19:09:03] wdoran: not sure if y'all were planning another log triage session, but i'm about to roll the train to all wikis. [19:09:25] log eyeballs always welcome. [19:09:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:09:34] (03CR) 10Dzahn: "thanks, it did not occur to me to search this separate repo outside puppet." [homer/public] - 10https://gerrit.wikimedia.org/r/579131 (owner: 10Elukey) [19:13:54] before i roll train forward, anyone know what was up with that spike of fatals around 19:00 UTC? [19:16:00] No. [19:16:27] righto. [19:17:00] well. things seem... fine now, i suppose. [19:17:29] it started earlier than 1900 [19:17:36] it started 1 minute after <+logmsgbot> !log ebernhardson@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: cirrus: [19:17:48] and then took until about 1900 to notice it was over [19:18:02] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:15] brennen: New blocker just added. [19:18:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:18:54] hrmm [19:19:18] well something seems to be up. [19:20:03] third spike after 2 others which were both 12 minutes long [19:20:07] started after deploy [19:20:11] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&from=now-6h&to=now [19:20:11] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) I'm going through the process on virtual machines before I proceed on labstore1007 so at least I'm no... [19:20:24] see the blue dotted line ..matches exactly [19:20:25] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:44] were there 3 separate scap syncs? [19:24:43] Wikibase\Repo\Content\DataUpdateAdapter::doUpdate: Commit failed on server(s) 10.64.48.172 ? [19:25:10] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti2024.codfw.wmnet'] ` and were **ALL** successful. [19:26:24] hrm, deploy at 18:34 looks pretty far away from that code [19:26:32] mutante: i think that one's tracked at T246898 [19:26:33] T246898: Wikibase\Repo\Content\DataUpdateAdapter::doUpdate: Commit failed on server(s) 10.64.48.172: Cannot execute query from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate while transaction status is ERROR - https://phabricator.wikimedia.org/T246898 [19:27:14] not sure why the spike of it, though. [19:28:28] brennen: Nikki and Holger were just pairing with Timo on this, let me ping them here [19:28:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:29:10] ok.. so that recovered [19:29:14] but that was 3 times in a row [19:29:40] yeah, if the pattern holds we should expect another one in... ~15 min? [19:29:45] still looks bad on 24h view [19:30:33] lots of errors from wikidata it seems [19:30:47] pile of database errors in logstash, all from wmf.23 AFAICS [19:30:49] Joy. [19:30:54] is it time to roll this back to group0? [19:30:58] since the 18:28 deploy [19:31:14] Amir1 / addshore / tarrow might have a view. [19:31:26] yes, seeing wikidata as well [19:32:26] I think I know what's going on [19:32:28] let me check [19:32:33] hrm, yeah, this did start at like 18:3x [19:32:39] what is that pattern though.. 11 minutes ..then recover then 11 minutes again [19:32:50] mutante: Job runner stuff? [19:33:04] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/579326 [19:33:10] hmm. makes sense i guess. and thanks Amir [19:33:27] So it's a load issue? [19:33:29] James_F: no it's not job runner, it's people brining down wikidata [19:33:36] fun stuff, let me show it to you [19:33:51] James_F: it's a load issue because we are at middle of migration [19:33:57] so everything is write both [19:34:02] Yeah, the endless wb_terms migration. [19:34:17] We should just make the damn DB read-only for a few hours, migrate, and be done. [19:34:29] * addshore reads yp [19:34:36] The world will not end if Wikidata is read-only for a bit. [19:34:40] https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&from=now-12h&to=now&fullscreen&panelId=10 [19:34:42] Nothing wbterms related should stop a rollback [19:35:09] * addshore reads up more [19:35:13] addshore: More the other way around; if we rollback the wb_terms stuff can we continue with the train? [19:35:17] * addshore just landed in an airport [19:35:21] T247553 [19:35:24] Sorry, addshore. [19:35:26] the period is because the maxlag reaches 5, all bots stop, the maxlag recovers, and again and again [19:35:45] Can we artificially inflate maxlag to slow them down more aggressively? [19:35:52] addshore: don't worry, I got this, it's setting all to write_both that caused it [19:36:02] Gorcha [19:36:33] James_F: the good news is that we are almost done [19:36:53] Next week the DB will be entirely down? ;-) [19:36:57] (03PS1) 10Ladsgroup: Revert "Set term store to WRITE_BOTH for all of Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579378 [19:37:50] James_F: haha, I hope not but seriously we soon stop LOTS of writes [19:37:57] That'd be great. [19:38:17] Though I guess it'll mean that maxlag will drop, so edit rate will shoot up? [19:38:28] * addshore continues through security. (Ping me if needed) [19:39:24] Amir1: so it's https://phabricator.wikimedia.org/T246898 ? [19:39:35] James_F: it just changes the pattern of edits [19:39:48] that's what i saw in exception.log so confirmed [19:40:11] mutante: yup, quick question. Is there logstash I can see? [19:40:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:41:00] James_F: This would probably help: https://phabricator.wikimedia.org/T247459 [19:41:20] Amir1: you can ssh to mwlog1001.eqiad.wmnet and look at /srv/mw-log/exception.log [19:41:24] i see your user on it [19:41:24] Amir1: Oh, yes. [19:41:28] but I'm not sure when I can get to properly implement it (basically a site-wide rate limit) [19:41:58] (03CR) 10Ladsgroup: [C: 03+2] Revert "Set term store to WRITE_BOTH for all of Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579378 (owner: 10Ladsgroup) [19:42:09] I just deploy this to reduce the load [19:42:38] * James_F nods. [19:42:44] Amir1: logstash requires one of nda/ops/wmf so that should also work because you are in nda [19:42:57] (03Merged) 10jenkins-bot: Revert "Set term store to WRITE_BOTH for all of Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579378 (owner: 10Ladsgroup) [19:43:17] mutante: yup, I've been there for four years now [19:44:50] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:56] Amir1: so you wanted to see that specific error still? mwlog1001 is ok> [19:44:58] Amir1: re: logstash, mediawiki-new-errors dashboard is good [19:44:59] ? [19:45:08] PROBLEM - Host kafka-jumbo1006 is DOWN: PING CRITICAL - Packet loss = 100% [19:45:11] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert "Set term store to WRITE_BOTH for all of Wikidata" (duration: 01m 08s) [19:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:16] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops [19:45:26] This seems good [19:46:37] yea, that's the one we had arlier [19:46:39] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert "Set term store to WRITE_BOTH for all of Wikidata", take II (duration: 01m 06s) [19:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:10] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/579269 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [19:50:52] * brennen twiddles thumbs, watches error graphs. [19:51:55] the 19:45 deploy made it go down [19:52:05] looks good so far to me [19:52:38] though still slightly elevated from baseline [19:52:52] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/579346 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [19:52:56] yeah, let's give it 'til the hour? [19:53:06] sounds good [19:57:21] Amir1: we could use lock manager, to stop the post edit data updates stepping on each other [19:57:49] addshore: yup, I had this idea a while back [19:59:08] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 87 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [20:00:46] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 63 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [20:04:09] error rate seems... pretty well back to normal if a touch on the higher side than baseline? [20:04:26] Amir1, James_F, mutante: comfortable with .23 moving forward at this point? [20:04:59] brennen: Yes. [20:05:57] yes [20:06:08] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 84 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [20:06:29] thanks, rolling. [20:06:57] (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579385 [20:06:59] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579385 (owner: 10Brennen Bearnes) [20:07:10] (03CR) 10Cwhite: "I looks like the parameter is not used anywhere. Is this CS to just set up the plumbing first before adding it to a template?" [puppet] - 10https://gerrit.wikimedia.org/r/579268 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [20:07:57] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579385 (owner: 10Brennen Bearnes) [20:09:01] uhhh kafka jumbo1006 eh? [20:09:01] looking [20:10:32] !log revoking puppet cert for doc.discovery.wmnet, re-creating with doc.wikimedia.org as SAN [20:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:03] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.23 [20:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:47] ...aaaaand probably reverting. [20:12:11] a log of Memcached::setMulti() errors [20:12:14] (03PS3) 10Jbond: puppet agent: add parameter to change certificate_revocation behaviour [puppet] - 10https://gerrit.wikimedia.org/r/579268 (https://phabricator.wikimedia.org/T247509) [20:12:16] Eurgh, that spike. [20:12:49] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/579268 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [20:13:35] rolling back. [20:15:00] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert "all wikis to 1.35.0-wmf.23" [20:15:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:24] Amir1: I think we will need it :) [20:16:52] (03PS1) 10Dzahn: ssl: update cert for doc.discovery.wmnet to include doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/579390 (https://phabricator.wikimedia.org/T210411) [20:17:56] (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.35.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579391 [20:17:58] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.35.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579391 (owner: 10Brennen Bearnes) [20:19:34] (03Merged) 10jenkins-bot: Revert "all wikis to 1.35.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579391 (owner: 10Brennen Bearnes) [20:19:44] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 95 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [20:20:47] addshore: I'll create a ticket [20:21:55] What's the error? [20:22:32] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 79 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [20:23:01] FYI, these under replicated errors is because kafka-jumbo1006 is offline [20:23:40] 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Ottomata) [20:23:54] 10Operations, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Ottomata) p:05Triage→03High [20:27:24] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6.846e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:27:41] 10Operations, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Ottomata) [20:27:56] 10Operations, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Ottomata) Possibly related to / caused by {T244506} ? [20:31:17] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/579268 (https://phabricator.wikimedia.org/T247509) (owner: 10Jbond) [20:33:44] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash7-codfw,logstash7-eqiad} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from [20:33:44] gId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [20:36:33] (03CR) 10Dzahn: Add mgmt/production dns for kafka-jumbo100[789] (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/579295 (https://phabricator.wikimedia.org/T244506) (owner: 10Cmjohnson) [20:41:47] (03PS1) 10Cmjohnson: Fix typo for kafka-jumbo1008 [dns] - 10https://gerrit.wikimedia.org/r/579394 (https://phabricator.wikimedia.org/T244506) [20:42:25] (03CR) 10Cmjohnson: [C: 03+2] Fix typo for kafka-jumbo1008 [dns] - 10https://gerrit.wikimedia.org/r/579394 (https://phabricator.wikimedia.org/T244506) (owner: 10Cmjohnson) [20:49:01] !log kafka-jumbo1006 - stopping kafka and powercycling - T247561 [20:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:07] T247561: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 [20:49:39] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10Papaul) [20:49:55] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10Papaul) ` [edit interfaces interface-range ganeti] member ge-5/0/25 { ... } + member ge-1/0/0; [edit interfaces interface-range disabled] - member ge-1/0/0; [edit i... [20:50:02] 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Krinkle) To answer that we would need to kn... [20:50:58] 10Operations, 10ops-codfw, 10DC-Ops: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10Papaul) 05Open→03Resolved @akosiaris This is done. [20:53:15] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Papaul) [21:00:30] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 225 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:01:39] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 (10Ottomata) [21:01:49] (03CR) 10Herron: [C: 03+2] logstash: curator: set ignore_empty_list true for replica job [puppet] - 10https://gerrit.wikimedia.org/r/579346 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [21:02:00] (03PS2) 10Herron: logstash: curator: set ignore_empty_list true for replica job [puppet] - 10https://gerrit.wikimedia.org/r/579346 (https://phabricator.wikimedia.org/T247014) [21:02:22] (03PS2) 10Dzahn: ssl: update cert for doc.discovery.wmnet to include doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/579390 (https://phabricator.wikimedia.org/T210411) [21:04:19] (03PS1) 10Mstyles: kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) [21:06:09] !log remove one file for legal compliance [21:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:43] (03CR) 10Krinkle: [C: 03+2] multiversion: Simplify path inclusions in buildDBLists.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 (owner: 10Krinkle) [21:07:00] (03CR) 10jerkins-bot: [V: 04-1] kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [21:07:34] (03Merged) 10jenkins-bot: multiversion: Simplify path inclusions in buildDBLists.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579103 (owner: 10Krinkle) [21:08:45] (03PS5) 10Krinkle: multiversion: Structure buildDBLists.php to prevent labs info in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579104 [21:08:48] (03CR) 10Krinkle: [C: 03+2] multiversion: Structure buildDBLists.php to prevent labs info in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579104 (owner: 10Krinkle) [21:08:52] (03PS6) 10Krinkle: multiversion: Let buildDBLists.php expand expression dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) [21:09:45] (03Merged) 10jenkins-bot: multiversion: Structure buildDBLists.php to prevent labs info in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579104 (owner: 10Krinkle) [21:12:25] (03Abandoned) 10Mstyles: kibana: add kibana to relforge [puppet] - 10https://gerrit.wikimedia.org/r/577031 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [21:13:20] (03CR) 10Mstyles: "plz" [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [21:14:01] (03PS2) 10Mstyles: kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) [21:16:42] (03CR) 10jerkins-bot: [V: 04-1] kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [21:17:58] (03CR) 10Dzahn: [C: 03+2] ssl: update cert for doc.discovery.wmnet to include doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/579390 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:19:55] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [21:19:55] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [21:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:58] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [21:25:18] (03PS3) 10Mstyles: kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) [21:25:34] PROBLEM - Long running screen/tmux on htmldumper1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.206: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [21:25:53] (03PS1) 10Dzahn: ATS: switch doc.wikimedia.org to https to backend [puppet] - 10https://gerrit.wikimedia.org/r/579407 (https://phabricator.wikimedia.org/T210411) [21:30:08] PROBLEM - Long running screen/tmux on netbox1001 is CRITICAL: CRIT: Long running tmux process. (user: crusnov PID: 24298, 1734411s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [21:32:02] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Dwisehaupt) 05Open→03Resolved [21:32:24] ho ho [21:32:28] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [21:32:48] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:32:53] chaomodus: it worked:) (pinging people from icinga text) [21:33:28] yah [21:34:16] once added the username for that [21:47:28] !log doc1001 - had to manually run "/usr/local/sbin/build-envoy-config -c /etc/envoy/" to get envoy tls_terminator_443 listener into the config or envoy would not listen on 443 (T210411) [21:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:34] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [21:51:13] 10Operations, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10Jgreen) [21:52:40] 10Operations, 10DC-Ops, 10decommission: decommission WMF6149 (old pay-lvs2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247572 (10Jgreen) [21:52:43] (03CR) 10Dzahn: "< mutante> !log doc1001 - had to manually run "/usr/local/sbin/build-envoy-config -c /etc/envoy/" to get envoy tls_terminator_443 listener" [puppet] - 10https://gerrit.wikimedia.org/r/572378 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:53:01] (03CR) 10Dzahn: [C: 03+1] "@doc1001:/etc/envoy/listeners.d# curl -H "Host: doc.wikimedia.org" https://doc.discovery.wmnet/dir.php" [puppet] - 10https://gerrit.wikimedia.org/r/579407 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:53:12] (03PS2) 10Dzahn: ATS: switch doc.wikimedia.org to https to backend [puppet] - 10https://gerrit.wikimedia.org/r/579407 (https://phabricator.wikimedia.org/T210411) [21:54:09] (03CR) 10Dzahn: [C: 03+2] ATS: switch doc.wikimedia.org to https to backend [puppet] - 10https://gerrit.wikimedia.org/r/579407 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:58:06] (03PS1) 10Ottomata: Repackage eventgate and eventstreams charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/579419 [21:59:54] (03CR) 10Ottomata: [C: 03+2] Repackage eventgate and eventstreams charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/579419 (owner: 10Ottomata) [22:02:39] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [22:02:39] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [22:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:17] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) https://doc.wikimedia.org has been switched from http://doc1001.eqiad.wmnet to https://doc.discovery.wmnet [22:04:29] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [22:05:59] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [22:05:59] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [22:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:17] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: move all traffic to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/579095 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [22:07:54] !log moving all nfs traffic off labstore1007 and to labstore1006 for upgrades T224583 [22:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:59] T224583: Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 [22:09:56] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [22:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:26] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [22:15:26] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [22:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:23] (03PS1) 10Herron: logstash: add curator job to require disktype hdd after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/579422 (https://phabricator.wikimedia.org/T247376) [22:22:57] (03PS2) 10Herron: ELk7: add curator job to require disktype hdd after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/579422 (https://phabricator.wikimedia.org/T247376) [22:28:54] !log mforns@deploy1001 Started deploy [analytics/refinery@906bd1e]: deploying refinery together with refinery-source v0.0.118 [22:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:41] (03CR) 10Krinkle: [C: 03+2] multiversion: Let buildDBLists.php expand expression dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:39:39] (03Merged) 10jenkins-bot: multiversion: Let buildDBLists.php expand expression dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579105 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:41:14] !log mforns@deploy1001 Finished deploy [analytics/refinery@906bd1e]: deploying refinery together with refinery-source v0.0.118 (duration: 12m 20s) [22:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:09] !log krinkle@deploy1001 Synchronized dblists/: I403a9890a9 (duration: 01m 09s) [22:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:32] !log krinkle@deploy1001 Synchronized multiversion/: I403a9890a9 (duration: 01m 07s) [22:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:17] 10Operations: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10colewhite) It looks like most of the issues stems from CSP blocking mixed-content. idp.wikimedia.org is redirecting to http per [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/576921 | this changeset. ]] Can the idp red... [23:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200312T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:16:21] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) a:03Dzahn [23:21:17] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw216[0-6].codfw.wmnet [23:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2178.codfw.wmnet [23:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:49] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw215[89].codfw.wmnet [23:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:09] (03PS4) 10Dzahn: decom 15 codfw appservers from rack C3 [puppet] - 10https://gerrit.wikimedia.org/r/579073 (https://phabricator.wikimedia.org/T247018) [23:40:49] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw215[89].codfw.wmnet [23:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:48] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw216[0-9].codfw.wmnet [23:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:09] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw217[1-2].codfw.wmnet [23:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:33] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) mw2158 through mw2172 are permanently depooled (state=inactive) now. That's exactly 15 servers from the middle of C3.... [23:53:59] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: codfw: decom at least 15 appservers(mw2158 through mw2172) in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul) [23:58:22] !log reload prometheus@ops on prometheus1004 [23:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log