[00:02:34] 10Operations, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10Dzahn) Yep, that fixed it. logstash1029 now all green in Icinga. thanks [00:04:14] (03PS1) 10CDanis: cf: add support for description (& prefix-match) [cookbooks] - 10https://gerrit.wikimedia.org/r/578646 [00:04:16] (03PS1) 10CDanis: cf: read api_token from a config file [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 [00:05:58] (03PS2) 10CDanis: cf: print out prefix descriptions (& prefix-match) [cookbooks] - 10https://gerrit.wikimedia.org/r/578646 [00:06:00] (03PS2) 10CDanis: cf: read api_token from a config file [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 [00:09:00] (03CR) 10Bstorm: [C: 03+1] "It seems like a cool idea to me." (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [00:10:13] (03PS1) 10Krinkle: tests: Rename tests/noc/ files to match class name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578648 [00:10:15] (03PS1) 10Krinkle: tests: Remove remaining inline require_once statements to bootstrap.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578649 [00:10:17] (03PS1) 10Krinkle: tests: Avoid duplicate data providers in WmfConfigServicesTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578650 [00:10:19] (03PS1) 10Krinkle: tests: Move TestServices.php fixture to data/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578651 [00:10:21] (03PS1) 10Krinkle: tests: Remove outdated FIXME comment from TestServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578652 [00:10:33] (03CR) 10Volans: [C: 03+1] "LGTM, two nits inline." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/578646 (owner: 10CDanis) [00:14:21] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw23[66-76].codfw.wmnet [00:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:43] (03PS3) 10CDanis: cf: print out prefix descriptions (& prefix-match) [cookbooks] - 10https://gerrit.wikimedia.org/r/578646 [00:14:45] (03PS3) 10CDanis: cf: read api_token from a config file [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 [00:15:17] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw236[68].codfw.wmnet [00:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:34] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw237[0246].codfw.wmnet [00:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:01] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2375.codfw.wmnet [00:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:23] 10Operations, 10Wikimedia-Logstash, 10observability: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10herron) p:05Triage→03Medium [00:16:30] (03CR) 10CDanis: cf: print out prefix descriptions (& prefix-match) (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/578646 (owner: 10CDanis) [00:16:49] (03PS1) 10Herron: elasticsearch: add 'disktype' param to configure node.attr.disktype [puppet] - 10https://gerrit.wikimedia.org/r/578653 (https://phabricator.wikimedia.org/T247376) [00:18:19] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) mw2350 through mw2376 are all pooled in production and set to "Active" in netbox now. [00:18:22] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 (owner: 10CDanis) [00:18:45] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [00:18:52] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/578646 (owner: 10CDanis) [00:18:52] 10Operations, 10ops-codfw, 10Wikimedia-Logstash, 10Patch-For-Review: (Need by: TBD) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10herron) 05Open→03Resolved >>! In T240882#5937405, @Papaul wrote: > @herron is it possible to create another task to track th... [00:24:58] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10herron) [00:26:47] (03PS2) 10Krinkle: tests: Remove remaining inline require_once statements to bootstrap.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578649 [00:26:49] (03PS2) 10Krinkle: tests: Avoid duplicate data providers in WmfConfigServicesTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578650 [00:26:56] (03PS2) 10Krinkle: tests: Move TestServices.php fixture to data/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578651 [00:26:58] (03PS2) 10Krinkle: tests: Remove outdated FIXME comment from TestServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578652 [00:27:00] (03PS1) 10Krinkle: tests: Add structure test to disallow loading unused dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578654 (https://phabricator.wikimedia.org/T169821) [00:27:16] (03CR) 10Krinkle: "Yay, five unused dblists:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578654 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:28:01] 10Operations, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10herron) 05Open→03Resolved Thanks @Cmjohnson! Will resolve this and track service setup in T247376 [00:29:58] (03CR) 10CDanis: [C: 04-2] "not to be merged until after a spicerack release with I98adc429f7." [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 (owner: 10CDanis) [00:30:02] (03CR) 10jerkins-bot: [V: 04-1] tests: Add structure test to disallow loading unused dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578654 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:30:24] (03PS2) 10Krinkle: tests: Add structure test to disallow loading unused dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578654 (https://phabricator.wikimedia.org/T169821) [00:30:28] (03CR) 10Krinkle: [C: 03+2] tests: Rename tests/noc/ files to match class name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578648 (owner: 10Krinkle) [00:30:30] (03PS1) 10Krinkle: MWConfigCacheGenerator: Stop loading five unused dblists on web reqs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578655 (https://phabricator.wikimedia.org/T169821) [00:30:36] (03CR) 10CDanis: [C: 03+2] cf: print out prefix descriptions (& prefix-match) [cookbooks] - 10https://gerrit.wikimedia.org/r/578646 (owner: 10CDanis) [00:31:59] (03Merged) 10jenkins-bot: tests: Rename tests/noc/ files to match class name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578648 (owner: 10Krinkle) [00:32:14] (03PS2) 10Krinkle: MWConfigCacheGenerator: Stop loading five unused dblists on web reqs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578655 (https://phabricator.wikimedia.org/T169821) [00:32:16] (03PS3) 10Krinkle: tests: Add structure test to disallow loading unused dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578654 (https://phabricator.wikimedia.org/T169821) [00:33:26] (03CR) 10Jforrester: "Yay." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578654 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:33:28] (03PS3) 10Krinkle: tests: Move TestServices.php fixture to data/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578651 [00:33:31] (03CR) 10Krinkle: [C: 03+2] tests: Move TestServices.php fixture to data/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578651 (owner: 10Krinkle) [00:33:33] (03Merged) 10jenkins-bot: cf: print out prefix descriptions (& prefix-match) [cookbooks] - 10https://gerrit.wikimedia.org/r/578646 (owner: 10CDanis) [00:33:36] (03PS3) 10Krinkle: tests: Remove outdated FIXME comment from TestServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578652 [00:33:47] (03CR) 10Krinkle: [C: 03+2] tests: Remove outdated FIXME comment from TestServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578652 (owner: 10Krinkle) [00:33:50] (03CR) 10Jforrester: [C: 03+1] MWConfigCacheGenerator: Stop loading five unused dblists on web reqs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578655 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [00:34:40] (03Merged) 10jenkins-bot: tests: Move TestServices.php fixture to data/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578651 (owner: 10Krinkle) [00:35:03] (03Merged) 10jenkins-bot: tests: Remove outdated FIXME comment from TestServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578652 (owner: 10Krinkle) [00:35:44] James_F: thx, I'm leaving the other three for your review. These two seemed trivial enough though, I've rebased them out of the stack. [00:36:12] * James_F nods. [00:36:19] Review the require_once one now. [00:37:27] (03CR) 10Jforrester: [C: 03+1] "Lovely." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578649 (owner: 10Krinkle) [00:37:31] (03PS1) 10CDanis: allow spicerack_config_dir to be set from config.yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/578656 [00:38:19] (03CR) 10Krinkle: [C: 03+2] tests: Remove remaining inline require_once statements to bootstrap.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578649 (owner: 10Krinkle) [00:38:36] (03CR) 10Jforrester: [C: 03+1] tests: Avoid duplicate data providers in WmfConfigServicesTest (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578650 (owner: 10Krinkle) [00:46:52] (03PS5) 10Jforrester: wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [00:46:54] (03PS5) 10Jforrester: Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [00:46:56] (03PS5) 10Jforrester: Switch restbase to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578497 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [00:46:58] (03PS1) 10Jforrester: tests: Fix testVariantUrlsAreLocalhost logic reversal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578657 [00:53:06] (03PS3) 10Krinkle: tests: Remove remaining inline require_once statements to bootstrap.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578649 [00:53:08] (03CR) 10Krinkle: tests: Remove remaining inline require_once statements to bootstrap.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578649 (owner: 10Krinkle) [00:53:10] (03CR) 10Krinkle: [C: 03+2] tests: Remove remaining inline require_once statements to bootstrap.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578649 (owner: 10Krinkle) [00:53:16] (03PS3) 10Krinkle: tests: Avoid duplicate data providers in WmfConfigServicesTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578650 [00:53:21] (03CR) 10Krinkle: [C: 03+2] tests: Avoid duplicate data providers in WmfConfigServicesTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578650 (owner: 10Krinkle) [00:53:52] Krinkle: Could you sling out https://gerrit.wikimedia.org/r/578657 whilst you're at it? [00:54:04] (03Merged) 10jenkins-bot: tests: Remove remaining inline require_once statements to bootstrap.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578649 (owner: 10Krinkle) [00:54:22] (03Merged) 10jenkins-bot: tests: Avoid duplicate data providers in WmfConfigServicesTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578650 (owner: 10Krinkle) [00:57:19] James_F: reviewing now [00:57:24] the double negative makes me look a few times [00:59:02] Yeah, I reverse the negation a few times in writing it and ended up screwing up. [00:59:05] Oops. [00:59:34] (03CR) 10Krinkle: [C: 03+2] tests: Fix testVariantUrlsAreLocalhost logic reversal (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578657 (owner: 10Jforrester) [01:00:03] (03PS2) 10Krinkle: tests: Fix testVariantUrlsAreLocalhost logic reversal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578657 (owner: 10Jforrester) [01:00:09] (03CR) 10Krinkle: [C: 03+2] tests: Fix testVariantUrlsAreLocalhost logic reversal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578657 (owner: 10Jforrester) [01:00:21] (03CR) 10Jforrester: tests: Fix testVariantUrlsAreLocalhost logic reversal (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578657 (owner: 10Jforrester) [01:01:11] (03Merged) 10jenkins-bot: tests: Fix testVariantUrlsAreLocalhost logic reversal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578657 (owner: 10Jforrester) [01:42:56] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Krinkle) [01:55:03] (03PS1) 10CDanis: spicerack: add auth secret for cf cookbook [puppet] - 10https://gerrit.wikimedia.org/r/578665 [01:55:39] (03CR) 10CDanis: "will make necessary changes in private and labs-private after a positive review" [puppet] - 10https://gerrit.wikimedia.org/r/578665 (owner: 10CDanis) [02:11:42] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 128.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [02:12:36] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [02:12:44] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 146.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [02:40:28] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [03:03:49] Zuul is struggling to cope with the the recent stack of patches to https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/skins/Vector - 34 sets of related changes that are processed as dependencies. Can someone take a look at https://integration.wikimedia.org/zuul/ / is there any way to abort the jobs? [03:20:13] DannyS712: define struggling? The status page being busy is generally not an issue. It just looks messy there. [03:20:52] to me, it looked like it was struggling, taking almost an hour for the commits, but it seems to have gone down [03:24:14] DannyS712: Each patch individually depends on its parent, but each one is also processed separately. [03:24:29] To avoid overload it is strictly limited how many are processed at once, so the delay is intentional. [03:24:59] should be fine to just let be :) If nothing else happens it'll go through it eventually. If something else starts happening, it means there are resources free. It's not ideal, but no solution is :) [03:25:13] okay, thanks for the explanation [03:48:56] (03PS1) 10KartikMistry: Add apertium-recursive package [debs/contenttranslation/apertium-recursive] - 10https://gerrit.wikimedia.org/r/578704 (https://phabricator.wikimedia.org/T234181) [03:52:26] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 70.77 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [03:53:16] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [04:10:26] (03PS1) 10KartikMistry: Add apertium-anaphora package [debs/contenttranslation/apertium-anaphora] - 10https://gerrit.wikimedia.org/r/578705 (https://phabricator.wikimedia.org/T234181) [04:20:27] (03PS2) 10KartikMistry: Add apertium-anaphora package [debs/contenttranslation/apertium-anaphora] - 10https://gerrit.wikimedia.org/r/578705 (https://phabricator.wikimedia.org/T234181) [04:21:29] (03PS2) 10KartikMistry: Add apertium-recursive package [debs/contenttranslation/apertium-recursive] - 10https://gerrit.wikimedia.org/r/578704 (https://phabricator.wikimedia.org/T234181) [04:23:45] (03PS1) 10KartikMistry: apertium-en-ca: Update to new upstream 1.0.1 [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700) [04:31:56] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 64.54 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [05:21:16] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [05:22:11] (03PS2) 10KartikMistry: apertium-en-ca: Update to new upstream 1.0.1 [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700) [05:22:20] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [05:39:22] (03CR) 10jerkins-bot: [V: 04-1] apertium-en-ca: Update to new upstream 1.0.1 [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700) (owner: 10KartikMistry) [06:16:26] (03PS3) 10KartikMistry: apertium-en-ca: Update to new upstream 1.0.1 [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700) [06:43:19] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700) (owner: 10KartikMistry) [07:11:40] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [07:12:48] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [07:21:36] (03PS6) 10Giuseppe Lavagetto: wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) [07:28:01] (03CR) 10jerkins-bot: [V: 04-1] apertium-en-ca: Update to new upstream 1.0.1 [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700) (owner: 10KartikMistry) [07:37:22] (03PS1) 10KartikMistry: apertium-nno-nob: Update to new upstream release 1.3.0 [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/578761 (https://phabricator.wikimedia.org/T239779) [07:38:23] !log fixcopyrightwiki_p views from labs hosts T246055 [07:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:29] T246055: Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 [07:44:29] (03PS1) 10KartikMistry: apertium-swe-dan: Update to new upstream release 0.8.1 [debs/contenttranslation/apertium-swe-dan] - 10https://gerrit.wikimedia.org/r/578762 (https://phabricator.wikimedia.org/T239779) [07:53:04] (03PS1) 10KartikMistry: apertium-swe-nor: Update to new upstream release 0.3.1 [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/578764 (https://phabricator.wikimedia.org/T239779) [07:56:55] <_joe_> jouncebot: next [07:56:55] In 1 hour(s) and 3 minute(s): es3 read only database deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T0900) [07:57:03] <_joe_> ok I have time [07:58:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [07:59:38] (03Merged) 10jenkins-bot: wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [07:59:54] (03PS1) 10KartikMistry: apertium-dan-nor: Update to new upstream release 1.4.1 [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/578765 (https://phabricator.wikimedia.org/T239779) [08:02:02] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Set es3 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578766 (https://phabricator.wikimedia.org/T246072) [08:04:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/578621 (owner: 10CDanis) [08:04:46] (03CR) 10Marostegui: "Thanks for the patch Guozrm.im. Once you are ready for people (possibly jcrespo or myself) to review this change, make sure to add us as "" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [08:08:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (my previous test missed to specify the 8080 port)" [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [08:09:59] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: switch wdqs-internal to use envoy (duration: 01m 21s) [08:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:00] 10Operations, 10serviceops, 10Service-Architecture: Monitor envoy status where it's installed - https://phabricator.wikimedia.org/T247387 (10Joe) [08:16:36] 10Operations, 10MediaWiki-General, 10serviceops, 10Service-Architecture: Create a grafana dashboard to monitor services proxied via envoy - https://phabricator.wikimedia.org/T247388 (10Joe) [08:19:24] 10Operations, 10MediaWiki-General, 10serviceops, 10Service-Architecture: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 (10Joe) [08:21:37] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) >>! In T244843#5957717, @Ottomata wrote: > @Joe @akosiaris all deploym... [08:26:25] 10Puppet, 10SRE-tools, 10Python3-Porting: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10Aklapper) [08:33:54] !log installing libvpx security updates [08:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:10] 10Operations, 10SRE-tools, 10Traffic, 10Continuous-Integration-Config, and 5 others: Integrate automated DNS snippets into CI - https://phabricator.wikimedia.org/T243362 (10hashar) [08:39:22] (03PS1) 10Giuseppe Lavagetto: profile::envoy: check that envoy is running [puppet] - 10https://gerrit.wikimedia.org/r/578770 (https://phabricator.wikimedia.org/T247387) [08:40:17] !log installing libidn security updates [08:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:16] (03CR) 10jerkins-bot: [V: 04-1] profile::envoy: check that envoy is running [puppet] - 10https://gerrit.wikimedia.org/r/578770 (https://phabricator.wikimedia.org/T247387) (owner: 10Giuseppe Lavagetto) [08:47:38] (03PS2) 10Giuseppe Lavagetto: profile::envoy: check that envoy is running [puppet] - 10https://gerrit.wikimedia.org/r/578770 (https://phabricator.wikimedia.org/T247387) [08:50:10] (03CR) 10jerkins-bot: [V: 04-1] profile::envoy: check that envoy is running [puppet] - 10https://gerrit.wikimedia.org/r/578770 (https://phabricator.wikimedia.org/T247387) (owner: 10Giuseppe Lavagetto) [08:53:19] <_joe_> jerkins you're mean to me [08:54:05] 10Operations, 10SRE-tools, 10Traffic, 10Continuous-Integration-Config, and 5 others: Integrate automated DNS snippets into CI - https://phabricator.wikimedia.org/T243362 (10hashar) p:05Triage→03Medium [08:56:20] (03PS1) 10Elukey: Follow up after Analytics client host refactoring [puppet] - 10https://gerrit.wikimedia.org/r/578783 (https://phabricator.wikimedia.org/T243934) [08:59:59] (03PS2) 10Muehlenhoff: Create roles for the initial setup of a server [puppet] - 10https://gerrit.wikimedia.org/r/575485 [09:00:04] marostegui and jynus: #bothumor My software never has bugs. It just develops random features. Rise for es3 read only database deployment. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T0900). [09:00:14] Morning jouncebot and jynus, ready? [09:01:47] !log restarting Apache on puppetboard, people.wikimedia.org, webperf*, bromine, miscweb* to pick up libidn security updates [09:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:17] jynus: when ready, take a look at https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/578766/ [09:02:21] ok [09:02:59] (03CR) 10Jcrespo: [C: 03+1] db-eqiad,db-codfw.php: Set es3 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578766 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:03:07] yay! [09:03:09] (03PS2) 10Elukey: Follow up after Analytics client host refactoring [puppet] - 10https://gerrit.wikimedia.org/r/578783 (https://phabricator.wikimedia.org/T243934) [09:03:18] (03CR) 10Ema: [C: 03+1] Create roles for the initial setup of a server [puppet] - 10https://gerrit.wikimedia.org/r/575485 (owner: 10Muehlenhoff) [09:03:23] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Set es3 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578766 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:04:35] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Set es3 as RO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578766 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:06:04] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Set es3 as RO - T246072 (duration: 01m 08s) [09:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:09] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:06:19] did you remember to rebase? :-P [09:06:31] haha yes! [09:06:43] checking logs before going for eqiad [09:06:49] nothing yet [09:07:31] let's go for eqiad? [09:07:35] +1 [09:07:56] deploying [09:09:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set es3 as RO - T246072 (duration: 01m 08s) [09:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:08] monitoring [09:09:42] binlogs looking clean [09:09:52] monitoring enwiki [09:10:17] still a cluster25 [09:10:20] DB://cluster25/223340966 [09:10:30] also DB://cluster25/223340965 [09:10:58] no more wikiuser connections from what I can see [09:11:21] no more in the last 100 [09:11:28] last 200 [09:12:06] nothing in binlogs [09:12:08] apart from heartbeat [09:12:24] DB://cluster25/223340967 may be the last one [09:12:28] on enwiki [09:12:43] old_id 956757167 [09:12:49] let's go for read_only on es1017? [09:12:54] no more in the last 300 [09:13:15] if you see no more binlog activity on other wikis +1 [09:13:30] yeah, there is nothing on binlogs [09:13:31] does it need a puppet patch too? [09:13:36] so it doesn't restart? [09:13:45] yeah, that is yet to come [09:13:50] to make them standalone [09:17:58] let me know when/if you set it [09:18:37] yeah, doing it now [09:18:53] !log Set es1017 (es3 master) in read only on mysql T246072 [09:18:55] (03PS1) 10Marostegui: es3 hosts: Change them to standalone [puppet] - 10https://gerrit.wikimedia.org/r/578816 (https://phabricator.wikimedia.org/T246072) [09:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:58] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:19:05] done [09:19:13] checking for errors [09:20:15] so far so good [09:21:00] es1017 CRIT: read_only: "True", expected "False": [09:21:14] as expected [09:21:16] (03CR) 10Elukey: [C: 03+2] Follow up after Analytics client host refactoring [puppet] - 10https://gerrit.wikimedia.org/r/578783 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [09:21:19] yeah [09:21:27] should be gone once we merge that above patch [09:21:42] checking [09:22:24] everything there is correct, let me see if everything to be done is included [09:22:27] next is to kill pthearbeat [09:22:31] and grab positions and all that [09:22:36] before resetting replication [09:22:58] (03CR) 10Jcrespo: [C: 03+1] es3 hosts: Change them to standalone [puppet] - 10https://gerrit.wikimedia.org/r/578816 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:23:20] (03CR) 10Marostegui: [C: 03+2] es3 hosts: Change them to standalone [puppet] - 10https://gerrit.wikimedia.org/r/578816 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:23:54] elukey: your change can be merged? [09:24:17] sure! [09:24:19] thanks [09:24:21] ok, merging [09:25:36] pt heartbeat stopped on the master [09:25:39] grabbing positions [09:27:19] jynus: please confirm: es1017 es1014 es1019 es2017 es2018 es2019: stop slave; reset slave all; [09:28:56] correct- well, except on the 1 master that will not be needed [09:29:10] Excellent, will run it [09:29:23] !log Disconnect replication on all es3 hosts T246072 [09:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:28] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:29:31] es1017 should not have replication active, but worth checking anyway [09:29:49] it didn't yeah [09:29:54] all done and replication threads gone [09:30:44] (03CR) 10Marostegui: [C: 03+1] "volans: this is good to go" [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) (owner: 10Volans) [09:30:58] jynus: we need to update zarcillo no? [09:31:01] 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10MoritzMuehlenhoff) 05Resolved→03Open I ran the OS upgrade tracking script and noticed that ncredir5002 is still on Buster, reopening [09:31:03] only zarcillo/tendril tunings, right? [09:31:23] just setting standalone to es3 sections table [09:31:27] to 1 [09:31:35] yeah [09:31:49] and hiding it from tendril, if desired [09:31:57] yeah, I will do that as it was done with es2 [09:32:21] es1017 on icinga: Version 10.1.36-MariaDB, Uptime 46118584s, read_only: True [09:32:21] done [09:32:25] nice [09:32:51] there was a bit of a slowdown in rows read before the change: [09:33:09] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-shard=es5&var-role=All&fullscreen&panelId=8&from=1583897583818&to=1583919183818 [09:33:40] but maybe it is just someone stopping batch processes out of precaution or something [09:34:10] yeah, maybe [09:34:16] let me check grafana got it [09:34:35] I have updated zarcillo.sections and tendril.shards [09:34:37] (03PS1) 10Elukey: statistics::rsync::eventlogging: run crons as root [puppet] - 10https://gerrit.wikimedia.org/r/578845 [09:34:47] oh, it takes 1 hour or so to run the script [09:34:50] let me force it [09:35:03] or 30 minutes [09:35:39] ok [09:35:50] (03CR) 10jerkins-bot: [V: 04-1] statistics::rsync::eventlogging: run crons as root [puppet] - 10https://gerrit.wikimedia.org/r/578845 (owner: 10Elukey) [09:36:22] I updated it manually, it should show up on grafana any time now [09:36:53] ^that means running: mysqld_exporter_config.py eqiad /srv/prometheus/ops/targets [09:37:01] on prometheus1003/4 [09:37:05] FYI [09:37:15] it is done automatically anyway [09:37:33] yep, it worked [09:38:10] there was a spike in master reads at 9:11 [09:38:22] I am guessing just one of us checking data being added or something [09:38:42] yeah [09:38:43] that was me [09:38:47] checkingsome tables on the master [09:39:02] it was actually not that large [09:39:13] but compared to the low read rate, it stood out [09:39:33] reads seem up back again [09:39:47] 0 errors on log [09:40:48] yeah, it was definitely me [09:41:00] checking some tables [09:42:38] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 (10aborrero) p:05Triage→03High This filter might be blocking legit traffic. I believe we need to explicitly allow traffic with... [09:43:03] !log Finish es3 maintenance window T246072 [09:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:08] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:43:43] (03PS1) 10Elukey: Use the analytics user to check the hdfs mountpoint where needed [puppet] - 10https://gerrit.wikimedia.org/r/578859 [09:46:56] !log depool and reimage ncredir5002 with buster - T243391 [09:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:01] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [09:48:54] (03CR) 10Gehel: "LGTM, minor question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578653 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [09:50:30] (03PS2) 10Elukey: statistics::rsync::eventlogging: run crons as root [puppet] - 10https://gerrit.wikimedia.org/r/578845 [09:50:32] (03PS2) 10Elukey: Use the analytics user to check the hdfs mountpoint where needed [puppet] - 10https://gerrit.wikimedia.org/r/578859 [09:51:39] (03CR) 10jerkins-bot: [V: 04-1] statistics::rsync::eventlogging: run crons as root [puppet] - 10https://gerrit.wikimedia.org/r/578845 (owner: 10Elukey) [09:52:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=ncredir site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:53:10] !log installing postgresql-9.6 security updates on maps* [09:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:38] ^^ thats's expected from ncredir5002 being reimaged [09:58:13] (03PS3) 10Elukey: statistics::rsync::eventlogging: run crons as root [puppet] - 10https://gerrit.wikimedia.org/r/578845 [09:58:15] (03PS3) 10Elukey: Use the analytics user to check the hdfs mountpoint where needed [puppet] - 10https://gerrit.wikimedia.org/r/578859 [09:58:56] (03PS2) 10Alexandros Kosiaris: eventstreams: Switch to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/578555 (https://phabricator.wikimedia.org/T238658) [09:59:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Switch to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/578555 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [10:02:08] (03CR) 10Elukey: [C: 03+2] statistics::rsync::eventlogging: run crons as root [puppet] - 10https://gerrit.wikimedia.org/r/578845 (owner: 10Elukey) [10:02:24] (03CR) 10Elukey: [C: 03+2] Use the analytics user to check the hdfs mountpoint where needed [puppet] - 10https://gerrit.wikimedia.org/r/578859 (owner: 10Elukey) [10:04:04] 10Operations: Integrate Stretch 9.12 point update - https://phabricator.wikimedia.org/T244695 (10MoritzMuehlenhoff) [10:20:23] (03CR) 10Arturo Borrero Gonzalez: "Please see my comment here https://phabricator.wikimedia.org/T247236#5955523 before merging this patch." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [10:29:46] (03PS2) 10Volans: mysql: update CORE_SECTIONS for external storage [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) [10:30:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:38] (03CR) 10Jcrespo: "Unrelated to this patch, but this reminds me we have to "solve" the s10 problem." [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) (owner: 10Volans) [10:33:28] (03CR) 10Volans: "LGTM, see caveat inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578665 (owner: 10CDanis) [10:34:11] (03CR) 10Hnowlan: [C: 03+2] changeprop: enable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/578567 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [10:34:51] !log restarting Apache on graphite*. kibana, netmon* to pick up libidn security updates [10:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:55] (03CR) 10Jbond: [C: 03+1] switch webproxy CNAMEs to new install servers [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [10:38:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly give weight to es3 old masters - T246072', diff saved to https://phabricator.wikimedia.org/P10684 and previous config saved to /var/cache/conftool/dbconfig/20200311-103802-marostegui.json [10:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:07] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [10:38:24] (03PS1) 10Alexandros Kosiaris: eventstreams: Move new LVS TLS IP to production state [puppet] - 10https://gerrit.wikimedia.org/r/578896 (https://phabricator.wikimedia.org/T238658) [10:41:49] zuul seems completely overwhelmed by mediawiki/skins/Vector jobs [10:42:15] the job queue of zuul is at ~8k [10:42:24] !log pool ncredir5002 - T243391 [10:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:29] T243391: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 [10:43:09] liw: maybe you're able to have a look? (zuul above ^^^) [10:44:12] 10Operations, 10Traffic: Upgrade ncredir cluster to buster - https://phabricator.wikimedia.org/T243391 (10Vgutierrez) 05Open→03Resolved `vgutierrez@cumin1001:~$ sudo -i cumin 'A:ncredir' 'cat /etc/debian_version' 10 hosts will be targeted: ncredir[2001-2002].codfw.wmnet,ncredir[1001-1002].eqiad.wmnet,ncred... [10:44:25] jesus christ that zuul queue [10:47:33] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] eventstreams: Move new LVS TLS IP to production state [puppet] - 10https://gerrit.wikimedia.org/r/578896 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [10:52:10] (03PS3) 10Giuseppe Lavagetto: profile::envoy: check that envoy is running [puppet] - 10https://gerrit.wikimedia.org/r/578770 (https://phabricator.wikimedia.org/T247387) [10:52:50] (03PS2) 10Volans: Allow to override Spicerack's instance parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/578656 (owner: 10CDanis) [10:52:52] (03PS1) 10Volans: mypy: remove unused type ignore comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/578918 [10:53:02] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) Reviewing this setup again, I just noticed this static route: ` route 185.15.57.0/29 next-hop 208.80.153.190; ` That CIDR is both the floati... [10:53:32] <_joe_> sigh [10:56:45] <_joe_> it looks like zuul merger is not doing its job [10:59:12] <_joe_> https://grafana.wikimedia.org/d/000000321/zuul?orgId=1&from=1583880217482&to=1583924304958&fullscreen&panelId=25 seems very wrong [10:59:46] !log Remove Mostrevisions from mwmaint1002 T239072 [10:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:51] T239072: SpecialMostRevisions::reallyDoQuery takes lots of hours to run on wikidata - https://phabricator.wikimedia.org/T239072 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T1100). [11:00:04] addshore and CFisch_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:03:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give normal 100 weight to es3 old masters - T246072', diff saved to https://phabricator.wikimedia.org/P10685 and previous config saved to /var/cache/conftool/dbconfig/20200311-110334-marostegui.json [11:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:40] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [11:03:49] (03PS2) 10Hnowlan: changeprop: enable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/578567 (https://phabricator.wikimedia.org/T213193) [11:08:01] Isn't the bot announcement a bit early? Or is it new to war an hour before? [11:08:56] CFisch_WMDE: it's based on SF time [11:09:05] CFisch_WMDE: maybe that's triggered cause the US is already in summer time [11:09:11] and they already made the summer time transitiion ahead of CET [11:09:26] yeah.. Europe will catch up at the end of March [11:09:26] Hey all, this 8k long Zuul queue for mediawiki/skins/Vector jobs is **all** WIP patches from a volunteer. If there's anyway to cancel those to clear the queue, I think that would be fine. [11:10:59] <_joe_> jan_drewniak: I have no idea how to do that [11:11:04] !log restarting exim on MXes to pick up libidn security updates [11:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:19] jouncebot: next [11:11:19] In 1 hour(s) and 48 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T1300) [11:11:20] (03PS2) 10Ema: Debianization [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) [11:11:24] jouncebot: now [11:11:24] For the next 0 hour(s) and 48 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T1100) [11:11:29] oh! that is now [11:11:33] _joe_ in theory, removing them from the whitelist will prevent it in the future [11:11:34] <_joe_> yes [11:11:55] <_joe_> yeah I think we're more worried of making ci work now [11:12:18] <_joe_> and I'll have to document myself on zuul if no one from release engineering is around [11:12:22] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop: enable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/578567 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [11:13:27] (03Merged) 10jenkins-bot: changeprop: enable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/578567 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [11:13:51] _joe_: we might try https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Very_high_queue_of_merger:merge_functions [11:14:17] not sure if in the same state, probably not, checking [11:14:36] (03PS1) 10Alexandros Kosiaris: eventstreams: Swith caching layer to TLS enabled eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/578921 (https://phabricator.wikimedia.org/T238658) [11:14:56] CFisch_WMDE: are you swatting? [11:15:00] a restart should work AFAIUI from the docs, but will loose all existing queued jobs [11:15:14] addshore: nope don't have the time right now [11:15:23] can you? do you? [11:15:24] <_joe_> volans: yeah I was thinking the same [11:15:33] <_joe_> it's ok to lose them at this point IMHO [11:15:44] +1 for me [11:15:53] fwiw, +1 from me as well [11:15:59] agreed [11:16:02] it's not the first time this has happened [11:16:04] <_joe_> ok, doing so [11:16:08] not all of the queued jobs are vector - https://gerrit.wikimedia.org/r/#/q/topic:hooks+status:open [11:16:12] CFisch_WMDE: I'll do mine toward the end of the window [11:16:14] got a 15 min daily now [11:16:34] <_joe_> !log restarting zuul and zuul-merger on contint1001, they're stuck [11:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:48] CFisch_WMDE: I can give you a ping when I do it (I can do yours too) [11:17:01] cool, thx [11:18:18] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:18:28] (03PS2) 10Alexandros Kosiaris: eventstreams: Swith caching layer to TLS enabled eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/578921 (https://phabricator.wikimedia.org/T238658) [11:18:34] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/578918 (owner: 10Volans) [11:19:08] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) (owner: 10Volans) [11:19:52] seems to be back fine and the queue is gone [11:20:07] addshore: are you doing the SWAT? [11:20:20] now how to notify people to hit recheck I dunno, and ofc without doing it on the whole set of patches that triggered this all at once : [11:20:58] I should laso get a -1 soon, to verify all works also in the negative path [11:21:24] (03PS1) 10Elukey: role::analytics_test_cluster::client: use the analytics keytab [puppet] - 10https://gerrit.wikimedia.org/r/578922 [11:21:39] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [11:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:58] (03PS1) 10Jbond: check_puppet_run_changes: chgeck for >= and not just > [puppet] - 10https://gerrit.wikimedia.org/r/578923 [11:22:06] <_joe_> volans: I guess people will recheck their changes [11:22:57] (03CR) 10Muehlenhoff: [C: 03+1] "dh-golang is probably more future-proof for future rebuilds of the package, but that should work unless you are planning to also upload th" [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [11:23:15] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: use the analytics keytab [puppet] - 10https://gerrit.wikimedia.org/r/578922 (owner: 10Elukey) [11:23:29] Lucas_WMDE: in daily but will do swat after [11:23:41] (03CR) 10jerkins-bot: [V: 04-1] Debianization [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [11:23:46] Did you re +2 my patch? :) If so ty [11:24:02] no because I wasn’t sure what was happening :) [11:24:04] I can do it [11:24:14] If pos that would be great! [11:24:15] Ty [11:25:21] !log restarting slapd on serpens/seaborgium to pick up libidn security updates [11:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:05] (03CR) 10jerkins-bot: [V: 04-1] mysql: update CORE_SECTIONS for external storage [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) (owner: 10Volans) [11:26:36] (03CR) 10Volans: [C: 03+2] "Merging as it make other patches fail CI" [software/spicerack] - 10https://gerrit.wikimedia.org/r/578918 (owner: 10Volans) [11:28:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Swith caching layer to TLS enabled eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/578921 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [11:31:23] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/578923 (owner: 10Jbond) [11:31:27] Lucas_WMDE: So you're swating now? ;-) [11:31:30] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` stat1008.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [11:32:09] I guess? :D [11:32:36] (03Merged) 10jenkins-bot: mypy: remove unused type ignore comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/578918 (owner: 10Volans) [11:32:57] (03PS3) 10Volans: mysql: update CORE_SECTIONS for external storage [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) [11:32:58] CFisch_WMDE: do you want your config change to be merged first? [11:33:05] Zuul predicts 9 more minutes for the backport [11:33:56] Yeah no worries. I just wanted to make sure what's happening ;-) [11:34:11] ok [11:40:20] (03CR) 10Volans: [C: 03+2] mysql: update CORE_SECTIONS for external storage [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) (owner: 10Volans) [11:41:55] (03PS4) 10Giuseppe Lavagetto: profile::envoy: check that envoy is running [puppet] - 10https://gerrit.wikimedia.org/r/578770 (https://phabricator.wikimedia.org/T247387) [11:42:10] PROBLEM - traffic_server backend process restarted on cp2004 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2004&var-layer=backend [11:42:50] <_joe_> vgutierrez, ema I've seen several cases of this alert in the last few days, is there anything I need to do if you're not around? [11:45:44] (03Merged) 10jenkins-bot: mysql: update CORE_SECTIONS for external storage [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) (owner: 10Volans) [11:46:03] (03PS3) 10Volans: Allow to override Spicerack's instance parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/578656 (owner: 10CDanis) [11:46:24] _joe_: so traffic_manager handles auto-restarts of traffic_server [11:46:45] _joe_: that alert is there to let us notice about an unexpected issue with traffic_server [11:46:51] * vgutierrez checking cp2004 [11:47:31] vgutierrez: it looks like another instance of T242952 [11:47:32] T242952: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 [11:47:45] :/ [11:48:19] ema: will you handle the restart to clear the alert or should do I? [11:48:48] vgutierrez: please go ahead [11:48:50] ack [11:48:59] !log restarting ats-backend on cp2004 [11:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:41] <_joe_> so we need to restart it, or that's just needed to clear the alert? [11:51:42] RECOVERY - traffic_server backend process restarted on cp2004 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2004&var-layer=backend [11:52:00] _joe_: just to clear the alert, no restart needed for functional reasons [11:52:27] addshore: backport was merged, are you deploying it / will you deploy it? [11:52:33] <_joe_> ok, when we'll have alertmanager we will be able to alert on the restart rate instead [11:52:41] (I didn’t read all the messages here so I might have missed something, sorry) [11:52:46] Lucas_WMDE: if you could that would be great, I'm in another call now [11:52:49] ok [11:52:53] is it testable [11:52:55] ? [11:53:18] volans, I'm afraid I don't know what to do about Zuul, sorry (I'll ask thcipriani to teach me how to kick) [11:54:11] _joe_: if we're not around you can take a look at the journal of either trafficserver-tls.service or trafficserver.service (depending on which ats crashed) and if the logs say "crash upon Lua reload" please add a note to T242952 [11:54:11] T242952: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 [11:54:16] Yup, right now wbsearchentities should not show data types on test [11:54:21] ok [11:54:26] Lucas_WMDE: with the patch it will show them again [11:54:27] it’s on mwdebug1001 now, checking [11:54:28] Shoyld [11:54:32] <_joe_> ema: ack! [11:54:41] _joe_: or just mention the thing on #-traffic. Thanks! :) [11:55:18] yup, seem to do the right thing [11:55:20] syncing [11:56:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578520 (https://phabricator.wikimedia.org/T247292) (owner: 10WMDE-Fisch) [11:56:19] +2ing the other one because it seems to be beta-only [11:56:30] so I hope I can get that done before the end of the hour [11:56:52] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/WikibaseCirrusSearch/: SWAT: [[gerrit:578805|Wrap property EntitySearchHelper in PropertyDataTypeSearchHelper]] (duration: 01m 05s) [11:56:52] thx [11:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:15] (03Merged) 10jenkins-bot: Don't use TwoColConflict as beta feature on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578520 (https://phabricator.wikimedia.org/T247292) (owner: 10WMDE-Fisch) [11:57:53] (03PS1) 10Volans: dns: fine tune snippet generation script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/578925 (https://phabricator.wikimedia.org/T233183) [11:59:29] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: SWAT (prod no-op): [[gerrit:578520|Don't use TwoColConflict as beta feature on labs (T247292)]] (duration: 01m 09s) [11:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:34] T247292: Don't use beta feature mode on CI and beta clusters - https://phabricator.wikimedia.org/T247292 [11:59:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::envoy: check that envoy is running [puppet] - 10https://gerrit.wikimedia.org/r/578770 (https://phabricator.wikimedia.org/T247387) (owner: 10Giuseppe Lavagetto) [12:00:42] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: SWAT (prod no-op): [[gerrit:578520|Don't use TwoColConflict as beta feature on labs (T247292)]], take II (duration: 01m 07s) [12:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:55] !log EU SWAT done [12:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:57] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) Wait, I think I may have a solution for this. [12:14:21] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 (10aborrero) I think I may have a solution for this, stay tuned. [12:20:27] (03PS1) 10Alexandros Kosiaris: eventstreams: Increase CPU limits by 25% [deployment-charts] - 10https://gerrit.wikimedia.org/r/578929 (https://phabricator.wikimedia.org/T238658) [12:21:24] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] eventstreams: Increase CPU limits by 25% [deployment-charts] - 10https://gerrit.wikimedia.org/r/578929 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [12:22:17] cmjohnson1: o/ - stat1008 seems stuck from the serial console [12:23:36] elukey: I can powercycle it and get it unstuck. is it just raid10 all disks? [12:23:43] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [12:23:43] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [12:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:01] cmjohnson1: not sure, I can see where it gets stuck into, might still be in d-i [12:24:38] I just powered it down [12:24:57] volans: Zuul doesn’t seem to have started any new builds in a while, dashboard is almost empty (pinging you since you were working with Zuul earlier) [12:25:33] Lucas_WMDE: zuul was restarted, and all new CRs that I've seen have had CI run as usual AFAICT [12:25:52] I’m missing a gate-and-submit on https://gerrit.wikimedia.org/r/578336 for example [12:25:59] +2ed 7 minutes ago [12:26:05] restart was longer ago than that, right? [12:26:15] unless there was another one I missed [12:26:41] yes was way before, see SAL [12:26:47] interesting https://integration.wikimedia.org/zuul/ is totally empty [12:26:55] yeah [12:27:12] I got some successful builds during the end of the SWAT window, but then it seems to have stopped again [12:27:57] (03PS1) 10CDanis: spicerack: add auth faux-secrets for cf cookbook [labs/private] - 10https://gerrit.wikimedia.org/r/578934 [12:28:14] I'm sorry I've no idea of the internals of Zuul, we just used the hammer approach and restarted it because of the huge queue [12:28:51] hm, anyone around who does know the internals? :) [12:29:01] cmjohnson1: if you are busy I can attempt another wmf-reimage with serial console on to see where it gets stuck [12:29:24] (03PS1) 10KartikMistry: WIP: cxserver: Add sectionmapping config for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/578935 (https://phabricator.wikimedia.org/T243430) [12:30:09] Lucas_WMDE: RelEng (cc liw and hashar in closeby TZ) [12:30:52] Looks like CI is stuck and 'recheck' has no effect, volans [12:30:56] (03CR) 10Elukey: [C: 03+2] Add xmldumps to stat100[4,5] [puppet] - 10https://gerrit.wikimedia.org/r/577278 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [12:31:47] the weird part is that it worked fine after the restart [12:32:35] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [12:32:36] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [12:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:42] (03CR) 10Volans: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/578934 (owner: 10CDanis) [12:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:51] (03CR) 10CDanis: [C: 03+2] Allow to override Spicerack's instance parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/578656 (owner: 10CDanis) [12:36:34] (03CR) 10CDanis: [V: 03+2 C: 03+2] spicerack: add auth faux-secrets for cf cookbook [labs/private] - 10https://gerrit.wikimedia.org/r/578934 (owner: 10CDanis) [12:38:35] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [12:38:35] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [12:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:38] akosiaris: I see you're a gerrit admin, do you think maybe could be this the issue now? [12:39:38] Elukey: please try, maybe I’m missing something. [12:39:41] https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock [12:40:57] volans: I honestly have no idea. Never done that before. [12:41:03] but I can try I guess? [12:41:22] I can't tell if it's gearman or jenkins in this case [12:41:28] the queues are all zeros AFAICT [12:42:00] so I untick and tick a box [12:42:08] lol... ok let's try it [12:43:05] !log disconnect+connect jenkins from gearman server. [12:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:33] apparently so [12:44:12] volans: done [12:44:34] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` stat1008.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [12:44:38] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1008.eqiad.wmnet'] ` Of which those **FAILED**: ` ['stat1008.eqiad.wmnet'] ` [12:44:49] well, maybe not. I 've hit save, but I am still stuck at "waiting for integration.wikimedia.org" [12:44:58] ah, finally done [12:45:10] what on earth is that jenkins doing. 60+ secs to save the tick of a box? [12:45:15] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` stat1008.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [12:45:17] lol [12:45:58] it doesn't seem to have helped.. [12:46:05] Lucas_WMDE: removed your +2 and tried a recheck [12:46:07] let's see [12:46:20] tried some rechecks and re-+2 [12:46:24] I don't see any sign if that having helped either [12:46:34] * thcipriani blinks [12:46:48] we had an issue a few weeks ago where a stuck thread in zuul did this [12:46:50] * thcipriani checks [12:49:42] thcipriani: thanks! I 'll admit the entire team is a bit at a loss and doesn't know how to help with this [12:51:11] akosiaris: ack. I'm getting a threaddump from zuul via sending SIGUSR2 [12:51:51] (03PS4) 10CDanis: cf: read api_token & account_id from a config file [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 [12:52:21] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10elukey) This seems to be where d-i gets stuck: ` Error while setting up RAID │ │ An unexpected error occurred while setting up a preseeded... [12:53:17] 10Puppet, 10SRE-tools, 10Python3-Porting: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) >>! In T247364#5958525, @MoritzMuehlenhoff wrote: > Note that there's a number of scripts blocked by OS runtime dependencies, e.g. various LDA... [12:53:31] hrm, there is a thread that is suck on gerrit, there are other threads that look fine. I think I'd actually like to try restarting gerrit to see what happens. [12:53:51] hashar speculated this may fix zuul at one point and I'd like to see if he's correct about that. [12:54:31] !log restarting gerrit to try to fix thread deadlock on zuul (cf: T246973 ) [12:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:36] T246973: CI / Zuul not processing changes - https://phabricator.wikimedia.org/T246973 [12:57:04] (03CR) 10Thcipriani: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/462748 (https://phabricator.wikimedia.org/T205313) (owner: 10Thcipriani) [12:57:36] (03PS1) 10Hnowlan: changeprop: Change redis connect string to use host/port rather than file path [deployment-charts] - 10https://gerrit.wikimedia.org/r/578946 (https://phabricator.wikimedia.org/T213193) [12:57:54] well. I did receive my recheck event seemingly [12:58:44] but the zuul debug logs seem to be flailing a bit [12:59:29] (03CR) 10Ppchelko: [C: 03+2] changeprop: Change redis connect string to use host/port rather than file path [deployment-charts] - 10https://gerrit.wikimedia.org/r/578946 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [12:59:46] (03Merged) 10jenkins-bot: changeprop: Change redis connect string to use host/port rather than file path [deployment-charts] - 10https://gerrit.wikimedia.org/r/578946 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [13:00:04] brennen and hashar: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - American+European Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T1300). [13:01:10] afaict changes are processing again [13:01:54] !log restarting gerrit unstuck the zuul server (T246973) [13:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:00] T246973: CI / Zuul not processing changes - https://phabricator.wikimedia.org/T246973 [13:02:14] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1008.eqiad.wmnet'] ` Of which those **FAILED**: ` ['stat1008.eqiad.wmnet'] ` [13:02:33] 10Operations, 10SRE-OnFire: write up impact estimation procedure - https://phabricator.wikimedia.org/T246739 (10CDanis) [13:06:29] (03CR) 10Holger Knust: "@Alex, I was wondering if you had any feedback on Petr's comment. Are you OK with going the 2 chart route or do we need to combine the two" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [13:10:41] 10Operations, 10Android-app-Bugs, 10Traffic, 10Wikipedia-Android-App-Backlog, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10JoeWalsh) a:03JoeWalsh [13:11:00] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10elukey) From the d-i shell I can see: ` ~ # fdisk -l Disk /dev/sda: 7.3 TiB, 7999376588800 bytes, 15623782400 sectors Disk model: PERC H730P Adp Units: sectors of 1 * 512 = 512 byt... [13:11:03] thanks thcipriani and volans [13:16:18] cmjohnson1: is it possible that stat1008 has a raid-10 hw config and partman tries to deploy the same sw config? (finding only one /dev/, so failing) [13:20:37] elukey, cmjohnson1: stat1008 is currently in fact configured to a partman recipes which uses RAID10 software raid with 10 disks [13:24:22] (03CR) 10CDanis: [C: 03+2] typos: add preform (but not preformat) [puppet] - 10https://gerrit.wikimedia.org/r/578621 (owner: 10CDanis) [13:28:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-swe-nor: Update to new upstream release 0.3.1 [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/578764 (https://phabricator.wikimedia.org/T239779) (owner: 10KartikMistry) [13:28:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-dan-nor: Update to new upstream release 1.4.1 [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/578765 (https://phabricator.wikimedia.org/T239779) (owner: 10KartikMistry) [13:29:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-swe-dan: Update to new upstream release 0.8.1 [debs/contenttranslation/apertium-swe-dan] - 10https://gerrit.wikimedia.org/r/578762 (https://phabricator.wikimedia.org/T239779) (owner: 10KartikMistry) [13:29:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-nno-nob: Update to new upstream release 1.3.0 [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/578761 (https://phabricator.wikimedia.org/T239779) (owner: 10KartikMistry) [13:30:15] (03PS2) 10KartikMistry: WIP: cxserver: Add sectionmapping config for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/578935 (https://phabricator.wikimedia.org/T243430) [13:30:18] 10Operations, 10netops, 10Patch-For-Review: can aggregated netflow data include the router it was sampled from? - https://phabricator.wikimedia.org/T246186 (10elukey) If the cardinality of the three new dimensions are not too big we could definitely add them to Druid. One note - adding the new info to pmacc... [13:30:30] 10Operations, 10netops, 10Patch-For-Review, 10User-Elukey: can aggregated netflow data include the router it was sampled from? - https://phabricator.wikimedia.org/T246186 (10elukey) [13:30:59] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21388/cumin1001.eqiad.wmnet/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578665 (owner: 10CDanis) [13:32:27] (03PS1) 10Muehlenhoff: Fix user search in modify-mfa [puppet] - 10https://gerrit.wikimedia.org/r/578952 [13:32:45] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:35:38] 10Operations, 10serviceops, 10Service-Architecture: Monitor envoy status where it's installed - https://phabricator.wikimedia.org/T247387 (10Joe) p:05Triage→03High [13:36:23] Elukey I did do a h/w raid 10 [13:37:39] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:41:06] cmjohnson1: ah ok, what is the best way to proceed? Should we use hw raid and change the raid config on netboot.cfg, or move to a sw raid only? Not sure what is the best practice :) [13:43:53] we can move to sw raid to match the others.....fixing it now [13:46:36] thanks! [13:51:15] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: debian-glue-backports not enabling backports on buster - https://phabricator.wikimedia.org/T247316 (10ema) 05Open→03Resolved a:03ema >>! In T247316#5956625, @gerritbot wrote: > Change 578543 **merged** by Ema: > [operations/puppet@produc... [13:52:41] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.32 [software/spicerack] - 10https://gerrit.wikimedia.org/r/578953 [13:55:05] (03PS1) 10Jgreen: nsca_frack.cfg.erb add fundraising-database-analytics hostgroup for frdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/578954 [13:57:55] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` stat1008.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [13:58:10] elukey: fixed the raid and re-imaging now [13:59:05] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.32 [software/spicerack] - 10https://gerrit.wikimedia.org/r/578953 (owner: 10Volans) [13:59:59] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb add fundraising-database-analytics hostgroup for frdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/578954 (owner: 10Jgreen) [14:00:07] (03PS5) 10CDanis: cf: read api_token & account_id from a config file [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 [14:02:03] (03CR) 10Nuria: "Seems like this should be ready to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) (owner: 10DCausse) [14:02:29] (03PS4) 10Elukey: [wdqs] purge wdqs query logs [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) (owner: 10DCausse) [14:03:10] (03PS5) 10Elukey: profile::analytics::refinery::job::data_purge: purge wdqs query logs [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) (owner: 10DCausse) [14:03:12] (03PS6) 10CDanis: cf: read api_token & account_id from a config file [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 [14:04:05] (03CR) 10Volans: [C: 03+1] "LGTM, I think we should also make puppet fail if a host has no roles at this point." [puppet] - 10https://gerrit.wikimedia.org/r/575485 (owner: 10Muehlenhoff) [14:04:56] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 (owner: 10CDanis) [14:05:25] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.32 [software/spicerack] - 10https://gerrit.wikimedia.org/r/578953 (owner: 10Volans) [14:07:11] (03PS1) 10Volans: Upstream release v0.0.32 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/578955 [14:07:47] (03PS3) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [14:08:46] !log T239779 upload apertium-dan-nor_1.4.1-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main [14:08:46] !log T239779 upload apertium-nno-nob_1.3.0-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main [14:08:46] !log T239779 upload apertium-swe-dan_0.8.1-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main [14:08:46] !log T239779 upload apertium-swe-nor_0.3.1-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main [14:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:52] T239779: Update apertium-nno-nob, apertium-swe-dan, apertium-swe-nor and apertium-dan-nor packages - https://phabricator.wikimedia.org/T239779 [14:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:25] (03PS1) 10Giuseppe Lavagetto: envoy: check for runtime variables set for a long time [puppet] - 10https://gerrit.wikimedia.org/r/578956 (https://phabricator.wikimedia.org/T247387) [14:11:20] (03CR) 10Muehlenhoff: "Volans: wrt make puppet fail; good idea, we can add this once this is merged and put into day-to-day racking practice." [puppet] - 10https://gerrit.wikimedia.org/r/575485 (owner: 10Muehlenhoff) [14:12:14] (03CR) 10jerkins-bot: [V: 04-1] envoy: check for runtime variables set for a long time [puppet] - 10https://gerrit.wikimedia.org/r/578956 (https://phabricator.wikimedia.org/T247387) (owner: 10Giuseppe Lavagetto) [14:13:34] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.32 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/578955 (owner: 10Volans) [14:15:05] (03PS4) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [14:16:39] (03PS1) 10Alexandros Kosiaris: Revert "Revert "ProductionServices:switch eventgate-analytics to use envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578957 [14:17:50] (03CR) 10Jbond: [C: 03+1] Fix user search in modify-mfa [puppet] - 10https://gerrit.wikimedia.org/r/578952 (owner: 10Muehlenhoff) [14:18:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "Revert "ProductionServices:switch eventgate-analytics to use envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578957 (owner: 10Alexandros Kosiaris) [14:18:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [14:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:41] (03CR) 10Jbond: [C: 03+2] check_puppet_run_changes: chgeck for >= and not just > [puppet] - 10https://gerrit.wikimedia.org/r/578923 (owner: 10Jbond) [14:19:11] (03Merged) 10jenkins-bot: Upstream release v0.0.32 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/578955 (owner: 10Volans) [14:20:43] (03Merged) 10jenkins-bot: Revert "Revert "ProductionServices:switch eventgate-analytics to use envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578957 (owner: 10Alexandros Kosiaris) [14:20:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:13] !log akosiaris@deploy1001 Started scap: wmf-config/ProductionServices.php [14:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:53] !log switch mediawiki to talk to eventgate-analytics via envoy [14:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:51] !log uploaded spicerack_0.0.32-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [14:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:55] !log akosiaris@deploy1001 sync aborted: wmf-config/ProductionServices.php (duration: 02m 42s) [14:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:28] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:57] (03PS2) 10Giuseppe Lavagetto: envoy: check for runtime variables set for a long time [puppet] - 10https://gerrit.wikimedia.org/r/578956 (https://phabricator.wikimedia.org/T247387) [14:25:25] !log akosiaris@deploy1001 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 01m 11s) [14:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:50] * akosiaris monitoring for errors [14:28:22] <_joe_> akosiaris: no errors I see for now [14:29:28] nor do I [14:29:38] what I do see is lower cpu usage on eventgate-analytics [14:29:54] <_joe_> on the service itself? [14:29:59] major garbage collections are going down as well [14:29:59] yes [14:30:23] that's the only thing that I notice. And the only thing I hoped to notice [14:30:38] oh ho ho [14:30:42] because connection reeuse? [14:30:48] wait, you mean there's a cost to establishing new TLS sessions? [14:30:52] <_joe_> possinly [14:30:57] well it wasn't TLS before [14:30:59] <_joe_> cdanis: the service was on http [14:31:04] oh [14:31:05] <_joe_> we're switching it to use tls now [14:31:06] but now it is TLS but with conncetiono reuse [14:31:14] <_joe_> but it's going envoy2envoy [14:31:19] fascinating [14:31:21] pretty COoOol [14:31:28] looks very smooth! [14:31:36] <_joe_> so both mediawiki and eventgate see http connections [14:32:02] hm oh right eventgate still has to do the same # of http connection open&closes, right? [14:32:07] nothingh as changed there. [14:32:11] _joe_: you sir, are democratizing service meshes [14:32:16] <_joe_> ottomata: nope [14:32:20] someone post that bash please :P [14:32:25] <_joe_> akosiaris: only with your help, sir [14:33:47] <_joe_> now if only we solved T246083 [14:33:47] T246083: Envoy can't connect to servers using TLS 1.3 (but can serve TLS 1.3 to clients) - https://phabricator.wikimedia.org/T246083 [14:33:51] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: purge wdqs query logs [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) (owner: 10DCausse) [14:34:12] <_joe_> we'll test again with 1.13.1 that rlazarus is building now [14:34:54] <_joe_> ottomata: wanna dare eventgate-main too? [14:35:02] <_joe_> that gets more requests, correct? [14:35:06] nope, less [14:35:10] 1 to 7 IIRC [14:35:16] <_joe_> so less to main? [14:35:18] <_joe_> wow [14:35:56] some 12k to -analytics and some 1.5k to -main [14:36:05] we did the difficult thing first :P [14:36:28] <_joe_> sum(irate(envoy_http_downstream_rq_time_count{envoy_http_conn_manager_prefix="eventgate-analytics_egress",cluster="api_appserver"}[2m])) tells me about 1k to -analytics [14:36:41] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&fullscreen&panelId=11 [14:36:42] <_joe_> from mediawiki itself [14:36:43] says otherwise ? [14:36:50] or is that all events? [14:36:52] <_joe_> probably coming from other sources too [14:36:57] <_joe_> ottomata: ^^ [14:37:14] <_joe_> we do around 150 req/s from appservers, and 1k req/s from the apis [14:37:27] <_joe_> so I suppose we're getting messages from other sources too [14:37:28] 10Operations, 10Traffic, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) 25h later, there's additional skew between 1003 and 1004: {P10686} [14:38:24] ya analytilcs is busier than main [14:38:31] is has api requests + cirrussearch requests [14:38:55] all from mediawiki [14:39:11] <_joe_> ottomata: so you think we have 10 reqests/second? [14:39:27] it also has sparql requests from WDQS [14:39:49] what's funny is https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&fullscreen&panelId=28 .. The pod is no longer throttled after that [14:40:01] _joe_: ? [14:40:06] <_joe_> but still, it's 11k vs the numbers I see in envoy which are closer to 1.5k [14:40:19] <_joe_> in terms or requests/s [14:40:23] (03PS2) 10Herron: elasticsearch: add 'disktype' param to configure node.attr.disktype [puppet] - 10https://gerrit.wikimedia.org/r/578653 (https://phabricator.wikimedia.org/T247376) [14:41:07] !log installed spicerack to 0.0.32-1 on cumin[12]001 [14:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:16] interestingly https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&fullscreen&panelId=43 says a lot more _joe_ [14:41:25] _joe_: you mean envoy is reporting fewere req/s than eventgate is? [14:41:46] <_joe_> ottomata: yes [14:41:48] sure, it's kafka messages, not requests, so not directly comparable, but looking at the ratios between the topics [14:41:53] the api still is the big thing? [14:42:07] <_joe_> yes [14:42:08] (03CR) 10RLazarus: [C: 03+1] "This will alert if a runtime variable exists, even if it only has a base value set in the bootstrap config and isn't overridden at runtime" [puppet] - 10https://gerrit.wikimedia.org/r/578956 (https://phabricator.wikimedia.org/T247387) (owner: 10Giuseppe Lavagetto) [14:42:15] this one says the most: [14:42:16] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eqiad.mediawiki.api-request&from=1583851323219&to=1583937723219 [14:42:20] <_joe_> ottomata: do we send multiple messages in an http request? [14:42:25] since it is reported by kafka brokers, not by kafka client producer [14:43:33] <_joe_> sure [14:43:50] (03CR) 10Giuseppe Lavagetto: "> This will alert if a runtime variable exists, even if it only has a" [puppet] - 10https://gerrit.wikimedia.org/r/578956 (https://phabricator.wikimedia.org/T247387) (owner: 10Giuseppe Lavagetto) [14:44:00] those message numbers to add up to what https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-analytics&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&fullscreen&panelId=43 says [14:44:07] s/to add up/do add up/ [14:44:16] <_joe_> ok [14:44:24] <_joe_> now try to plot what I indicated before [14:44:29] <_joe_> let me sum by cluster too [14:44:31] sum(irate(envoy_http_downstream_rq_time_count{envoy_http_conn_manager_prefix="eventgate-analytics_egress",cluster="api_appserver"}[2m])) [14:44:35] that's the query? [14:44:36] (03PS1) 10Hnowlan: changeprop: New release [deployment-charts] - 10https://gerrit.wikimedia.org/r/578964 (https://phabricator.wikimedia.org/T213193) [14:44:59] _joe_: it is possible to do so, but eventbus ext. doesn't buffer them, it only sends whatever it gets from the hooks (or in this case monolog via wfDebugLog) that are fired during the http request. [14:45:20] <_joe_> sorry I'm an idiot [14:45:25] <_joe_> it was 10k not 1k [14:45:27] <_joe_> ahah [14:45:29] "parse error at char 4: unexpected character: '\\ufeff'" how on earth did I get a BOM by copypasting from IRC To graphana? [14:45:56] <_joe_> ok no discrepancy [14:45:58] <_joe_> sorry [14:46:26] _joe_: no worries, we 'll just take back your "service mesh democratizer" badge. [14:46:41] (03CR) 10Herron: elasticsearch: add 'disktype' param to configure node.attr.disktype (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578653 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [14:47:29] <_joe_> akosiaris: go ahead and migrate eventgate-main too! [14:47:44] feeling adventurous, aren't we? [14:47:45] (03CR) 10Ppchelko: [C: 03+2] changeprop: New release [deployment-charts] - 10https://gerrit.wikimedia.org/r/578964 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [14:47:46] hhehehh [14:47:52] let it be for a day I 'd say [14:48:08] (03Merged) 10jenkins-bot: changeprop: New release [deployment-charts] - 10https://gerrit.wikimedia.org/r/578964 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [14:48:47] 10Operations, 10ops-eqiad, 10Patch-For-Review: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10BPirkle) Sorry, typo'd the phab task number in gerrit. My patch was supposed to be attached to T237852, not to this task. [14:49:32] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:14] (03PS1) 10Volans: sre.hosts.decommission: cache the Ipmi instance [cookbooks] - 10https://gerrit.wikimedia.org/r/578966 [14:59:29] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Add check for changes applied at all runs - https://phabricator.wikimedia.org/T242910 (10jbond) 05Open→03Resolved a:03jbond This is in production now, reopen if any issues [15:02:23] jbond42: fwiw I see that check in state UNKNOWN ("An unknown error occurred") on puppetdb1002 for 4h now https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetdb1002&service=Ensure+hosts+are+not+performing+a+change+on+every+puppet+run [15:03:30] cdanis: thanks as you are here can you review ^ :) [15:05:23] (03CR) 10RLazarus: [C: 03+1] "> Not with the current version of envoy. We definie a series of runtime variables but we leave them to the default value, and they don't a" [puppet] - 10https://gerrit.wikimedia.org/r/578956 (https://phabricator.wikimedia.org/T247387) (owner: 10Giuseppe Lavagetto) [15:05:33] cdanis: for clarity the current issue is fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/578923 this is just clean up [15:05:42] ack, looking now [15:05:47] thx [15:06:54] (03CR) 10CDanis: [C: 03+1] check_puppet_run_changes.py: Ensure we exit OK if warning > 1 [puppet] - 10https://gerrit.wikimedia.org/r/578967 (owner: 10Jbond) [15:08:42] (03CR) 10Jbond: [C: 03+2] check_puppet_run_changes.py: Ensure we exit OK if warning > 1 [puppet] - 10https://gerrit.wikimedia.org/r/578967 (owner: 10Jbond) [15:15:32] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10aaron) >>! In T244877#5873326, @Ladsgroup wrote: > Apparently this is a built-in and not-properly docume... [15:18:42] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:20:15] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Dwisehaupt) [15:20:48] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [15:21:52] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [15:22:21] jbond42: cool, WARNING now :) thanks! [15:23:54] np, cheers :) [15:31:10] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 58.68 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:33:03] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10elukey) From the cumin log I see the following: ` 2020-03-11 14:16:20 [INFO] (cmjohnson) wmf-auto-reimage::print_line: Started first puppet run (sit back, relax, and enjoy the wait... [15:36:05] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Volans) @elukey you can see the failures here: https://puppetboard.wikimedia.org/node/stat1008.eqiad.wmnet [15:36:06] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 59.39 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:38:32] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 79.21 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:50:31] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10aborrero) [15:57:43] (03CR) 10Ema: "recheck" [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:07:49] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10elukey) The partition layout don't look what I expected: ` elukey@stat1008:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 63G 0 63G... [16:12:24] (03CR) 10CDanis: [C: 03+2] "actually this is fine" [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 (owner: 10CDanis) [16:14:58] (03Merged) 10jenkins-bot: cf: read api_token & account_id from a config file [cookbooks] - 10https://gerrit.wikimedia.org/r/578647 (owner: 10CDanis) [16:15:30] (03PS3) 10Ema: Debianization [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) [16:20:03] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/578653 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [16:21:00] (03CR) 10jerkins-bot: [V: 04-1] Debianization [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:24:40] (03PS4) 10Ema: Debianization [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) [16:25:07] !log restarting Zuul to clear queues (in collab with James F) [16:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:40] liw: LGTM. Thanks. CC Reedy. [16:25:59] (03CR) 10Ema: [C: 03+2] Debianization [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:26:07] (03PS2) 10Ema: Add basic testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/578507 (https://phabricator.wikimedia.org/T237993) [16:26:38] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:26:46] (03CR) 10jerkins-bot: [V: 04-1] Add basic testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/578507 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:27:51] (03PS3) 10Ema: Add basic testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/578507 (https://phabricator.wikimedia.org/T237993) [16:29:08] (03CR) 10Ema: [C: 03+2] Add basic testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/578507 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:29:17] (03PS3) 10Ema: Use json.Marshal [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) [16:29:38] (03CR) 10Jforrester: "Hurrah." [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:29:50] (03PS2) 10Jforrester: Handle rdkafka statistics [software/atskafka] - 10https://gerrit.wikimedia.org/r/578549 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:35:47] (03CR) 10Muehlenhoff: [C: 03+2] Fix user search in modify-mfa [puppet] - 10https://gerrit.wikimedia.org/r/578952 (owner: 10Muehlenhoff) [16:36:20] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [16:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:34] (03PS1) 10Jbond: utils: add puppet csr generation script [puppet] - 10https://gerrit.wikimedia.org/r/578988 [16:50:52] (03CR) 10jerkins-bot: [V: 04-1] utils: add puppet csr generation script [puppet] - 10https://gerrit.wikimedia.org/r/578988 (owner: 10Jbond) [16:52:12] (03PS1) 10Cmjohnson: Add htmldumper1001 to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/578989 (https://phabricator.wikimedia.org/T245567) [16:52:19] (03PS2) 10Jbond: utils: add puppet csr generation script [puppet] - 10https://gerrit.wikimedia.org/r/578988 [16:53:06] !log removed cas-2020-03-09.log and cas-2020-03-10.log on idp2001 (huge logs due to some debug log level for tracking down a performance issue) [16:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:06] (03PS1) 10Cmjohnson: Add htmldumper1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/578990 (https://phabricator.wikimedia.org/T245567) [16:54:40] (03CR) 10Cmjohnson: [C: 03+2] Add htmldumper1001 to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/578989 (https://phabricator.wikimedia.org/T245567) (owner: 10Cmjohnson) [16:55:22] (03CR) 10RobH: [C: 03+1] Create roles for the initial setup of a server [puppet] - 10https://gerrit.wikimedia.org/r/575485 (owner: 10Muehlenhoff) [16:55:42] (03PS1) 10Volans: spicerack: align config to the latest release [puppet] - 10https://gerrit.wikimedia.org/r/578991 [16:55:50] (03CR) 10Cmjohnson: [C: 03+2] Add htmldumper1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/578990 (https://phabricator.wikimedia.org/T245567) (owner: 10Cmjohnson) [16:56:59] elukey: did everything turn out okay with stat1008? [16:58:14] (03CR) 10CDanis: [C: 03+1] spicerack: align config to the latest release [puppet] - 10https://gerrit.wikimedia.org/r/578991 (owner: 10Volans) [16:58:50] (03CR) 10Guozr.im: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:59:31] (03CR) 10Volans: [C: 03+2] spicerack: align config to the latest release [puppet] - 10https://gerrit.wikimedia.org/r/578991 (owner: 10Volans) [17:02:31] (03CR) 10Dzahn: [C: 03+1] "I just talked with dcops about this (avoiding servers with OS but without roles) and we agreed to lways add new servers in site.pp with ro" [puppet] - 10https://gerrit.wikimedia.org/r/575485 (owner: 10Muehlenhoff) [17:04:00] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/575485 (owner: 10Muehlenhoff) [17:06:06] (03CR) 10Muehlenhoff: "@dzahn: role::spare is mis-named and has the issue that role::spare includes base::firewall, which some roles don't use and we can't easil" [puppet] - 10https://gerrit.wikimedia.org/r/575485 (owner: 10Muehlenhoff) [17:08:35] (03PS3) 10Muehlenhoff: Create roles for the initial setup of a server [puppet] - 10https://gerrit.wikimedia.org/r/575485 [17:10:28] (03PS6) 10Jcrespo: wmfbackups: Add new simple script to analyze dump row ids [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/577224 (https://phabricator.wikimedia.org/T244884) [17:10:30] (03PS1) 10Jcrespo: mariadb-backups: Add script to perform external storage incrementals [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578992 (https://phabricator.wikimedia.org/T244884) [17:10:53] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Add script to perform external storage incrementals [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578992 (https://phabricator.wikimedia.org/T244884) (owner: 10Jcrespo) [17:13:06] (03CR) 10Muehlenhoff: [C: 03+2] Create roles for the initial setup of a server [puppet] - 10https://gerrit.wikimedia.org/r/575485 (owner: 10Muehlenhoff) [17:14:45] (03PS2) 10Jcrespo: mariadb-backups: Add script to perform external storage incrementals [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578992 (https://phabricator.wikimedia.org/T244884) [17:16:20] cmjohnson1: I updated the task, the only thing that seems off is the final size of the /srv partition, I expected more TBs.. I have to check if it is the partman recipe or something else [17:16:58] okay, that may be the recipe, the h/w is set correct or would not have installed. [17:17:43] I used the same partman recipe as the other stat servers but they do not have the the capacity as the new one. [17:19:59] (03PS1) 10Dzahn: use new role(insetup) on a few hosts in setup [puppet] - 10https://gerrit.wikimedia.org/r/578996 [17:20:25] (03PS1) 10CDanis: sre.network.cf: Use a requests.Session object [cookbooks] - 10https://gerrit.wikimedia.org/r/578998 [17:20:37] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10Cmjohnson) [17:20:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/578996 (owner: 10Dzahn) [17:21:02] (03PS1) 10Cmjohnson: Add production dns for sretest100[12] [dns] - 10https://gerrit.wikimedia.org/r/578999 (https://phabricator.wikimedia.org/T245754) [17:21:44] (03CR) 10Cmjohnson: [C: 03+2] Add production dns for sretest100[12] [dns] - 10https://gerrit.wikimedia.org/r/578999 (https://phabricator.wikimedia.org/T245754) (owner: 10Cmjohnson) [17:22:00] (03CR) 10Dzahn: [C: 03+2] use new role(insetup) on a few hosts in setup [puppet] - 10https://gerrit.wikimedia.org/r/578996 (owner: 10Dzahn) [17:24:36] (03PS2) 10Dzahn: use new role(insetup) on a few hosts in setup [puppet] - 10https://gerrit.wikimedia.org/r/578996 (https://phabricator.wikimedia.org/T245567) [17:25:37] cmjohnson1: the other stat boxes have a lvs volume of ~7.something TB, they should be similar in hw specs to 1008 no? [17:26:09] (03CR) 10Volans: "LGTM, just one doubt inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/578998 (owner: 10CDanis) [17:26:26] (03PS1) 10C. Scott Ananian: Remove $wgVisualEditorRestbaseParsoidVariant, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579004 (https://phabricator.wikimedia.org/T229074) [17:26:56] cmjohnson1: ah wait you mean that 1008 might have less disk space available [17:27:28] (03CR) 10Volans: [C: 03+1] "+1 for me as I said in the related patch." [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [17:28:23] in theory no, I see 8x1.8T disks [17:28:36] (03CR) 10Dzahn: "puppet without issues on parse2001, wdqs1011. htmldumper1001 though is not reachable for me." [puppet] - 10https://gerrit.wikimedia.org/r/578996 (https://phabricator.wikimedia.org/T245567) (owner: 10Dzahn) [17:28:53] ah but partman may use only 4 of them, right [17:29:43] so the correct recipe is raid10-8dev [17:30:32] you can specify the number of disks in netboot.cfg [17:31:08] yep yep [17:31:14] (03CR) 10Dzahn: "used here https://gerrit.wikimedia.org/r/c/operations/puppet/+/578996" [puppet] - 10https://gerrit.wikimedia.org/r/575485 (owner: 10Muehlenhoff) [17:31:22] currently all stat* hosts are set to 4dev, if those actually have 8 disks, it needs to use raid10-8dev.cfg [17:31:40] moritzm: sending a code change in a sec [17:31:44] k [17:31:54] (03PS1) 10C. Scott Ananian: Remove no-longer-necessary $wmgParsoidVariant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579005 (https://phabricator.wikimedia.org/T229015) [17:32:24] elukey was just about to tell you that it's 8 disks on the new one not 4 [17:32:27] (03PS1) 10Elukey: Use raid10-8dev for stat100x hosts [puppet] - 10https://gerrit.wikimedia.org/r/579006 (https://phabricator.wikimedia.org/T246472) [17:33:52] cmjohnson1, moritzm --^ [17:34:07] (03PS1) 10Dzahn: site: add missing parse2010 to regex [puppet] - 10https://gerrit.wikimedia.org/r/579007 (https://phabricator.wikimedia.org/T243112) [17:34:44] (03CR) 10Dzahn: "re: 3 servers without role." [puppet] - 10https://gerrit.wikimedia.org/r/579007 (https://phabricator.wikimedia.org/T243112) (owner: 10Dzahn) [17:35:04] (03CR) 10Muehlenhoff: Use raid10-8dev for stat100x hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579006 (https://phabricator.wikimedia.org/T246472) (owner: 10Elukey) [17:35:20] (03PS1) 10Cmjohnson: Add sretest* hosts to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/579008 (https://phabricator.wikimedia.org/T245754) [17:36:00] (03CR) 10Muehlenhoff: [C: 03+1] "netboot.cfg part looks good" [puppet] - 10https://gerrit.wikimedia.org/r/579008 (https://phabricator.wikimedia.org/T245754) (owner: 10Cmjohnson) [17:37:30] moritzm: ah right my bad, thanks :) [17:37:45] 1.8T vs 3.6T [17:38:11] (03PS1) 10Cmjohnson: Add srestest100[12] role spare in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/579009 (https://phabricator.wikimedia.org/T245754) [17:39:00] mortizm can you check the site.pp ^^ [17:39:01] cmjohnson1: could you please use role(insetup) instead of spare now [17:39:08] it literally just changed on our end [17:39:23] thanks for adding to site.pp [17:39:40] mutante sure thing! thanks [17:39:41] brandnew role for the purpose [17:39:57] to make it more obvious what is in setup and what is actual spare. [17:40:28] mutante: please confirm this is correct? [17:40:31] https://www.irccloud.com/pastebin/RLuB4qYF/ [17:40:32] either way the goal is we don't have any without a role.. just adding one to the last remaining 3 [17:40:38] (03PS2) 10Elukey: Use raid10-8dev for stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/579006 (https://phabricator.wikimedia.org/T246472) [17:41:08] cmjohnson1: yes, that is correct. here is what i merged a few minutes ago https://gerrit.wikimedia.org/r/c/operations/puppet/+/578996/2/manifests/site.pp [17:41:55] thanks [17:42:16] had no issues on these, it just adds the firewall. the only special case would be if for whatever reason the service_owner requested to not have firewall. but that should be rare [17:42:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/579006 (https://phabricator.wikimedia.org/T246472) (owner: 10Elukey) [17:42:42] (03PS2) 10Cmjohnson: Add srestest100[12] role(insetup) in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/579009 (https://phabricator.wikimedia.org/T245754) [17:42:54] then there is "insetup_noferm" for that if ever needed [17:43:16] cool [17:43:21] (03CR) 10Elukey: [C: 03+2] Use raid10-8dev for stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/579006 (https://phabricator.wikimedia.org/T246472) (owner: 10Elukey) [17:43:25] (03PS1) 10Dzahn: site: add role(insetup) to wdqs200[78] [puppet] - 10https://gerrit.wikimedia.org/r/579010 (https://phabricator.wikimedia.org/T242301) [17:45:06] (03PS2) 10Cmjohnson: Add sretest* hosts to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/579008 (https://phabricator.wikimedia.org/T245754) [17:47:33] (03CR) 10Dzahn: [C: 03+1] Add srestest100[12] role(insetup) in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/579009 (https://phabricator.wikimedia.org/T245754) (owner: 10Cmjohnson) [17:47:48] (03CR) 10DannyS712: "Recheck" [puppet] - 10https://gerrit.wikimedia.org/r/578609 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [17:47:55] (03CR) 10DannyS712: "Recheck" [puppet] - 10https://gerrit.wikimedia.org/r/578611 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [17:48:47] (03CR) 10QEDK: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/578611 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [17:49:42] (03CR) 10DannyS712: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/578611 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [17:50:31] (03CR) 10Dzahn: [C: 03+1] "this is more a "i like this feature" than a "i have tested the code" +1, but thanks for doing it :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/578966 (owner: 10Volans) [17:52:45] (03CR) 10QEDK: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/578611 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [17:53:05] (03CR) 10Elukey: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/579006 (https://phabricator.wikimedia.org/T246472) (owner: 10Elukey) [17:53:20] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [17:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:15] I am checking https://integration.wikimedia.org/zuul/ and I see some queue for ops puppet, is CI working? [17:57:49] (03CR) 10C. Scott Ananian: "ping? what's the current status of this?" [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [18:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:04:12] (03PS1) 10Hnowlan: changeprop: Nutcracker sidecar fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/579016 (https://phabricator.wikimedia.org/T213193) [18:05:14] (03PS1) 10Dzahn: site: let new ganeti nodes and logstash1008 use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579017 (https://phabricator.wikimedia.org/T228924) [18:06:10] elukey: I triggered a restart of CI Jenkins half an hour ago and it only just took, sorry. [18:06:28] elukey: Should be back up soon. [18:06:35] James_F: thanks for the info! [18:06:43] (It's back.) [18:07:26] (03CR) 10Dzahn: "from comments on ticket we need to reduce the "apc_shm_size" for this to work (per commit message that was so far just a copy/paste from p" [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [18:08:28] (03CR) 10Dzahn: "i'd say we should keep using Horizon hiera until we have it working and then copy from there over here and remove the things from the web " [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [18:09:10] (03PS1) 10C. Scott Ananian: WIP: update linter whitelist. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579018 [18:13:55] (03PS1) 10Dzahn: site: let new logstash machines use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579019 (https://phabricator.wikimedia.org/T240881) [18:14:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10RobH) a:05RobH→03Cmjohnson [18:14:44] (03CR) 10Jforrester: [C: 03+2] Remove $wgVisualEditorRestbaseParsoidVariant, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579004 (https://phabricator.wikimedia.org/T229074) (owner: 10C. Scott Ananian) [18:14:50] I'll do a little light deploying. [18:15:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` stat1008.eqiad.wmnet ` The log can be found in `/var/l... [18:15:53] (03Merged) 10jenkins-bot: Remove $wgVisualEditorRestbaseParsoidVariant, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579004 (https://phabricator.wikimedia.org/T229074) (owner: 10C. Scott Ananian) [18:16:59] (03CR) 10Jforrester: [C: 03+2] Remove no-longer-necessary $wmgParsoidVariant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579005 (https://phabricator.wikimedia.org/T229015) (owner: 10C. Scott Ananian) [18:17:02] (03CR) 10Dzahn: [C: 03+2] site: add missing parse2010 to regex [puppet] - 10https://gerrit.wikimedia.org/r/579007 (https://phabricator.wikimedia.org/T243112) (owner: 10Dzahn) [18:17:10] (03PS1) 10Jforrester: ProductionServices: Stop defining the 'parsoid' JS service, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579020 (https://phabricator.wikimedia.org/T229015) [18:17:37] (03PS2) 10Dzahn: site: add missing parse2010 to regex [puppet] - 10https://gerrit.wikimedia.org/r/579007 (https://phabricator.wikimedia.org/T243112) [18:18:03] (03Merged) 10jenkins-bot: Remove no-longer-necessary $wmgParsoidVariant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579005 (https://phabricator.wikimedia.org/T229015) (owner: 10C. Scott Ananian) [18:20:52] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: cache the Ipmi instance [cookbooks] - 10https://gerrit.wikimedia.org/r/578966 (owner: 10Volans) [18:21:16] mutante: I'm merging it, I know you have decoms to do very soon, you'll probably be it's first tester ;) [18:21:19] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop using wmgParsoidVariant, no longer varied T229015 (duration: 01m 08s) [18:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:24] let me know if you get any issue [18:21:25] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [18:21:38] and feel free to revert in the worse case scenario this is somehow blocing [18:21:41] *blocking [18:21:50] volans: yea, sounds good! [18:22:02] (03CR) 10EBernhardson: [C: 03+1] logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [18:22:03] when do you plan to decom? [18:22:15] today [18:22:17] if it's more than 30m I'll skip the force puppet run [18:22:23] 30m from now [18:22:26] oh, in minutes, heh [18:22:45] yea, maybe 60 min [18:22:53] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wmgParsoidVariant, no longer read T229015 (duration: 01m 07s) [18:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:58] ok, then you should get the updated cookbook already [18:23:09] yea, sounds good, dont worry about forcing puppet [18:23:20] k [18:24:02] (03PS2) 10Dzahn: site: add role(insetup) to wdqs200[78] [puppet] - 10https://gerrit.wikimedia.org/r/579010 (https://phabricator.wikimedia.org/T242301) [18:24:18] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 07s) [18:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:34] volans: after that next merge above there won't be any servers not in puppet anymore [18:24:41] \o/ [18:24:48] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-recursive] - 10https://gerrit.wikimedia.org/r/578704 (https://phabricator.wikimedia.org/T234181) (owner: 10KartikMistry) [18:25:11] !log temporary disabled puppet on A:dns-auth to deploy g/578506 T233183 [18:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:16] T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 [18:25:39] (03CR) 10Dzahn: [C: 03+2] site: add role(insetup) to wdqs200[78] [puppet] - 10https://gerrit.wikimedia.org/r/579010 (https://phabricator.wikimedia.org/T242301) (owner: 10Dzahn) [18:26:41] (03PS1) 10C. Scott Ananian: parsoidphp is dead, long live parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579021 [18:27:46] (03CR) 10Jforrester: "This is undeployable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579021 (owner: 10C. Scott Ananian) [18:28:00] (03CR) 10Dzahn: "per Moritz we had only 3 servers currently remaining that were up but not in site.pp. These were parse2010 and wdqs2007/2008. These were " [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [18:28:17] (03CR) 10Volans: [C: 03+2] dns: add the Netbox driven DNS zonefile snippets [puppet] - 10https://gerrit.wikimedia.org/r/578506 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:29:04] (03PS2) 10Dzahn: site: let new ganeti nodes and logstash1008 use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579017 (https://phabricator.wikimedia.org/T228924) [18:29:44] (03CR) 10Krinkle: ":) See also https://wikitech.wikimedia.org/wiki/SWAT_deploys#Guidelines "Forbidden patches" (T187761)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579021 (owner: 10C. Scott Ananian) [18:30:43] (03CR) 10Dzahn: "adding CCs to let you know about the new roles" [puppet] - 10https://gerrit.wikimedia.org/r/579017 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [18:31:31] (03CR) 10Dzahn: [C: 03+2] site: let new ganeti nodes and logstash1008 use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579017 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [18:32:06] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:32:22] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:33:17] (03PS2) 10Dzahn: site: let new logstash machines use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579019 (https://phabricator.wikimedia.org/T240881) [18:33:39] (03PS3) 10Dzahn: site: let new logstash machines use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579019 (https://phabricator.wikimedia.org/T240881) [18:33:40] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [18:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:14] (03CR) 10Dzahn: "fyi, we are now using these new roles added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/575485" [puppet] - 10https://gerrit.wikimedia.org/r/579019 (https://phabricator.wikimedia.org/T240881) (owner: 10Dzahn) [18:34:25] (03CR) 10Dzahn: "fyi, we are now using these new roles added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/575485" [puppet] - 10https://gerrit.wikimedia.org/r/579017 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [18:34:52] (03CR) 10Dzahn: [C: 03+2] site: let new logstash machines use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579019 (https://phabricator.wikimedia.org/T240881) (owner: 10Dzahn) [18:36:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1008.eqiad.wmnet'] ` and were **ALL** successful. [18:44:23] \o/ [18:46:14] elukey: :) and using new role(insetup) now, until it gets the prod role whenever you want to swtich [18:47:00] nice! [18:47:25] /dev/md0 vg0 lvm2 a-- 7.27t 1.45t [18:47:28] much better now [18:48:31] great [18:48:42] (03PS1) 10Dzahn: site: let new dbproxy machines use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579024 (https://phabricator.wikimedia.org/T202367) [18:49:05] (03CR) 10Dzahn: "fyi" [puppet] - 10https://gerrit.wikimedia.org/r/579024 (https://phabricator.wikimedia.org/T202367) (owner: 10Dzahn) [18:50:24] (03CR) 10Dzahn: [C: 03+2] site: let new dbproxy machines use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579024 (https://phabricator.wikimedia.org/T202367) (owner: 10Dzahn) [18:50:36] (03PS2) 10Dzahn: site: let new dbproxy machines use role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/579024 (https://phabricator.wikimedia.org/T202367) [18:51:36] (03PS3) 10Cmjohnson: Add sretest* hosts to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/579008 (https://phabricator.wikimedia.org/T245754) [18:53:49] (03CR) 10Cmjohnson: [C: 03+2] Add sretest* hosts to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/579008 (https://phabricator.wikimedia.org/T245754) (owner: 10Cmjohnson) [18:54:09] (03PS3) 10Cmjohnson: Add srestest100[12] role(insetup) in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/579009 (https://phabricator.wikimedia.org/T245754) [18:56:43] (03CR) 10Cmjohnson: [C: 03+2] Add srestest100[12] role(insetup) in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/579009 (https://phabricator.wikimedia.org/T245754) (owner: 10Cmjohnson) [18:57:34] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Cmjohnson) [18:57:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10Cmjohnson) [19:00:04] brennen and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American+European Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T1900). [19:00:12] (03CR) 10EBernhardson: [C: 03+1] elasticsearch: add 'disktype' param to configure node.attr.disktype [puppet] - 10https://gerrit.wikimedia.org/r/578653 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [19:03:05] 10Operations, 10serviceops: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 (10Dzahn) [19:03:35] wdoran: K.rinkle mentioned CPT might need a slight delay on group1 rollout of train to finish some triage - any word on that? [19:04:53] (03PS1) 10Dzahn: add new codfw parsoid servers into production [puppet] - 10https://gerrit.wikimedia.org/r/579026 (https://phabricator.wikimedia.org/T247441) [19:05:57] earlier question cc: anomie [19:07:00] (03PS3) 10Dzahn: move 20 new codfw parsoid servers into production [puppet] - 10https://gerrit.wikimedia.org/r/579026 (https://phabricator.wikimedia.org/T247441) [19:07:11] brennen: I don't know anything about a delay to finish triage. If there is a need, it's for someone else. [19:07:58] anomie: thx. [19:08:13] (pfffioouu 20 servers for Parsoid, Visual Editor has some success now) [19:09:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [19:09:22] brennen: To be clear, by "someone else" I only mean "not me". The someone else might still be on CPT. [19:09:38] yeah, i'm clarifying. ta. [19:10:29] hey brennen: yep, we would like a slight delay so that we can be better set up to triage for errors using Krinkle's process as part of handing off that process from Krinkle to us. [19:11:41] slight delay in the sense of "in today's train rollout", or slight delay in the sense of some modification to existing train schedule in general? [19:12:03] (sorry, i appear to be a touch out of the loop) [19:14:03] it's the bug about the bug reporting bug quitting when there are too many bug messages [19:14:33] https://phabricator.wikimedia.org/T237109 [19:15:26] wdoran: could use some clarity on this soonish; i'm currently burning train window. [19:16:18] brennen: Sorry, currently in a meeting. We would need a delay for this train but we should also revisit generally how we integrate this triage into the train process so that we're not blocking [19:17:03] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) [x] enabled console redirection after boot [19:18:29] (03PS1) 10Jforrester: [trwiki] Restore pre-unblocking celebration logo versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579028 (https://phabricator.wikimedia.org/T247445) [19:19:26] wdoran: Please file UBN train blocker tasks if you want the train blocked. Word of mouth on IRC isn't a good process. [19:20:28] James_F: right will do, I've actually already reached out to Tyler about this, I was not expecting this to be a discussion on IRC [19:20:41] Filed T247446 [19:20:42] T247446: CPT want the wmf.23 train blocked for group1 - https://phabricator.wikimedia.org/T247446 [19:21:24] wdoran: Sure. [19:21:49] James_F: thank you [19:21:56] wdoran: i don't think there's a huge issue holding 'til, say, after the 14:00 UTC services window. if someone can let me know on ticket when you're good to proceed that'd be appreciated. [19:22:41] and a ticket for figuring out general changes that might be needed would also be good, when you've got time to file it. [19:23:38] brennen: ok thanks [19:28:32] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` htmldumper1001.eqiad.wmnet ` The log can be foun... [19:32:56] (03CR) 10Jforrester: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [19:35:45] (03CR) 10Herron: [C: 03+2] logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [19:36:41] (03PS1) 10Dzahn: decom elnath.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/579032 [19:38:56] (03CR) 10Herron: [C: 03+1] decom elnath.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/579032 (owner: 10Dzahn) [19:45:02] (03PS1) 10Dzahn: puppetmaster: stop allowing elnath [puppet] - 10https://gerrit.wikimedia.org/r/579034 [19:46:03] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Cmjohnson) [19:46:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Cmjohnson) 05Open→03Resolved [19:47:50] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` sretest1001.eqiad.wmnet ` The log can be found in `/var/log/w... [19:47:55] (03CR) 10C. Scott Ananian: "> :) See also https://wikitech.wikimedia.org/wiki/SWAT_deploys#Guidelines" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579021 (owner: 10C. Scott Ananian) [19:48:21] (03PS1) 10C. Scott Ananian: Parsoid10 is dead, long live parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/579035 (https://phabricator.wikimedia.org/T246833) [19:48:52] (03CR) 10jerkins-bot: [V: 04-1] Parsoid10 is dead, long live parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/579035 (https://phabricator.wikimedia.org/T246833) (owner: 10C. Scott Ananian) [19:49:35] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) [19:50:15] (03PS2) 10Volans: sre.hosts.decommission: cache the Ipmi instance [cookbooks] - 10https://gerrit.wikimedia.org/r/578966 [19:50:34] mutante: sorry didn't notice there was a merge conflict [19:51:36] volans: ah, no problem. got distracted with other things anyways :) [19:51:46] will be merged in few min [19:52:04] cool [19:53:07] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [19:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:58] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` sretest1002.eqiad.wmnet ` The log can be found in `/var/log/w... [19:54:32] (03PS3) 10Volans: sre.dns.netbox: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/578531 (https://phabricator.wikimedia.org/T233183) [19:56:26] (03PS1) 10Dzahn: devtools (cloud): rm beta hostnames from scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/579040 [19:56:28] (03CR) 10C. Scott Ananian: "In retrospect this patch should have been split in two, in case the removal of $wmgParsoidVariant was sync'ed before CommonSettings.php wa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579005 (https://phabricator.wikimedia.org/T229015) (owner: 10C. Scott Ananian) [19:57:00] (03PS2) 10C. Scott Ananian: parsoidphp is dead, long live parsoid (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579021 [19:57:02] (03PS1) 10C. Scott Ananian: parsoidphp is dead, long live parsoid (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579042 [19:57:04] (03PS1) 10C. Scott Ananian: parsoidphp is dead, long live parsoid (part 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579043 [19:57:57] (03CR) 10C. Scott Ananian: "ok, fixed the deploy-order issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579021 (owner: 10C. Scott Ananian) [19:58:10] mutante: all yours [19:58:20] volans: perfect, thanks [19:58:23] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/578531 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:59:13] (03CR) 10Paladox: [C: 03+1] devtools (cloud): rm beta hostnames from scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/579040 (owner: 10Dzahn) [20:00:04] halfak and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T2000). [20:00:20] (03Merged) 10jenkins-bot: sre.dns.netbox: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/578531 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [20:00:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:40] (03CR) 10Dzahn: [C: 03+2] devtools (cloud): rm beta hostnames from scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/579040 (owner: 10Dzahn) [20:02:51] (03PS2) 10Dzahn: devtools (cloud): rm beta hostnames from scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/579040 [20:03:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sretest1001.eqiad.wmnet'] ` and were **ALL** successful. [20:09:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:06] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sretest1002.eqiad.wmnet'] ` and were **ALL** successful. [20:13:08] (03CR) 10C. Scott Ananian: [C: 03+1] devtools (cloud): rm beta hostnames from scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/579040 (owner: 10Dzahn) [20:13:31] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10Cmjohnson) [20:13:52] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10Cmjohnson) 05Open→03Resolved resolving this task, all DC ops work has been completed [20:13:54] 10Operations: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10Cmjohnson) [20:16:30] (03PS2) 10C. Scott Ananian: Parsoid10 is dead, long live parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/579035 (https://phabricator.wikimedia.org/T246833) [20:20:02] (03CR) 10Dzahn: [C: 03+1] "want it merged now?" [puppet] - 10https://gerrit.wikimedia.org/r/579035 (https://phabricator.wikimedia.org/T246833) (owner: 10C. Scott Ananian) [20:20:03] going to go ahead with train to group1 [20:22:14] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579050 [20:22:16] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579050 (owner: 10Brennen Bearnes) [20:23:32] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579050 (owner: 10Brennen Bearnes) [20:24:10] (03CR) 10Dzahn: [C: 03+2] puppet/site: fail if no role has been assigned to a node [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [20:24:54] (03CR) 10Dzahn: [C: 03+2] Parsoid10 is dead, long live parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/579035 (https://phabricator.wikimedia.org/T246833) (owner: 10C. Scott Ananian) [20:25:45] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.23 [20:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:49] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.23 (duration: 01m 03s) [20:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:23] (03PS4) 10Dzahn: puppet/site: fail if no role has been assigned to a node [puppet] - 10https://gerrit.wikimedia.org/r/534005 [20:38:17] (03CR) 10Dzahn: [C: 03+2] puppet/site: fail if no role has been assigned to a node [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [20:38:35] ^ puppet will now fail if there is no role assigned to a machine [20:38:49] we just removed the last 3 machines that had none. so it should be no alerts [20:39:07] still watching for puppet failures in the next few minutes [20:42:54] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10Jclark-ctr) replaced drive slot 17, removed 2nd failed drive in slot 18 [20:45:52] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for ganeti2019 to ganeti2024 [dns] - 10https://gerrit.wikimedia.org/r/579053 [20:48:33] (03PS6) 10Bstorm: Add rate limiting to profile::toolforge::mailrelay with warn action [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [20:49:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10Jclark-ctr) Removed failed drive from slot 18 [20:49:43] (03CR) 10Bstorm: "Resurrecting this by changing it to the profile, in case we still want it." [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [20:51:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10JHedden) 05Open→03Resolved This host had 2 spares configured, the disk in this task happened to be one of them. @Jclark-ctr removed the failed drive an... [20:52:47] !log cdanis@cumin2001 START - Cookbook sre.hosts.decommission [20:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:19] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10JHedden) Cleaned up the RAID config with `hpssacli ctrl slot=0 array b remove spares=2I:1:18` [20:53:37] !log cdanis@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:00] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10JHedden) 05Open→03Resolved [20:59:04] (03PS1) 10CDanis: remove references to grafana1001 [puppet] - 10https://gerrit.wikimedia.org/r/579055 (https://phabricator.wikimedia.org/T242992) [20:59:48] volans: ^ that was also a run of the decom book right there [21:01:02] (03CR) 10Dzahn: [C: 03+1] remove references to grafana1001 [puppet] - 10https://gerrit.wikimedia.org/r/579055 (https://phabricator.wikimedia.org/T242992) (owner: 10CDanis) [21:02:24] (03PS2) 10CDanis: sre.network.cf: Use a requests.Session object [cookbooks] - 10https://gerrit.wikimedia.org/r/578998 [21:02:26] (03PS1) 10CDanis: cf: remove on_demand output; will always be True for us [cookbooks] - 10https://gerrit.wikimedia.org/r/579056 [21:03:37] (03CR) 10CDanis: sre.network.cf: Use a requests.Session object (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/578998 (owner: 10CDanis) [21:03:51] (03CR) 10CDanis: [C: 03+2] remove references to grafana1001 [puppet] - 10https://gerrit.wikimedia.org/r/579055 (https://phabricator.wikimedia.org/T242992) (owner: 10CDanis) [21:04:06] (03PS2) 10CDanis: remove references to grafana1001 [puppet] - 10https://gerrit.wikimedia.org/r/579055 (https://phabricator.wikimedia.org/T242992) [21:04:38] (03PS1) 10Dzahn: remove grafana1001.eqiad.wmnet. [dns] - 10https://gerrit.wikimedia.org/r/579057 (https://phabricator.wikimedia.org/T242992) [21:07:27] (03CR) 10Volans: "LGTM, nit in the commit message" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/579056 (owner: 10CDanis) [21:07:33] (03CR) 10Volans: [C: 03+1] cf: remove on_demand output; will always be True for us [cookbooks] - 10https://gerrit.wikimedia.org/r/579056 (owner: 10CDanis) [21:08:17] (03PS2) 10CDanis: sre.network.cf: remove on_demand output; will always be True for us [cookbooks] - 10https://gerrit.wikimedia.org/r/579056 [21:08:20] (03PS3) 10CDanis: sre.network.cf: Use a requests.Session object [cookbooks] - 10https://gerrit.wikimedia.org/r/578998 [21:08:44] (03CR) 10Bstorm: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/568856 (owner: 10Legoktm) [21:08:53] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/578998 (owner: 10CDanis) [21:11:58] (03CR) 10CDanis: [C: 03+2] sre.network.cf: Use a requests.Session object [cookbooks] - 10https://gerrit.wikimedia.org/r/578998 (owner: 10CDanis) [21:12:04] (03CR) 10CDanis: [C: 03+2] sre.network.cf: remove on_demand output; will always be True for us [cookbooks] - 10https://gerrit.wikimedia.org/r/579056 (owner: 10CDanis) [21:13:57] (03PS1) 10Volans: sre.dns.netbox: fix bug in dry-run mode [cookbooks] - 10https://gerrit.wikimedia.org/r/579058 (https://phabricator.wikimedia.org/T233183) [21:14:22] (03Merged) 10jenkins-bot: sre.network.cf: remove on_demand output; will always be True for us [cookbooks] - 10https://gerrit.wikimedia.org/r/579056 (owner: 10CDanis) [21:14:24] (03Merged) 10jenkins-bot: sre.network.cf: Use a requests.Session object [cookbooks] - 10https://gerrit.wikimedia.org/r/578998 (owner: 10CDanis) [21:14:47] (03PS1) 10Papaul: DHCP: Add MAC address and role insetup for ganeti2019 to ganeti2024 [puppet] - 10https://gerrit.wikimedia.org/r/579059 (https://phabricator.wikimedia.org/T244783) [21:15:40] (03CR) 10Bstorm: [C: 03+2] Run Python tests using pytest, not nose [puppet] - 10https://gerrit.wikimedia.org/r/568856 (owner: 10Legoktm) [21:15:59] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address and role insetup for ganeti2019 to ganeti2024 [puppet] - 10https://gerrit.wikimedia.org/r/579059 (https://phabricator.wikimedia.org/T244783) (owner: 10Papaul) [21:20:55] (03PS2) 10Papaul: DHCP: Add MAC address and role insetup for ganeti2019 to ganeti2024 [puppet] - 10https://gerrit.wikimedia.org/r/579059 (https://phabricator.wikimedia.org/T244783) [21:20:57] (03CR) 10CRusnov: [C: 03+1] "LGTM very simple change" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/578925 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [21:22:17] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address and role insetup for ganeti2019 to ganeti2024 [puppet] - 10https://gerrit.wikimedia.org/r/579059 (https://phabricator.wikimedia.org/T244783) (owner: 10Papaul) [21:23:12] (03CR) 10Volans: [C: 03+2] dns: fine tune snippet generation script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/578925 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [21:27:25] (03PS3) 10Dzahn: DHCP: Add MAC address and role insetup for ganeti2019 to ganeti2024 [puppet] - 10https://gerrit.wikimedia.org/r/579059 (https://phabricator.wikimedia.org/T244783) (owner: 10Papaul) [21:27:48] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@2726268]: Downgrade kafka_python to 1.4.3 [21:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:06] (03PS3) 10Jforrester: Re-apply "PageImages: Add NS_CATEGORY for Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450885 (https://phabricator.wikimedia.org/T198716) [21:29:43] James_F: basically i saw you did a revert of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/450885 but strangely the revert is empty, so I'm not sure if there's another patch I should be looking at with context. [21:30:04] Jdlrobson: Context is that it didn't work, as confirmed by you on the task. :-) [21:30:21] So can these patches be abandoned? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/450884 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/450885 ? [21:31:16] (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address and role insetup for ganeti2019 to ganeti2024 [puppet] - 10https://gerrit.wikimedia.org/r/579059 (https://phabricator.wikimedia.org/T244783) (owner: 10Papaul) [21:31:23] (03PS4) 10Dzahn: DHCP: Add MAC address and role insetup for ganeti2019 to ganeti2024 [puppet] - 10https://gerrit.wikimedia.org/r/579059 (https://phabricator.wikimedia.org/T244783) (owner: 10Papaul) [21:31:25] I think this is a sensible default personally [21:31:54] (03PS2) 10Volans: sre.dns.netbox: fix bug in dry-run mode [cookbooks] - 10https://gerrit.wikimedia.org/r/579058 (https://phabricator.wikimedia.org/T233183) [21:33:34] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@2726268]: Downgrade kafka_python to 1.4.3 (duration: 05m 45s) [21:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:28] (03PS1) 10Jdlrobson: Enable PageImages on Commons categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579067 (https://phabricator.wikimedia.org/T198716) [21:40:52] (03PS2) 10Dzahn: puppetmaster: stop allowing elnath [puppet] - 10https://gerrit.wikimedia.org/r/579034 (https://phabricator.wikimedia.org/T188544) [21:41:23] Jdlrobson: The task is not complete, so no, but I need to fix them up,. [21:42:19] !log stop all mjolnir-kafka-bulk-daemons in eqiad except 1 to assist debugging [21:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:17] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Jclark-ctr) [21:45:11] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Jclark-ctr) racked host , cabled ,updated netbox handing off to Chris for bios configuration host rack unit switchport kafka-jumbo1007 c6 33 35 kafka-jum... [21:45:27] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Jclark-ctr) a:05Jclark-ctr→03Christopher [21:45:29] (03CR) 10Bstorm: [C: 03+1] Run Python tests using pytest, not nose [puppet] - 10https://gerrit.wikimedia.org/r/568856 (owner: 10Legoktm) [21:45:46] (03PS3) 10Bstorm: Run Python tests using pytest, not nose [puppet] - 10https://gerrit.wikimedia.org/r/568856 (owner: 10Legoktm) [21:51:00] (03CR) 10Bstorm: [C: 03+2] Run Python tests using pytest, not nose [puppet] - 10https://gerrit.wikimedia.org/r/568856 (owner: 10Legoktm) [21:52:49] (03Abandoned) 10Bstorm: nfs-exportd: remove unused parameters from Project [puppet] - 10https://gerrit.wikimedia.org/r/469773 (owner: 10Faidon Liambotis) [21:55:23] (03CR) 10Bstorm: [C: 03+2] maintain-replicas is no more, it's maintain-views now [puppet] - 10https://gerrit.wikimedia.org/r/566581 (owner: 10Reedy) [21:56:28] (03CR) 10CDanis: [C: 03+2] remove grafana1001.eqiad.wmnet. [dns] - 10https://gerrit.wikimedia.org/r/579057 (https://phabricator.wikimedia.org/T242992) (owner: 10Dzahn) [22:04:21] (03PS1) 10C. Scott Ananian: Ensure puppet works on beta cluster by allowing envoy to be absent [puppet] - 10https://gerrit.wikimedia.org/r/579070 (https://phabricator.wikimedia.org/T247147) [22:05:19] (03CR) 10Jforrester: [C: 03+1] Ensure puppet works on beta cluster by allowing envoy to be absent [puppet] - 10https://gerrit.wikimedia.org/r/579070 (https://phabricator.wikimedia.org/T247147) (owner: 10C. Scott Ananian) [22:05:54] 10Operations, 10SRE-Access-Requests: Requesting access to mwlog1001.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T247470 (10holger.knust) [22:07:04] (03PS1) 10Dzahn: decom codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/579073 [22:07:35] 10Operations, 10SRE-Access-Requests: Requesting access to mwlog1001.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T247470 (10WDoranWMF) As @holger.knust 's manager I approve this! [22:08:07] (03CR) 10Dzahn: [C: 03+2] Ensure puppet works on beta cluster by allowing envoy to be absent [puppet] - 10https://gerrit.wikimedia.org/r/579070 (https://phabricator.wikimedia.org/T247147) (owner: 10C. Scott Ananian) [22:11:13] bstorm_: requesting puppet-merge lock :) [22:12:13] (03CR) 10Jforrester: [C: 03+2] [trwiki] Restore pre-unblocking celebration logo versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579028 (https://phabricator.wikimedia.org/T247445) (owner: 10Jforrester) [22:13:29] (03Merged) 10jenkins-bot: [trwiki] Restore pre-unblocking celebration logo versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579028 (https://phabricator.wikimedia.org/T247445) (owner: 10Jforrester) [22:15:42] !log jforrester@deploy1001 Synchronized static/images/project-logos/: [trwiki] Restore pre-unblocking celebration logo versions T247445 (duration: 01m 09s) [22:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:48] T247445: Change Turkish Wikipedia logo - https://phabricator.wikimedia.org/T247445 [22:16:09] !log Purged trwiki logos from ATS/Varnish for T247445 [22:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:21] (03PS2) 10Dzahn: decom 15 codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/579073 (https://phabricator.wikimedia.org/T247018) [22:19:40] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [22:20:02] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [22:20:19] (03CR) 10CRusnov: [C: 03+1] "Looks good :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/579058 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [22:21:08] (03PS3) 10Volans: sre.dns.netbox: fix bug in dry-run mode [cookbooks] - 10https://gerrit.wikimedia.org/r/579058 (https://phabricator.wikimedia.org/T233183) [22:21:18] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: fix bug in dry-run mode [cookbooks] - 10https://gerrit.wikimedia.org/r/579058 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [22:21:26] merging multiple on puppetmaster [22:21:36] both labs only [22:22:08] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [22:22:30] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [22:23:11] (03Merged) 10jenkins-bot: sre.dns.netbox: fix bug in dry-run mode [cookbooks] - 10https://gerrit.wikimedia.org/r/579058 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [22:26:12] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw216[789].codfw.wmnet [22:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:05] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw217[012].codfw.wmnet [22:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:02] !log depooled mw2167 through mw2172 - rack C3 (T247018) [22:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:07] T247018: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 [22:31:02] 10Operations, 10ops-eqiad, 10DC-Ops: () rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10RobH) [22:31:24] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10RobH) [22:31:55] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10RobH) [22:32:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10RobH) [22:35:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:38:57] (03PS3) 10Dzahn: decom 15 codfw appservers from rack C3 [puppet] - 10https://gerrit.wikimedia.org/r/579073 (https://phabricator.wikimedia.org/T247018) [22:39:16] !log volans@cumin2001 START - Cookbook sre.dns.netbox [22:39:17] !log volans@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:31] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10Papaul) [22:48:22] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579005 (https://phabricator.wikimedia.org/T229015) (owner: 10C. Scott Ananian) [22:53:32] Hello everyone, I get constant 429 with https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Mimo%C5%99%C3%A1dn%C3%A9_opat%C5%99en%C3%AD_-_z%C3%A1kaz_v%C3%BDvozu_desinfekce_rukou.pdf/page2-636px-Mimo%C5%99%C3%A1dn%C3%A9_opat%C5%99en%C3%AD_-_z%C3%A1kaz_v%C3%BDvozu_desinfekce_rukou.pdf.jpg, even when I try from my server with a dedicated public IP. What is happening? [23:00:00] Urbanecm: not just you. Probably worth a phab task. It kind of looks to me from the response like thumbor is barfing on fulfilling the request and varnish gave up for a while. [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200311T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:16] bd808: thanks, will fill [23:00:31] 10Operations, 10Commons: Getting constant 429 for a thumbnail - https://phabricator.wikimedia.org/T247473 (10Urbanecm) [23:00:33] bd808: T247473's live [23:00:34] T247473: Getting constant 429 for a thumbnail - https://phabricator.wikimedia.org/T247473 [23:02:48] 10Operations, 10Commons: Getting constant 429 for a thumbnail - https://phabricator.wikimedia.org/T247473 (10Urbanecm) Many last PDF by Janbery seems to fail (last page always), cf https://commons.wikimedia.org/wiki/Special:Contributions/Janbery. [23:07:41] (03PS3) 10Krinkle: MWConfigCacheGenerator: Stop loading five unused dblists on web reqs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578655 (https://phabricator.wikimedia.org/T169821) [23:08:30] (03CR) 10Krinkle: [C: 03+2] MWConfigCacheGenerator: Stop loading five unused dblists on web reqs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578655 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [23:09:24] (03Merged) 10jenkins-bot: MWConfigCacheGenerator: Stop loading five unused dblists on web reqs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578655 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [23:11:21] * Krinkle testing on mwdebug1002 [23:12:36] 10Operations, 10Commons: Getting constant 429 for a thumbnail - https://phabricator.wikimedia.org/T247473 (10Urbanecm) [23:17:22] (03PS4) 10Krinkle: tests: Add structure test to disallow loading unused dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578654 (https://phabricator.wikimedia.org/T169821) [23:17:25] (03CR) 10Krinkle: [C: 03+2] tests: Add structure test to disallow loading unused dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578654 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [23:18:14] !log krinkle@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: I91b3a18317af (duration: 01m 08s) [23:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:24] (03Merged) 10jenkins-bot: tests: Add structure test to disallow loading unused dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578654 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [23:25:35] (03PS1) 10Jforrester: Drop the 'pp_stage0' and 'pp_stage1' dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579083 [23:25:37] (03PS1) 10Jforrester: Drop the 'top6-wikipedia' dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579084 [23:26:06] 10Operations, 10Commons: Getting constant 429 for a thumbnail - https://phabricator.wikimedia.org/T247473 (10Urbanecm) Hmm seems it's actually two issues... a) something throws 429 on a server-side error b) our thumbnailing logic fails to generate a thumbnail. https://logstash.wikimedia.org/goto/8516bf501bea... [23:27:40] (03PS1) 10Jforrester: Follow-up 3d40ef44c: Fix Parsoid box's name [puppet] - 10https://gerrit.wikimedia.org/r/579085 [23:27:48] 10Operations, 10Commons, 10Thumbor: Getting constant 429 for a thumbnail - https://phabricator.wikimedia.org/T247473 (10Urbanecm) (T239510, T223357 might be relevant) [23:29:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need by: 2020-04-01) rack/setup/install cloudcontrol1005 - https://phabricator.wikimedia.org/T247471 (10bd808) [23:30:05] Krinkle: Jforrester: i've got a bugfix backport and config change to deploy but will wait until y'all are done with what you're up to [23:33:15] * Krinkle isn't deploying right now [23:35:41] * James_F isn't either. [23:35:49] mholloway: Go ahead. [23:36:21] great, thanks! [23:38:14] (03CR) 10Dzahn: [C: 03+2] Follow-up 3d40ef44c: Fix Parsoid box's name [puppet] - 10https://gerrit.wikimedia.org/r/579085 (owner: 10Jforrester) [23:40:12] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/WikimediaEditorTasks: Fix revert counting for non-language-specific counters (duration: 01m 11s) [23:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:27] (03PS8) 10Dzahn: switch webproxy CNAMEs to new install servers [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) [23:41:30] jouncebot: next [23:41:30] In 0 hour(s) and 18 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200312T0000) [23:42:59] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/WikimediaEditorTasks: Fix revert counting for non-language-specific counters (duration: 01m 08s) [23:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:49] (03PS6) 10Mholloway: Enabling depicts count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 (owner: 10Sharvaniharan) [23:44:44] So I & another Canadian can't access enwiki. Stuck in Toronto Cloudflare https://usercontent.irccloud-cdn.com/file/Kp4vrokG/screenshot.117.jpg [23:45:15] (03CR) 10Mholloway: [C: 03+2] Enabling depicts count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 (owner: 10Sharvaniharan) [23:45:18] XioNoX: ^^ [23:45:33] AmandaNP: https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue (though it might not be needed) [23:46:15] (03Merged) 10jenkins-bot: Enabling depicts count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 (owner: 10Sharvaniharan) [23:46:32] Ya the screenshot is a tracert [23:46:53] I'm having the same issue [23:47:16] AmandaNP: bradv: thank you, any traceroutes you can provide via a pastebin site would be helpful. I'll escalate to Cloudflare [23:47:16] same [23:47:37] k, i'll go put it in pastebin [23:48:39] cdanis: is pm ok since it has my provider information? [23:48:48] or should I just redact? [23:48:59] AmandaNP: please PM :) [23:49:59] done [23:51:22] !log cdanis@cumin1001 START - Cookbook sre.network.cf [23:51:22] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [23:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:30] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Enable depicts counter (T244974) (duration: 01m 07s) [23:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:35] T244974: Connect image tag contributions to Suggested edits profile stats - https://phabricator.wikimedia.org/T244974 [23:52:41] adding a late patch to SWAT and deploying [23:52:46] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: WikimediaEditorTasks: Enable depicts counter (T244974) (Simon says) (duration: 01m 07s) [23:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:14] (after you're done :) [23:56:09] AmandaNP: any better yet? [23:56:33] nope [23:58:02] down for you too AmandaNP ? [23:58:14] afterhours: are you in Canada? [23:58:18] yes [23:58:37] it's being worked on right now [23:59:01] afterhours: give it a few minutes please [23:59:19] no worries :)