[00:06:49] (03PS1) 10Dzahn: httpd: drop the ServerAdmin line completely [puppet] - 10https://gerrit.wikimedia.org/r/651649 (https://phabricator.wikimedia.org/T251005) [00:12:26] (03CR) 10Dzahn: "@joe What do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/651649 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn) [00:12:32] (03CR) 10Dzahn: "simple one, not bothering to figure out types for graphoid and no default values. just getting rid of hiera." [puppet] - 10https://gerrit.wikimedia.org/r/650636 (owner: 10Dzahn) [00:27:53] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27239/" [puppet] - 10https://gerrit.wikimedia.org/r/650636 (owner: 10Dzahn) [00:29:14] (03PS2) 10Dzahn: puppet_compiler: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650631 [00:30:14] (03CR) 10Dzahn: "noop on scb2001" [puppet] - 10https://gerrit.wikimedia.org/r/650636 (owner: 10Dzahn) [00:31:12] (03PS3) 10Dzahn: puppet_compiler: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650631 (https://phabricator.wikimedia.org/T209953) [00:33:24] (03PS2) 10Dzahn: mariadb::maintenance: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650637 (https://phabricator.wikimedia.org/T209953) [00:33:43] (03CR) 10Faidon Liambotis: [C: 03+2] "Turns out that this is caused by an apt bug (+ reprepro possibly misusing the method API). MR with a fix submitted, cf. https://salsa.debi" [puppet] - 10https://gerrit.wikimedia.org/r/651300 (owner: 10Legoktm) [00:34:22] (03PS2) 10Dzahn: swap: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650635 (https://phabricator.wikimedia.org/T209953) [00:34:41] (03PS3) 10Dzahn: swap: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650635 (https://phabricator.wikimedia.org/T209953) [00:34:43] (03PS2) 10Dzahn: prometheus:node_exporter: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650632 (https://phabricator.wikimedia.org/T209953) [00:35:00] (03PS2) 10Dzahn: pybaltest: convert to role/profile, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650634 (https://phabricator.wikimedia.org/T209953) [00:36:12] (03CR) 10jerkins-bot: [V: 04-1] swap: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650635 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [00:37:31] (03CR) 10Urbanecm: varnish: ratelimit vscode-phabricator plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond) [00:38:10] (03PS2) 10Bstorm: wikireplicas: set up VM haproxy layer [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) [00:39:28] (03CR) 10Bstorm: wikireplicas: set up VM haproxy layer (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [00:40:34] RECOVERY - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is OK: (C)210 ge (W)150 ge 123.8 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [00:40:44] (03CR) 10Bstorm: wikireplicas: set up VM haproxy layer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [01:00:40] (03PS3) 10Bstorm: wikireplicas: set up VM haproxy layer [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) [01:03:35] (03CR) 10Bstorm: wikireplicas: set up VM haproxy layer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [01:09:42] (03PS4) 10Bstorm: wikireplicas: set up VM haproxy layer [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) [01:10:32] (03CR) 10Ladsgroup: [C: 03+1] "straightforward" [puppet] - 10https://gerrit.wikimedia.org/r/650632 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [01:10:58] (03CR) 10Ladsgroup: [C: 03+1] "straightforward" [puppet] - 10https://gerrit.wikimedia.org/r/650631 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [01:11:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:12:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:21:17] (03CR) 10Bstorm: "Drat, I mistook the way this would handle the scope of the profile parameters according to PCC. I need to change it a bit. Feel free to co" [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [01:34:25] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) > the occasional review to cleanup old puppet code there is only https://gerrit.wikimedia.org/r/c/operations/puppet/+/626460/ DHCP and an entry check-microcode.py: blacklist_mds = ['helium'].... [01:37:54] 10Operations, 10Commons: Improve mwmaint servers (e.g. mwmain1001) userland to process server side uploads - https://phabricator.wikimedia.org/T159661 (10Dzahn) Well.. @Dereckson was also the creator of the ticket so since it is now not assigned to him anymore I doubt we will find out if he considers it to be... [01:39:52] 10Operations, 10Commons: Improve mwmaint servers (e.g. mwmain1001) userland to process server side uploads - https://phabricator.wikimedia.org/T159661 (10Dzahn) 05Open→03Resolved a:03Dzahn @Dereckson I'm being bold and say mwmaint* are nowadays more ready for uploads because we set the proxies and a wget... [01:45:49] (03CR) 10Dzahn: [C: 03+2] puppet_compiler: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650631 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [01:45:51] (03CR) 10Dzahn: [C: 03+2] prometheus:node_exporter: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650632 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [02:08:28] (03PS1) 10CDanis: WIP: klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651665 [02:09:09] (03CR) 10jerkins-bot: [V: 04-1] WIP: klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651665 (owner: 10CDanis) [02:14:40] (03PS2) 10CDanis: WIP: klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651665 [02:15:57] (03PS1) 10Andrew Bogott: Neutron: update stein l3 packages [puppet] - 10https://gerrit.wikimedia.org/r/651668 (https://phabricator.wikimedia.org/T261134) [02:15:59] (03PS1) 10Andrew Bogott: Neutron: apply our local dmz hacks for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651669 (https://phabricator.wikimedia.org/T261134) [02:16:35] (03PS1) 10CDanis: dummy secrets for Klaxon [labs/private] - 10https://gerrit.wikimedia.org/r/651670 [02:16:37] (03CR) 10jerkins-bot: [V: 04-1] Neutron: apply our local dmz hacks for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651669 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [02:16:58] (03PS2) 10Andrew Bogott: Neutron: update stein l3 python files [puppet] - 10https://gerrit.wikimedia.org/r/651668 (https://phabricator.wikimedia.org/T261134) [02:17:13] (03PS2) 10Andrew Bogott: Neutron: apply our local dmz hacks for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651669 (https://phabricator.wikimedia.org/T261134) [02:17:15] (03PS2) 10CDanis: faux secrets for Klaxon [labs/private] - 10https://gerrit.wikimedia.org/r/651670 (https://phabricator.wikimedia.org/T270324) [02:17:32] (03CR) 10CDanis: [V: 03+2 C: 03+2] faux secrets for Klaxon [labs/private] - 10https://gerrit.wikimedia.org/r/651670 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [02:19:31] (03PS3) 10CDanis: WIP: klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651665 [02:22:06] (03PS3) 10Andrew Bogott: Neutron: apply our local dmz hacks for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651669 (https://phabricator.wikimedia.org/T261134) [02:23:05] (03PS4) 10CDanis: WIP: klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651665 [02:25:01] (03PS5) 10CDanis: WIP: klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651665 [02:31:49] (03PS6) 10CDanis: WIP: klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651665 [02:49:29] (03PS7) 10CDanis: WIP: klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651665 [02:57:50] (03PS8) 10CDanis: Puppetization of klaxon, a webapp for trusted users to page SRE [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) [02:59:48] (03CR) 10CDanis: "PCC at https://puppet-compiler.wmflabs.org/compiler1001/27249/" [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [03:01:29] (03PS1) 10CDanis: add CNAME for klaxon [dns] - 10https://gerrit.wikimedia.org/r/651674 [03:30:29] dsaez [04:05:12] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 20.06 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:07:10] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [04:08:30] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:08:48] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [04:23:15] /disconnect [04:39:00] (03PS1) 10Ladsgroup: Fix typo in autoreview right of eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651682 [04:52:11] (03CR) 10Ryan Kemper: [C: 03+2] cirrus: bump es shard size alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/650021 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [04:59:20] 10Operations, 10Discovery-Search (Current work): Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10RKemper) Shard limit temporarily increased: https://gerrit.wikimedia.org/r/c/operations/puppet/+/650021 Will re-index after the new year [04:59:58] (03PS1) 10Andrew Bogott: Nova: add Stein manifests [puppet] - 10https://gerrit.wikimedia.org/r/651683 (https://phabricator.wikimedia.org/T261134) [05:00:00] (03PS1) 10Andrew Bogott: Glance: add Stein versions of manifests [puppet] - 10https://gerrit.wikimedia.org/r/651684 (https://phabricator.wikimedia.org/T261134) [05:00:02] (03PS1) 10Andrew Bogott: Cinder: add Stein version of service manifest [puppet] - 10https://gerrit.wikimedia.org/r/651685 (https://phabricator.wikimedia.org/T261134) [05:00:05] (03PS1) 10Andrew Bogott: Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) [05:00:06] (03PS1) 10Andrew Bogott: Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) [05:00:08] (03PS1) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [05:00:11] (03PS1) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [05:02:11] (03CR) 10jerkins-bot: [V: 04-1] Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [05:03:04] (03PS1) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:05:50] (03PS2) 10Andrew Bogott: Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) [05:05:52] (03PS2) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [05:05:54] (03PS2) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [05:05:56] (03PS2) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:11:41] (03PS2) 10Andrew Bogott: Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) [05:11:43] (03PS3) 10Andrew Bogott: Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) [05:11:45] (03PS3) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [05:11:47] (03PS3) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [05:11:49] (03PS3) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:12:58] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [05:13:46] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [05:15:50] (03PS2) 10Andrew Bogott: Cinder: add Stein version of service manifest [puppet] - 10https://gerrit.wikimedia.org/r/651685 (https://phabricator.wikimedia.org/T261134) [05:16:05] (03PS3) 10Andrew Bogott: Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) [05:16:07] (03PS4) 10Andrew Bogott: Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) [05:16:09] (03PS4) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [05:16:11] (03PS4) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [05:16:13] (03PS4) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:17:08] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [05:22:08] (03PS5) 10Andrew Bogott: Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) [05:22:11] (03PS5) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [05:22:13] (03PS5) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [05:22:15] (03PS5) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:24:21] (03PS6) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:27:13] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10observability, 10User-DannyS712: Beta cluster logstash down - https://phabricator.wikimedia.org/T268200 (10colewhite) >>! In T268200#6709451, @DannyS712 wrote: > > I'm still not seeing any mediawiki debug logs though (`type: "med... [05:29:21] (03PS7) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:34:14] (03CR) 10Ladsgroup: [C: 03+2] Fix typo in autoreview right of eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651682 (owner: 10Ladsgroup) [05:35:39] (03PS6) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [05:35:41] (03PS6) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [05:35:43] (03PS8) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:36:04] (03Merged) 10jenkins-bot: Fix typo in autoreview right of eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/651682 (owner: 10Ladsgroup) [05:36:44] (03CR) 10jerkins-bot: [V: 04-1] Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [05:38:22] (03PS7) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [05:38:24] (03PS7) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [05:38:26] (03PS9) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:50:28] (03PS2) 10Andrew Bogott: Glance: add Stein versions of manifests [puppet] - 10https://gerrit.wikimedia.org/r/651684 (https://phabricator.wikimedia.org/T261134) [05:50:30] (03PS3) 10Andrew Bogott: Cinder: add Stein version of service manifest [puppet] - 10https://gerrit.wikimedia.org/r/651685 (https://phabricator.wikimedia.org/T261134) [05:50:32] (03PS4) 10Andrew Bogott: Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) [05:50:34] (03PS6) 10Andrew Bogott: Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) [05:50:36] (03PS8) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [05:50:38] (03PS8) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [05:50:40] (03PS10) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [05:50:42] (03PS1) 10Andrew Bogott: Nova: Move some hard-coded rocky includes into version-specific manifests [puppet] - 10https://gerrit.wikimedia.org/r/651692 [05:52:12] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [05:54:53] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:651682|Fix typo in autoreview right of eliminators in fawiki]] (duration: 00m 57s) [05:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:48] (03PS1) 10Marostegui: install_server: Do not reimage db1154 [puppet] - 10https://gerrit.wikimedia.org/r/651694 (https://phabricator.wikimedia.org/T268742) [06:08:53] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1154 [puppet] - 10https://gerrit.wikimedia.org/r/651694 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui) [06:31:41] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10observability, 10User-DannyS712: Beta cluster logstash down - https://phabricator.wikimedia.org/T268200 (10DannyS712) >>! In T268200#6709792, @colewhite wrote: >>>! In T268200#6709451, @DannyS712 wrote: >> >> I'm still not seeing... [06:32:24] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) Thank you - that's very useful! [07:13:02] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:27:34] (03PS2) 10Elukey: Set a more restrictive umask for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/651249 (https://phabricator.wikimedia.org/T270629) [07:28:51] (03CR) 10Ladsgroup: "PCC is still the same https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27260/console" [puppet] - 10https://gerrit.wikimedia.org/r/642649 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:30:14] (03CR) 10Elukey: [C: 03+2] Set a more restrictive umask for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/651249 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey) [07:31:21] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/27261/ it is dbmonitor1001.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/642649 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:31:47] (03CR) 10Ladsgroup: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/642649 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:58:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:02:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:08:06] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 14.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:10:00] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [08:11:24] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:13:18] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:01:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:03:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:08:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:12:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:54:49] !log upgraded python3-wmflib to 0.0.5 on cumin1001 [09:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:59] !log gerrit: removed old gerrit directory /srv/var-lib-gerrit2-cobalt.wikimedia.org/.gerritcodereview/ (was some tmp dirs for Gerrit jars ) [10:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:29:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:39] (03PS1) 10Jbond: profile::pybaltest: move hiera keys to profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) [10:32:34] (03PS2) 10Jbond: profile::pybaltest: move hiera keys to profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) [10:33:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27262/console" [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:33:57] jbond42: pcc success but no hosts? [10:34:07] (03CR) 10jerkins-bot: [V: 04-1] profile::pybaltest: move hiera keys to profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:34:37] volans: yes the hosts selectio returned no hosts, its on my list to change the exit status when that happens :) [10:34:52] ack, just checking if it was a known issue :) [10:34:59] thanks a lot for all the efforts you made there [10:35:19] volans: its known by me :S, let me create a task for it quickly so its dosn;t get lost though [10:35:49] usually "known by jbond42" has a high probability of being fixed quickly ;) [10:36:03] :D thanks [10:36:18] (03PS1) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [10:39:55] 10Operations, 10puppet-compiler, 10User-jbond: PCC: change the exist code when no hosts are found - https://phabricator.wikimedia.org/T270757 (10jbond) p:05Triage→03Medium [10:42:22] (03PS3) 10Jbond: profile::pybaltest: move hiera keys to profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) [10:43:52] (03CR) 10jerkins-bot: [V: 04-1] profile::pybaltest: move hiera keys to profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:44:58] yes [10:47:12] if you start talking with jerkins-bot it's time to go on vacation jbond42 :-P [10:47:19] (03PS4) 10Jbond: profile::pybaltest: move hiera keys to profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) [10:47:21] lol [10:48:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27264/console" [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:50:37] (03PS5) 10Jbond: profile::pybaltest: move hiera keys to profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) [10:51:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27265/console" [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:52:01] (03CR) 10Jbond: [C: 03+2] pybaltest: convert to role/profile, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/650634 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:52:17] (03CR) 10Jbond: [V: 03+1 C: 03+2] profile::pybaltest: move hiera keys to profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/651756 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:53:09] (03CR) 10Jbond: "looked good, i sent a follow up PS to change the hiera key so it was namespaced in profile and merged both, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/650634 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [11:03:41] (03PS2) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [11:27:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Neutron: update stein l3 python files [puppet] - 10https://gerrit.wikimedia.org/r/651668 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [11:28:28] (03PS1) 10Giuseppe Lavagetto: Retry when failing to fetch image metadata from the registry [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/651760 [11:28:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Neutron: apply our local dmz hacks for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651669 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [11:30:09] (03CR) 10jerkins-bot: [V: 04-1] Retry when failing to fetch image metadata from the registry [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/651760 (owner: 10Giuseppe Lavagetto) [11:30:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Nova: add Stein manifests [puppet] - 10https://gerrit.wikimedia.org/r/651683 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [11:33:03] (03PS1) 10David Caro: wmcs.backup: Add command to backup all assigned vms [puppet] - 10https://gerrit.wikimedia.org/r/651761 [11:33:39] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Glance: add Stein versions of manifests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651684 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [11:36:22] (03CR) 10Arturo Borrero Gonzalez: wikireplicas: set up VM haproxy layer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [11:43:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Cinder: add Stein version of service manifest (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651685 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [11:44:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [11:46:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [11:47:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [11:51:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Nova: Move some hard-coded rocky includes into version-specific manifests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651692 (owner: 10Andrew Bogott) [11:51:28] (03PS4) 10Jbond: varnish: ratelimit vscode-phabricator plugin [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) [11:55:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backup: Add command to backup all assigned vms [puppet] - 10https://gerrit.wikimedia.org/r/651761 (owner: 10David Caro) [12:18:01] (03Abandoned) 10Jbond: CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T211750) (owner: 10Jbond) [12:18:11] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/651764 [12:18:34] (03Abandoned) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T211750) (owner: 10Jbond) [12:23:28] (03PS1) 10Volans: interactive: migrate from spicerack to wmflib [cookbooks] - 10https://gerrit.wikimedia.org/r/651765 (https://phabricator.wikimedia.org/T257905) [12:27:33] (03PS1) 10Jbond: README: add local hacking instructions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651766 [12:28:42] (03PS2) 10Volans: interactive: migrate from spicerack to wmflib [cookbooks] - 10https://gerrit.wikimedia.org/r/651765 (https://phabricator.wikimedia.org/T257905) [12:28:44] (03PS1) 10Volans: tests: fix deprecated pytest argument [cookbooks] - 10https://gerrit.wikimedia.org/r/651767 [12:42:46] (03PS2) 10Jbond: README: add local hacking instructions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651766 [12:44:01] (03CR) 10Jbond: [C: 03+2] README: add local hacking instructions [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651766 (owner: 10Jbond) [13:03:07] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:04:45] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 33 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:12:29] PROBLEM - SSH on logstash2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:13:17] PROBLEM - Check systemd state on logstash2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:57] RECOVERY - SSH on logstash2005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:14:43] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7ff34bbd74e0: Failed to establish a new connection: [Errno 111] Connection [13:14:43] ://wikitech.wikimedia.org/wiki/Search%23Administration [13:20:50] (03PS1) 10Jbond: cli: add ability to force removale of old job directories [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651770 [13:21:24] (03CR) 10jerkins-bot: [V: 04-1] cli: add ability to force removale of old job directories [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651770 (owner: 10Jbond) [13:25:14] (03PS2) 10Jbond: cli: add ability to force removale of old job directories [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651770 [13:30:48] (03CR) 10Jbond: [C: 03+2] cli: add ability to force removale of old job directories [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651770 (owner: 10Jbond) [13:37:48] (03PS9) 10CDanis: Puppetization of klaxon, a webapp for trusted users to page SRE [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) [13:39:01] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2005 is OK: OK - elasticsearch status production-logstash-codfw: task_max_waiting_in_queue_millis: 0, relocating_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, active_primary_shards: 456, timed_out: False, cluster_name: production-logstash-codfw, active_shards: 862, status: green, unassigned_shards: 0, number_of_nodes: 6, active_shards_perce [13:39:01] .0, number_of_in_flight_fetch: 0, initializing_shards: 0, number_of_data_nodes: 3 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:39:11] RECOVERY - Check systemd state on logstash2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:38] (03PS10) 10CDanis: Puppetization of klaxon, a webapp for trusted users to page SRE [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) [13:53:33] (03PS1) 10Jbond: controller: raise an error if no hosts are found [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651771 [13:54:14] (03CR) 10jerkins-bot: [V: 04-1] controller: raise an error if no hosts are found [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651771 (owner: 10Jbond) [13:59:16] (03PS2) 10Jbond: controller: raise an error if no hosts are found [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651771 [13:59:57] (03CR) 10jerkins-bot: [V: 04-1] controller: raise an error if no hosts are found [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651771 (owner: 10Jbond) [14:01:46] (03PS3) 10Jbond: controller: raise an error if no hosts are found [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651771 [14:02:27] (03CR) 10jerkins-bot: [V: 04-1] controller: raise an error if no hosts are found [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651771 (owner: 10Jbond) [14:15:03] (03CR) 10CDanis: "updated PCC after a few tweaks https://puppet-compiler.wmflabs.org/compiler1002/27266/" [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [14:16:06] (03PS4) 10Jbond: controller: raise an error if no hosts are found [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651771 [14:17:49] (03CR) 10Jbond: [C: 03+2] controller: raise an error if no hosts are found [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651771 (owner: 10Jbond) [14:18:33] (03Merged) 10jenkins-bot: controller: raise an error if no hosts are found [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651771 (owner: 10Jbond) [14:20:29] (03PS1) 10Jbond: 1.1.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651775 [14:27:14] (03CR) 10Jbond: [C: 03+2] 1.1.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651775 (owner: 10Jbond) [14:29:27] PROBLEM - MD RAID on an-coord1002 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:29:28] ACKNOWLEDGEMENT - MD RAID on an-coord1002 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T270768 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:29:33] 10Operations, 10ops-eqiad: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10ops-monitoring-bot) [14:33:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27268/console" [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond) [14:36:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM overall, the only real recommendation would be to use ssl_ciphersuite in the apache configuration to determine the TLS settings." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [14:36:14] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Volans) p:05Triage→03Medium [14:45:19] (03PS11) 10CDanis: Puppetization of klaxon, a webapp for trusted users to page SRE [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) [14:46:47] (03CR) 10jerkins-bot: [V: 04-1] Puppetization of klaxon, a webapp for trusted users to page SRE [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [14:48:56] (03PS12) 10CDanis: Puppetization of klaxon, a webapp for trusted users to page SRE [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) [14:51:27] (03CR) 10David Caro: [C: 03+2] [wmcs][backup] Remove all temp files after usage [puppet] - 10https://gerrit.wikimedia.org/r/650542 (https://phabricator.wikimedia.org/T270478) (owner: 10David Caro) [14:58:59] (03PS13) 10CDanis: Puppetization of klaxon, a webapp for trusted users to page SRE [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) [15:00:14] (03CR) 10CDanis: Puppetization of klaxon, a webapp for trusted users to page SRE (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [15:05:33] (03CR) 10Bstorm: wikireplicas: set up VM haproxy layer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [15:09:50] (03PS1) 10David Caro: wmcs.backup: add a command to remove non-handled backups [puppet] - 10https://gerrit.wikimedia.org/r/651776 [15:14:04] (03CR) 10CDanis: [C: 03+2] add CNAME for klaxon [dns] - 10https://gerrit.wikimedia.org/r/651674 (owner: 10CDanis) [15:14:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Puppetization of klaxon, a webapp for trusted users to page SRE [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [15:15:54] !log disabling puppet on alert1001 for klaxon rollout [15:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:06] (03CR) 10CDanis: [C: 03+2] Puppetization of klaxon, a webapp for trusted users to page SRE [puppet] - 10https://gerrit.wikimedia.org/r/651665 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [15:16:34] klaxon, very self-explanatory :) [15:16:52] 🎺 🚨 🔔 [15:21:59] (03PS1) 10Bstorm: wikireplicas: set up VM haproxy layer [puppet] - 10https://gerrit.wikimedia.org/r/651778 (https://phabricator.wikimedia.org/T267376) [15:23:31] (03CR) 10Bstorm: "This is the minimum I think is needed to spin up the VM. I'll fix the other one and make it the refactor. The general notion would be to u" [puppet] - 10https://gerrit.wikimedia.org/r/651778 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [15:24:50] (03PS5) 10Bstorm: cloud haproxy: refactor the various haproxy setups [puppet] - 10https://gerrit.wikimedia.org/r/651301 (https://phabricator.wikimedia.org/T267376) [15:26:09] (03PS3) 10Andrew Bogott: Glance: add Stein versions of manifests [puppet] - 10https://gerrit.wikimedia.org/r/651684 (https://phabricator.wikimedia.org/T261134) [15:26:12] (03PS4) 10Andrew Bogott: Cinder: add Stein version of service manifest [puppet] - 10https://gerrit.wikimedia.org/r/651685 (https://phabricator.wikimedia.org/T261134) [15:26:14] (03PS5) 10Andrew Bogott: Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) [15:26:16] (03PS7) 10Andrew Bogott: Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) [15:26:18] (03PS9) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [15:26:20] (03PS9) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [15:26:22] (03PS11) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [15:26:28] (03PS1) 10CDanis: klaxon: tweak deploy path [puppet] - 10https://gerrit.wikimedia.org/r/651780 [15:27:28] (03PS2) 10CDanis: klaxon: tweak deploy path & fix dependency [puppet] - 10https://gerrit.wikimedia.org/r/651780 [15:27:51] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [15:31:30] (03CR) 10CDanis: [C: 03+2] klaxon: tweak deploy path & fix dependency [puppet] - 10https://gerrit.wikimedia.org/r/651780 (owner: 10CDanis) [15:31:32] (03PS6) 10Bstorm: cloud haproxy: refactor the various haproxy setups [puppet] - 10https://gerrit.wikimedia.org/r/651301 [15:33:38] cdanis: klaxon's cert mentions icigna, I guess you'll need to run acme-chief to generate a new cert for that page? [15:34:15] PROBLEM - Device not healthy -SMART- on an-coord1002 is CRITICAL: cluster=analytics device=sda instance=an-coord1002 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1002&var-datasource=eqiad+prometheus/ops [15:34:44] tabbycat: yeah, not actually live yet, troubleshooting a few things on the backup alert host :) [15:34:54] I'll let you know! should just be a few more minutes [15:35:07] np, just saw the NET::ERR_CERT_COMMON_NAME_INVALID [15:35:18] I guess it just needs a little tweak [15:35:26] just a puppet run [15:35:43] but the apache config being generated is wrong in other ways, so until I fix that you'd be getting a different error page :D [15:37:56] (03PS1) 10CDanis: Klaxon: correct pass port number to apache [puppet] - 10https://gerrit.wikimedia.org/r/651783 [15:38:07] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:39:46] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27277/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/651783 (owner: 10CDanis) [15:45:25] (03PS1) 10CDanis: Klaxon: needs gunicorn3, not gunicorn [puppet] - 10https://gerrit.wikimedia.org/r/651784 [15:45:39] only two dumb mistakes, not bad [15:46:24] (03CR) 10CDanis: [C: 03+2] Klaxon: needs gunicorn3, not gunicorn [puppet] - 10https://gerrit.wikimedia.org/r/651784 (owner: 10CDanis) [15:49:07] (03PS1) 10Ottomata: Failover analytics-hive.eqiad.wmnet to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/651786 (https://phabricator.wikimedia.org/T268028) [15:50:34] make that three ;) [15:51:02] (03CR) 10Ottomata: [C: 03+2] Failover analytics-hive.eqiad.wmnet to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/651786 (https://phabricator.wikimedia.org/T268028) (owner: 10Ottomata) [15:52:31] (03PS1) 10CDanis: IDP configuration for Klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651787 (https://phabricator.wikimedia.org/T270324) [15:53:31] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10observability, 10User-DannyS712: Beta cluster logstash down - https://phabricator.wikimedia.org/T268200 (10colewhite) >>! In T268200#6709830, @DannyS712 wrote: > Indeed! But, looking at the errors dashboard (https://logstash-beta.... [15:53:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27281/console" [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [15:53:47] jbond42: do you have a minute to glance at https://gerrit.wikimedia.org/r/651787 ? the mod_auth_cas config is already merged -- https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/templates/idp/client/httpd-klaxon.erb [15:54:02] cdanis: looking [15:54:56] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:55:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/651787 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [15:56:24] ty! [15:56:28] (03CR) 10CDanis: [C: 03+2] IDP configuration for Klaxon [puppet] - 10https://gerrit.wikimedia.org/r/651787 (https://phabricator.wikimedia.org/T270324) (owner: 10CDanis) [15:57:27] cdanis: i think you will need to manuly restart tomcat on the active instance for the config to take (although it may monitor the dir i can rember); however as everyone is mostly out i think its safe to do [15:58:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Glance: add Stein versions of manifests [puppet] - 10https://gerrit.wikimedia.org/r/651684 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [15:58:11] (03PS1) 10Jbond: pcc: add more info to the status message [puppet] - 10https://gerrit.wikimedia.org/r/651788 [15:58:17] jbond42: looks like it Just Worked! :) [15:58:20] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10Ottomata) This node should now be in standby mode and should be safe to take offline at any time. As it is in standby, I believe it should be fine to wait until after th... [15:58:35] cdanis: ok cool good to know [15:58:45] in that case you have time to take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/651788 ;) [15:58:48] https://imgur.com/IWUvjiM.png [15:58:49] (03CR) 10jerkins-bot: [V: 04-1] pcc: add more info to the status message [puppet] - 10https://gerrit.wikimedia.org/r/651788 (owner: 10Jbond) [15:58:53] ahaha yes sorry :) [15:59:13] only joking i only just pushed it, its not critical and CI just went red :( [15:59:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backup: add a command to remove non-handled backups [puppet] - 10https://gerrit.wikimedia.org/r/651776 (owner: 10David Caro) [15:59:39] I'll review it once you make CI happy; will also look at your other blocking patches today :) [15:59:49] ack thanks [16:01:19] (03PS2) 10Jbond: pcc: add more info to the status message [puppet] - 10https://gerrit.wikimedia.org/r/651788 [16:01:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] [wmcs][backup] Add command to remove/print dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/650535 (https://phabricator.wikimedia.org/T270478) (owner: 10David Caro) [16:02:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge-k8s: AdmissionsConfiguration is GA after 1.17 [puppet] - 10https://gerrit.wikimedia.org/r/639883 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [16:02:21] (03PS3) 10Jbond: pcc: add more info to the status message [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) [16:02:26] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27275/" [puppet] - 10https://gerrit.wikimedia.org/r/651301 (owner: 10Bstorm) [16:04:20] volans: tabbycat: https://klaxon.wikimedia.org/ 🎉 [16:05:17] cdanis: awesome \o/ [16:05:39] now let's hope it does not sound like a real klaxon at 4AM [16:08:43] cdanis: nice layout [16:08:57] thank you! [16:09:26] it started as a weekend project where I wanted to learn how to write "good" HTML/CSS and do layouts well. I'm still not sure exactly why I had that itch, but I did :) [16:09:31] cdanis: the source code button doesn't like to any code I can steal though. It just stays on home page. [16:09:42] oh no [16:09:48] I'm waiting on GitHub for education to come through to get a course [16:10:02] (03CR) 10Bstorm: "It's there in the change catalog, so that makes me think this is correct:" [puppet] - 10https://gerrit.wikimedia.org/r/651301 (owner: 10Bstorm) [16:10:20] Then go crazy (re)writing docs that don't exist or are nonsense [16:10:42] (c) All Rights Reserved cdanis and their cats. [16:11:07] (03PS1) 10CDanis: fix source code link [software/klaxon] - 10https://gerrit.wikimedia.org/r/651789 [16:12:26] cdanis: yoooo congrats [16:12:50] (03CR) 10CDanis: [V: 03+2 C: 03+2] fix source code link [software/klaxon] - 10https://gerrit.wikimedia.org/r/651789 (owner: 10CDanis) [16:13:25] selfmerge abuse!!!111 :P [16:13:55] it's the V+2 that is the most reprehensible part of that change tbh, I really need to set up CI in that repo 😅 [16:15:21] (03PS1) 10Jbond: (DO NOT MERGE) testing CI [puppet] - 10https://gerrit.wikimedia.org/r/651790 [16:15:39] n00b q: so klaxon is replacing icinga or both are for different purposes? [16:15:44] Let me know when it is live and I'll make sure it looks fine here [16:15:51] tabbycat: klaxon is paging [16:15:53] (03CR) 10jerkins-bot: [V: 04-1] (DO NOT MERGE) testing CI [puppet] - 10https://gerrit.wikimedia.org/r/651790 (owner: 10Jbond) [16:16:09] Icinga would trigger the page but klaxon sends it I think [16:16:15] tabbycat: Klaxon is what real users should use if they have found an issue that needs urgent SRE attention (e.g. they believe their account credentials compromised) [16:16:51] (or something under-monitored is causing widespread issues for users) [16:17:00] I *hope* not every LDAP account has access to that cdanis [16:17:13] no -- only wmf, wmde, nda [16:17:25] good :) [16:17:36] they just don't know it yet :D [16:17:52] I was planning funny things [16:18:02] tabbycat: can you imagine? [16:18:08] That would be interesting [16:18:29] RhinosF1: I'm used to receive calls at late night hours but still kinda bugs me [16:18:39] I would note that if no one has thought you need to have a backup way to authenticate if ldap fails [16:18:44] I think it'd be reasonable for any code deployer to have access to it, but that would require some more work and probably some more discussion [16:18:56] RhinosF1: yeah, in such an event, we should already be getting approximately a bajillion automated alerts :) [16:19:02] True [16:19:11] WMF are good at detecting issues [16:19:13] anyway I'll write up a little FAQ on wikitech today [16:19:32] this is for things that are urgent but would otherwise fall through the cracks :) [16:19:47] Ah [16:19:59] cdanis: super minor nit. a malicious user could update the readonly input if the wished. however that would mean someone in wmf, nda or wmde was going out of there way to be a pain which seems highly unliukly [16:19:59] I assume for very urgent things phones are in the officewiki page [16:20:15] jbond42: oh true, I should read that from the header again! [16:20:40] tabbycat: indeed! although this is still designed for that purpose, and is probably faster than phoning individuals [16:21:20] jbond42: oh, no, it already does :) [16:21:29] cdanis: ahh cool :) [16:21:40] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/klaxon/+/refs/heads/master/klaxon/__init__.py#112 [16:21:57] cdanis: ack; or if everything else fails, old-style Telegram :D [16:22:29] cdanis: cool thx, also TIL you can use functions firectly in fstrings :) [16:22:49] any expression, as long as you can get the quoting right :D [16:22:56] awesome [16:26:53] cdanis: FYI you can als get thoses details from `request.environ` [16:27:02] https://puppetboard.wikimedia.org/debug/ gives the output of [16:27:08] return '
'.join(['{}={}'.format(k,v) for k,v in request.environ.items()]) [16:27:12] hah! nice [16:27:18] didn't have to put that in the app config at all [16:27:30] oh, you meant username [16:27:50] cdanis: yes instead of parsing the headers [16:28:21] e.g. request.environ['HTTP_X_CAS_UID'] [16:28:54] HTTP_X_CAS_MAIL is a nice one actually [16:29:18] i think flask just p[arses the headers and makes the environment variables so not sure if it saves much but yes there are other usefull things in there [16:29:29] jbond42: oh that /debug endpoint is great, thank you [16:29:41] that debug pages doesn;t exist :S [16:30:04] i think i just hacked that into the puppetboard app and i need to make a proper page somewhere :D [16:30:26] ahaha [16:30:35] well, whatever it is is very useful [16:31:20] yes definetly, and i really should make a proper one somewhere [16:31:39] yeah it makes me want to add a debug endpoint to klaxon :D [16:31:55] @app.route('/debug/') [16:31:55] def debug(): return '
'.join(['{}={}'.format(k,v) for k,v in request.environ.items()]) [16:32:09] gi for it it at least makes more senses there then puppetboard [16:32:58] https://phabricator.wikimedia.org/P13629 [16:42:54] (03PS1) 10CDanis: klaxon: git::clone ensure=>latest [puppet] - 10https://gerrit.wikimedia.org/r/651796 [16:43:25] (03CR) 10jerkins-bot: [V: 04-1] klaxon: git::clone ensure=>latest [puppet] - 10https://gerrit.wikimedia.org/r/651796 (owner: 10CDanis) [16:44:20] 10Puppet, 10puppet-compiler: Puppet checks for invalid class names - https://phabricator.wikimedia.org/T175979 (10jbond) 05Open→03Resolved a:03jbond I have tested this out with a [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/651790/ | change ]] and it seems things have progressed since this tic... [16:44:37] (03PS2) 10CDanis: klaxon: git::clone ensure=>latest [puppet] - 10https://gerrit.wikimedia.org/r/651796 [16:44:41] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:30] (03PS1) 10CDanis: Some wordsmithing, & remove a long-done TODO [software/klaxon] - 10https://gerrit.wikimedia.org/r/651797 [16:47:46] (03CR) 10CDanis: [C: 03+2] klaxon: git::clone ensure=>latest [puppet] - 10https://gerrit.wikimedia.org/r/651796 (owner: 10CDanis) [16:51:36] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:41] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:41] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 59.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:55:17] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) [16:55:29] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [16:55:51] (03PS1) 10Jbond: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) [16:55:53] at first look it seems just an artifact of a spike that happened ~30m ago [16:55:59] cdanis: do you agree? (eqsin) [16:56:33] (03CR) 10jerkins-bot: [V: 04-1] nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [16:57:16] volans: taking a look but the traffic drop alert is that 95% of the time [16:57:26] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) [16:57:39] yeah, looks like we had some scraper or other traffic burst in eqsin [16:58:02] yep [16:58:11] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:24] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) These are racked, need bios setup [16:58:51] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 80.06 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:59:03] (03CR) 10Volans: "Couple of comments inline, thanks for adding the support for it" (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [16:59:45] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [16:59:55] (03PS1) 10Volans: elasticsearch_cluster: fix call to @retry [software/spicerack] - 10https://gerrit.wikimedia.org/r/651802 [16:59:57] (03PS1) 10Volans: tests: fix deprecated pytest argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/651803 [16:59:59] (03PS1) 10Volans: dnsdisc: improve test coverage [software/spicerack] - 10https://gerrit.wikimedia.org/r/651804 [17:00:01] (03PS1) 10Volans: Use newly migrated code from wmflib [software/spicerack] - 10https://gerrit.wikimedia.org/r/651805 (https://phabricator.wikimedia.org/T257905) [17:01:42] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr These are racked and netbox updated but not connected to the switches yet. @Jclark-ctr could you please cable these and... [17:02:42] (03PS1) 10CDanis: klaxon: restart on pulling changes from git [puppet] - 10https://gerrit.wikimedia.org/r/651806 [17:04:50] (03CR) 10RLazarus: [C: 03+1] "As long as we're here, maybe "your root account was compromised"? Or "privileged" or some other more specific thing? I'm worried about som" [software/klaxon] - 10https://gerrit.wikimedia.org/r/651797 (owner: 10CDanis) [17:07:40] (03CR) 10CDanis: "> Patch Set 1: Code-Review+1" [software/klaxon] - 10https://gerrit.wikimedia.org/r/651797 (owner: 10CDanis) [17:08:44] (03CR) 10RLazarus: [C: 03+1] "> I might go with "shell account"? It's also not quite right (we'd care about the compromise of an account with no shell access but with " [software/klaxon] - 10https://gerrit.wikimedia.org/r/651797 (owner: 10CDanis) [17:10:28] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) 05Resolved→03Open This isn't really a duplicate. It was the overall tracking ticket for the individual system... [17:10:30] (03PS2) 10CDanis: Some wordsmithing, & remove a long-done TODO [software/klaxon] - 10https://gerrit.wikimedia.org/r/651797 [17:10:49] (03CR) 10CDanis: [V: 03+2 C: 03+2] Some wordsmithing, & remove a long-done TODO [software/klaxon] - 10https://gerrit.wikimedia.org/r/651797 (owner: 10CDanis) [17:11:07] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) I was just looking into this. If the cross-over cables ordered will work... [17:12:04] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) [17:14:28] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) [17:15:02] (03PS2) 10Jbond: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) [17:15:13] (03CR) 10Jbond: "Thanks for the very quick review" (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [17:15:25] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) All moves were successful for the the traffic interfaces. The DRBD interface is still replicating, but it will he... [17:16:10] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) I think this just needs a long enough cable between the two if the connectors are sfp+, per T266192#6710545 [17:22:28] (03CR) 10Volans: "> Patch Set 1:" (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [17:24:18] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) Please ping Analytics before shutting down the host since there is a database running on it, so I'd prefer to do things gracefully and stop replication from an-co... [17:24:38] (03CR) 10CDanis: "I'm not convinced that this works, but I think it probably does?" [puppet] - 10https://gerrit.wikimedia.org/r/651806 (owner: 10CDanis) [17:30:54] (03CR) 10CDanis: [C: 03+1] "looks good! just one nit, optional" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) (owner: 10Jbond) [17:31:49] (03CR) 10CDanis: [C: 03+1] "looks good, one typo fix -- thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond) [17:32:40] (03CR) 10Jbond: [C: 03+1] "LGTM see inline comment the more specific Exec is a little less intuitive and the more general Git::clone might do something unexpected. " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651806 (owner: 10CDanis) [17:36:19] (03PS2) 10CDanis: klaxon: restart on pulling changes from git [puppet] - 10https://gerrit.wikimedia.org/r/651806 [17:37:01] (03CR) 10Volans: pcc: add more info to the status message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) (owner: 10Jbond) [17:40:42] (03CR) 10CDanis: [C: 03+2] "thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651806 (owner: 10CDanis) [17:43:15] are we sure it will not restart on every git fetch/pull on every puppet run? [17:43:34] I've never tested the subscribe on the git::clone [17:44:54] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Andrew) a:05Andrew→03Bstorm Reassigning to Brooke for drbd things [17:45:55] (03CR) 10Jbond: pcc: add more info to the status message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) (owner: 10Jbond) [17:47:00] (03CR) 10Jbond: klaxon: restart on pulling changes from git (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651806 (owner: 10CDanis) [17:47:29] volans: there was some descussion on the change i think the thinking is lets just test it and see [17:48:09] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: update stein l3 python files [puppet] - 10https://gerrit.wikimedia.org/r/651668 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [17:48:13] ack, my only fear is that it will restart at each puppet run [17:48:18] because puppet :D [17:48:36] yes could be lets test and see :) [17:48:45] (03CR) 10Andrew Bogott: [C: 03+2] haproxy: add a note about the $logging param [puppet] - 10https://gerrit.wikimedia.org/r/651246 (owner: 10Andrew Bogott) [17:49:02] (03PS4) 10Jbond: pcc: add more info to the status message [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) [17:49:10] (03CR) 10Bstorm: "Will merge this after the break since this is a volume mount change 😊" [puppet] - 10https://gerrit.wikimedia.org/r/639883 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [17:49:28] (03CR) 10jerkins-bot: [V: 04-1] pcc: add more info to the status message [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) (owner: 10Jbond) [17:49:37] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: apply our local dmz hacks for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651669 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [17:49:53] (03PS2) 10Andrew Bogott: Nova: add Stein manifests [puppet] - 10https://gerrit.wikimedia.org/r/651683 (https://phabricator.wikimedia.org/T261134) [17:50:03] (03PS5) 10Jbond: varnish: ratelimit vscode-phabricator plugin [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) [17:50:39] (03CR) 10Andrew Bogott: [C: 03+2] Nova: add Stein manifests [puppet] - 10https://gerrit.wikimedia.org/r/651683 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [17:51:11] (03CR) 10Volans: pcc: add more info to the status message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) (owner: 10Jbond) [17:51:28] (03PS4) 10Andrew Bogott: Neutron: apply our local dmz hacks for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651669 (https://phabricator.wikimedia.org/T261134) [17:51:30] (03PS2) 10Andrew Bogott: Nova: Move some hard-coded rocky includes into version-specific manifests [puppet] - 10https://gerrit.wikimedia.org/r/651692 [17:51:32] (03PS4) 10Andrew Bogott: Glance: add Stein versions of manifests [puppet] - 10https://gerrit.wikimedia.org/r/651684 (https://phabricator.wikimedia.org/T261134) [17:51:34] (03PS5) 10Andrew Bogott: Cinder: add Stein version of service manifest [puppet] - 10https://gerrit.wikimedia.org/r/651685 (https://phabricator.wikimedia.org/T261134) [17:51:36] (03PS6) 10Andrew Bogott: Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) [17:51:38] (03PS8) 10Andrew Bogott: Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) [17:51:40] (03PS10) 10Andrew Bogott: Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) [17:51:42] (03PS10) 10Andrew Bogott: Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) [17:51:44] (03PS12) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [17:52:58] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Add Stein service manifests [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [17:53:16] (03CR) 10Andrew Bogott: [C: 03+2] Nova: Move some hard-coded rocky includes into version-specific manifests [puppet] - 10https://gerrit.wikimedia.org/r/651692 (owner: 10Andrew Bogott) [17:54:06] (03CR) 10Andrew Bogott: [C: 03+2] Nova: Move some hard-coded rocky includes into version-specific manifests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651692 (owner: 10Andrew Bogott) [17:54:27] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) [18:00:43] (03PS2) 10Bstorm: wikireplicas: set up VM haproxy layer [puppet] - 10https://gerrit.wikimedia.org/r/651778 (https://phabricator.wikimedia.org/T267376) [18:00:51] (03CR) 10Jbond: pcc: add more info to the status message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) (owner: 10Jbond) [18:01:21] (03PS5) 10Jbond: pcc: add more info to the status message [puppet] - 10https://gerrit.wikimedia.org/r/651788 (https://phabricator.wikimedia.org/T270757) [18:03:52] (03CR) 10Andrew Bogott: [C: 03+2] Glance: add Stein versions of manifests [puppet] - 10https://gerrit.wikimedia.org/r/651684 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:04:10] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: add Stein version of service manifest [puppet] - 10https://gerrit.wikimedia.org/r/651685 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:04:15] (03CR) 10Bstorm: [C: 03+2] "Merging this to facilitate tinkering while the other patch will be aimed at making things better in puppet." [puppet] - 10https://gerrit.wikimedia.org/r/651778 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [18:05:11] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] "overriding linter complaint; as far as I know including network::constants is the only good way to do this." [puppet] - 10https://gerrit.wikimedia.org/r/651686 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:05:47] (03CR) 10Andrew Bogott: [C: 03+2] Neutron service manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651687 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:06:05] (03CR) 10Andrew Bogott: [C: 03+2] Add OpenStack client package manifests for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651689 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:06:40] (03CR) 10Andrew Bogott: [C: 03+2] Barbican: add service manifest for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651688 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:09:17] (03PS7) 10Bstorm: cloud haproxy: refactor the various haproxy setups [puppet] - 10https://gerrit.wikimedia.org/r/651301 [18:10:37] (03PS13) 10Andrew Bogott: OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) [18:10:39] (03PS1) 10Andrew Bogott: OpenStack Keystone: move ldap-common-rocky-fixed.py into files/rocky [puppet] - 10https://gerrit.wikimedia.org/r/651812 (https://phabricator.wikimedia.org/T261134) [18:11:53] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Keystone: move ldap-common-rocky-fixed.py into files/rocky [puppet] - 10https://gerrit.wikimedia.org/r/651812 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:12:28] thanks cdanis! [18:12:36] 🍻 [18:18:59] cdanis: so Riccardo is allowed to go on holidays? [18:19:09] I am not sure if it is wise [18:19:10] elukey: the only one stopping him is himself [18:19:16] :D [18:19:30] happy holidays folks :) [18:19:43] to you too! enjoy [18:20:09] We can celebrate with some Lambrusco 9_9 [18:20:58] Happy Xmas everyone! Thanks for all your work! [18:25:15] (03CR) 10Elukey: "This is a great start, I added some comments to let you keep working on it, thanks a lot! \o/" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [18:44:58] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) >>! In T260445#6706896, @Cmjohnson wrote: > @elukey that will work, I will add 1 to B4 and 2 to C2. Thanks! Followed up with Chris on IRC,... [18:49:47] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:51:17] cmjohnson1: ^^^ FYI I guess you were working on it [18:52:24] (03PS2) 10Dzahn: gerrit: split replica hosts into sepearate role/profile [puppet] - 10https://gerrit.wikimedia.org/r/649752 [18:53:31] (03CR) 10RLazarus: [C: 03+1] dnsdisc: improve test coverage [software/spicerack] - 10https://gerrit.wikimedia.org/r/651804 (owner: 10Volans) [18:53:43] 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity, 10Patch-For-Review: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10nskaggs) [18:53:56] (03CR) 10jerkins-bot: [V: 04-1] gerrit: split replica hosts into sepearate role/profile [puppet] - 10https://gerrit.wikimedia.org/r/649752 (owner: 10Dzahn) [18:57:22] (03PS3) 10Dzahn: gerrit: split replica hosts into sepearate role/profile [puppet] - 10https://gerrit.wikimedia.org/r/649752 [18:58:01] (03PS4) 10Dzahn: gerrit: split replica hosts into separate role/profile [puppet] - 10https://gerrit.wikimedia.org/r/649752 [19:04:51] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack codfw1dev -> Stein [puppet] - 10https://gerrit.wikimedia.org/r/651691 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [19:08:40] (03PS1) 10Andrew Bogott: Add some resource files for Cinder/Stein [puppet] - 10https://gerrit.wikimedia.org/r/651820 (https://phabricator.wikimedia.org/T261134) [19:11:13] (03CR) 10Andrew Bogott: [C: 03+2] Add some resource files for Cinder/Stein [puppet] - 10https://gerrit.wikimedia.org/r/651820 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [19:12:26] (03PS1) 10Dzahn: gerrit: drop is_replica and replica_hosts after splitting roles [puppet] - 10https://gerrit.wikimedia.org/r/651821 [19:14:28] (03CR) 10jerkins-bot: [V: 04-1] gerrit: drop is_replica and replica_hosts after splitting roles [puppet] - 10https://gerrit.wikimedia.org/r/651821 (owner: 10Dzahn) [19:15:13] (03CR) 10Dzahn: "> All of that will be gone eventually when we use two different roles though." [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [19:16:39] (03PS1) 10Andrew Bogott: Glance: remove glance-registry service from Stein deploys [puppet] - 10https://gerrit.wikimedia.org/r/651822 (https://phabricator.wikimedia.org/T261134) [19:17:18] (03CR) 10jerkins-bot: [V: 04-1] Glance: remove glance-registry service from Stein deploys [puppet] - 10https://gerrit.wikimedia.org/r/651822 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [19:22:38] (03PS2) 10Andrew Bogott: Glance: remove glance-registry service from Stein deploys [puppet] - 10https://gerrit.wikimedia.org/r/651822 (https://phabricator.wikimedia.org/T261134) [19:23:52] (03CR) 10Andrew Bogott: [C: 03+2] Glance: remove glance-registry service from Stein deploys [puppet] - 10https://gerrit.wikimedia.org/r/651822 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [19:29:13] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:39:46] (03PS1) 10Andrew Bogott: Cinder: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651825 (https://phabricator.wikimedia.org/T261134) [19:40:23] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651825 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [19:53:41] (03PS1) 10Andrew Bogott: Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) [19:55:13] (03CR) 10jerkins-bot: [V: 04-1] Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [19:56:47] (03PS2) 10Andrew Bogott: Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) [19:58:27] (03PS3) 10Andrew Bogott: Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) [20:00:09] (03PS4) 10Andrew Bogott: Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) [20:00:12] (03CR) 10jerkins-bot: [V: 04-1] Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [20:01:42] (03CR) 10jerkins-bot: [V: 04-1] Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [20:02:32] (03PS5) 10Andrew Bogott: Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) [20:04:05] (03CR) 10jerkins-bot: [V: 04-1] Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [20:05:35] (03PS6) 10Andrew Bogott: Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) [20:07:16] (03CR) 10Andrew Bogott: [C: 03+2] Nova-api: add a puppetized api init script in Stein [puppet] - 10https://gerrit.wikimedia.org/r/651829 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [20:13:53] (03PS1) 10CDanis: link to FAQ in footer; tweak layout [software/klaxon] - 10https://gerrit.wikimedia.org/r/651832 [20:14:25] (03PS1) 10Nskaggs: wmcs: Add project NFS for wikilink project [puppet] - 10https://gerrit.wikimedia.org/r/651833 (https://phabricator.wikimedia.org/T264107) [20:16:53] (03PS1) 10Dzahn: etcd::v3: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/651834 (https://phabricator.wikimedia.org/T209953) [20:18:29] (03CR) 10jerkins-bot: [V: 04-1] etcd::v3: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/651834 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:27:07] (03CR) 10Bstorm: [C: 03+2] wmcs: Add project NFS for wikilink project [puppet] - 10https://gerrit.wikimedia.org/r/651833 (https://phabricator.wikimedia.org/T264107) (owner: 10Nskaggs) [20:30:23] 10Operations: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE - https://phabricator.wikimedia.org/T270324 (10CDanis) Now live: https://klaxon.wikimedia.org/ and successfully tested today! Also wrote some docs + a FAQ for users at https://wikitech.wikimedia.org/wiki/Klaxon Al... [20:30:57] 10Operations: enable Python CI in operations/software/klaxon - https://phabricator.wikimedia.org/T270790 (10CDanis) [20:32:05] 10Operations, 10SRE-OnFire: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE - https://phabricator.wikimedia.org/T270324 (10CDanis) [20:32:16] 10Operations, 10SRE-OnFire: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE - https://phabricator.wikimedia.org/T270324 (10CDanis) 05Open→03Resolved Thanks to #observability for initial feedback, and to @Joe @jbond and especially @RLazarus for code reviews! [20:36:28] (03CR) 10RLazarus: [C: 03+1] link to FAQ in footer; tweak layout [software/klaxon] - 10https://gerrit.wikimedia.org/r/651832 (owner: 10CDanis) [20:37:40] 10Operations, 10SRE-OnFire: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE - https://phabricator.wikimedia.org/T270324 (10CDanis) [20:38:02] (03CR) 10CDanis: [V: 03+2 C: 03+2] link to FAQ in footer; tweak layout [software/klaxon] - 10https://gerrit.wikimedia.org/r/651832 (owner: 10CDanis) [20:38:31] (03CR) 10Razzi: Add cookbook for rebooting druid nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [20:40:31] (03CR) 10Legoktm: "recheck" [software/klaxon] - 10https://gerrit.wikimedia.org/r/651832 (owner: 10CDanis) [20:50:46] (03PS1) 10CDanis: specify a minimum version of responses [software/klaxon] - 10https://gerrit.wikimedia.org/r/651836 [20:51:46] (03CR) 10jerkins-bot: [V: 04-1] specify a minimum version of responses [software/klaxon] - 10https://gerrit.wikimedia.org/r/651836 (owner: 10CDanis) [20:58:59] (03PS2) 10CDanis: specify a minimum version of responses [software/klaxon] - 10https://gerrit.wikimedia.org/r/651836 [21:00:52] (03PS1) 10Dzahn: openldap: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/651838 [21:01:14] 10Operations, 10SRE-OnFire: launch Klaxon: manual paging app for trusted users to escalate urgent issues to SRE - https://phabricator.wikimedia.org/T270324 (10Legoktm) [21:01:19] 10Operations, 10Patch-For-Review: enable Python CI in operations/software/klaxon - https://phabricator.wikimedia.org/T270790 (10Legoktm) 05Open→03Resolved a:05CDanis→03Legoktm [21:10:24] (03PS3) 10CDanis: specify a minimum version of responses [software/klaxon] - 10https://gerrit.wikimedia.org/r/651836 [21:11:43] (03PS4) 10CDanis: specify a minimum version of responses [software/klaxon] - 10https://gerrit.wikimedia.org/r/651836 [21:12:43] (03PS5) 10CDanis: specify a minimum version of responses; relax requests version [software/klaxon] - 10https://gerrit.wikimedia.org/r/651836 [21:14:20] (03PS6) 10CDanis: specify a minimum version of responses; relax requests version [software/klaxon] - 10https://gerrit.wikimedia.org/r/651836 [21:14:30] (03PS1) 10Bstorm: wikireplicas proxies: fix a couple errors in the config [puppet] - 10https://gerrit.wikimedia.org/r/651839 [21:15:47] (03PS1) 10Andrew Bogott: wmfkeystonehooks: Update the monkeypatch that renames project IDs for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651840 (https://phabricator.wikimedia.org/T261134) [21:16:28] (03CR) 10jerkins-bot: [V: 04-1] wmfkeystonehooks: Update the monkeypatch that renames project IDs for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651840 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [21:16:38] (03CR) 10Bstorm: [C: 03+2] wikireplicas proxies: fix a couple errors in the config [puppet] - 10https://gerrit.wikimedia.org/r/651839 (owner: 10Bstorm) [21:16:49] (03CR) 10CDanis: [C: 03+2] specify a minimum version of responses; relax requests version [software/klaxon] - 10https://gerrit.wikimedia.org/r/651836 (owner: 10CDanis) [21:18:15] (03Merged) 10jenkins-bot: specify a minimum version of responses; relax requests version [software/klaxon] - 10https://gerrit.wikimedia.org/r/651836 (owner: 10CDanis) [21:20:36] (03PS2) 10Andrew Bogott: wmfkeystonehooks: Update the monkeypatch that renames project IDs for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651840 (https://phabricator.wikimedia.org/T261134) [21:21:38] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: Update the monkeypatch that renames project IDs for Stein [puppet] - 10https://gerrit.wikimedia.org/r/651840 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [21:30:13] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [21:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:09] (03PS1) 10Bstorm: wikireplicas proxy: another round of fixes [puppet] - 10https://gerrit.wikimedia.org/r/651843 [21:33:17] (03CR) 10Bstorm: [C: 03+2] wikireplicas proxy: another round of fixes [puppet] - 10https://gerrit.wikimedia.org/r/651843 (owner: 10Bstorm) [21:33:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:26] (03PS1) 10Bstorm: wikireplicas proxy: Make port map available to cloud [puppet] - 10https://gerrit.wikimedia.org/r/651845 (https://phabricator.wikimedia.org/T267376) [21:48:19] (03CR) 10Bstorm: [C: 03+2] wikireplicas proxy: Make port map available to cloud [puppet] - 10https://gerrit.wikimedia.org/r/651845 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [21:54:17] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:57:18] (03CR) 10Bstorm: "Since puppetdb doesn't work on the PCC (or in cloud in general), I'll likely need to add a safeguard for that here for the sake of checkin" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [21:59:59] (03PS8) 10Bstorm: cloud haproxy: refactor the various haproxy setups [puppet] - 10https://gerrit.wikimedia.org/r/651301 [22:01:30] (03CR) 10jerkins-bot: [V: 04-1] cloud haproxy: refactor the various haproxy setups [puppet] - 10https://gerrit.wikimedia.org/r/651301 (owner: 10Bstorm) [22:01:32] (03PS9) 10Bstorm: cloud haproxy: refactor the various haproxy setups [puppet] - 10https://gerrit.wikimedia.org/r/651301 [22:12:04] (03PS11) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [22:13:22] (03PS12) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [22:14:42] (03PS1) 10Legoktm: Switch to pytest and use tox-wikimedia [software/klaxon] - 10https://gerrit.wikimedia.org/r/651846 [22:17:24] (03PS13) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [22:20:19] (03PS1) 10Legoktm: [DNM] jenkins testing [software/cumin] - 10https://gerrit.wikimedia.org/r/651847 [22:26:32] (03PS14) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [22:35:19] (03PS15) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [22:46:17] (03PS16) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [22:57:17] (03Abandoned) 10Legoktm: [DNM] jenkins testing [software/cumin] - 10https://gerrit.wikimedia.org/r/651847 (owner: 10Legoktm) [23:00:18] (03PS17) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [23:12:54] (03PS18) 10Bstorm: wikireplicas: Work toward a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [23:47:07] (03PS1) 10Bstorm: multiinstance proxies: workaround puppetdb not being in cloud [labs/private] - 10https://gerrit.wikimedia.org/r/651857 (https://phabricator.wikimedia.org/T260389) [23:48:22] (03PS2) 10Bstorm: multiinstance proxies: workaround puppetdb not being in cloud [labs/private] - 10https://gerrit.wikimedia.org/r/651857 (https://phabricator.wikimedia.org/T260389) [23:54:05] (03PS3) 10Bstorm: multiinstance proxies: workaround puppetdb not being in cloud [labs/private] - 10https://gerrit.wikimedia.org/r/651857 (https://phabricator.wikimedia.org/T260389) [23:56:54] (03CR) 10Bstorm: [V: 03+2 C: 03+2] "I understand this is really hacky, but I am trying my method of overrides in puppet by adding this here. I can revert/change it later if i" [labs/private] - 10https://gerrit.wikimedia.org/r/651857 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [23:59:59] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=- method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST