[00:00:04] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210325T0000). [00:03:21] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Legoktm) OK, fixed it with: ` MariaDB [testmailman3]> ALTER DATABASE testmailman3 CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; Query OK, 1 row affected (0.001 sec) Ma... [00:05:50] !log T274204 `sudo -i cookbook sre.elasticsearch.rolling-upgrade search_eqiad "eqiad cluster reboot" --task-id T274204 --nodes-per-run 3 --start-datetime 2021-03-24T23:55:35` on `ryankemper@cumin1001` tmux session `elasticsearch_rolling_upgrade_reboots` [00:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:02] T274204: Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster - https://phabricator.wikimedia.org/T274204 [00:08:12] (03PS1) 10Andrew Bogott: nova vendordata: change config block type to text/jinja2 [puppet] - 10https://gerrit.wikimedia.org/r/674734 [00:10:05] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: change config block type to text/jinja2 [puppet] - 10https://gerrit.wikimedia.org/r/674734 (owner: 10Andrew Bogott) [00:10:56] !log deploying phabricator [00:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:20] (03PS1) 10Dzahn: site/conftool-data: decom jobrunners mw2243,mw2246,mw2247,mw2248 [puppet] - 10https://gerrit.wikimedia.org/r/674736 (https://phabricator.wikimedia.org/T277780) [00:14:37] !log phabricator update complete [00:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:19:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:20:59] RECOVERY - Check systemd state on lists1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:05] (03PS1) 10Legoktm: mailman3: Remove incorrect STATICFILES_DIRS setting [puppet] - 10https://gerrit.wikimedia.org/r/674737 [00:21:08] 10SRE, 10serviceops: bring 35 new mediawiki appserver in codfw into production (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) [00:21:46] (03CR) 10Legoktm: [V: 03+2 C: 03+2] mailman3: Remove incorrect STATICFILES_DIRS setting [puppet] - 10https://gerrit.wikimedia.org/r/674737 (owner: 10Legoktm) [00:23:47] !log mw2377, mw2378 - reboot [00:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:15] (03PS1) 10Legoktm: mailman3: Connect using MySQL strict mode (STRICT_ALL_TABLES) [puppet] - 10https://gerrit.wikimedia.org/r/674738 [00:27:09] RECOVERY - DPKG on lists1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:29:20] wheee [00:29:51] !log syncing facts for puppet-compiler [00:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:03] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2377.codfw.wmnet [00:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2378.codfw.wmnet [00:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:43] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2377.codfw.wmnet [00:33:51] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw2378.codfw.wmnet [00:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:30] !log mw2377, mw2378 - first scap pull [00:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2777.codfw.wmnet [00:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2377.codfw.wmnet [00:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:50:58] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2378.codfw.wmnet [00:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:08:11] !log mailman3: renamed default site from "example.com" to "lists-next.wikimedia.org" [01:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:45] !log mailman3: added lists-next.wikimedia.org domain [01:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:27] 2021-03-25 01:09:58 1lPEVY-0003jL-Pn H=muffat.debian.org [2607:f8f0:614:1::1274:33]: SMTP error from remote mail server after RCPT TO:: 451 Greylisted, see http://postgrey.schweikert.ch/help/debian.org.html [01:27:47] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:32:18] the email eventually went through [01:33:05] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [01:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:16] 10SRE, 10Wikimedia-Mailing-lists: Create `mailman-web` helper alias - https://phabricator.wikimedia.org/T278404 (10Legoktm) [01:47:48] (03CR) 10BryanDavis: "> It would need a rewrite to fix it, and there *already is a querykiller* on the servers. It's set to four hours." [puppet] - 10https://gerrit.wikimedia.org/r/674710 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [01:49:35] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) [01:49:46] 10Puppet, 10SRE, 10Wikimedia-Mailing-lists: Make puppet for mailman3 ready for production - https://phabricator.wikimedia.org/T277286 (10Legoktm) 05Open→03Resolved a:03Legoktm This is basically done by virtue of having the test server working. Announcement to follow shortly. [02:05:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:07:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:31:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:33:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:10:33] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [03:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:33] (03PS1) 10Legoktm: mailman3: Lower logging level to debug [puppet] - 10https://gerrit.wikimedia.org/r/674748 [03:18:36] (03CR) 10Legoktm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/674748 (owner: 10Legoktm) [03:18:37] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:33] (03CR) 10Legoktm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/674748 (owner: 10Legoktm) [03:25:11] (03CR) 10Legoktm: [C: 03+2] mailman3: Lower logging level to debug [puppet] - 10https://gerrit.wikimedia.org/r/674748 (owner: 10Legoktm) [03:25:19] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 4.106 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:32] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28770/console" [puppet] - 10https://gerrit.wikimedia.org/r/674738 (owner: 10Legoktm) [03:28:50] (03CR) 10Legoktm: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/674738 (owner: 10Legoktm) [04:16:53] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [04:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:25] !log restarted exim4 on lists1002 so it listens on 0.0.0.0 instead of 127.0.0.1 [04:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:50:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:57:24] (03PS1) 10Legoktm: mailman3: Use host IP in MAILMAN_ARCHIVER_FROM [puppet] - 10https://gerrit.wikimedia.org/r/674750 [04:58:16] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28771/console" [puppet] - 10https://gerrit.wikimedia.org/r/674750 (owner: 10Legoktm) [04:58:31] (03CR) 10Ladsgroup: [C: 03+1] mailman3: Use host IP in MAILMAN_ARCHIVER_FROM [puppet] - 10https://gerrit.wikimedia.org/r/674750 (owner: 10Legoktm) [04:58:53] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Use host IP in MAILMAN_ARCHIVER_FROM [puppet] - 10https://gerrit.wikimedia.org/r/674750 (owner: 10Legoktm) [05:19:44] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [05:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:59] (03PS1) 10Andrew Bogott: cloud-init vendordata: try to optimize initial puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/674752 (https://phabricator.wikimedia.org/T278051) [05:23:21] (03CR) 10Andrew Bogott: [C: 03+2] cloud-init vendordata: try to optimize initial puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/674752 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [05:34:10] ACKNOWLEDGEMENT - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Ryan Kemper https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:49:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:51:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:23:41] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [06:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:57] !log disable puppet on all mediawiki servers to merge 673061 (service proxy to listen on ::1) [06:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:27] Seen as patches aren't being relayed to irc, https://gerrit.wikimedia.org/r/c/operations/puppet/+/674684 [06:44:08] Why aren't puppet patches showing in here, it's up [06:45:22] (03PS3) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673061 (https://phabricator.wikimedia.org/T255568) [06:46:09] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable ipv6 on envoy services on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673061 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [06:47:13] !log enable puppet on parsoid [06:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:16] (03CR) 10Ayounsi: [C: 03+2] Option 82: use-vlan-id [homer/public] - 10https://gerrit.wikimedia.org/r/674574 (https://phabricator.wikimedia.org/T221388) (owner: 10Ayounsi) [06:53:30] !log enable puppet on jobrunners [06:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:56] (03Merged) 10jenkins-bot: Option 82: use-vlan-id [homer/public] - 10https://gerrit.wikimedia.org/r/674574 (https://phabricator.wikimedia.org/T221388) (owner: 10Ayounsi) [06:57:51] !log Option 82: use-vlan-id [06:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:09] !log enable puppet on all mediawiki servers [07:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:29] XioNoX: Riccardo explained to me Option 82, looks really really nice [07:31:52] elukey: it's a hack, but it's well supported [07:31:56] it seems as if you and Riccardo are on a secret mission project [07:32:09] elegant hack :D [07:32:28] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Fix the problem with gravatar and mailman3 - https://phabricator.wikimedia.org/T256541 (10Ladsgroup) [07:32:38] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) [07:32:52] 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Ladsgroup) 05Open→03Declined @Legoktm is patching the old hyperkitty instead. [07:35:40] !log restart db2135 T278408 T273281 [07:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:56] T278408: db2135 crashed - https://phabricator.wikimedia.org/T278408 [07:36:33] !log uploaded hyperkitty 1.2.2-1+wmf1 to buster-wikimedia (T276687) [07:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:43] T276687: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 [07:41:58] !log upgraded lists1002 to hyperkitty 1.2.2-1+wmf1 (T276687) [07:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:09] T276687: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 [07:45:04] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) [07:45:31] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Fix the problem with gravatar and mailman3 - https://phabricator.wikimedia.org/T256541 (10Legoktm) 05Open→03Resolved a:03Legoktm We ended up patching out gravatar instead: https://salsa.debian.org/legoktm/hyperkitty/-/commit/de3e8a12825907180949d3f6983dfbda... [07:48:51] (03CR) 10Filippo Giunchedi: [C: 03+1] Add normalized object field [software/ecs] - 10https://gerrit.wikimedia.org/r/672805 (owner: 10Cwhite) [07:51:07] RECOVERY - MariaDB Replica Lag: m5 on db2078 is OK: OK slave_sql_lag Replication lag: 0.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:51:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28772/console" [puppet] - 10https://gerrit.wikimedia.org/r/671283 (owner: 10Herron) [07:51:43] RECOVERY - MariaDB Replica Lag: m5 on db2135 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:51:57] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10jcrespo) I restarted the host to check for hw errors. After upgrade and restart, I ran into: ` Error 'Duplicate key name 'ix_mailinglist_list_id'' on query. Default database: 'testmailman3'. Query: 'CREATE... [07:54:40] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Legoktm) @lmata, we could use some help in setting this up. [07:58:56] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673594 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [08:04:27] Amir1: I still see gravatar mentioned on https://lists-next.wikimedia.org/user-profile/, does it make sense to have it if it's off? [08:05:30] RhinosF1: probably missing the packaging (or it's of postorious?) file a bug and will handle it [08:08:03] T278410 [08:08:06] T278410: Gravatar add link still shows in profile - https://phabricator.wikimedia.org/T278410 [08:11:25] !log upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 [08:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:36] !log upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 for buster-wikimedia [08:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:36] (03PS1) 10Kosta Harlan: linkrecommendation: Enable proxy API URL for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/674802 [08:16:41] Just checking: are we running Varnishd 4 in production? [08:18:18] (03PS1) 10Filippo Giunchedi: alertmanager: send recovery emails for performance [puppet] - 10https://gerrit.wikimedia.org/r/674803 (https://phabricator.wikimedia.org/T272979) [08:19:24] Please disregard, looks like varnish 4.1.3 [08:21:16] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: send recovery emails for performance [puppet] - 10https://gerrit.wikimedia.org/r/674803 (https://phabricator.wikimedia.org/T272979) (owner: 10Filippo Giunchedi) [08:26:52] (03CR) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [08:31:25] (03PS1) 10Elukey: hadoop: tune Yarn's capacity scheduler defaults [puppet] - 10https://gerrit.wikimedia.org/r/674806 (https://phabricator.wikimedia.org/T277062) [08:31:27] (03PS14) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [08:31:46] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [08:35:53] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [08:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:18] (03PS10) 10Effie Mouzeli: hieradata: enable memcached socket mwdebug1001, mwdebug2001 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [08:40:00] (03CR) 10Effie Mouzeli: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [08:42:16] (03PS15) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [08:43:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [08:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:07] (03PS2) 10Elukey: hadoop: tune Yarn's capacity scheduler defaults [puppet] - 10https://gerrit.wikimedia.org/r/674806 (https://phabricator.wikimedia.org/T277062) [08:45:46] !log drain ganeti2023 [08:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28774/console" [puppet] - 10https://gerrit.wikimedia.org/r/674806 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:51:23] (03CR) 10Joal: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/674806 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:52:49] (03CR) 10Elukey: [V: 03+1 C: 03+2] hadoop: tune Yarn's capacity scheduler defaults [puppet] - 10https://gerrit.wikimedia.org/r/674806 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:57:22] (03CR) 10Hashar: gerrit: restoring a change adds #Patch-For-Review (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674548 (https://phabricator.wikimedia.org/T277597) (owner: 10Hashar) [08:57:30] (03PS2) 10Hashar: gerrit: restoring a change adds #Patch-For-Review [puppet] - 10https://gerrit.wikimedia.org/r/674548 (https://phabricator.wikimedia.org/T277597) [08:58:41] (03PS2) 10Hashar: gerrit: escape remarkup for Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/674558 (https://phabricator.wikimedia.org/T93331) [09:06:57] s [09:10:01] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 111543560 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:17:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet [09:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:13] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 711072 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:25:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet [09:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:03] !log drain ganeti2024 [09:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:32] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10Sfigor) I'm acknowledging I have read and signed L3 Wikimedia Server Access Responsibilities document. [09:31:40] (03CR) 10Jcrespo: "Please note you sent the updates as a new patch." [software/transferpy] - 10https://gerrit.wikimedia.org/r/674663 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [09:32:48] (03PS2) 10Muehlenhoff: Point irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/674617 (https://phabricator.wikimedia.org/T224579) [09:35:28] (03PS7) 10Palak199: Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) [09:38:05] (03Abandoned) 10Jcrespo: Modify:: Modify path variable in parse function of transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674663 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [09:38:07] (03CR) 10Muehlenhoff: [C: 03+2] Point irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/674617 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [09:38:58] (03CR) 10Jcrespo: "recheck" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [09:41:28] (03CR) 10jerkins-bot: [V: 04-1] Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [09:45:10] (03CR) 10Palak199: "I mentioned about it in the message of this" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [09:45:20] (03PS1) 10Matthias Mullie: Enable MediaSearch by default for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674820 [09:48:58] (03PS8) 10Palak199: Modify:: The parsing function in transfer.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) [09:49:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:41] (03CR) 10Jbond: netbase: add new module to manage /etc/services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [09:51:27] (03CR) 10Volans: "@jbond I did a quick pass and I have to admit that this is a bit out of my confort zone for giving a clear vote. I did add a couple of gen" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [09:51:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:57:12] (03CR) 10Jcrespo: "> Patch Set 7:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [09:59:23] 10SRE, 10Traffic: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) >>! In T275809#6862524, @BBlack wrote: > We have such a strategy in our puppetized VCL, currently unused, called "exp" - https://gerrit.wikimedia.org/r/plugins/gitiles/operations/pupp... [10:00:04] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210325T1000). [10:00:28] 10SRE, 10Data-Persistence-Backup: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10jcrespo) a:05jcrespo→03None [10:01:56] 10SRE, 10Data-Persistence-Backup: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10jcrespo) 05Open→03Resolved a:03jcrespo This is technically resolved, for both regular backups and database ones. However, I may create a new task at some point to remove the offsite copy... [10:03:21] (03CR) 10David Caro: [C: 03+1] "If it's not working already +1 for removing it." [puppet] - 10https://gerrit.wikimedia.org/r/674710 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [10:07:57] (03CR) 10Palak199: "> So my first suggestion would be to change the behavior of finding multiple colons (and its test) to producing an error. If the syntax is" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [10:10:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:13:19] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:13:20] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=99) [10:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:40] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:51] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:53] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [10:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:29] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10jbond) @thcipriani / @wkandek can you approve this request @KFrancis can you confirm NDA status for igor.l@speedandfunction.com (this is a speed and f... [10:23:05] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:45] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10jbond) [10:24:13] (03CR) 10Jcrespo: "Yeah, after comparing to scp, I think your idea is a better one. If the path doesn't exist, it will fail anyway, so it should be "safe"." [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [10:24:14] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [10:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:25] (03PS1) 10Jbond: admin: add : igor.l from speed and function [puppet] - 10https://gerrit.wikimedia.org/r/674847 (https://phabricator.wikimedia.org/T278327) [10:28:26] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [10:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [10:29:47] (03CR) 10Jbond: [C: 04-1] "-1 for now while we await approvals" [puppet] - 10https://gerrit.wikimedia.org/r/674847 (https://phabricator.wikimedia.org/T278327) (owner: 10Jbond) [10:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10jbond) [10:31:24] (03CR) 10Volans: [C: 04-1] "Some comments inline, I did a quick test on netbox-next and those are my related comments:" (039 comments) [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:34:51] !log drain ganeti1005 [10:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:32] (03PS2) 10Kosta Harlan: linkrecommendation: Enable proxy API URL and Swagger UI for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/674802 [10:35:50] !log running cleanup on aqs1004-a: nodetool-a cleanup "local_group_default_T_pageviews_per_project_v2" data [10:35:52] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Enable proxy API URL and Swagger UI for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/674802 (owner: 10Kosta Harlan) [10:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:34] !log running general nodetool cleanup on aqs1004-a [10:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:27] (03Merged) 10jenkins-bot: linkrecommendation: Enable proxy API URL and Swagger UI for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/674802 (owner: 10Kosta Harlan) [10:40:30] 10SRE, 10SRE-Access-Requests: Request for access to mailman3-roots role for Ladsgroup - https://phabricator.wikimedia.org/T278078 (10Kormat) Is there anything further that needs to happen on this task? [10:40:40] (03CR) 10Jbond: [C: 03+1] Add CAS authentication support (031 comment) [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:40:42] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host furud.codfw.wmnet [10:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:50] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [10:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:37] (03CR) 10Muehlenhoff: Add CAS authentication support (031 comment) [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:44:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host furud.codfw.wmnet [10:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet [10:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:15] (03CR) 10Volans: [C: 04-1] "clarification inline" (031 comment) [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [10:47:28] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Fix the problem with gravatar and mailman3 - https://phabricator.wikimedia.org/T256541 (10Aklapper) Related, mentioning just for cross-linking: {T263161} [10:48:37] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [10:48:37] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet [10:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:32] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [10:54:32] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:01] (03PS1) 10Jbond: PKI: make pki serveres spare in prep for rebuild [puppet] - 10https://gerrit.wikimedia.org/r/674854 [10:57:50] (03CR) 10Jbond: [C: 03+2] PKI: make pki serveres spare in prep for rebuild [puppet] - 10https://gerrit.wikimedia.org/r/674854 (owner: 10Jbond) [10:58:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1005.eqiad.wmnet [10:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, apergos, and duesen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210325T1100). [11:01:45] PROBLEM - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:02:24] !jouncebot refresh [11:02:24] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [11:02:37] jouncebot: refresh [11:02:39] I refreshed my knowledge about deployments. [11:02:50] I keep on forgetting about the ! [11:03:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1005.eqiad.wmnet [11:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:19] oh heck [11:04:26] this is today? [11:04:57] yup [11:05:42] there is a video call for it, which is also being recorded right now [11:05:48] oh! uh, sec [11:05:55] RECOVERY - Host kubestagetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [11:06:15] I'm off today so t's my bad to sign up for somthing on a holiday [11:06:20] but let me get i the call [11:06:29] (03CR) 10Jbond: [C: 03+1] Create group for root access to snapshot, dumpsdata and labstore1006,7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/672879 (https://phabricator.wikimedia.org/T277629) (owner: 10ArielGlenn) [11:08:22] (03PS1) 10Kosta Harlan: linkrecommendation: Fix Swagger UI URL prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/674856 [11:08:30] So should I schedule my backport for later? [11:08:56] I assume it’s good if we have something to backport during the backport training :) [11:09:05] might just take a bit longer than usual, hope that’s okay [11:10:07] * Urbanecm will have a config patch soon-ish, if possible [11:10:13] !log drain ganeti1006 [11:10:15] but it can wait for later [11:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:22] (03CR) 10Ladsgroup: [C: 03+2] "backport" [skins/Vector] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674382 (owner: 10Phuedx) [11:11:31] I can do the backports yay [11:11:35] hooray [11:12:16] Urbanecm: we could deploy your config patch now as part of the training, if it's ready to go [11:12:33] kostajh: I will have it in a few minutes [11:15:09] (03Abandoned) 10Kosta Harlan: linkrecommendation: Fix Swagger UI URL prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/674856 (owner: 10Kosta Harlan) [11:15:36] (03PS1) 10Urbanecm: tawiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674857 (https://phabricator.wikimedia.org/T278369) [11:15:42] kostajh: ^^it's this one^^ [11:15:46] and it's even Growth related :) [11:16:10] Urbanecm: ty [11:16:35] kostajh: is there a call or something i can join? Just curious how this works 🙂 [11:17:18] (03PS1) 10Jbond: O:pki::root: move pki-root server to pki-root role [puppet] - 10https://gerrit.wikimedia.org/r/674859 [11:17:22] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pki1001.eqiad.wmnet with reason: REIMAGE [11:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:41] Urbanecm: it's a training for new deployers, I assume it's stuff you already know well by now :) [11:18:21] I can't really tell from https://wikitech.wikimedia.org/wiki/Caching_overview , but I'm wondering if the MediaWiki Action API is behind varnish? [11:18:45] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host dns4001.wikimedia.org [11:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:55] kostajh: I know, I'm just curious :D. But I'm also fine if someone just pings me when it's ready to be tested on mwdebug :) [11:19:03] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:19:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki1001.eqiad.wmnet with reason: REIMAGE [11:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:57] awight: yes, if you run `curl -v 'https://www.mediawiki.org/w/api.php?action=query&format=json&smaxage=60'` you can see “x-cache-status: hit-front” after the second request [11:20:19] awight: but note authed requests will bypass cache anyway [11:20:21] but AFAIK MediaWiki only sends cache headers if the request explicitly specified maxage or smaxage in the parameters [11:20:41] (as you can have privileged view access) [11:21:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={bird,haproxy,pdnsrec} site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:22:23] awight: see https://www.mediawiki.org/wiki/API:Caching_data/en for some docs on this [11:22:23] MediaWiki is supposed to skip sending the cache headers if the response contains non-public data [11:22:39] IIUC, the fact that authenticated requests bypass varnish is specific to Wikimedia’s production deployment [11:22:47] as a safeguard against MediaWiki bugs, I assume [11:23:00] Lucas_WMDE: it also uses user's language [11:23:09] (03CR) 10Ladsgroup: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674857 (https://phabricator.wikimedia.org/T278369) (owner: 10Urbanecm) [11:23:56] (03Merged) 10jenkins-bot: tawiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674857 (https://phabricator.wikimedia.org/T278369) (owner: 10Urbanecm) [11:24:11] (03PS1) 10Ladsgroup: Disable Legacy javascript in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674861 (https://phabricator.wikimedia.org/T72470) [11:24:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4001.wikimedia.org [11:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:26:03] Urbanecm: it's live in mwdebug1001, please test [11:26:06] Amir1: thanks, looking [11:26:28] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is great work, but I have a few comments on the implementation, specifically:" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [11:27:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 (owner: 10Jbond) [11:27:15] Amir1: it works, please sync [11:27:19] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pki2001.codfw.wmnet with reason: REIMAGE [11:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki2001.codfw.wmnet with reason: REIMAGE [11:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:21] !log ladsgroup@deploy1002 Synchronized wmf-config: [[gerrit:674857|tawiki: Enable Growth features in dark mode]] (T278369) (duration: 01m 30s) [11:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:31] T278369: Deploy Growth features on Tamil Wikipedia - https://phabricator.wikimedia.org/T278369 [11:33:55] !log ladsgroup@deploy1002 Synchronized dblists/: [[gerrit:674857|tawiki: Enable Growth features in dark mode]], Part II (T278369) (duration: 01m 07s) [11:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:10] (03Merged) 10jenkins-bot: Inform anonymous A/B test by tracking time from navigationStart [skins/Vector] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674382 (owner: 10Phuedx) [11:34:10] Lucas_WMDE: Urbanecm: Thanks for educating me, really helpful links and explanation! [11:34:14] (03CR) 10Giuseppe Lavagetto: P:netbase: parse the service catalogue and inject the service ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [11:34:20] any time awight :) [11:34:49] Urbanecm: it's done yay [11:34:55] thanks Amir1 :) [11:35:29] yw [11:38:42] PROBLEM - AQS root url on aqs1012 is CRITICAL: connect to address 10.64.32.16 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [11:38:43] phuedx: live in mwdebug1001 [11:39:16] PROBLEM - AQS root url on aqs1014 is CRITICAL: connect to address 10.64.48.62 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [11:39:18] PROBLEM - AQS root url on aqs1015 is CRITICAL: connect to address 10.64.48.63 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [11:39:30] Thanks. Taking a look now [11:39:42] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1006.eqiad.wmnet [11:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:57] (03PS2) 10Ladsgroup: Disable Legacy javascript in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674861 (https://phabricator.wikimedia.org/T72470) [11:41:04] Not seeing any errors in the console, Amir1. LGTM [11:41:09] (03CR) 10Volans: "Looks ok to me, few nits inline, nothing major AFAICT. I didn't test it though." (039 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [11:41:26] phuedx: okay, syncing [11:41:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:43:34] ACKNOWLEDGEMENT - AQS root url on aqs1012 is CRITICAL: connect to address 10.64.32.16 and port 7232: Connection refused Hnowlan New hosts, not pooled. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [11:43:34] ACKNOWLEDGEMENT - AQS root url on aqs1013 is CRITICAL: connect to address 10.64.32.136 and port 7232: Connection refused Hnowlan New hosts, not pooled. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [11:43:34] ACKNOWLEDGEMENT - AQS root url on aqs1014 is CRITICAL: connect to address 10.64.48.62 and port 7232: Connection refused Hnowlan New hosts, not pooled. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [11:43:35] ACKNOWLEDGEMENT - AQS root url on aqs1015 is CRITICAL: connect to address 10.64.48.63 and port 7232: Connection refused Hnowlan New hosts, not pooled. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [11:43:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:43:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1006.eqiad.wmnet [11:44:03] !log ladsgroup@deploy1002 Synchronized php-1.36.0-wmf.36/skins/Vector/resources: [[gerrit:674382|Inform anonymous A/B test by tracking time from navigationStart (T275807)]] (duration: 01m 09s) [11:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:17] T275807: Create harness for the language switcher A/B test - https://phabricator.wikimedia.org/T275807 [11:44:25] phuedx: sync'ed [11:44:49] Amir1: Thanks! I'll monitor the metric in Graphite/Grafana [11:46:56] !log drain ganeti1007 [11:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:04] (03CR) 10Ladsgroup: [C: 03+2] Disable Legacy javascript in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674861 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:47:48] (03Merged) 10jenkins-bot: Disable Legacy javascript in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674861 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:51:02] (03CR) 10Volans: "LGTM, just a couple of final tweaks and it's ready!" (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [11:52:54] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:674861|Disable Legacy javascript in fawikiquote]] (T72470) (duration: 01m 07s) [11:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:04] T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470 [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210325T1200) [12:08:33] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1007.eqiad.wmnet [12:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1007.eqiad.wmnet [12:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:43] !log drain ganeti1008 [12:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:51] (03PS41) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [12:22:23] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [12:24:39] (03PS42) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [12:24:48] (03PS15) 10Jbond: P:base: add ability to manage services file [puppet] - 10https://gerrit.wikimedia.org/r/670918 [12:24:58] (03PS15) 10Jbond: P:netbase: parse the service catalogue and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [12:25:12] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [12:28:07] (03PS43) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [12:29:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28775/console" [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [12:40:12] (03CR) 10Herron: [C: 03+2] initial grafana::grizzly module and profile [puppet] - 10https://gerrit.wikimedia.org/r/671283 (owner: 10Herron) [12:42:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28778/console" [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [12:45:16] (03CR) 10Palak199: "> In this case, the problem is the unit test is assuming the wrong behaviour, so you just need to modify the test to behave as you expect." [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [12:47:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={codfw,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:48:41] (03CR) 10Jbond: "Thanks for the reviews have responded and added more comments to the merge.rb function" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [12:49:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:52:05] !log aqs1004 nodetool-a cleanup finished [12:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:22] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10lmata) Sure thing @legoktm will discuss with team and share notes here [12:57:35] (03PS1) 10Herron: add "cdofw" to typos [puppet] - 10https://gerrit.wikimedia.org/r/674863 [13:00:04] hashar and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210325T1300). [13:05:35] 10SRE, 10Wikimedia-Mailing-lists: lists-next: no clickable link in “confirm” email - https://phabricator.wikimedia.org/T278432 (10LucasWerkmeister) So the more useful version of list subscription confirmation emails just doesn’t exist anymore? Why? [13:08:16] (03PS1) 10Arturo Borrero Gonzalez: toolforge: services: drop code to support cold-standby approach [puppet] - 10https://gerrit.wikimedia.org/r/674869 (https://phabricator.wikimedia.org/T278354) [13:08:18] herron: hi, deployment-mwlog01 puppet was disabled by you, as the comment indicates it was for testing some config changes, does it still have to be disabled or can I re-enable? [13:09:09] hey Majavah sure that's good to be re-enabled if needed [13:09:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: services: drop code to support cold-standby approach [puppet] - 10https://gerrit.wikimedia.org/r/674869 (https://phabricator.wikimedia.org/T278354) (owner: 10Arturo Borrero Gonzalez) [13:11:35] herron: thanks, now re-enabled, but udp2log-mw.service is failing to start [13:11:49] I'll have a look [13:12:06] thanks! [13:14:14] Majavah: looks ok now, and seeing logs flowing at /srv/mw-log [13:15:33] (03CR) 10Jbond: [V: 03+1] "thanks see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [13:16:56] (03PS1) 10Arturo Borrero Gonzalez: toolforge: services: don't install grid stuff [puppet] - 10https://gerrit.wikimedia.org/r/674872 (https://phabricator.wikimedia.org/T278354) [13:17:09] (03CR) 10Jbond: netbase: add new module to manage /etc/services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [13:17:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: services: don't install grid stuff [puppet] - 10https://gerrit.wikimedia.org/r/674872 (https://phabricator.wikimedia.org/T278354) (owner: 10Arturo Borrero Gonzalez) [13:18:34] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1008.eqiad.wmnet [13:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:20:21] PROBLEM - Host kubetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:23:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:24:06] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1008.eqiad.wmnet [13:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:15] 10SRE, 10Wikimedia-Mailing-lists: lists-next: no clickable link in “confirm” email - https://phabricator.wikimedia.org/T278432 (10Ladsgroup) Spam redaction? Mass subscription redaction? [13:25:35] RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [13:26:01] (03PS1) 10Muehlenhoff: Remove access for phamhi [puppet] - 10https://gerrit.wikimedia.org/r/674874 (https://phabricator.wikimedia.org/T278435) [13:26:07] (03PS19) 10Herron: mwlog: "tee" udp2logs received to all mwlog hosts [puppet] - 10https://gerrit.wikimedia.org/r/673594 (https://phabricator.wikimedia.org/T224565) [13:27:05] !log reduce webperf1001/webperf2001 to 4G RAM (xhgui has been split off to separate VMs) [13:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:32] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 1:00:00 on webperf1001.eqiad.wmnet with reason: adapt RAM [13:27:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on webperf1001.eqiad.wmnet with reason: adapt RAM [13:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:32] (03PS2) 10Gilles: Add edge cache hostname to Server-Timing header [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) [13:34:26] (03CR) 10Herron: "Thanks! comments addressed in the current PS, and here's an updated PCC https://puppet-compiler.wmflabs.org/compiler1003/28779/mwlog1001.e" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673594 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [13:34:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:34:31] (03PS2) 10Jbond: O:pki::root: move pki-root server to pki-root role [puppet] - 10https://gerrit.wikimedia.org/r/674859 [13:36:35] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10LSobanski) p:05Triage→03Medium [13:36:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:43:24] (03PS3) 10Gilles: Add edge cache hostname to Server-Timing header [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) [13:43:51] (03CR) 10jerkins-bot: [V: 04-1] Add edge cache hostname to Server-Timing header [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [13:45:29] !log drain ganeti1009 [13:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:33] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for phamhi [puppet] - 10https://gerrit.wikimedia.org/r/674874 (https://phabricator.wikimedia.org/T278435) (owner: 10Muehlenhoff) [13:47:07] (03PS1) 10Elukey: Fix two typos in admin_ng's README [deployment-charts] - 10https://gerrit.wikimedia.org/r/674878 [13:47:54] (03PS4) 10Gilles: Add edge cache hostname to Server-Timing header [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) [13:49:38] (03CR) 10Elukey: [C: 03+2] Fix two typos in admin_ng's README [deployment-charts] - 10https://gerrit.wikimedia.org/r/674878 (owner: 10Elukey) [13:50:48] (03CR) 10Gilles: Add edge cache hostname to Server-Timing header (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [14:00:37] 10SRE, 10netops: Higher latency on Lumen eqiad/esams link - https://phabricator.wikimedia.org/T277654 (10ayounsi) Opened another ticket 20938358 [14:02:00] (03CR) 10Filippo Giunchedi: [C: 03+1] mwlog: "tee" udp2logs received to all mwlog hosts [puppet] - 10https://gerrit.wikimedia.org/r/673594 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [14:03:53] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: db2135 crashed - https://phabricator.wikimedia.org/T278408 (10Kormat) This looks like https://jira.mariadb.org/browse/MDEV-23019, which was fixed in 10.4.14. The server was running 10.4.13 when the crash occurred. The server is now running 10.4.18. [14:06:12] (03PS1) 10Muehlenhoff: Remove Hieu from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/674880 (https://phabricator.wikimedia.org/T278435) [14:09:22] (03PS1) 10Matthias Mullie: Enable media change tags on wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674882 (https://phabricator.wikimedia.org/T266067) [14:13:18] !log update phabricator again (last night's update undid a hotfix that is now fixed properly) [14:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove Hieu from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/674880 (https://phabricator.wikimedia.org/T278435) (owner: 10Muehlenhoff) [14:19:50] (03CR) 10Herron: [C: 03+2] mwlog: "tee" udp2logs received to all mwlog hosts [puppet] - 10https://gerrit.wikimedia.org/r/673594 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [14:28:55] 10SRE, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10Volans) >>! In T265904#6937430, @ayounsi wrote: > @volans, @crusnov, what do you think about changing the `status` of those IPs from `active` to `SLAAC`? I think those should not be in Net... [14:34:52] 10SRE, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10ayounsi) >>! In T265904#6945448, @Volans wrote: >>>! In T265904#6937430, @ayounsi wrote: >> @volans, @crusnov, what do you think about changing the `status` of those IPs from `active` to `S... [14:41:47] (03CR) 10Jbond: [C: 03+2] O:pki::root: move pki-root server to pki-root role [puppet] - 10https://gerrit.wikimedia.org/r/674859 (owner: 10Jbond) [14:45:00] !log herron@cumin1001 START - Cookbook sre.dns.netbox [14:45:05] (03PS1) 10Jbond: site.pp: Use correct role name for root CA [puppet] - 10https://gerrit.wikimedia.org/r/674896 [14:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:39] (03CR) 10Jbond: [C: 03+2] site.pp: Use correct role name for root CA [puppet] - 10https://gerrit.wikimedia.org/r/674896 (owner: 10Jbond) [14:48:31] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:30] (03PS1) 10Jbond: hiera: convert pki.*db_pass values to Sensetive [puppet] - 10https://gerrit.wikimedia.org/r/674902 [14:55:08] (03CR) 10Jbond: [C: 03+2] hiera: convert pki.*db_pass values to Sensetive [puppet] - 10https://gerrit.wikimedia.org/r/674902 (owner: 10Jbond) [14:55:55] 10SRE, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10crusnov) >>! In T265904#6945448, @Volans wrote: > We should fix the script IMHO and once we're sure it would not re-add any SLAAC IP just delete all of them from Netbox. Thoughts? This sou... [14:56:13] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [14:56:28] (03PS9) 10Ayounsi: Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) [14:56:37] (03CR) 10Ayounsi: "Thanks!" (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [15:00:32] Amir1: duesen -- just watch the reply of the first deployment training -- well done! [15:01:26] (03PS1) 10Jbond: pki-int: add ROOT ocsp and debmonitor intermidiate certificate [puppet] - 10https://gerrit.wikimedia.org/r/674905 [15:01:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: allow running on a subset of charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/673994 (owner: 10Giuseppe Lavagetto) [15:02:02] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: allow running on a subset of charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/673994 (owner: 10Giuseppe Lavagetto) [15:02:04] (03PS2) 10Giuseppe Lavagetto: Rakefile: allow running on a subset of charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/673994 [15:02:40] (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:08:01] @Tchanders I think it turned out to be pretty scary overall :) Which replies where do you mean? [15:10:44] (03PS1) 10Elukey: hadoop: increase the Yarn's defaults for vmem vs pmem ratio [puppet] - 10https://gerrit.wikimedia.org/r/674910 (https://phabricator.wikimedia.org/T278441) [15:14:01] (03CR) 10Elukey: [C: 03+2] hadoop: increase the Yarn's defaults for vmem vs pmem ratio [puppet] - 10https://gerrit.wikimedia.org/r/674910 (https://phabricator.wikimedia.org/T278441) (owner: 10Elukey) [15:17:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:19:32] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [15:19:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:20:46] duesen: sorry, typo -- "replay" :) [15:21:21] (03CR) 10Volans: pki-int: add ROOT ocsp and debmonitor intermidiate certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674905 (owner: 10Jbond) [15:22:50] (03PS2) 10Jbond: pki-int: add ROOT ocsp and debmonitor intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/674905 [15:23:11] (03CR) 10Jbond: pki-int: add ROOT ocsp and debmonitor intermediate certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674905 (owner: 10Jbond) [15:24:09] @thcipriani ah, right! I think if I was to watch that as part of my onboarding, i'd never want to do deployments... [15:24:25] :) [15:24:52] (03PS1) 10Jbond: O:pki::multirootca: add multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674914 [15:24:54] (03PS1) 10Jbond: pki1001: move host into multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674915 [15:24:56] (03PS1) 10Jbond: pki2001: move to multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674916 [15:25:18] duesen: now that we've been through one (there's another this afternoon) we'll get feedback from everyone who participated and see what can make it less terrifying (and there is work in process for "make the actual process less terrifying") [15:25:58] (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: add multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674914 (owner: 10Jbond) [15:26:35] (03PS2) 10Jbond: O:pki::multirootca: add multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674914 [15:27:05] (03Abandoned) 10Jbond: pki-int: add ROOT ocsp and debmonitor intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/674905 (owner: 10Jbond) [15:27:37] (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: add multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674914 (owner: 10Jbond) [15:31:14] (03PS3) 10Jbond: O:pki::multirootca: add multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674914 [15:32:18] (03PS2) 10Jbond: pki1001: move host into multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674915 [15:32:27] (03PS2) 10Jbond: pki2001: move to multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674916 [15:32:53] (03CR) 10jerkins-bot: [V: 04-1] O:pki::multirootca: add multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674914 (owner: 10Jbond) [15:33:00] (03PS1) 10Ema: varnish: add bot_posts_blocked_nets to traffic cloud project [puppet] - 10https://gerrit.wikimedia.org/r/674920 (https://phabricator.wikimedia.org/T272330) [15:34:36] (03PS4) 10Jbond: O:pki::multirootca: add multirootca role [puppet] - 10https://gerrit.wikimedia.org/r/674914 [15:35:17] thcipriani: if you want the process less terrifying, I have one plea https://phabricator.wikimedia.org/T223602 [15:35:22] 🥺 [15:35:49] (03CR) 10Ema: [C: 03+2] varnish: add bot_posts_blocked_nets to traffic cloud project [puppet] - 10https://gerrit.wikimedia.org/r/674920 (https://phabricator.wikimedia.org/T272330) (owner: 10Ema) [15:36:29] (03PS8) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [15:37:05] duesen: I'm sorry, I didn't want to sound terrifying [15:37:38] just be careful, one wrong command, and everything goes kaboom [15:40:53] (03PS1) 10Elukey: hadoop: disable vmem vs pmem check in Yarn defaults [puppet] - 10https://gerrit.wikimedia.org/r/674921 (https://phabricator.wikimedia.org/T278441) [15:41:54] (03PS1) 10Ema: varnish: add phabricator_abusers to traffic cloud project [puppet] - 10https://gerrit.wikimedia.org/r/674922 (https://phabricator.wikimedia.org/T270618) [15:41:57] (03CR) 10jerkins-bot: [V: 04-1] hadoop: disable vmem vs pmem check in Yarn defaults [puppet] - 10https://gerrit.wikimedia.org/r/674921 (https://phabricator.wikimedia.org/T278441) (owner: 10Elukey) [15:43:05] (03PS2) 10Elukey: hadoop: disable vmem vs pmem check in Yarn defaults [puppet] - 10https://gerrit.wikimedia.org/r/674921 (https://phabricator.wikimedia.org/T278441) [15:44:28] (03CR) 10Elukey: [C: 03+2] hadoop: disable vmem vs pmem check in Yarn defaults [puppet] - 10https://gerrit.wikimedia.org/r/674921 (https://phabricator.wikimedia.org/T278441) (owner: 10Elukey) [15:45:32] !log installing openssl updates on buster [15:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:08] (03CR) 10Ema: [C: 03+2] varnish: add phabricator_abusers to traffic cloud project [puppet] - 10https://gerrit.wikimedia.org/r/674922 (https://phabricator.wikimedia.org/T270618) (owner: 10Ema) [15:50:08] (03PS1) 10Ema: varnish: add text_abuse_nets to traffic cloud project [puppet] - 10https://gerrit.wikimedia.org/r/674923 (https://phabricator.wikimedia.org/T193762) [15:52:32] (03CR) 10Ema: [C: 03+2] varnish: add text_abuse_nets to traffic cloud project [puppet] - 10https://gerrit.wikimedia.org/r/674923 (https://phabricator.wikimedia.org/T193762) (owner: 10Ema) [15:55:42] (03PS1) 10Hashar: Disallow negative or decimal values in pages tag [extensions/ProofreadPage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674834 (https://phabricator.wikimedia.org/T278400) [15:55:55] (03CR) 10Hashar: [C: 03+2] Disallow negative or decimal values in pages tag [extensions/ProofreadPage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674834 (https://phabricator.wikimedia.org/T278400) (owner: 10Hashar) [16:00:04] jbond42 and cdanis: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210325T1600). [16:00:57] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10LarsWirzenius) I've not managed to do anything... [16:01:27] !log A:cp rolling ats-{tls,backend}-restart for openssl upgrades -- https://www.openssl.org/news/secadv/20210325.txt [16:01:33] (03CR) 10jerkins-bot: [V: 04-1] Disallow negative or decimal values in pages tag [extensions/ProofreadPage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674834 (https://phabricator.wikimedia.org/T278400) (owner: 10Hashar) [16:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:53] (03Merged) 10jenkins-bot: Disallow negative or decimal values in pages tag [extensions/ProofreadPage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674834 (https://phabricator.wikimedia.org/T278400) (owner: 10Hashar) [16:02:26] !log restart idp service [16:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:44] !log restart apache service on gerrit [16:03:49] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) Please see details on T211974 where @Ha... [16:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:05] 10SRE, 10SRE-Access-Requests: Request for access to mailman3-roots role for Ladsgroup - https://phabricator.wikimedia.org/T278078 (10Dzahn) 05Open→03Resolved a:03Dzahn Until yesterday it was supposed to stay open because the role was not applied yet on any machine. But now it is. Closing this as resolve... [16:06:11] (03CR) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [16:07:13] Amir1: that task is a good one, the other one I have in mind is doing restarts as part of deploys which would make deploys "atomic" so the section on sync-order wouldn't have to exist. I don't think that you did anything to make the process seem more terrifying than it actually is, I think you made it seem approachable by reiterating that there is documentation that you can step [16:07:15] through in a more-or-less rote way. I also enjoyed that you didn't hesitate wwhen asked what scap stands for: "scatter crap around production" :D [16:08:02] !log restart Apacge on matomo/piwik [16:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:50] 10SRE, 10Wikimedia-Mailing-lists: lists-next: not receiving own emails - https://phabricator.wikimedia.org/T278434 (10Ladsgroup) p:05Triage→03Lowest That seems like an almost useless feature. It's available in archives and your own sent mails, improves performance as well. I never had this, I think you cou... [16:09:42] pushing a backport [16:10:52] !log restarting apache on dbmonitor [16:10:58] thcipriani: it was on top of scap age in wikitech for ages IIRC [16:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:06] *page [16:11:33] indeed [16:12:15] 10SRE, 10Wikimedia-Mailing-lists: lists-next: not receiving own emails - https://phabricator.wikimedia.org/T278434 (10Majavah) There's a per-user setting named "Receive own postings", have you turned that on? [16:12:17] !lof restarting nginx on apt* [16:12:21] !log restarting nginx on apt* [16:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:58] !log restart routinator on rpki* [16:13:05] !log hashar@deploy1002 Synchronized php-1.36.0-wmf.36/extensions/ProofreadPage: Disallow negative or decimal values in pages tag - T278400 (duration: 01m 32s) [16:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:16] T278400: PHP Warning: array_key_exists(): The first argument should be either a string or an integer - https://phabricator.wikimedia.org/T278400 [16:14:46] ip addr [16:15:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:16:16] XioNoX: anychance you are around to take a look at that ^^ i just restarted it so it may be temporary [16:16:37] (03PS3) 10H.krishna123: Add logger functionality to recover-dump, add logger statements, added unit test to test initializing logging [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) [16:18:31] (03PS8) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [16:18:39] !log restart apache on netbox [16:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:20:12] (03PS5) 10Ema: Add edge cache hostname to Server-Timing header [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [16:20:16] XioNoX: either thanks or dont worry it recovered :) [16:20:29] (03CR) 10Jeena Huneidi: Rsync private mediawiki files to releases server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [16:20:42] !log restart apache on lists1002 [16:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:19] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [16:22:32] !log restart slapd on ldap-corp [16:22:34] (03PS9) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) [16:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:07] (03CR) 10H.krishna123: "Yes it is clear now, sorry, I was also confused as well, too much haste to reply and I should have read that properly beforehand." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [16:23:51] (03CR) 10H.krishna123: "recheck" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [16:24:48] !log restart slapd on ldap-replica [16:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:30] (03CR) 10H.krishna123: "> Patch Set 3:" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [16:25:52] (03CR) 10Giuseppe Lavagetto: Add liveness/readiness probe script to php-fpm images (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672767 (https://phabricator.wikimedia.org/T276908) (owner: 10Giuseppe Lavagetto) [16:26:46] (03CR) 10Jcrespo: "recheck" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [16:27:12] (03PS34) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [16:27:14] (03CR) 10CRusnov: dhcp: Introduce automation proxies for management networks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [16:27:14] !log restarting dnsdist/rdns-recursor on malmok [16:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:21] (03CR) 10CRusnov: "Nits should be resolved and this is cleanly against production again." [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [16:28:54] (03PS14) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) [16:28:58] (03CR) 10Ayounsi: [C: 03+2] Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [16:30:22] someone who is handling openssl security upgrades: I installed openssl security upgrades on deployment-cache-* hosts and restarted trafficserver-tls.service, anything else on deployment-prep that might need them installed? [16:31:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to gitlab1001 / gitlab1002 for Igor Lidin from Speed & Function - https://phabricator.wikimedia.org/T278327 (10KFrancis) @jbond Hi John, As Igor is an employee of Speed & Function, he is covered under Speed & Function's agreement in Coupa.... [16:31:37] (03CR) 10Ayounsi: [C: 03+2] Add Capirca definitions exporter (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [16:32:08] (03PS1) 10Andrew Bogott: Update fqdn of the grid master to use eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/674930 [16:32:13] !log restarting apache on an-tool1007/turnilo [16:32:14] (03PS2) 10Giuseppe Lavagetto: Add liveness/readiness probe script to php-fpm images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672767 (https://phabricator.wikimedia.org/T276908) [16:32:16] (03PS2) 10Giuseppe Lavagetto: mcrouter: add healthz script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/673845 [16:32:18] (03PS3) 10Giuseppe Lavagetto: [WiP] test harness for php-fpm images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672768 [16:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:40] Majavah: thanks [16:33:00] (03CR) 10Andrew Bogott: [C: 03+2] Update fqdn of the grid master to use eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/674930 (owner: 10Andrew Bogott) [16:35:35] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [16:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:38:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:38:54] (03CR) 10Jcrespo: "Small reminder that you may want to start working on a first draft of a GSoC proposal as per https://www.mediawiki.org/wiki/Google_Summer_" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [16:39:04] !log Restarted Apache 2 on contint2001 / contint1001 [16:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:14] Majavah: primarily the caches, yes. Thanks! Cloud VPS runs unattended-upgrades, so the updates will trickle in over night as well [16:43:41] moritzm: does it? for example deployment-mx02 currently has 82 packages with updates pending :/ [16:44:52] (03PS9) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [16:44:55] 10SRE, 10netops: Higher latency on Lumen eqiad/esams link - https://phabricator.wikimedia.org/T277654 (10ayounsi) > The latency increased on the circuit due to a route change on your circuit. The route had to be changed to accommodate an equipment decommissioning. The current length of the path has increased d... [16:46:45] (03PS10) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [16:47:49] 10SRE, 10Analytics: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10Ottomata) a:03elukey [16:50:36] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add liveness/readiness probe script to php-fpm images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672767 (https://phabricator.wikimedia.org/T276908) (owner: 10Giuseppe Lavagetto) [16:54:38] (03CR) 10Ahmon Dancy: [C: 03+1] Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [16:55:15] (03PS35) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [16:58:19] (03CR) 10CRusnov: [C: 03+1] "Looks good to me." [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [17:00:04] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210325T1700). [17:00:17] (03CR) 10Volans: [C: 03+1] "LGTM, make sure to double check with PCC the latest PS to be on the safe side." [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:00:55] (03CR) 10Volans: [C: 03+2] netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [17:03:26] (03PS1) 10Andrew Bogott: profile::wmcs::nfsclient: include nfs-common package [puppet] - 10https://gerrit.wikimedia.org/r/674933 [17:07:40] (03Merged) 10jenkins-bot: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [17:08:51] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [17:08:51] (03PS3) 10Volans: icinga: add new IcingaHosts class [software/spicerack] - 10https://gerrit.wikimedia.org/r/674549 (https://phabricator.wikimedia.org/T277740) [17:08:53] (03PS1) 10Volans: icinga: simplify command_file detection [software/spicerack] - 10https://gerrit.wikimedia.org/r/674934 [17:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:25] (03PS2) 10Andrew Bogott: profile::wmcs::nfsclient: include nfs-common package [puppet] - 10https://gerrit.wikimedia.org/r/674933 [17:09:53] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: include nfs-common package [puppet] - 10https://gerrit.wikimedia.org/r/674933 (owner: 10Andrew Bogott) [17:11:10] (03PS3) 10Andrew Bogott: profile::wmcs::nfsclient: include nfs-common package [puppet] - 10https://gerrit.wikimedia.org/r/674933 [17:14:12] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [17:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:53] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/674936 [17:16:32] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [17:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log