[00:05:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:48] (03PS1) 10Dzahn: mediawiki::maintenance: install modsecurity-crs [puppet] - 10https://gerrit.wikimedia.org/r/618161 (https://phabricator.wikimedia.org/T255629) [00:43:45] (03PS2) 10Dzahn: mediawiki::maintenance: install modsecurity-crs [puppet] - 10https://gerrit.wikimedia.org/r/618161 (https://phabricator.wikimedia.org/T255629) [00:47:52] (03PS3) 10Dzahn: mediawiki::maintenance: install modsecurity-crs [puppet] - 10https://gerrit.wikimedia.org/r/618161 (https://phabricator.wikimedia.org/T255629) [01:15:57] (03CR) 10Dzahn: "installed the package on mwmaint* manually to not leave these in an unstable state. this fixes things and resolves the original ticket. mw" [puppet] - 10https://gerrit.wikimedia.org/r/606218 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [01:17:01] (03CR) 10Dzahn: "installed the package on mwmaint* manually to not leave these in an unstable state. this fixes things and resolves the original ticket. mw" [puppet] - 10https://gerrit.wikimedia.org/r/618161 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [01:17:14] (03CR) 10Dzahn: "needed follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/618161" [puppet] - 10https://gerrit.wikimedia.org/r/607848 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [01:45:37] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) @DannyS712 Thanks for pointing that out. I made [[ https://www.wikidata.org/w/index.php?title=Wikidata%3ALiving_people&type... [01:47:49] (03PS1) 10Tim Starling: MW firejail: blacklist /run and conf cache [puppet] - 10https://gerrit.wikimedia.org/r/618163 (https://phabricator.wikimedia.org/T257090) [01:56:50] (03CR) 10C. Scott Ananian: [C: 03+1] "LGTM, but we'll have to get it SWATted." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618155 (owner: 10Arlolra) [01:56:56] (03PS6) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) [02:05:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.3 [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618164 [02:05:38] RECOVERY - dump of es5 in codfw on icinga1001 is OK: Last dump for es5 at codfw (es2025.codfw.wmnet) taken on 2020-08-04 00:00:01 (514 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:07:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:13:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:22:08] (03CR) 10Tim Starling: [C: 03+2] "I tested it on mw1286 to confirm that I've got the syntax right. I confirmed that the glob is doing what it's meant to do." [puppet] - 10https://gerrit.wikimedia.org/r/618163 (https://phabricator.wikimedia.org/T257090) (owner: 10Tim Starling) [03:18:25] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to v0.13.0-a4 [vendor] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618038 (https://phabricator.wikimedia.org/T251422) [03:38:11] (03CR) 10Subramanya Sastry: "Looks like we missed the branch cut for wmf.3." [vendor] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618038 (https://phabricator.wikimedia.org/T251422) (owner: 10C. Scott Ananian) [03:42:22] (03CR) 10Subramanya Sastry: [C: 03+1] Bump wikimedia/parsoid to v0.13.0-a4 [vendor] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618038 (https://phabricator.wikimedia.org/T251422) (owner: 10C. Scott Ananian) [03:48:21] (03CR) 10C. Scott Ananian: [C: 03+2] "Applying subbu's C+2, he apparently doesn't have the proper permissions?" [vendor] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618038 (https://phabricator.wikimedia.org/T251422) (owner: 10C. Scott Ananian) [03:48:54] (03CR) 10Legoktm: [C: 03+1] "Belatedly" [puppet] - 10https://gerrit.wikimedia.org/r/618163 (https://phabricator.wikimedia.org/T257090) (owner: 10Tim Starling) [03:53:37] !log added subbu to wmf-deployment Gerrit group [03:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:06] !log added Arlo to wmf-deployment Gerrit group [03:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:03] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to v0.13.0-a4 [vendor] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618038 (https://phabricator.wikimedia.org/T251422) (owner: 10C. Scott Ananian) [04:47:33] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.3 [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618164 (https://phabricator.wikimedia.org/T257971) (owner: 10TrainBranchBot) [04:52:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:54:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:55:02] (03CR) 10Marostegui: [C: 03+1] Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [04:55:20] RECOVERY - dump of es5 in eqiad on icinga1001 is OK: Last dump for es5 at eqiad (es1025.eqiad.wmnet) taken on 2020-08-04 00:00:01 (514 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:58:11] (03PS2) 10Marostegui: mariadb: Promote db1107 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/617997 (https://phabricator.wikimedia.org/T257540) [05:01:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1089 into API', diff saved to https://phabricator.wikimedia.org/P12145 and previous config saved to /var/cache/conftool/dbconfig/20200804-050150-marostegui.json [05:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:23] !log Reboot db1107 to pick up the last kernel [05:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:16] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:22] (03PS1) 10Marostegui: mariadb: Reimage db1119 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618172 (https://phabricator.wikimedia.org/T250666) [05:18:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 for reimage', diff saved to https://phabricator.wikimedia.org/P12146 and previous config saved to /var/cache/conftool/dbconfig/20200804-051843-marostegui.json [05:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db1119 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618172 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [05:35:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:46] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:56:36] (03PS1) 10Tim Starling: Re-enable LilyPond/Score in safe mode (3rd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618039 [05:58:06] (03PS2) 10Tim Starling: Re-enable LilyPond/Score in safe mode (3rd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618039 (https://phabricator.wikimedia.org/T257091) [05:58:15] (03PS3) 10Tim Starling: Re-enable LilyPond/Score in safe mode (3rd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618039 (https://phabricator.wikimedia.org/T257091) [06:10:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317 for MCR', diff saved to https://phabricator.wikimedia.org/P12147 and previous config saved to /var/cache/conftool/dbconfig/20200804-061003-marostegui.json [06:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:26] (03CR) 10Chad: Revoke all remaining group memberships, etc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617749 (owner: 10Chad) [06:10:57] (03PS4) 10Chad: Revoke all remaining group memberships, etc [puppet] - 10https://gerrit.wikimedia.org/r/617749 [06:11:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Re-enable LilyPond/Score in safe mode (3rd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618039 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling) [06:11:03] (03CR) 10Tim Starling: [C: 03+2] Re-enable LilyPond/Score in safe mode (3rd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618039 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling) [06:11:48] (03Merged) 10jenkins-bot: Re-enable LilyPond/Score in safe mode (3rd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618039 (https://phabricator.wikimedia.org/T257091) (owner: 10Tim Starling) [06:12:05] (03PS5) 10Chad: Revoke all remaining group memberships, etc [puppet] - 10https://gerrit.wikimedia.org/r/617749 [06:12:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P12148 and previous config saved to /var/cache/conftool/dbconfig/20200804-061209-marostegui.json [06:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight to db1089 on main traffic', diff saved to https://phabricator.wikimedia.org/P12149 and previous config saved to /var/cache/conftool/dbconfig/20200804-061255-marostegui.json [06:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:45] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: re-enabling lilypond execution in safe mode 3rd attempt (duration: 00m 58s) [06:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:51] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [06:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P12150 and previous config saved to /var/cache/conftool/dbconfig/20200804-062256-marostegui.json [06:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Restore original weight to db1089 on main traffic', diff saved to https://phabricator.wikimedia.org/P12151 and previous config saved to /var/cache/conftool/dbconfig/20200804-062358-marostegui.json [06:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:58] <_joe_> !log restarting docker daemon on kubestage1002, seems like a case of https://github.com/moby/moby/issues/29635 [06:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:11] (03PS1) 10Marostegui: db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618177 [06:29:51] (03CR) 10Marostegui: [C: 03+2] db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618177 (owner: 10Marostegui) [06:30:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P12152 and previous config saved to /var/cache/conftool/dbconfig/20200804-063026-marostegui.json [06:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:10] (03PS1) 10Elukey: Fix Spark config file path for Daily EL2Druid Analyitcs jobs [puppet] - 10https://gerrit.wikimedia.org/r/618227 (https://phabricator.wikimedia.org/T254493) [06:37:47] (03CR) 10Elukey: [C: 03+2] Fix Spark config file path for Daily EL2Druid Analyitcs jobs [puppet] - 10https://gerrit.wikimedia.org/r/618227 (https://phabricator.wikimedia.org/T254493) (owner: 10Elukey) [06:42:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1119', diff saved to https://phabricator.wikimedia.org/P12153 and previous config saved to /var/cache/conftool/dbconfig/20200804-064223-marostegui.json [06:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:34] (03PS1) 10Elukey: Fix daily config file path for Spark EL2Druid Analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/618229 (https://phabricator.wikimedia.org/T254493) [06:44:45] (03CR) 10Muehlenhoff: [C: 03+2] Also exclude /mnt/hdfs on analytics_test_cluster::coordinator from debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/618018 (owner: 10Muehlenhoff) [06:48:22] (03CR) 10Elukey: [C: 03+2] Fix daily config file path for Spark EL2Druid Analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/618229 (https://phabricator.wikimedia.org/T254493) (owner: 10Elukey) [06:52:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, I'll merge in a bit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617749 (owner: 10Chad) [06:53:45] (03PS6) 10Muehlenhoff: Revoke all remaining group memberships, etc [puppet] - 10https://gerrit.wikimedia.org/r/617749 (owner: 10Chad) [06:55:32] (03CR) 10Muehlenhoff: [C: 03+2] Revoke all remaining group memberships, etc [puppet] - 10https://gerrit.wikimedia.org/r/617749 (owner: 10Chad) [07:00:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:44] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] New upstream version 2.16.9 [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) (owner: 10JMeybohm) [07:06:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:16] (03CR) 10Hashar: "recheck CI now uses buster-backport" [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) (owner: 10JMeybohm) [07:17:49] (03CR) 10JMeybohm: [C: 03+2] New upstream version 2.16.9 [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) (owner: 10JMeybohm) [07:20:57] (03PS3) 10JMeybohm: eventgate: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617695 (https://phabricator.wikimedia.org/T253843) [07:21:35] (03Merged) 10jenkins-bot: New upstream version 2.16.9 [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) (owner: 10JMeybohm) [07:25:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:25:27] !log installing rails security updates [07:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:47] !log Start topology changes on m2 - T257540 [07:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:49] T257540: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 [07:28:10] !log remove nonstop-bridging from asw2-esams - T191667 [07:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:12] T191667: Juniper HA audit - https://phabricator.wikimedia.org/T191667 [07:28:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:29:11] !log remove nonstop-bridging from eqiad asw2 switches - T191667 [07:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:24] XioNoX: starting the procedure for the druid upgrade :) [07:29:35] elukey: ok! [07:29:35] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1107 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/617997 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [07:32:30] !log remove nonstop-bridging from fasw-c-eqiad switches - T191667 [07:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:33:59] !log upgrade druid analytics (backend for Turnilo/Superset/etc..) to 0.19 [07:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:59] (03CR) 10Elukey: [C: 03+2] role::druid::analytics::worker: upgrade druid to 0.19 [puppet] - 10https://gerrit.wikimedia.org/r/618005 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [07:36:05] (03CR) 10Ayounsi: [C: 03+2] Remove nonstop-bridging from switches [homer/public] - 10https://gerrit.wikimedia.org/r/609139 (https://phabricator.wikimedia.org/T191667) (owner: 10Ayounsi) [07:36:33] (03Merged) 10jenkins-bot: Remove nonstop-bridging from switches [homer/public] - 10https://gerrit.wikimedia.org/r/609139 (https://phabricator.wikimedia.org/T191667) (owner: 10Ayounsi) [07:38:02] 10Operations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) 05Open→03Resolved [07:38:05] !log imported helm_2.16.9-1 to buster-wikimedia [07:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:40:24] (03CR) 10Kormat: [C: 03+2] switchover: Fix import path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618063 (owner: 10Kormat) [07:43:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) (owner: 10MSantos) [07:43:23] !log imported helm_2.16.9-1 to stretch-wikimedia [07:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:32] !log imported helm_2.16.9-1 to jessie-wikimedia [07:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:00] !log installing poppler security updates on stretch [07:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:42] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:58:03] (03PS3) 10Hashar: zuul: stop prefixing report with the job name [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) [07:58:27] (03CR) 10Hashar: zuul: stop prefixing report with the job name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [08:00:04] marostegui, jynus, and akosiaris: That opportune time is upon us again. Time for a m2 database master failover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200804T0800). [08:00:08] \o/ [08:00:10] let's go? [08:00:13] +1 [08:00:17] ok [08:00:23] !log Failover m2 from db1132 to db1107 -T257540 [08:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:31] T257540: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540 [08:02:07] all done [08:02:30] otrs works fine, no need for any actions [08:02:43] I see connections from debmonitor on the new master too [08:02:54] akosiaris: any chance you can generate a write on otrs? [08:03:18] sure [08:03:34] just closed a ticket as spam [08:03:39] worked fine [08:04:05] cooool [08:04:39] recommendation-api isn't complaining either [08:05:21] marostegui: copy process is about to finish, when that happens we can move db1117 and you can shutdown db1132 if you want [08:05:29] excellent [08:06:02] actually, copy just finished [08:06:07] marostegui: sorry had totally forgot about the failover, my bad [08:06:20] debmonitor usually reconnects automatically [08:06:31] (03PS1) 10Filippo Giunchedi: librenms: replace python2 purge.py with native options [puppet] - 10https://gerrit.wikimedia.org/r/618235 (https://phabricator.wikimedia.org/T257017) [08:06:32] but lmk if there is any isue [08:06:38] volans: yeah, I didn't ping you cause last time we saw it wasn't needed :) [08:06:45] (03CR) 10jerkins-bot: [V: 04-1] librenms: replace python2 purge.py with native options [puppet] - 10https://gerrit.wikimedia.org/r/618235 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [08:07:30] (03PS2) 10Filippo Giunchedi: librenms: replace python2 purge.py with native options [puppet] - 10https://gerrit.wikimedia.org/r/618235 (https://phabricator.wikimedia.org/T257017) [08:07:46] marostegui: looks like it isn't needed for OTRS/recommendation-api either. Want me to update some docs about that? [08:08:01] akosiaris: yeah, that'd be great :) [08:09:46] do you want to run the move? [08:10:28] jynus: did you start replication or should I? [08:10:34] it started automatically [08:10:38] ah cool [08:10:39] by the backup [08:10:44] it caught up already [08:10:45] then I will check its status and do the move [08:11:09] will you test db-move-replica or use the local version? [08:11:35] I have used move_replica from the repo [08:11:40] ok [08:11:45] worked well! [08:11:49] cool [08:13:09] I will give db1107 a few hours before I reimage and move db1132 [08:13:58] !log cleaning up a bunch of prefix limit reached issues [08:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:02] open connections halved: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=9&fullscreen&orgId=1&var-site=All&var-group=misc&var-shard=m2&var-role=All&from=1596507237819&to=1596528837819 [08:14:07] (03PS1) 10Marostegui: db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618236 (https://phabricator.wikimedia.org/T257540) [08:15:07] !log installing remaining cups security updates [08:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:10] db2122 is failing to get prometheus metrics [08:15:27] maybe needs a restart [08:15:33] (03CR) 10Marostegui: [C: 03+2] db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618236 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [08:15:35] 95% of the times is that [08:15:41] should I? [08:15:48] go ahead please [08:17:53] now only 29 failures, kicking it worked [08:18:21] I wondering if we should just add a preexec or something to the mariadb unit? [08:19:53] or maybe WantedBy to the prometheus one? [08:24:39] (03PS2) 10Filippo Giunchedi: alertmanager: add IRC notifier [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) [08:24:41] (03PS2) 10Filippo Giunchedi: role: add alertmanager::irc to alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/617689 (https://phabricator.wikimedia.org/T258948) [08:25:20] (03PS1) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [08:33:58] (03CR) 10Ayounsi: "Thanks!" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/618235 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [08:36:54] (03CR) 10Filippo Giunchedi: alertmanager: add IRC notifier (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:39:21] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) [08:39:37] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) 05Open→03Resolved [08:42:22] (03PS1) 10Marostegui: mariadb: Reimage db2134 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618238 (https://phabricator.wikimedia.org/T259589) [08:43:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2134 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618238 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [08:45:35] (03PS3) 10Filippo Giunchedi: librenms: replace python2 purge.py with native options, extend retention. [puppet] - 10https://gerrit.wikimedia.org/r/618235 (https://phabricator.wikimedia.org/T257017) [08:45:41] (03CR) 10Hnowlan: [C: 03+1] changeprop: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617694 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:46:44] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24291/netmon1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618235 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [08:46:48] (03CR) 10JMeybohm: [C: 03+2] changeprop: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617694 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:47:55] (03Merged) 10jenkins-bot: changeprop: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617694 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:50:02] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:52:16] ^ me [08:52:29] (03CR) 10Ayounsi: [C: 03+1] librenms: replace python2 purge.py with native options, extend retention. [puppet] - 10https://gerrit.wikimedia.org/r/618235 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [08:55:43] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [08:56:28] James_F, Warning: file_put_contents(/home/jforrester/updateVarDumps-record-group1 -- seems to be happening again? 62 in the past 15 minutes [08:56:47] (03Merged) 10jenkins-bot: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [08:58:45] !log installing python3.5 security updates [08:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:04] (03PS1) 10Alexandros Kosiaris: kubernetes: Stop sending ICMP redirects [puppet] - 10https://gerrit.wikimedia.org/r/618239 (https://phabricator.wikimedia.org/T226237) [09:04:11] 10Operations, 10netops: Make eqord it's own AS - https://phabricator.wikimedia.org/T259593 (10ayounsi) p:05Triage→03Medium [09:04:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC at https://puppet-compiler.wmflabs.org/compiler1003/24292/ says ok, I am gonna merge this to at least mitigate the issue at the linked" [puppet] - 10https://gerrit.wikimedia.org/r/618239 (https://phabricator.wikimedia.org/T226237) (owner: 10Alexandros Kosiaris) [09:16:16] (03PS1) 10Alexandros Kosiaris: admin: Add dvrandecic to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/618243 (https://phabricator.wikimedia.org/T259388) [09:17:44] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10akosiaris) [09:19:11] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10akosiaris) @DVrandecic The patchset is ready to be merged, but we can't proceed without your signature on the L3 document. Could you... [09:25:16] liw: Yeah, unfortunately they'll keep happening until script run finishs. [09:25:59] James_F, ok [09:26:42] (03PS1) 10Elukey: prometheus::druid_exporter: adjust metric list for Druid 0.19 [puppet] - 10https://gerrit.wikimedia.org/r/618244 (https://phabricator.wikimedia.org/T244482) [09:27:55] (03CR) 10Elukey: [C: 03+2] prometheus::druid_exporter: adjust metric list for Druid 0.19 [puppet] - 10https://gerrit.wikimedia.org/r/618244 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [09:28:26] (03PS1) 10Tobias Andersson: DNM WIP: add new limited bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) [09:29:39] (03PS2) 10Tobias Andersson: DNM WIP: add new limited bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) [09:29:58] (03CR) 10Vgutierrez: [C: 03+2] Release 0.28 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/618081 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [09:30:11] (03PS3) 10Tobias Andersson: DNM WIP: add new limited bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) [09:31:31] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: replace python2 purge.py with native options, extend retention. [puppet] - 10https://gerrit.wikimedia.org/r/618235 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [09:32:52] (03Merged) 10jenkins-bot: Release 0.28 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/618081 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [09:34:15] PROBLEM - Check systemd state on druid1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:21] (03PS1) 10Filippo Giunchedi: librenms: fix purge.py definition [puppet] - 10https://gerrit.wikimedia.org/r/618246 [09:34:25] PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:32] ahhh lovely [09:34:36] druid is me [09:34:41] PROBLEM - Check systemd state on druid1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:49] PROBLEM - Check systemd state on druid1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:00] a winner is you elukey [09:35:01] PROBLEM - Check systemd state on druid1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:18] godog: apparently I am not able to write correct json [09:35:28] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: fix purge.py definition [puppet] - 10https://gerrit.wikimedia.org/r/618246 (owner: 10Filippo Giunchedi) [09:35:49] PROBLEM - Check systemd state on druid1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:55] PROBLEM - Check systemd state on druid1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:06] I can agree, seems more like for machines than humans [09:36:09] PROBLEM - Check systemd state on an-druid1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:17] RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:03] PROBLEM - Check systemd state on druid1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:13] PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:47] (03PS1) 10Elukey: prometheus::druid_exporter: remove extra comma in 0.19 config [puppet] - 10https://gerrit.wikimedia.org/r/618248 [09:38:50] (03CR) 10Elukey: [C: 03+2] prometheus::druid_exporter: remove extra comma in 0.19 config [puppet] - 10https://gerrit.wikimedia.org/r/618248 (owner: 10Elukey) [09:42:11] RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:22] (03PS1) 10Marostegui: db2134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618249 (https://phabricator.wikimedia.org/T259589) [09:42:39] RECOVERY - Check systemd state on druid1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:33] (03CR) 10Marostegui: [C: 03+2] db2134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618249 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [09:44:01] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:44:13] RECOVERY - Check systemd state on druid1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:25] RECOVERY - Check systemd state on druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:47] RECOVERY - Check systemd state on an-druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:15] RECOVERY - Check systemd state on druid1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:21] RECOVERY - Check systemd state on druid1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:33] RECOVERY - Check systemd state on druid1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:37] RECOVERY - Check systemd state on an-druid1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P12154 and previous config saved to /var/cache/conftool/dbconfig/20200804-094909-marostegui.json [09:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:31] (03PS1) 10Giuseppe Lavagetto: deployment_server::helmfile: use to_yaml instead of a template [puppet] - 10https://gerrit.wikimedia.org/r/618252 [09:56:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P12155 and previous config saved to /var/cache/conftool/dbconfig/20200804-095608-marostegui.json [09:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:15] (03PS1) 10Hnowlan: api-gateway: add dummy tokens [labs/private] - 10https://gerrit.wikimedia.org/r/618254 (https://phabricator.wikimedia.org/T254906) [09:59:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] deployment_server::helmfile: use to_yaml instead of a template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618252 (owner: 10Giuseppe Lavagetto) [10:00:15] (03PS2) 10Giuseppe Lavagetto: deployment_server::helmfile: use to_yaml instead of a template [puppet] - 10https://gerrit.wikimedia.org/r/618252 [10:00:30] librenms-wmf: welcome back! [10:00:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 for MCR and PK change T259524', diff saved to https://phabricator.wikimedia.org/P12156 and previous config saved to /var/cache/conftool/dbconfig/20200804-100035-marostegui.json [10:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:38] T259524: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 [10:00:45] 04Critical Testing transport from LibreNMS [10:00:53] yay [10:01:01] now to puppetize the change [10:01:12] \o/ [10:01:46] (03PS1) 10Vgutierrez: api: Exclude not valid parts from get_directory_metadata output [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/618257 (https://phabricator.wikimedia.org/T259338) [10:01:49] (03PS1) 10Vgutierrez: Release 0.28 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/618258 (https://phabricator.wikimedia.org/T259338) [10:03:22] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:05:27] (03PS1) 10Ayounsi: Fix LibreNMs IRC bot [puppet] - 10https://gerrit.wikimedia.org/r/618259 [10:07:12] (03PS3) 10Giuseppe Lavagetto: deployment_server::helmfile: use to_yaml instead of a template [puppet] - 10https://gerrit.wikimedia.org/r/618252 [10:08:40] (03CR) 10Filippo Giunchedi: [C: 03+1] Fix LibreNMs IRC bot [puppet] - 10https://gerrit.wikimedia.org/r/618259 (owner: 10Ayounsi) [10:08:52] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:08:58] (03CR) 10Ayounsi: [C: 03+2] Fix LibreNMs IRC bot [puppet] - 10https://gerrit.wikimedia.org/r/618259 (owner: 10Ayounsi) [10:10:59] (03CR) 10Vgutierrez: [C: 03+2] api: Exclude not valid parts from get_directory_metadata output [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/618257 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [10:11:12] (03CR) 10Vgutierrez: [C: 03+2] Release 0.28 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/618258 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [10:14:06] (03Merged) 10jenkins-bot: api: Exclude not valid parts from get_directory_metadata output [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/618257 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [10:14:14] (03Merged) 10jenkins-bot: Release 0.28 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/618258 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [10:14:55] You are not authorized to (de)voice librenms-wmf on #wikimedia-operations. [10:16:42] XioNoX: if the framework that librenms-wmf is using can handle authing with NickServ, it can be given a cloak and it will auto voice [10:17:08] yeah it's authenticated to nickserv [10:18:02] !log installing imagemagick security updates on stretch [10:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24295/" [puppet] - 10https://gerrit.wikimedia.org/r/618252 (owner: 10Giuseppe Lavagetto) [10:19:21] (03PS1) 10Vgutierrez: debian: Add release 0.28 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/618261 (https://phabricator.wikimedia.org/T259338) [10:20:07] XioNoX: https://meta.wikimedia.org/wiki/IRC/Cloaks#Obtaining_a_cloak covers the process [10:20:27] thx [10:20:47] <_joe_> XioNoX: done [10:21:02] once it has the cloak, the -ChanServ- 27 *!*@wikimedia/bot/* rule will kick in [10:21:24] <_joe_> if I understood correctly you needed it now, correct? [10:22:36] <_joe_> we will go the auto way once you have a cloak XioNoX :) [10:22:56] sounds good, thx! [10:23:27] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.28 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/618261 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [10:24:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server::helmfile: use to_yaml instead of a template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618252 (owner: 10Giuseppe Lavagetto) [10:26:12] (03Merged) 10jenkins-bot: debian: Add release 0.28 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/618261 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [10:27:07] irc is so annyoing... [10:27:55] /msg wmopbot botcloak 9515386d70d63e337d12d112dbbdd963 [10:28:05] wtf [10:28:51] lol :) [10:29:15] quick! we shall organize an irc channel takeover *g* [10:29:26] why /msg doesn't work in Xchat? [10:31:27] XioNoX: it looks like there's a space at the beginning of that line [10:31:38] alright I think I got it [10:32:59] !log upload acme-chief 0.28 to apt.wm.o (buster) - T259338 [10:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:03] T259338: do not generate metadata for parts that aren't allowed - https://phabricator.wikimedia.org/T259338 [10:38:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:40:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:44:08] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1050 host + mgmt down - https://phabricator.wikimedia.org/T258370 (10elukey) @Cmjohnson yep do anything that you need! [10:47:11] !log upgrade acme-chief to version 0.28 [10:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:36] !log installing tomcat8 security updates [10:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:30] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: do not generate metadata for parts that aren't allowed - https://phabricator.wikimedia.org/T259338 (10Vgutierrez) 05Open→03Resolved [10:55:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:57:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:57:54] ^ looks like a maintenance script got lots of permission denied errors [10:58:07] AbuseFilter’s updateVarDumps.php [10:58:44] hm, not sure if that’s the same channel actually [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200804T1100). [11:00:08] o/ [11:00:21] nothing in the calendar, but I’ll deploy a (non-public) fix for T259565 [11:00:22] T259565: [Regression] Unparsed wikitext in various JavaScript messages - https://phabricator.wikimedia.org/T259565 [11:00:42] o/ once you're done, let's jsonize repo [11:01:09] \o/ [11:01:13] excitimg [11:03:52] !log installing e2fsprogs security updates for stretch [11:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:40] Lucas_WMDE: oh, that script still runs? Thought it's a one-off and ignored the permission denied messages, hopefully that doesn't throw important logging under the bus :/ [11:08:58] testing on mwdebug1001 [11:10:10] looks good [11:12:39] !log Deployed patch for T86738 / T259565 [11:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:42] T259565: [Regression] Unparsed wikitext in various JavaScript messages - https://phabricator.wikimedia.org/T259565 [11:13:47] Lucas_WMDE: Yeah, still running. [11:16:23] I think I’m done [11:16:29] Amir1: do you want to do the repo honors? [11:16:46] Lucas_WMDE: since you have everything, do you want to do it? [11:16:52] ok, sure [11:16:56] cool! [11:17:08] what’s the gerrit change again? ^^ [11:17:10] * Lucas_WMDE looks [11:17:53] uh, does the gerrit change even exist? [11:18:18] Amir1: ? [11:18:21] or should I create it? [11:18:33] I thought you did [11:18:37] I can double check [11:18:57] I can’t find anything [11:19:00] I don't have any open one [11:19:03] there is a merged change to use JSON on beta [11:19:08] I don’t have a local branch either [11:19:10] I’ll create it [11:19:19] Cool, thanks! [11:22:30] (03PS1) 10Lucas Werkmeister (WMDE): Load WikibaseRepo using extension registration in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618266 (https://phabricator.wikimedia.org/T257433) [11:22:46] ^ [11:23:19] (03CR) 10Ladsgroup: [C: 03+1] "Let's get the party started." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618266 (https://phabricator.wikimedia.org/T257433) (owner: 10Lucas Werkmeister (WMDE)) [11:23:46] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Load WikibaseRepo using extension registration in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618266 (https://phabricator.wikimedia.org/T257433) (owner: 10Lucas Werkmeister (WMDE)) [11:24:31] wait, why does operations-mw-config-php72-composer-diffConfig-docker show up as a failure? [11:24:31] (03Merged) 10jenkins-bot: Load WikibaseRepo using extension registration in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618266 (https://phabricator.wikimedia.org/T257433) (owner: 10Lucas Werkmeister (WMDE)) [11:24:46] Lucas_WMDE: No change. [11:24:53] Lucas_WMDE: It literally explains. :-) [11:25:12] I was looking at the red/green pills under the commit message, no explanation there :) [11:25:14] "FAILURE No change detected against the current configuration." [11:25:23] and the consoleLog didn’t say that either [11:25:25] but thanks [11:25:29] Which you'd expect, because you made a non-variant change. [11:25:41] ok then that makes sense [11:25:44] :) [11:25:47] Lucas_WMDE: you should take the red pill sometimes [11:26:21] testing on mwdebug1001 [11:27:59] basic repo functionality still seems to work [11:30:03] everything is working as far as I can tell [11:30:08] let’s look at logstash [11:30:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:30:29] (03PS2) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [11:30:31] hmmmm [11:31:51] we haven't deployed yet :D [11:31:57] yeah [11:32:24] I think it’s still updateVarDumps.php [11:32:28] if I read logstash correctly [11:32:40] plus some good old “Invariant failed: Bad UTF-8 at end of string (3 byte sequence)” [11:33:01] grafana looks like the errors went down again [11:34:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:34:26] the mwdebug logstash has one error about a master connection being made from a get request [11:34:43] from the Translate extension, apparently [11:34:47] but I suspect that’s not new [11:34:53] s/error/warning/ [11:35:29] I think I’ll go ahead and sync [11:35:38] (03PS4) 10Hnowlan: Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) [11:35:43] \o/ [11:36:54] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:618266|Load WikibaseRepo using extension registration in production (T257433)]] (duration: 00m 58s) [11:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:58] T257433: Convert Repo to use Extension Registration - https://phabricator.wikimedia.org/T257433 [11:37:03] 🎉🎉🎉 [11:37:19] !log installing openjdk-11 security updates [11:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:27] thus endeth the age of extension PHP entry points in Wikimedia production [11:39:01] $ git grep '\$IP/extensions/.*\.php' | wc -l [11:39:03] 0 [11:39:04] \o/ \o/ \o/ [11:39:51] we need to update https://extreg-wos.toolforge.org/ :) [11:40:30] Where is the source? [11:40:43] heck if I know ^^ [11:40:49] I remember filing a phab task for it before, hang on [11:41:10] https://phabricator.wikimedia.org/T217408 [11:41:41] !log Deploy schema change on s3 codfw master, lag might show up on codfw s3 T259238 [11:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:44] T259238: Code and production differs on s3 on pagelinks table - https://phabricator.wikimedia.org/T259238 [11:41:47] I send an email to ops-l [11:42:30] thanks [11:43:16] 10Operations, 10Citoid, 10Services: Bind Citoid service to a static IP address (or addresses) - https://phabricator.wikimedia.org/T259040 (10Mvolz) @akosiaris is this feasible/desirable? [11:43:27] !log EU backport window done [11:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:44] (03PS1) 10JMeybohm: helmfile: New upstream version 0.125.2 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) [11:51:34] 10Operations, 10netops: automatically sample from all FPCs on core routers - https://phabricator.wikimedia.org/T257392 (10ayounsi) > or we'd be able to express this via the wildcarding functionality* in Juniper configs Seems like the cleanest way. Here is what it would looks like on cr1-eqiad (not committed):... [11:51:38] (03PS1) 10Ema: atskafka: double buffering_ms [puppet] - 10https://gerrit.wikimedia.org/r/618275 (https://phabricator.wikimedia.org/T254317) [11:52:07] (03CR) 10jerkins-bot: [V: 04-1] helmfile: New upstream version 0.125.2 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [11:52:49] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618275 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [11:53:08] (03PS2) 10JMeybohm: helmfile: New upstream version 0.125.2 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) [11:56:12] (03CR) 10jerkins-bot: [V: 04-1] helmfile: New upstream version 0.125.2 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [11:56:48] (03CR) 10Ema: [C: 03+2] atskafka: double buffering_ms [puppet] - 10https://gerrit.wikimedia.org/r/618275 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [12:00:32] !log helm was updated: 2.16.7-2 -> 2.16.9-1 on chartmuseum*, contint*, deploy* [12:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:45] (03CR) 10JMeybohm: "jenkins fails because of missing backports: https://gerrit.wikimedia.org/r/c/integration/config/+/618276" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [12:07:54] (03PS2) 10MSantos: Enable printBackground to fix style issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) [12:14:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:18:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:23:51] (03CR) 10Hnowlan: [C: 03+1] Add api.wikimedia.org and api.m.wikimedia.org DNS entries [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [12:24:56] (03CR) 10Hnowlan: [C: 03+1] mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [12:31:40] (03CR) 10Filippo Giunchedi: "LGTM overall, I think we should deploy it to the logstash7 cluster though as the logstash cluster is soon to be deprecated" [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [12:35:29] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) p:05High→03Medium [12:37:05] (03PS4) 10Tobias Andersson: DNM WIP: add new limited bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) [12:38:12] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar) [12:38:38] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar) [12:40:05] (03PS2) 10Hashar: python-build: allow reuse of existing wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) [12:40:37] (03CR) 10Hashar: "Amended commit message to link this change to T259611" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [12:43:34] (03Abandoned) 10Elukey: setup.py: skip pytest>= 6.0.0 to avoid prospector failures [software/spicerack] - 10https://gerrit.wikimedia.org/r/617378 (owner: 10Elukey) [12:50:30] (03PS1) 10Marostegui: wikireplica_dns.yaml: Depool dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) [12:52:08] (03PS1) 10Filippo Giunchedi: prometheus: lowercase alerts annotations [puppet] - 10https://gerrit.wikimedia.org/r/618284 (https://phabricator.wikimedia.org/T258948) [13:01:53] (03CR) 10Elukey: Initial release of wmflib (0318 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [13:05:58] (03PS3) 10Elukey: Initial release of wmflib [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) [13:07:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:11:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:13:16] (03PS1) 10Ema: varnishmtail: check if varnishncsa is still running [puppet] - 10https://gerrit.wikimedia.org/r/618308 (https://phabricator.wikimedia.org/T259020) [13:18:14] (03CR) 10Vgutierrez: varnishmtail: check if varnishncsa is still running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618308 (https://phabricator.wikimedia.org/T259020) (owner: 10Ema) [13:19:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618243 (https://phabricator.wikimedia.org/T259388) (owner: 10Alexandros Kosiaris) [13:19:14] (03CR) 10Kormat: [C: 03+1] wikireplica_dns.yaml: Depool dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [13:21:51] (03CR) 10Elukey: [C: 03+1] Install anaconda-wmf on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/618106 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [13:23:04] 10Operations, 10netops: Re-prioritize peering over transit - https://phabricator.wikimedia.org/T259614 (10ayounsi) p:05Triage→03Medium [13:26:53] (03CR) 10Mholloway: [C: 03+2] Enable printBackground to fix style issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) (owner: 10MSantos) [13:27:06] (03PS1) 10Kormat: check_health: update from check_mariadb.py in puppet repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) [13:27:33] (03CR) 10jerkins-bot: [V: 04-1] check_health: update from check_mariadb.py in puppet repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [13:27:58] (03Merged) 10jenkins-bot: Enable printBackground to fix style issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) (owner: 10MSantos) [13:28:57] (03PS2) 10Ema: varnishmtail: check if varnishncsa is still running [puppet] - 10https://gerrit.wikimedia.org/r/618308 (https://phabricator.wikimedia.org/T259020) [13:29:35] (03PS1) 10Mforns: analytics::reportupdater::jobs: Absent ee-beta-features job [puppet] - 10https://gerrit.wikimedia.org/r/618311 (https://phabricator.wikimedia.org/T256195) [13:29:38] (03CR) 10Ema: varnishmtail: check if varnishncsa is still running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/618308 (https://phabricator.wikimedia.org/T259020) (owner: 10Ema) [13:30:46] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) @Jclark-ctr would it be possible to move some CX allocations to BX? I am asking since workers on ROW C will be a little bit more unbalanced, bu... [13:34:01] (03PS3) 10Ottomata: Install anaconda-wmf on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/618106 (https://phabricator.wikimedia.org/T251006) [13:34:06] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Install anaconda-wmf on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/618106 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [13:34:38] (03CR) 10Elukey: [C: 03+2] analytics::reportupdater::jobs: Absent ee-beta-features job [puppet] - 10https://gerrit.wikimedia.org/r/618311 (https://phabricator.wikimedia.org/T256195) (owner: 10Mforns) [13:35:46] 10Operations, 10Desktop Improvements, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10ema) @ovasileva: shall we close this now? If other invalidations are needed in the fu... [13:37:00] (03PS1) 10Ema: Revert "ATS: force cache revalidation on a few wikis" [puppet] - 10https://gerrit.wikimedia.org/r/618294 (https://phabricator.wikimedia.org/T256750) [13:40:19] 10Operations, 10Desktop Improvements, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10ovasileva) 05Open→03Resolved a:03ovasileva Sounds good to me, thanks @ema! [13:41:45] 10Operations, 10LDAP-Access-Requests: Add Carol Dunn to the wmf LDAP group - https://phabricator.wikimedia.org/T259615 (10nshahquinn-wmf) [13:42:44] (03CR) 10Tobias Andersson: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson) [13:43:30] (03PS2) 10Kormat: check_health: update from check_mariadb.py in puppet repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) [13:48:40] (03PS15) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [13:49:33] (03CR) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:49:44] (03CR) 10jerkins-bot: [V: 04-1] Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:49:44] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:49:46] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:51:02] !log Install newer openjdk on contint2001 and restarting CI Jenkins [13:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:51] 10Operations, 10Citoid, 10Services: Bind Citoid service to a static IP address (or addresses) - https://phabricator.wikimedia.org/T259040 (10akosiaris) It's definitely not desirable. I 've commented already in T254700#6359644, best to keep the discussion there? [13:53:28] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:56:44] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [13:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:47] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [13:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:52] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:00] (03PS1) 10JMeybohm: helm-diff: New upstream version 3.1.2 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T25857) [13:57:52] (03CR) 10jerkins-bot: [V: 04-1] helm-diff: New upstream version 3.1.2 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T25857) (owner: 10JMeybohm) [13:58:32] PROBLEM - DPKG on stat1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:58:46] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:00:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:02:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:04:24] (03CR) 10Alexandros Kosiaris: "Most of the work for reasons that elude me currently used to happen in the buster-wikimedia branch. Does this change also diverge from tha" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [14:05:18] (03PS16) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [14:05:20] (03PS1) 10Giuseppe Lavagetto: Bumping termbox chart version to pick up the changes to the tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/618316 [14:05:22] (03PS1) 10Giuseppe Lavagetto: Enable the service proxy on termbox in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/618317 [14:05:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:05:44] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:06:27] (03CR) 10jerkins-bot: [V: 04-1] Bumping termbox chart version to pick up the changes to the tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/618316 (owner: 10Giuseppe Lavagetto) [14:06:34] (03CR) 10jerkins-bot: [V: 04-1] Enable the service proxy on termbox in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/618317 (owner: 10Giuseppe Lavagetto) [14:06:41] (03CR) 10jerkins-bot: [V: 04-1] Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:06:43] (03CR) 10JMeybohm: "> Patch Set 2:" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [14:09:54] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:10:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P12157 and previous config saved to /var/cache/conftool/dbconfig/20200804-141004-marostegui.json [14:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:14] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:10:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:10:35] (03PS2) 10JMeybohm: helm-diff: New upstream version 3.1.2 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T25857) [14:10:39] (03PS3) 10JMeybohm: helmfile: New upstream version 0.125.2 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) [14:10:46] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 3310.55 ms [14:11:05] (03CR) 10jerkins-bot: [V: 04-1] helm-diff: New upstream version 3.1.2 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T25857) (owner: 10JMeybohm) [14:12:04] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:12:36] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:26] (03PS17) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [14:13:28] (03PS2) 10Giuseppe Lavagetto: Bumping termbox chart version to pick up the changes to the tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/618316 [14:13:30] (03PS2) 10Giuseppe Lavagetto: Enable the service proxy on termbox in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/618317 [14:13:43] (03CR) 10jerkins-bot: [V: 04-1] helmfile: New upstream version 0.125.2 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [14:14:00] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) To summarize: if I start from https://phabricator.wikimedia.org/T243521#6005828, more precisely from the view of the distribution of hadoop wor... [14:14:56] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:15:02] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:15:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] common_templates: Sort values for checksum/tls-certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/618008 (owner: 10JMeybohm) [14:15:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:15:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P12158 and previous config saved to /var/cache/conftool/dbconfig/20200804-141556-marostegui.json [14:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:20] (03PS2) 10Giuseppe Lavagetto: common_templates: Sort values for checksum/tls-certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/618008 (owner: 10JMeybohm) [14:16:40] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 212.30 ms [14:16:48] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:18:05] (03CR) 10JMeybohm: "recheck with buster backports" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [14:18:07] (03CR) 10JMeybohm: "recheck with buster backports" [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T25857) (owner: 10JMeybohm) [14:18:30] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.66 ms [14:20:50] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P12159 and previous config saved to /var/cache/conftool/dbconfig/20200804-142220-marostegui.json [14:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:30] (03CR) 10Volans: "Nice fixes and really nice to have CI working (and passing flying colors)." (0320 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [14:25:31] (03PS4) 10JMeybohm: helmfile: New upstream version 0.125.2 [debs/helmfile] - 10https://gerrit.wikimedia.org/r/618273 (https://phabricator.wikimedia.org/T258572) [14:26:41] (03PS4) 10Elukey: Initial release of wmflib [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) [14:27:06] (03CR) 10Alexandros Kosiaris: deployment_server::helmfile: use to_yaml instead of a template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/618252 (owner: 10Giuseppe Lavagetto) [14:27:08] (03CR) 10Ppchelko: [C: 03+1] api-gateway: add dummy tokens [labs/private] - 10https://gerrit.wikimedia.org/r/618254 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [14:27:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P12160 and previous config saved to /var/cache/conftool/dbconfig/20200804-142710-marostegui.json [14:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:40] 10Operations, 10Citoid, 10Services: Bind Citoid service to a static IP address (or addresses) - https://phabricator.wikimedia.org/T259040 (10akosiaris) Actually, lemme merge this back in it. no reason to have a lot of tasks open that track exactly the same thing. [14:27:50] (03PS18) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [14:27:52] (03PS3) 10Giuseppe Lavagetto: Bumping termbox chart version to pick up the changes to the tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/618316 [14:27:54] (03PS3) 10Giuseppe Lavagetto: Enable the service proxy on termbox in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/618317 [14:28:01] 10Operations, 10Citoid, 10Services: Bind Citoid service to a static IP address (or addresses) - https://phabricator.wikimedia.org/T259040 (10akosiaris) [14:28:53] (03PS3) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [14:29:23] (03CR) 10Elukey: Initial release of wmflib (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [14:29:28] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:30:30] (03CR) 10jerkins-bot: [V: 04-1] Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) (owner: 10ZPapierski) [14:30:58] 10Operations, 10Platform Engineering, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 6 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10akosiaris) >>! In T224041#6353152, @jeena wrote: > We attempted to run the tests using C... [14:35:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1136 for MCR', diff saved to https://phabricator.wikimedia.org/P12161 and previous config saved to /var/cache/conftool/dbconfig/20200804-143524-marostegui.json [14:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:56] (03PS4) 10ZPapierski: Additional prefixes for sdoc for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/618237 (https://phabricator.wikimedia.org/T258625) [14:37:41] 10Operations, 10Platform Engineering, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 6 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10akosiaris) >>! In T224041#6353283, @jeena wrote: > So I re-tried installing the chart on... [14:37:55] (03CR) 10Kormat: "I've done basic testing of this in the pontoon virtual env, and it seems to work fine." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [14:41:04] !log powercycling analytics1050 T258370 [14:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:07] T258370: analytics1050 host + mgmt down - https://phabricator.wikimedia.org/T258370 [14:48:50] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1050 host + mgmt down - https://phabricator.wikimedia.org/T258370 (10Cmjohnson) 05Open→03Resolved @elukey the power reset cleared the issue, I was able to login to the idrac and reach console com2. Resolving the task [14:48:53] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [14:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:30] !log swapping kubernetes1010 network cable T257542 [14:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:32] T257542: Interface errors on asw2-b-eqiad:ge-5/0/35 (kubernetes1010) - https://phabricator.wikimedia.org/T257542 [15:00:36] 10Operations, 10ops-eqiad: Interface errors on asw2-b-eqiad:ge-5/0/35 (kubernetes1010) - https://phabricator.wikimedia.org/T257542 (10Cmjohnson) 05Open→03Resolved Swapped the network cable and cleared the switch errors. Resolving, if the problem persists please open again. [15:01:48] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] api-gateway: add dummy tokens [labs/private] - 10https://gerrit.wikimedia.org/r/618254 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [15:02:14] PROBLEM - nova-compute proc minimum on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:03:48] (03CR) 10Jcrespo: "I trust you with code changes. I am wondering what's the plan about packaging. Not the one shown here (which is ok as is), but regarding l" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:04:10] RECOVERY - nova-compute proc minimum on cloudvirt1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:08:39] !log installing qemu security updates on cloudvirt* Stretch hosts [15:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:10] (03CR) 10Kormat: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:10:32] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:11:17] (03PS3) 10Kormat: check_health: update from check_mariadb.py in puppet repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) [15:11:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:12:51] (03Merged) 10jenkins-bot: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:14:15] (03PS3) 10Ottomata: Remove now unused wgEventServiceStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616570 (https://phabricator.wikimedia.org/T229863) [15:16:31] (03CR) 10Ottomata: [C: 03+2] Remove now unused wgEventServiceStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616570 (https://phabricator.wikimedia.org/T229863) (owner: 10Ottomata) [15:18:49] !log installing jackson-databind security issues [15:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:57] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove now unused wgEventServiceStreamConfig - T229863 (duration: 00m 58s) [15:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:59] T229863: Refactor EventBus mediawiki configuration - https://phabricator.wikimedia.org/T229863 [15:20:38] (03PS5) 10Elukey: Initial release of wmflib [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) [15:21:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-17) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10Cmjohnson) [15:22:06] (03CR) 10Elukey: Initial release of wmflib (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [15:22:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Bumping termbox chart version to pick up the changes to the tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/618316 (owner: 10Giuseppe Lavagetto) [15:22:35] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [15:23:01] (03CR) 10Elukey: Initial release of wmflib (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [15:23:17] (03Merged) 10jenkins-bot: Bumping termbox chart version to pick up the changes to the tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/618316 (owner: 10Giuseppe Lavagetto) [15:26:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:27:20] (03CR) 10Volans: "I just noticed we're missing a .gitignore, definitely we want that with some basic stuff for now." (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [15:29:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:31:47] (03CR) 10Jcrespo: [C: 03+1] "General +1, we can do careful testing on deploy, specially given it is a paging check and affects all databases." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:31:49] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:18] (03PS1) 10Ottomata: EventStreamConfig - Set default topic_prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618330 (https://phabricator.wikimedia.org/T255888) [15:33:30] (03CR) 10Kormat: [C: 03+2] check_health: update from check_mariadb.py in puppet repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:34:01] (03Merged) 10jenkins-bot: check_health: update from check_mariadb.py in puppet repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618310 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:34:06] (03CR) 10jerkins-bot: [V: 04-1] EventStreamConfig - Set default topic_prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618330 (https://phabricator.wikimedia.org/T255888) (owner: 10Ottomata) [15:34:29] (03PS3) 10JMeybohm: helm-diff: New upstream version 3.1.2 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/618314 (https://phabricator.wikimedia.org/T258572) [15:36:21] (03PS2) 10Ottomata: EventStreamConfig - Set default topic_prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618330 (https://phabricator.wikimedia.org/T255888) [15:37:42] PROBLEM - Stale file for node-exporter textfile in eqiad on icinga1001 is CRITICAL: cluster=analytics file=intel_microcode.prom instance=analytics1050 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [15:38:17] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:03] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [15:39:03] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [15:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:33] (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - Set default topic_prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618330 (https://phabricator.wikimedia.org/T255888) (owner: 10Ottomata) [15:42:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:43:49] (03CR) 10BryanDavis: [C: 03+1] wikireplica_dns.yaml: Depool dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/618283 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [15:47:19] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Cmjohnson) [15:47:44] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10Cmjohnson) BIOS/IDRAC updated, switch ports labeled but disabled [15:50:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:51:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [15:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:56] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) @elukey I am sorry, 10G rackspace is already very limited and with these being 2U servers it's even tighter. The racking configuration will... [15:53:45] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Set default topic_prefixes - T255888 (duration: 00m 58s) [15:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:48] T255888: EventStreamConfig's auto-topics config is incorrect - https://phabricator.wikimedia.org/T255888 [15:58:38] RECOVERY - Stale file for node-exporter textfile in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [15:59:53] (03PS1) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet request window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200804T1600). Please do the needful. [16:00:54] (03PS5) 10Lucas Werkmeister (WMDE): Enable Data Bridge on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595542 (https://phabricator.wikimedia.org/T232584) [16:00:56] (03PS5) 10Lucas Werkmeister (WMDE): DNM: Enable Data Bridge on Catalan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595543 (https://phabricator.wikimedia.org/T232584) [16:01:14] (03CR) 10jerkins-bot: [V: 04-1] Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [16:02:20] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [16:02:21] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [16:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:51] (03PS2) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [16:04:53] (03PS6) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) [16:05:20] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) @Cmjohnson yep no problem, thanks :) [16:05:37] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [16:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:09] (03CR) 10jerkins-bot: [V: 04-1] Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [16:08:55] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:34] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10Cmjohnson) @elukey @wiki_willy I am getting ready to do all of these and T259071 this week. What do you need? Do they no longer need 10G? Please let me know be... [16:13:40] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10elukey) Yep all good, you can proceed with 10G :) [16:14:14] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618284 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:14:52] (03PS3) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [16:15:40] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [16:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:27] (03PS4) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [16:18:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:48] (03PS5) 10Elukey: Move oozie server to an-scheduler1001 [puppet] - 10https://gerrit.wikimedia.org/r/618339 (https://phabricator.wikimedia.org/T257412) [16:24:39] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:37] (03PS1) 10Nray: Re-enable growth study quick surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618343 (https://phabricator.wikimedia.org/T257015) [16:27:44] (03PS1) 10Hnowlan: api-gateway: remove out-dated repositories definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/618344 [16:28:06] (03PS1) 10Herron: alerting_host: assign alert[12]001 role::alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/618345 (https://phabricator.wikimedia.org/T247966) [16:28:28] (03PS2) 10Nray: Re-enable growth study quick survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618343 (https://phabricator.wikimedia.org/T257015) [16:28:30] (03CR) 10jerkins-bot: [V: 04-1] alerting_host: assign alert[12]001 role::alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/618345 (https://phabricator.wikimedia.org/T247966) (owner: 10Herron) [16:29:08] (03PS2) 10Herron: alerting_host: assign alert[12]001 role::alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/618345 (https://phabricator.wikimedia.org/T247966) [16:31:16] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:18] (03CR) 10JMeybohm: [C: 03+1] "Must have missed those. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/618344 (owner: 10Hnowlan) [16:33:32] (03PS1) 10JMeybohm: blubberoid: remove out-dated repositories definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/618347 (https://phabricator.wikimedia.org/T253843) [16:34:10] (03CR) 10Hnowlan: [C: 03+2] api-gateway: remove out-dated repositories definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/618344 (owner: 10Hnowlan) [16:35:37] (03Merged) 10jenkins-bot: api-gateway: remove out-dated repositories definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/618344 (owner: 10Hnowlan) [16:38:52] (03PS1) 10Ssingh: aptrepo: add a component for pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/618349 (https://phabricator.wikimedia.org/T252132) [16:40:56] (03CR) 10Arlolra: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618155 (owner: 10Arlolra) [16:43:08] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24299/" [puppet] - 10https://gerrit.wikimedia.org/r/618345 (https://phabricator.wikimedia.org/T247966) (owner: 10Herron) [16:43:24] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:49] (03PS1) 10JMeybohm: helm: Replace repo update cronjob by systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/618350 [16:49:57] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:33] (03CR) 10Herron: [C: 03+1] alertmanager: add IRC notifier [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:51:25] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:12] (03PS1) 10JMeybohm: releases: Remove deployment-charts repo [puppet] - 10https://gerrit.wikimedia.org/r/618352 (https://phabricator.wikimedia.org/T253843) [17:00:05] halfak and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200804T1700). Please do the needful. [17:02:39] (03CR) 10Brennen Bearnes: [C: 03+2] Branch commit for wmf/1.36.0-wmf.3 [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618164 (https://phabricator.wikimedia.org/T257971) (owner: 10TrainBranchBot) [17:03:44] !log 1.36.0-wmf.3 was branched at 2d0cf09cdf for T257971 [17:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:59] T257971: 1.36.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T257971 [17:08:06] (03CR) 10Dzahn: [C: 03+2] "affects only mwmaint* servers and is already installed but would not be puppetized otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/618161 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [17:08:14] (03CR) 10Dzahn: [C: 03+2] "upstream bug reported at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=967080" [puppet] - 10https://gerrit.wikimedia.org/r/618161 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [17:09:29] there it was again, puppet-merge sees multiple changes but you say "no" to the first one and then you can merge your own one [17:09:37] sometimes it just works like that [17:17:00] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [17:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:08] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [17:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:40] (03PS10) 10Elukey: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) [17:25:38] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [17:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:07] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.3 [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618164 (https://phabricator.wikimedia.org/T257971) (owner: 10TrainBranchBot) [17:29:54] (03PS4) 10Dzahn: aphlict: make client port and IP also configurable, rename parameters [puppet] - 10https://gerrit.wikimedia.org/r/616922 [17:31:47] (03PS1) 10Hnowlan: api-gatway: while testing, don't use a http liveness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/618355 (https://phabricator.wikimedia.org/T254906) [17:33:27] (03CR) 10Elukey: [C: 03+2] Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [17:33:50] (03CR) 10Hnowlan: [C: 03+2] api-gatway: while testing, don't use a http liveness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/618355 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:34:40] merged dummy tokens via puppet-merge :) [17:35:23] (03Merged) 10jenkins-bot: api-gatway: while testing, don't use a http liveness check [deployment-charts] - 10https://gerrit.wikimedia.org/r/618355 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:36:46] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [17:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:30] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618356 [17:38:32] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618356 (owner: 10Brennen Bearnes) [17:39:30] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618356 (owner: 10Brennen Bearnes) [17:39:48] (03CR) 10Dzahn: [C: 03+1] "looks alright. just that hosts who have alerting_hosts role get removed from .. monitoring.." [puppet] - 10https://gerrit.wikimedia.org/r/618345 (https://phabricator.wikimedia.org/T247966) (owner: 10Herron) [17:40:08] !log brennen@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.3 [17:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:13] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [17:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:30] (03PS1) 10Hnowlan: api-gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/618358 (https://phabricator.wikimedia.org/T254906) [17:45:04] (03CR) 10Hnowlan: [C: 03+2] api-gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/618358 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:46:20] (03Merged) 10jenkins-bot: api-gateway: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/618358 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:46:34] (03PS1) 10Herron: kafkamon: add role::kafak::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 [17:47:30] (03CR) 10jerkins-bot: [V: 04-1] kafkamon: add role::kafak::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (owner: 10Herron) [17:48:01] (03PS5) 10Dzahn: aphlict: make client port and IP also configurable, rename parameters [puppet] - 10https://gerrit.wikimedia.org/r/616922 [17:48:22] (03PS2) 10Herron: kafkamon: add role::kafak::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) [17:50:16] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [17:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:54] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission weblog1001 (unrack or return to spares) - https://phabricator.wikimedia.org/T259217 (10Cmjohnson) a:03Cmjohnson [17:51:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=mjolnir site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:51:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: Decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [17:51:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [17:52:14] 10Operations, 10ops-eqiad, 10DC-Ops: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [17:58:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200804T1800) [18:00:08] (03PS6) 10Dzahn: aphlict: make client port and IP also configurable, rename parameters [puppet] - 10https://gerrit.wikimedia.org/r/616922 [18:01:03] 10Operations, 10VPS-Projects, 10Wikimedia-Mailing-lists, 10User-Ladsgroup, 10cloud-services-team (Kanban): Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10Ladsgroup) >>! In T259444#6356557, @akosiaris wrote: > I am guessing this isn't... [18:02:08] (03PS1) 10Hnowlan: api-gateway: enable routing rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/618362 (https://phabricator.wikimedia.org/T254906) [18:04:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:04:33] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24304/aphlict1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/616922 (owner: 10Dzahn) [18:05:22] (03PS7) 10Dzahn: aphlict: make client port and IP also configurable, rename parameters [puppet] - 10https://gerrit.wikimedia.org/r/616922 [18:05:40] (03CR) 10Dzahn: "has V+2 and C+2, where's my submit button?" [puppet] - 10https://gerrit.wikimedia.org/r/616922 (owner: 10Dzahn) [18:06:46] (03CR) 10Dzahn: "+0.5 - As far as I know this is just fine but also I have neither added nor used it." [puppet] - 10https://gerrit.wikimedia.org/r/618352 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [18:07:23] (03CR) 10Dzahn: [C: 03+2] aphlict: make client port and IP also configurable, rename parameters [puppet] - 10https://gerrit.wikimedia.org/r/616922 (owner: 10Dzahn) [18:07:39] (03CR) 10Ppchelko: [C: 03+1] "Some random chatter inlined, feel free to ignore" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618362 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [18:08:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:11:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:12:58] (03CR) 10Dzahn: ""type": "client"," [puppet] - 10https://gerrit.wikimedia.org/r/616922 (owner: 10Dzahn) [18:13:32] (03PS3) 10Dzahn: ores: add envoy-proxy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/615569 (https://phabricator.wikimedia.org/T210411) [18:13:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618349 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:19:08] (03PS2) 10Hnowlan: api-gateway: enable routing rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/618362 (https://phabricator.wikimedia.org/T254906) [18:19:24] !log temp disabling puppet on all ores hosts to add envoy [18:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:39] (03CR) 10Dzahn: [C: 03+2] ores: add envoy-proxy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/615569 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [18:20:09] (03CR) 10Hnowlan: api-gateway: enable routing rules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618362 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [18:20:37] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 59 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:20:46] (03PS3) 10Hnowlan: api-gateway: enable routing rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/618362 (https://phabricator.wikimedia.org/T254906) [18:21:17] (03PS1) 10Bstorm: galera: try out parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/618363 [18:21:42] (03CR) 10jerkins-bot: [V: 04-1] galera: try out parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/618363 (owner: 10Bstorm) [18:22:33] (03PS1) 10Elukey: role::search::loader: add URL of elastic search endpoing [puppet] - 10https://gerrit.wikimedia.org/r/618365 (https://phabricator.wikimedia.org/T258245) [18:23:07] (03CR) 10Hnowlan: [C: 03+2] api-gateway: enable routing rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/618362 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [18:24:44] (03Merged) 10jenkins-bot: api-gateway: enable routing rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/618362 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [18:25:08] (03PS2) 10Bstorm: galera: try out parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/618363 [18:25:41] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/24306/" [puppet] - 10https://gerrit.wikimedia.org/r/618365 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [18:25:47] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:26:10] (03CR) 10Bstorm: "We could totally reduce the number, but 24 *should* be safe. 8 or 10 might seem gentler since these are rather multi-purpose servers. 4 t" [puppet] - 10https://gerrit.wikimedia.org/r/618363 (owner: 10Bstorm) [18:26:38] 10Operations, 10VPS-Projects, 10Wikimedia-Mailing-lists, 10User-Ladsgroup, and 2 others: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10bd808) 05Open→03Resolved a:03bd808 I made the record. Let's try to remember that it exists wh... [18:26:42] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Setup Mailman3 in Cloud VPS - https://phabricator.wikimedia.org/T258365 (10bd808) [18:26:55] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [18:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:12] (03CR) 10Ebernhardson: [C: 03+1] "Looks like that should work" [puppet] - 10https://gerrit.wikimedia.org/r/618365 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [18:28:25] (03CR) 10Elukey: [C: 03+2] role::search::loader: add URL of elastic search endpoing [puppet] - 10https://gerrit.wikimedia.org/r/618365 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [18:28:58] (03PS3) 10Bstorm: galera: try out parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/618363 [18:35:43] (03CR) 10Andrew Bogott: [C: 03+1] galera: try out parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/618363 (owner: 10Bstorm) [18:37:22] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [18:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:53] (03PS1) 10Dzahn: ssl: add TLS certificate for ORES [puppet] - 10https://gerrit.wikimedia.org/r/618367 (https://phabricator.wikimedia.org/T210411) [18:40:00] (03PS1) 10Dzahn: add fake key for ORES TLS cert [labs/private] - 10https://gerrit.wikimedia.org/r/618368 (https://phabricator.wikimedia.org/T210411) [18:40:42] (03CR) 10Ssingh: [C: 03+2] aptrepo: add a component for pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/618349 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:41:30] (03CR) 10Dzahn: [C: 03+2] ssl: add TLS certificate for ORES [puppet] - 10https://gerrit.wikimedia.org/r/618367 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [18:42:16] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake key for ORES TLS cert [labs/private] - 10https://gerrit.wikimedia.org/r/618368 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [18:42:41] (03PS2) 10Dzahn: add fake key for ORES TLS cert [labs/private] - 10https://gerrit.wikimedia.org/r/618368 (https://phabricator.wikimedia.org/T210411) [18:42:51] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake key for ORES TLS cert [labs/private] - 10https://gerrit.wikimedia.org/r/618368 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [18:46:07] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:46:12] !log letting puppet install envoy on all ores2* hosts [18:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:36] (03CR) 10Bstorm: [C: 03+2] galera: try out parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/618363 (owner: 10Bstorm) [18:49:49] !log letting puppet install envoy on all ores1* hosts [18:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:09] PROBLEM - Ensure local MW versions match expected deployment on mw1409 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:12] PROBLEM - Ensure local MW versions match expected deployment on mw1343 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:12] PROBLEM - Ensure local MW versions match expected deployment on mw2187 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:16] PROBLEM - Ensure local MW versions match expected deployment on mw2333 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:16] PROBLEM - Ensure local MW versions match expected deployment on mw2350 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:16] PROBLEM - Ensure local MW versions match expected deployment on mw2371 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:16] PROBLEM - Ensure local MW versions match expected deployment on mw2373 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:16] PROBLEM - Ensure local MW versions match expected deployment on mw2280 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:28] PROBLEM - Ensure local MW versions match expected deployment on mw2199 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:28] PROBLEM - Ensure local MW versions match expected deployment on mwdebug2001 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:32] PROBLEM - Ensure local MW versions match expected deployment on mw1363 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:51:40] what.. [18:51:44] jouncebot: now [18:51:44] For the next 0 hour(s) and 8 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200804T1800) [18:52:04] PROBLEM - Ensure local MW versions match expected deployment on mw1383 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:04] PROBLEM - Ensure local MW versions match expected deployment on mw1380 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:04] PROBLEM - Ensure local MW versions match expected deployment on mw2362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:06] PROBLEM - Ensure local MW versions match expected deployment on mw2216 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:06] PROBLEM - Ensure local MW versions match expected deployment on mw2274 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:08] is somebody deploying ? [18:52:08] PROBLEM - Ensure local MW versions match expected deployment on mw2208 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:08] PROBLEM - Ensure local MW versions match expected deployment on mw2190 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:08] PROBLEM - Ensure local MW versions match expected deployment on mw1300 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:14] PROBLEM - Ensure local MW versions match expected deployment on labweb1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:16] PROBLEM - Ensure local MW versions match expected deployment on mw2194 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:22] PROBLEM - Ensure local MW versions match expected deployment on mw1397 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:22] PROBLEM - Ensure local MW versions match expected deployment on mw1325 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:22] PROBLEM - Ensure local MW versions match expected deployment on mw2359 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:22] PROBLEM - Ensure local MW versions match expected deployment on scandium is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:24] PROBLEM - Ensure local MW versions match expected deployment on mw2295 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:24] PROBLEM - Ensure local MW versions match expected deployment on wtp2020 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:24] PROBLEM - Ensure local MW versions match expected deployment on mw2209 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:26] PROBLEM - Ensure local MW versions match expected deployment on snapshot1007 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:26] PROBLEM - Ensure local MW versions match expected deployment on mw2310 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:26] PROBLEM - Ensure local MW versions match expected deployment on wtp2008 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:34] PROBLEM - Ensure local MW versions match expected deployment on mw1387 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:36] PROBLEM - Ensure local MW versions match expected deployment on snapshot1005 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:36] PROBLEM - Ensure local MW versions match expected deployment on wtp1036 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:42] PROBLEM - Ensure local MW versions match expected deployment on mw1305 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:42] PROBLEM - Ensure local MW versions match expected deployment on mw1302 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:44] PROBLEM - Ensure local MW versions match expected deployment on wtp1039 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:44] PROBLEM - Ensure local MW versions match expected deployment on mw1269 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:44] PROBLEM - Ensure local MW versions match expected deployment on mw1297 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:44] PROBLEM - Ensure local MW versions match expected deployment on wtp1027 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:44] PROBLEM - Ensure local MW versions match expected deployment on mw2319 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:44] PROBLEM - Ensure local MW versions match expected deployment on mw2369 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:44] PROBLEM - Ensure local MW versions match expected deployment on mw2241 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:52:45] PROBLEM - Ensure local MW versions match expected deployment on mw2221 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:00] PROBLEM - Ensure local MW versions match expected deployment on mw2300 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:00] PROBLEM - Ensure local MW versions match expected deployment on mw2193 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:02] PROBLEM - Ensure local MW versions match expected deployment on mw1385 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:02] PROBLEM - Ensure local MW versions match expected deployment on mw1275 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:04] PROBLEM - Ensure local MW versions match expected deployment on labweb1001 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:05] o_O [18:53:06] brennen: hi?^ [18:53:08] PROBLEM - Ensure local MW versions match expected deployment on wtp1040 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:08] PROBLEM - Ensure local MW versions match expected deployment on mw1354 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:08] PROBLEM - Ensure local MW versions match expected deployment on mw1338 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:08] PROBLEM - Ensure local MW versions match expected deployment on mw1289 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:09] PROBLEM - Ensure local MW versions match expected deployment on mw1267 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:09] PROBLEM - Ensure local MW versions match expected deployment on mw2263 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:09] PROBLEM - Ensure local MW versions match expected deployment on mw2283 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:10] PROBLEM - Ensure local MW versions match expected deployment on mw2260 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:10] PROBLEM - Ensure local MW versions match expected deployment on mw2140 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:12] PROBLEM - Ensure local MW versions match expected deployment on mw1327 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:12] PROBLEM - Ensure local MW versions match expected deployment on mw1311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:12] PROBLEM - Ensure local MW versions match expected deployment on mw2331 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:12] PROBLEM - Ensure local MW versions match expected deployment on mw2275 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:12] brennen: scapping? [18:53:20] PROBLEM - Ensure local MW versions match expected deployment on mw1270 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:23] mutante: yeah [18:53:24] PROBLEM - Ensure local MW versions match expected deployment on mw2266 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:24] PROBLEM - Ensure local MW versions match expected deployment on mw2138 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:24] PROBLEM - Ensure local MW versions match expected deployment on mw2136 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:28] PROBLEM - Ensure local MW versions match expected deployment on mw1351 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:28] PROBLEM - Ensure local MW versions match expected deployment on mw1321 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:29] PROBLEM - Ensure local MW versions match expected deployment on mw2311 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:30] PROBLEM - Ensure local MW versions match expected deployment on mw2249 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:30] PROBLEM - Ensure local MW versions match expected deployment on mw2135 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:30] PROBLEM - Ensure local MW versions match expected deployment on wtp2018 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:34] PROBLEM - Ensure local MW versions match expected deployment on mw1290 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:34] PROBLEM - Ensure local MW versions match expected deployment on wtp2003 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:41] brennen: the alerts are not common though [18:53:42] PROBLEM - Ensure local MW versions match expected deployment on mw2284 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:46] PROBLEM - Ensure local MW versions match expected deployment on mw1399 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:47] yeah, this is not expected [18:53:56] PROBLEM - Ensure local MW versions match expected deployment on mw1393 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:53:56] currently syncing testwikis [18:54:00] PROBLEM - Ensure local MW versions match expected deployment on mw1375 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:14] PROBLEM - Ensure local MW versions match expected deployment on mw1274 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:20] PROBLEM - Ensure local MW versions match expected deployment on mw1318 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:20] PROBLEM - Ensure local MW versions match expected deployment on mw1347 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:20] PROBLEM - Ensure local MW versions match expected deployment on mw2330 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:24] PROBLEM - Ensure local MW versions match expected deployment on mw1395 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:32] PROBLEM - Ensure local MW versions match expected deployment on mw1273 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:34] PROBLEM - Ensure local MW versions match expected deployment on mw1362 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:34] PROBLEM - Ensure local MW versions match expected deployment on mw1316 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:56] PROBLEM - Ensure local MW versions match expected deployment on mw1349 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:54:56] PROBLEM - Ensure local MW versions match expected deployment on mw2372 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:17] !log upload pdns-recursor_4.3.3-1~deb10u1 to apt.wm.o (buster) - T252132 [18:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:20] T252132: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 [18:55:26] PROBLEM - Ensure local MW versions match expected deployment on mw1296 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:26] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:28] PROBLEM - Ensure local MW versions match expected deployment on mw1322 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:29] PROBLEM - Ensure local MW versions match expected deployment on mw2247 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:32] PROBLEM - Ensure local MW versions match expected deployment on snapshot1008 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:42] PROBLEM - Ensure local MW versions match expected deployment on mw1378 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:50] PROBLEM - Ensure local MW versions match expected deployment on mw1400 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:54] PROBLEM - Ensure local MW versions match expected deployment on mw2292 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:54] PROBLEM - Ensure local MW versions match expected deployment on mw2273 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:59] PROBLEM - Ensure local MW versions match expected deployment on mw1328 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:55:59] PROBLEM - Ensure local MW versions match expected deployment on mw1368 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:56:04] PROBLEM - Ensure local MW versions match expected deployment on mw2253 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:56:10] PROBLEM - Ensure local MW versions match expected deployment on mw1312 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:56:10] PROBLEM - Ensure local MW versions match expected deployment on wtp1035 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [18:56:20] i don't quite know the basis for that test - comparing deployment wikiversions.json? [18:56:28] RECOVERY - Ensure local MW versions match expected deployment on mw1343 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:30] RECOVERY - Ensure local MW versions match expected deployment on mw2187 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:34] RECOVERY - Ensure local MW versions match expected deployment on mw2350 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:34] RECOVERY - Ensure local MW versions match expected deployment on mw2333 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:34] RECOVERY - Ensure local MW versions match expected deployment on mw2371 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:34] RECOVERY - Ensure local MW versions match expected deployment on mw2373 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:34] RECOVERY - Ensure local MW versions match expected deployment on mw2280 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:40] ok, i JUST downtime it for 2 hours, lol [18:56:44] RECOVERY - Ensure local MW versions match expected deployment on mwdebug2001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:44] RECOVERY - Ensure local MW versions match expected deployment on mw2199 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:56:46] (03PS3) 10Herron: kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) [18:56:47] so we could talk about it without the noise [18:56:52] mutante: thx [18:57:16] RECOVERY - Ensure local MW versions match expected deployment on mw1380 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:24] so.. the check command is: nrpe_check!check_mw_wikiversion_difference!20 [18:57:26] RECOVERY - Ensure local MW versions match expected deployment on labweb1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:28] RECOVERY - Ensure local MW versions match expected deployment on mw2194 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:34] RECOVERY - Ensure local MW versions match expected deployment on mw1397 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:34] RECOVERY - Ensure local MW versions match expected deployment on mw1325 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:34] RECOVERY - Ensure local MW versions match expected deployment on mw2359 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:36] RECOVERY - Ensure local MW versions match expected deployment on scandium is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:38] RECOVERY - Ensure local MW versions match expected deployment on mw2295 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:39] RECOVERY - Ensure local MW versions match expected deployment on wtp2020 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:39] RECOVERY - Ensure local MW versions match expected deployment on snapshot1007 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:42] RECOVERY - Ensure local MW versions match expected deployment on wtp2008 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:49] RECOVERY - Ensure local MW versions match expected deployment on mw1387 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:49] RECOVERY - Ensure local MW versions match expected deployment on wtp1036 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:57:49] the NRPE part tells us it is being executed on the servers themselves [18:57:58] RECOVERY - Ensure local MW versions match expected deployment on mw1305 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:00] RECOVERY - Ensure local MW versions match expected deployment on wtp1039 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:00] RECOVERY - Ensure local MW versions match expected deployment on mw1269 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:02] RECOVERY - Ensure local MW versions match expected deployment on mw2319 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:02] RECOVERY - Ensure local MW versions match expected deployment on mw2369 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:02] RECOVERY - Ensure local MW versions match expected deployment on mw2241 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:20] RECOVERY - Ensure local MW versions match expected deployment on mw2300 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:20] RECOVERY - Ensure local MW versions match expected deployment on mw2193 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:26] RECOVERY - Ensure local MW versions match expected deployment on labweb1001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:30] RECOVERY - Ensure local MW versions match expected deployment on mw1289 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:30] RECOVERY - Ensure local MW versions match expected deployment on wtp1040 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:30] RECOVERY - Ensure local MW versions match expected deployment on mw1267 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:30] RECOVERY - Ensure local MW versions match expected deployment on mw2283 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:30] RECOVERY - Ensure local MW versions match expected deployment on mw2140 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:34] RECOVERY - Ensure local MW versions match expected deployment on mw1327 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:34] RECOVERY - Ensure local MW versions match expected deployment on mw1311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:34] RECOVERY - Ensure local MW versions match expected deployment on mw2275 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:42] RECOVERY - Ensure local MW versions match expected deployment on mw1270 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:48] RECOVERY - Ensure local MW versions match expected deployment on mw2138 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:52] RECOVERY - Ensure local MW versions match expected deployment on mw1321 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:54] RECOVERY - Ensure local MW versions match expected deployment on mw2249 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:54] RECOVERY - Ensure local MW versions match expected deployment on wtp2018 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:54] RECOVERY - Ensure local MW versions match expected deployment on mw2135 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:58:58] RECOVERY - Ensure local MW versions match expected deployment on wtp2003 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:04] RECOVERY - Ensure local MW versions match expected deployment on mw2284 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:09] RECOVERY - Ensure local MW versions match expected deployment on mw1399 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:18] RECOVERY - Ensure local MW versions match expected deployment on mw1393 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:25] brennen: on each appserver, it runs: /usr/local/lib/nagios/plugins/check_mw_versions --deployhost deploy1001.eqiad.wmnet [18:59:34] RECOVERY - Ensure local MW versions match expected deployment on mw1274 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:34] so it compares to deployment server [18:59:39] RECOVERY - Ensure local MW versions match expected deployment on mw1347 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:40] RECOVERY - Ensure local MW versions match expected deployment on mw1318 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:40] RECOVERY - Ensure local MW versions match expected deployment on mw2330 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:44] RECOVERY - Ensure local MW versions match expected deployment on mw1395 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:49] RECOVERY - Ensure local MW versions match expected deployment on mw1273 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [18:59:52] RECOVERY - Ensure local MW versions match expected deployment on mw1316 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:04] brennen and dancy: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200804T1900). [19:00:16] RECOVERY - Ensure local MW versions match expected deployment on mw2372 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:28] the source of truth is /mediawiki/mediawiki/wikiversions.json on the deploy server [19:00:42] then it compares that to local file /srv/mediawiki/wikiversions.json [19:00:44] RECOVERY - Ensure local MW versions match expected deployment on mw2247 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:48] RECOVERY - Ensure local MW versions match expected deployment on mwdebug1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:48] RECOVERY - Ensure local MW versions match expected deployment on mw1296 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:50] RECOVERY - Ensure local MW versions match expected deployment on mw1322 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:00:56] RECOVERY - Ensure local MW versions match expected deployment on snapshot1008 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:06] RECOVERY - Ensure local MW versions match expected deployment on mw1378 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:14] RECOVERY - Ensure local MW versions match expected deployment on mw1400 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:16] RECOVERY - Ensure local MW versions match expected deployment on mw2292 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:18] RECOVERY - Ensure local MW versions match expected deployment on mw2273 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:24] RECOVERY - Ensure local MW versions match expected deployment on mw1368 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:24] RECOVERY - Ensure local MW versions match expected deployment on mw2253 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:24] RECOVERY - Ensure local MW versions match expected deployment on mw1328 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:34] RECOVERY - Ensure local MW versions match expected deployment on wtp1035 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:01:49] RECOVERY - Ensure local MW versions match expected deployment on mw1409 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:16] RECOVERY - Ensure local MW versions match expected deployment on mw1363 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:16] syncing takes a long time these days, partly due to the infamous T223287. i wonder if the lag has something to do with it. [19:02:17] T223287: Investigate scap cluster_ssh idling until pressing ENTER repeatedly - https://phabricator.wikimedia.org/T223287 [19:02:44] RECOVERY - Ensure local MW versions match expected deployment on mw1383 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:44] RECOVERY - Ensure local MW versions match expected deployment on mw1300 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:44] RECOVERY - Ensure local MW versions match expected deployment on mw2362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:44] RECOVERY - Ensure local MW versions match expected deployment on mw2216 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:44] RECOVERY - Ensure local MW versions match expected deployment on mw2274 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:46] RECOVERY - Ensure local MW versions match expected deployment on mw2208 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:02:46] RECOVERY - Ensure local MW versions match expected deployment on mw2190 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:06] RECOVERY - Ensure local MW versions match expected deployment on mw2209 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:08] RECOVERY - Ensure local MW versions match expected deployment on mw2310 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:16] RECOVERY - Ensure local MW versions match expected deployment on snapshot1005 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:24] RECOVERY - Ensure local MW versions match expected deployment on mw1302 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:28] RECOVERY - Ensure local MW versions match expected deployment on wtp1027 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:28] RECOVERY - Ensure local MW versions match expected deployment on mw1297 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:28] RECOVERY - Ensure local MW versions match expected deployment on mw2221 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:50] RECOVERY - Ensure local MW versions match expected deployment on mw1385 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:50] RECOVERY - Ensure local MW versions match expected deployment on mw1275 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:52] !log current 1.36.0-wmf.3 train status (T257971): mid scap-cdb-rebuild for testwiki sync; will proceed with group0 when finished. [19:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:55] T257971: 1.36.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T257971 [19:03:58] RECOVERY - Ensure local MW versions match expected deployment on mw1354 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:58] RECOVERY - Ensure local MW versions match expected deployment on mw1338 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:59] RECOVERY - Ensure local MW versions match expected deployment on mw2263 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:03:59] RECOVERY - Ensure local MW versions match expected deployment on mw2260 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:04:02] RECOVERY - Ensure local MW versions match expected deployment on mw2331 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:04:19] RECOVERY - Ensure local MW versions match expected deployment on mw2266 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:04:19] RECOVERY - Ensure local MW versions match expected deployment on mw2136 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:04:24] RECOVERY - Ensure local MW versions match expected deployment on mw1351 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:04:26] RECOVERY - Ensure local MW versions match expected deployment on mw2311 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:04:32] RECOVERY - Ensure local MW versions match expected deployment on mw1290 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:04:56] RECOVERY - Ensure local MW versions match expected deployment on mw1375 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:05:18] brennen: i am reading the script and there is an interesting part in it. it tries to detect if prod wikiversions changed recently. then it assumes a recent deploy and does NOT alert [19:05:24] RECOVERY - Ensure local MW versions match expected deployment on mw1362 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:05:32] so this time that attempt to detect it must have failed [19:05:46] RECOVERY - Ensure local MW versions match expected deployment on mw1349 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:05:50] 139 if datetime.datetime.now() < (last_modified + datetime.timedelta(minutes=args.deploytime)): [19:05:53] 140 sys.stderr.write("Production wikiversions changed recently - assuming a recent deploy.") [19:05:56] 141 sys.stderr.write("Not alerting even if we see discrepancies.\n") [19:06:21] (03PS1) 10Ssingh: dnsrecursor: allow installation of pdns-recursor from component [puppet] - 10https://gerrit.wikimedia.org/r/618376 [19:06:32] i _think_ this has happened before, maybe not as many alerts. [19:06:50] on thinking about it. [19:07:10] RECOVERY - Ensure local MW versions match expected deployment on mw1312 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:07:26] yea.. so .. eh "if datetime.datetime.now() < (last_modified + datetime.timedelta(minutes=args.deploytime)):" [19:07:53] ... no_alert = True [19:09:44] (03PS2) 10Ssingh: dnsrecursor: allow installation of pdns-recursor from component [puppet] - 10https://gerrit.wikimedia.org/r/618376 [19:11:11] !log brennen@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.3 (duration: 91m 03s) [19:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:56] (03PS1) 10Brennen Bearnes: group0 wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618377 [19:12:58] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618377 (owner: 10Brennen Bearnes) [19:13:00] 10Operations, 10VPS-Projects, 10Wikimedia-Mailing-lists, 10User-Ladsgroup, and 2 others: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10Ladsgroup) Thanks! [19:13:46] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618377 (owner: 10Brennen Bearnes) [19:17:07] (03PS3) 10Ssingh: dnsrecursor: allow installation of pdns-recursor from component [puppet] - 10https://gerrit.wikimedia.org/r/618376 [19:18:54] (03CR) 10Ssingh: "Confirming with PCC: https://puppet-compiler.wmflabs.org/compiler1001/24309/ that no existing installations of dnsrecursor are affected by" [puppet] - 10https://gerrit.wikimedia.org/r/618376 (owner: 10Ssingh) [19:19:22] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.3 [19:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:19] (03PS1) 10Dzahn: ATS: switch ORES to TLS to backends [puppet] - 10https://gerrit.wikimedia.org/r/618379 (https://phabricator.wikimedia.org/T210411) [19:44:26] (03PS1) 10Ebernhardson: Add search-loader dsh group [puppet] - 10https://gerrit.wikimedia.org/r/618382 [19:44:40] (03CR) 10jerkins-bot: [V: 04-1] Add search-loader dsh group [puppet] - 10https://gerrit.wikimedia.org/r/618382 (owner: 10Ebernhardson) [19:45:26] (03PS2) 10Elukey: Add search-loader dsh group [puppet] - 10https://gerrit.wikimedia.org/r/618382 (owner: 10Ebernhardson) [19:45:30] (03PS3) 10Ebernhardson: Add search-loader dsh group [puppet] - 10https://gerrit.wikimedia.org/r/618382 (https://phabricator.wikimedia.org/T258245) [19:45:48] (03CR) 10jerkins-bot: [V: 04-1] Add search-loader dsh group [puppet] - 10https://gerrit.wikimedia.org/r/618382 (https://phabricator.wikimedia.org/T258245) (owner: 10Ebernhardson) [19:48:12] (03PS4) 10Ebernhardson: Add search-loader dsh group [puppet] - 10https://gerrit.wikimedia.org/r/618382 (https://phabricator.wikimedia.org/T258245) [19:48:30] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Setup Mailman3 in Cloud VPS - https://phabricator.wikimedia.org/T258365 (10bd808) [19:48:37] (03CR) 10jerkins-bot: [V: 04-1] Add search-loader dsh group [puppet] - 10https://gerrit.wikimedia.org/r/618382 (https://phabricator.wikimedia.org/T258245) (owner: 10Ebernhardson) [19:48:39] 10Operations, 10VPS-Projects, 10Wikimedia-Mailing-lists, 10User-Ladsgroup, and 2 others: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10bd808) 05Resolved→03Open `lang=irc [19:45] < Amir1> bd808: sorry but one thing, gmail gave... [19:49:15] (03PS5) 10Ebernhardson: Add search-loader dsh group [puppet] - 10https://gerrit.wikimedia.org/r/618382 (https://phabricator.wikimedia.org/T258245) [19:49:22] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:52:34] (03CR) 10Dzahn: [C: 04-2] "connect to ores.discovery.wmnet port 443: Connection refused" [puppet] - 10https://gerrit.wikimedia.org/r/618379 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [19:54:33] (03CR) 10Ebernhardson: [C: 04-1] "Might not be necessary, testing locally defining the dsh group in the search/mjolnir/deploy repo" [puppet] - 10https://gerrit.wikimedia.org/r/618382 (https://phabricator.wikimedia.org/T258245) (owner: 10Ebernhardson) [19:59:19] (03CR) 10Dzahn: [C: 03+1] "per mail thread "[Wikitech-l] Replacement for Helm chart repository"" [puppet] - 10https://gerrit.wikimedia.org/r/618352 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [20:08:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:12:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:12:57] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@b17bfd4]: Move mjolnir daemons from cirrus hosts to dedicated instances [20:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:05] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@b17bfd4]: Move mjolnir daemons from cirrus hosts to dedicated instances (duration: 02m 07s) [20:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:45] (03CR) 10CRusnov: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [20:20:12] (03PS2) 10Dzahn: admins: set http_proxy for myself, dzahn [puppet] - 10https://gerrit.wikimedia.org/r/617512 [20:21:08] (03CR) 10Dzahn: "run 'facter domain' to determine if in eqiad or codfw. then set http_proxy based on it." [puppet] - 10https://gerrit.wikimedia.org/r/617512 (owner: 10Dzahn) [20:21:12] (03CR) 10Dzahn: [C: 03+2] admins: set http_proxy for myself, dzahn [puppet] - 10https://gerrit.wikimedia.org/r/617512 (owner: 10Dzahn) [20:25:01] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@c80e2e7]: use provided ca certs for elasticsearch [20:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:24] (03PS1) 10Cwhite: profile: disable statsd_exporter relay for ores [puppet] - 10https://gerrit.wikimedia.org/r/618388 (https://phabricator.wikimedia.org/T205870) [20:26:37] (03Abandoned) 10Dzahn: phabricator: add envoy TLS terminator for aphlict (DO NOT MERGE) [puppet] - 10https://gerrit.wikimedia.org/r/603895 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:27:24] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@c80e2e7]: use provided ca certs for elasticsearch (duration: 02m 22s) [20:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:44] (03Abandoned) 10Dzahn: ATS/phabricator: directly talk wss:// to aphlict [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:27:59] (03PS5) 10Dzahn: ATS: add new backend for phabricator aphlict [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) [20:34:21] (03PS1) 10BryanDavis: wmcs: alphabetize labstore NFS mounts [puppet] - 10https://gerrit.wikimedia.org/r/618389 [20:34:23] (03PS1) 10BryanDavis: wmcs: Add project NFS for wmde-templates-alpha [puppet] - 10https://gerrit.wikimedia.org/r/618390 (https://phabricator.wikimedia.org/T259254) [20:35:49] (03PS1) 10Ebernhardson: mjolnir: Provide primary cirrus cluster url to msearch daemon [puppet] - 10https://gerrit.wikimedia.org/r/618391 (https://phabricator.wikimedia.org/T258245) [20:36:08] (03CR) 10BryanDavis: "The real change I was after is in the child patch at Ib21cfc573ac8761b47e8673e86676429dd007392" [puppet] - 10https://gerrit.wikimedia.org/r/618389 (owner: 10BryanDavis) [20:36:35] (03PS1) 10Dzahn: ssl: update aphlict TLS cert, add phabricator to SANs [puppet] - 10https://gerrit.wikimedia.org/r/618392 (https://phabricator.wikimedia.org/T238593) [20:37:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:38:14] (03PS2) 10Dzahn: ssl: update aphlict TLS cert, add phabricator to SANs [puppet] - 10https://gerrit.wikimedia.org/r/618392 (https://phabricator.wikimedia.org/T238593) [20:41:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:41:49] (03CR) 10Elukey: [C: 03+2] mjolnir: Provide primary cirrus cluster url to msearch daemon [puppet] - 10https://gerrit.wikimedia.org/r/618391 (https://phabricator.wikimedia.org/T258245) (owner: 10Ebernhardson) [20:42:01] (03CR) 10Elukey: mjolnir: Provide primary cirrus cluster url to msearch daemon [puppet] - 10https://gerrit.wikimedia.org/r/618391 (https://phabricator.wikimedia.org/T258245) (owner: 10Ebernhardson) [20:42:40] (03Abandoned) 10Ebernhardson: Add search-loader dsh group [puppet] - 10https://gerrit.wikimedia.org/r/618382 (https://phabricator.wikimedia.org/T258245) (owner: 10Ebernhardson) [20:43:10] (03PS2) 10Ebernhardson: mjolnir: Provide primary cirrus cluster url to msearch daemon [puppet] - 10https://gerrit.wikimedia.org/r/618391 (https://phabricator.wikimedia.org/T258245) [20:44:50] (03CR) 10Elukey: [C: 03+2] mjolnir: Provide primary cirrus cluster url to msearch daemon [puppet] - 10https://gerrit.wikimedia.org/r/618391 (https://phabricator.wikimedia.org/T258245) (owner: 10Ebernhardson) [20:46:29] (03PS2) 10BryanDavis: wmcs: alphabetize labstore NFS mounts [puppet] - 10https://gerrit.wikimedia.org/r/618389 [20:46:49] (03PS2) 10BryanDavis: wmcs: Add project NFS for wmde-templates-alpha [puppet] - 10https://gerrit.wikimedia.org/r/618390 (https://phabricator.wikimedia.org/T259254) [20:46:56] (03PS1) 10Mholloway: Update mobileapps to 2020-08-04-201901-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618393 [20:47:34] (03PS3) 10BryanDavis: wmcs: Add project NFS for wmde-templates-alpha [puppet] - 10https://gerrit.wikimedia.org/r/618390 (https://phabricator.wikimedia.org/T259254) [20:49:05] (03CR) 10Mholloway: [C: 03+2] Update mobileapps to 2020-08-04-201901-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618393 (owner: 10Mholloway) [20:50:18] (03Merged) 10jenkins-bot: Update mobileapps to 2020-08-04-201901-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618393 (owner: 10Mholloway) [20:52:56] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [20:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:42] (03CR) 10Dzahn: [C: 03+2] ssl: update aphlict TLS cert, add phabricator to SANs [puppet] - 10https://gerrit.wikimedia.org/r/618392 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:54:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:54:55] (03CR) 10Dzahn: "> Patch Set 4: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:55:12] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:55:12] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [20:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:33] (03PS1) 10Ottomata: Add eventgate-logging-external streams, and add destination_event_service to all stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) [20:58:07] (03CR) 10jerkins-bot: [V: 04-1] Add eventgate-logging-external streams, and add destination_event_service to all stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [21:02:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:03:42] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [21:03:42] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [21:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:28] (03PS1) 10Ottomata: eventgate-logging-external - Use MW EventStreamConfig API to get static stream configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/618395 (https://phabricator.wikimedia.org/T251935) [21:06:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:06:42] (03PS2) 10Ottomata: Add eventgate-logging-external streams, and add destination_event_service to all stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) [21:07:36] (03CR) 10jerkins-bot: [V: 04-1] Add eventgate-logging-external streams, and add destination_event_service to all stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [21:08:19] (03PS3) 10Ottomata: Add eventgate-logging-external streams, and add destination_event_service to all stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) [21:09:18] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:13:54] PROBLEM - kubelet operational latencies on kubernetes1011 is CRITICAL: instance=kubernetes1011.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:14:10] PROBLEM - kubelet operational latencies on kubernetes1009 is CRITICAL: instance=kubernetes1009.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:15:56] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:16:18] PROBLEM - kubelet operational latencies on kubernetes2014 is CRITICAL: instance=kubernetes2014.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:17:44] RECOVERY - kubelet operational latencies on kubernetes1011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:17:48] PROBLEM - kubelet operational latencies on kubernetes2010 is CRITICAL: instance=kubernetes2010.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:18:16] RECOVERY - kubelet operational latencies on kubernetes2014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:19:56] RECOVERY - kubelet operational latencies on kubernetes1009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:21:38] RECOVERY - kubelet operational latencies on kubernetes2010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:21:42] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:26:13] that opcache free space alert on mwdebug1002 is interesting [21:27:11] (03PS1) 10Mholloway: Update proton to 2020-07-30-193337-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618398 [21:29:00] (03CR) 10Mholloway: [C: 03+2] Update proton to 2020-07-30-193337-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618398 (owner: 10Mholloway) [21:30:05] (03Merged) 10jenkins-bot: Update proton to 2020-07-30-193337-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/618398 (owner: 10Mholloway) [21:34:29] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [21:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:51] (03CR) 10Ppchelko: [C: 04-1] Add eventgate-logging-external streams, and add destination_event_service to all stream configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618394 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [21:38:39] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [21:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:52] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [21:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:59:53] 10Operations, 10serviceops: httpbb: Mapping between tests and hosts - https://phabricator.wikimedia.org/T259665 (10RLazarus) p:05Triage→03Medium [22:07:50] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [22:08:47] 10Puppet, 10SRE-tools, 10Python3-Porting, 10User-MoritzMuehlenhoff, and 2 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) update on this project: I have a repeatable automated survey thus: ` git grep '#!.*python' |grep -v python3... [22:13:20] i'm guessing these are opcache corruption? https://logstash.wikimedia.org/goto/d6d027fbb2ee766b796f1a57d71ac3a6 [22:14:06] Class 'Wikimedi`\ParamValidator\TypeDef\IntegerDef' not found [22:17:57] Wikimedi`? [22:19:07] that is one bit off, I hate it [22:21:57] mw1404 [22:25:14] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 49.83 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [22:26:24] (03CR) 10Dzahn: [C: 03+1] "> This change confuses me, mostly because of the gid 903 in data.yaml." [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [22:27:08] (03CR) 10Dzahn: [C: 03+1] "also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/607853 which has 2 x +1 already" [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [22:29:55] rzl: ok if i go ahead and restart php on that box? [22:32:23] (or more generally: is there any reason not to do that in obvious cases...) [22:39:18] as long as you're using the same script that scap runs it should be safe as far as I understand it [22:39:30] i.e., it should depool and repool automagically [22:41:44] !log restarting php7.2-fpm on mw1404 for opcache issues [22:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:26] which looks like /usr/local/sbin/restart-php7.2-fpm [22:43:41] yep [22:43:48] did the trick. [22:43:55] ack [22:43:58] thx. [22:44:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:52:23] 10Operations, 10Fundraising-Backlog: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10thcipriani) Wiki creation is something for which I usually have to call on @Reedy's expertise as the keeper of the secrets of Wiki creation. [22:54:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:56:03] (03PS4) 10Dzahn: rsync::quickdatacopy: add optional parameter to let rsync --delete files [puppet] - 10https://gerrit.wikimedia.org/r/610389 (https://phabricator.wikimedia.org/T247652) [22:59:23] (03CR) 10Ppchelko: [C: 03+1] eventgate-logging-external - Use MW EventStreamConfig API to get static stream configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/618395 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [23:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200804T2300). [23:05:43] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10thcipriani) [23:06:00] RoanKattouw Niharika Urbanecm wikitech shows nothing scheduled. Can I add T259633? [23:06:01] T259633: Add import sources for lijwikisource - https://phabricator.wikimedia.org/T259633 [23:06:33] (03PS1) 10DannyS712: Add import sources for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618303 (https://phabricator.wikimedia.org/T259633) [23:08:43] noticing a lot of memory exhaustion errors for parsoid on srwikitionary. anybody know what's up with that? [23:11:37] (03PS2) 10DannyS712: Add import sources for lijwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618303 (https://phabricator.wikimedia.org/T259633) [23:12:42] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ladsgroup) Hello, can you put a similar form like {T259432} on the description of the ticket? Then the bot w... [23:13:25] patch is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/618303 [23:13:44] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618303 (https://phabricator.wikimedia.org/T259633) (owner: 10DannyS712) [23:16:58] (03CR) 10Cwhite: [C: 03+1] kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [23:17:36] (03CR) 10Cwhite: [C: 03+1] alerting_host: assign alert[12]001 role::alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/618345 (https://phabricator.wikimedia.org/T247966) (owner: 10Herron) [23:17:38] (03PS4) 10CRusnov: rotatedump: Enhance to retain period copies [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) [23:17:46] (03CR) 10jerkins-bot: [V: 04-1] rotatedump: Enhance to retain period copies [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [23:19:03] (03PS5) 10CRusnov: rotatedump: Enhance to retain period copies [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) [23:19:33] (03CR) 10Cwhite: [C: 03+1] alertmanager: add IRC notifier [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [23:31:39] (03PS5) 10Dzahn: rsync::quickdatacopy: add optional parameter to let rsync --delete files [puppet] - 10https://gerrit.wikimedia.org/r/610389 (https://phabricator.wikimedia.org/T247652) [23:36:55] brennen: i just saw a ping from you, but my crappy irc client won't seem to tell me what channel it was in [23:37:13] was it regarding the memory exhaustion on srwikitionary that you mentioned above? [23:37:25] yes. :-) [23:37:39] hey cscott - yeah, think Reedy pm'd you. [23:38:31] dancy just happened to notice the exceptions-and-fatals alert above and we glanced at grafana. seems unusual, but isn't trickling out into any of the log dashboards we track most closely after a deploy. [23:39:31] srwiktionary isn't in group0 is it? [23:39:36] nope. [23:39:39] aka this is running wmf.2, whatever it is [23:39:55] just seemed an unusual volume. [23:40:47] wow this channel is noisy! could you help point me to the alert and/or grafana dashboard in question? [23:42:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:42:35] ^^ [23:42:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:43:41] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24313/" [puppet] - 10https://gerrit.wikimedia.org/r/610389 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [23:43:59] looks to have dropped off just now. bot traffic, maybe. [23:44:04] brennen, dancy: seems to have been elevated for the past hour, and now down [23:46:13] a slight increase in cluster load over roughly the same time period. that might point to a bot i guess. [23:47:13] our dashboard seems a bit broken, though, the memory summary panes are "N/A" [23:47:15] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&from=now-3h&to=now [23:47:46] but if you look at the individual machines in the cluster, they all popped up for an hour, just now ending [23:48:15] especially the "memory per host" graph [23:48:28] so a bot re-rendering a bunch of big pages [23:48:35] yeah, makes sense. [23:49:12] i poked the rest of the team, and i'll keep the dashboard open to keep an eye on it [23:51:12] cscott: thanks - apologies for noise, but it seemed like the general flapping of the exceptions/fatals alert and filtering OOM stuff out in logstash might have been masking something worth flagging. [23:51:59] no worries [23:56:46] (03CR) 10CRusnov: "> Patch Set 3:" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov)