[00:00:04] <jouncebot>	 Deploy window No deploys all day! DC Switchover. See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201027T0000)
[00:12:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:13:45] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1141 is OK: OK slave_sql_lag Replication lag: 59.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:13:45] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1143 is OK: OK slave_sql_lag Replication lag: 59.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:13:45] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1121 is OK: OK slave_sql_lag Replication lag: 59.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:13:47] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1138 is OK: OK slave_sql_lag Replication lag: 59.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:13:57] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1148 is OK: OK slave_sql_lag Replication lag: 56.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:13:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:14:01] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on dbstore1004 is OK: OK slave_sql_lag Replication lag: 56.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:14:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1144 is OK: OK slave_sql_lag Replication lag: 52.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:14:45] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1149 is OK: OK slave_sql_lag Replication lag: 50.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:14:49] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1142 is OK: OK slave_sql_lag Replication lag: 50.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:14:51] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 50.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:14:57] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1081 is OK: OK slave_sql_lag Replication lag: 48.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:15:19] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1146 is OK: OK slave_sql_lag Replication lag: 42.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:15:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1125 is OK: OK slave_sql_lag Replication lag: 40.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:15:27] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1147 is OK: OK slave_sql_lag Replication lag: 40.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:23:00] <wikibugs>	 (03PS7) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396)
[00:23:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn)
[01:31:45] <wikibugs>	 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) For some reason (we found this out a few months ago), Dell Singapore part replacements don't go out with return tags.  They require you to call and sch...
[01:35:31] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1101 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:02:57] <icinga-wm>	 PROBLEM - Check the last execution of package_builder_Clean_up_build_directory on deneb is CRITICAL: CRITICAL: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:04:09] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:07] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.15 [core] (wmf/1.36.0-wmf.15) - 10https://gerrit.wikimedia.org/r/636549
[02:09:06] <wikibugs>	 (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.15 [core] (wmf/1.36.0-wmf.15) - 10https://gerrit.wikimedia.org/r/636549 (https://phabricator.wikimedia.org/T263181) (owner: 10TrainBranchBot)
[02:09:33] <wikibugs>	 (03CR) 10DannyS712: "No deployment this week, though should this be merged anyway? Or just abandoned?" [core] (wmf/1.36.0-wmf.15) - 10https://gerrit.wikimedia.org/r/636549 (https://phabricator.wikimedia.org/T263181) (owner: 10TrainBranchBot)
[03:27:51] <icinga-wm>	 PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:31:27] <icinga-wm>	 PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:34:03] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:30:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:35] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:53:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1100 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:32:13] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:32:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:55:19] <icinga-wm>	 PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:08:54] <wikibugs>	 10Operations, 10SRE-tools: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 (10Marostegui)
[06:09:03] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10Marostegui) 05Declined→03Open This has happened again: ` [05:55:19]  <+icinga-wm> PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degrad...
[06:35:56] <wikibugs>	 (03PS1) 10JMeybohm: Don't send duplicate events from resync to sink [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636553 (https://phabricator.wikimedia.org/T262675)
[06:35:58] <wikibugs>	 (03PS1) 10JMeybohm: Lower label cardinality of prometheus metrics [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636554 (https://phabricator.wikimedia.org/T262675)
[06:37:17] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Don't send duplicate events from resync to sink [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636553 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm)
[06:37:22] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Lower label cardinality of prometheus metrics [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636554 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm)
[06:40:01] <wikibugs>	 (03PS1) 10JMeybohm: eventrouter: don't send duplicate events, fix metrics [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636555 (https://phabricator.wikimedia.org/T262675)
[06:40:32] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: don't send duplicate events, fix metrics [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636555 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm)
[06:42:20] <ryankemper>	 !log T263970 Set number of replicas to 2 (from previous value of 1) for all codfw indices matching `apifeatureusage*`, new shards have been assigned without issue
[06:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:27] <stashbot>	 T263970: ElasticSearch unassigned shard check apifeatureusage-2020.06.30@codfw and enwiki_general_1587198756@eqiad - https://phabricator.wikimedia.org/T263970
[06:43:41] <wikibugs>	 (03PS1) 10JMeybohm: eventrouter: Pump image version and resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/636556 (https://phabricator.wikimedia.org/T262675)
[06:44:02] <wikibugs>	 (03PS2) 10JMeybohm: eventrouter: Bump image version and resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/636556 (https://phabricator.wikimedia.org/T262675)
[06:48:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] eventrouter: Bump image version and resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/636556 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm)
[06:50:24] <jayme>	 !log published docker-registry.discovery.wmnet/eventrouter:0.3.0-4
[06:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:09] <wikibugs>	 (03Merged) 10jenkins-bot: eventrouter: Bump image version and resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/636556 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm)
[06:58:37] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' .
[06:58:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:11] <wikibugs>	 (03CR) 10Elukey: geoip: cleanup having moved archiving to launcher (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi)
[07:01:19] <wikibugs>	 (03PS1) 10Marostegui: orchestrator.conf: Change RecoveryPeriodBlockSeconds default [puppet] - 10https://gerrit.wikimedia.org/r/636558 (https://phabricator.wikimedia.org/T265990)
[07:02:51] <wikibugs>	 (03CR) 10Elukey: "Let's also remove all the profile::tlsproxy::service::* configs in hiera :)" [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi)
[07:06:29] <icinga-wm>	 PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:14:55] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:18:25] <wikibugs>	 (03PS1) 10Elukey: base::standard_packages: avoid mcelog with kernels >= 4.12 [puppet] - 10https://gerrit.wikimedia.org/r/636560
[07:19:30] <wikibugs>	 (03PS8) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396)
[07:21:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634936 (owner: 10Giuseppe Lavagetto)
[07:22:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634933 (owner: 10Giuseppe Lavagetto)
[07:29:22] <wikibugs>	 (03Abandoned) 10Tobias Andersson: Add new slow-bot group for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson)
[07:31:02] <wikibugs>	 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) Another spike yesterday on DELETEs {F32415982}  Checking binlogs from 22:17 to 22:22...
[07:35:38] <godog>	 !log swift codfw-prod: bump object weight for ms-be2057 - T261633
[07:35:38] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Increase cirrus morelike pool counter by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636480 (owner: 10Ebernhardson)
[07:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:44] <stashbot>	 T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633
[07:38:17] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:39:59] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[07:51:51] <wikibugs>	 (03CR) 10Tobias Andersson: [C: 03+1] Enable propagatePageDeletion on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636453 (owner: 10Lucas Werkmeister (WMDE))
[08:04:28] <wikibugs>	 10Operations, 10vm-requests: Site: 1 VM request for Analytics test cluster - https://phabricator.wikimedia.org/T266064 (10elukey) 05Open→03Resolved a:03elukey This is done!
[08:08:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui)
[08:08:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) Per my chat with Chris, updating the rack location from A2 to A1 and from C2 to C3
[08:15:48] <godog>	 !log update thanos-fe2002 to thanos 0.16.0 - T261281
[08:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:54] <stashbot>	 T261281: Improve performance of Thanos (+ Prometheus) - https://phabricator.wikimedia.org/T261281
[08:18:45] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:20:11] <wikibugs>	 (03PS3) 10Elukey: sre.hadoop.init-hadoop-workers: add more defensive code [cookbooks] - 10https://gerrit.wikimedia.org/r/636403 (https://phabricator.wikimedia.org/T260411)
[08:21:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey)
[08:21:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add Z side device/interface/vlan and cable to PuppetDB importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi)
[08:24:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-workers: add more defensive code [cookbooks] - 10https://gerrit.wikimedia.org/r/636403 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[08:25:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix auto restart on IDP test hosts [puppet] - 10https://gerrit.wikimedia.org/r/636605
[08:26:33] <wikibugs>	 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) Another spike from 08:05 to 08:06 and this is what the binlog shows (number of state...
[08:27:17] <icinga-wm>	 RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:27:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[08:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[08:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[08:32:09] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[08:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:19] <icinga-wm>	 PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:22] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] orchestrator.conf: Change RecoveryPeriodBlockSeconds default [puppet] - 10https://gerrit.wikimedia.org/r/636558 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui)
[08:34:19] <wikibugs>	 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) From my POV as Swift maintainer I'm ok to go ahead with testing etc.  Although it'd be really good to have at...
[08:34:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Change RecoveryPeriodBlockSeconds default [puppet] - 10https://gerrit.wikimedia.org/r/636558 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui)
[08:35:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:37:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:39:48] <logmsgbot>	 !log elukey@cumin1001 END (ERROR) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=97)
[08:39:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:43] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:42:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:43:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:45:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove auto restart for pmacctd [puppet] - 10https://gerrit.wikimedia.org/r/636607
[08:47:19] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:48:23] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:32] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.init-hadoop-workers: avoid wipefs [cookbooks] - 10https://gerrit.wikimedia.org/r/636608
[08:50:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:52:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:53:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-workers: avoid wipefs [cookbooks] - 10https://gerrit.wikimedia.org/r/636608 (owner: 10Elukey)
[08:54:09] <wikibugs>	 (03PS2) 10Gehel: [wdqs-data-reload] load all lexemes chunks [cookbooks] - 10https://gerrit.wikimedia.org/r/636018 (owner: 10DCausse)
[08:55:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[08:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:47] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] [wdqs-data-reload] load all lexemes chunks [cookbooks] - 10https://gerrit.wikimedia.org/r/636018 (owner: 10DCausse)
[08:56:22] <wikibugs>	 (03PS1) 10Kormat: mariadb: Set both db_inventory nodes read-write [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003)
[08:56:44] <wikibugs>	 (03PS2) 10Kormat: mariadb: Set both db_inventory nodes read-write [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003)
[08:58:33] <logmsgbot>	 !log elukey@cumin1001 END (ERROR) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=97)
[08:58:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:53] <wikibugs>	 (03PS3) 10Kormat: mariadb: Set both db_inventory nodes read-write [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003)
[09:02:06] <wikibugs>	 (03CR) 10Kormat: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1001/26145/" [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat)
[09:02:20] <wikibugs>	 (03CR) 10Kormat: [C: 04-2] "Not merging this today." [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat)
[09:02:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat)
[09:04:29] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:12:40] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: add metric trafficserver_tls_client_total_time [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) (owner: 10Ema)
[09:14:49] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10jcrespo) So my guess is this is only happening on buster.
[09:15:25] <wikibugs>	 (03PS1) 10Kormat: orchestrator: Search both eqiad and codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/636613 (https://phabricator.wikimedia.org/T265990)
[09:17:02] <wikibugs>	 (03CR) 10Kormat: [C: 04-2] "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1003/26146/dborch1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636613 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat)
[09:17:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] orchestrator: Search both eqiad and codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/636613 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat)
[09:19:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Only handle auto restart of Jenkins on active instance [puppet] - 10https://gerrit.wikimedia.org/r/636614
[09:20:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Remove auto restart for pmacctd [puppet] - 10https://gerrit.wikimedia.org/r/636607 (owner: 10Muehlenhoff)
[09:21:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm optional minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey)
[09:21:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove auto restart for pmacctd [puppet] - 10https://gerrit.wikimedia.org/r/636607 (owner: 10Muehlenhoff)
[09:22:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] remote-backup-mariadb: update cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[09:23:04] <moritzm>	 jbond42: shall I merge along?
[09:23:04] <wikibugs>	 (03PS1) 10Kormat: orchestrator: Install mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/636616 (https://phabricator.wikimedia.org/T265990)
[09:23:18] <jbond42>	 yes go ahead i need a follow up one but that one isfine to go now
[09:23:19] <icinga-wm>	 RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:23:24] <moritzm>	 ack, doing
[09:23:37] <icinga-wm>	 RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:23:53] <icinga-wm>	 RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:24:49] <wikibugs>	 (03CR) 10Kormat: [C: 04-2] "-2: Don't merge today" [puppet] - 10https://gerrit.wikimedia.org/r/636616 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat)
[09:25:08] <wikibugs>	 (03PS1) 10Marostegui: orchestrator.conf: Add query to detect alias [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485)
[09:25:17] <wikibugs>	 (03PS1) 10Jbond: mariadb-snapshot: disable monitoring on timer job [puppet] - 10https://gerrit.wikimedia.org/r/636618
[09:25:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] mariadb-snapshot: disable monitoring on timer job [puppet] - 10https://gerrit.wikimedia.org/r/636618 (owner: 10Jbond)
[09:26:02] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] orchestrator.conf: Add query to detect alias [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) (owner: 10Marostegui)
[09:26:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] orchestrator: Install mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/636616 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat)
[09:27:18] <wikibugs>	 (03CR) 10Kormat: "Great, thank you! <3" [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) (owner: 10Dzahn)
[09:33:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: toolforge: script to make long-running processes on bastions less good (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm)
[09:35:36] <wikibugs>	 (03CR) 10Kormat: "Can you update the orchestrator grants template as well, please?" [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) (owner: 10Marostegui)
[09:36:05] <wikibugs>	 (03CR) 10Marostegui: "> Patch Set 1: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) (owner: 10Marostegui)
[09:37:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott)
[09:40:43] <wikibugs>	 (03PS1) 10Jbond: systemd::timer::job: add complex interval type checking [puppet] - 10https://gerrit.wikimedia.org/r/636622 (https://phabricator.wikimedia.org/T265138)
[09:43:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636605 (owner: 10Muehlenhoff)
[09:44:59] <wikibugs>	 (03PS1) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899)
[09:45:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix auto restart on IDP test hosts [puppet] - 10https://gerrit.wikimedia.org/r/636605 (owner: 10Muehlenhoff)
[09:45:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff)
[09:46:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] systemd::timer::job: add complex interval type checking [puppet] - 10https://gerrit.wikimedia.org/r/636622 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[09:46:23] <icinga-wm>	 PROBLEM - eventlogging Varnishkafka log producer on cp4032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[09:46:43] <icinga-wm>	 PROBLEM - statsv Varnishkafka log producer on cp4032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[09:46:50] <elukey>	 mmmm
[09:46:53] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp4032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[09:47:24] <ema>	 looking
[09:48:05] <elukey>	 it says varnishkafka-webrequest.service: Job varnishkafka-webrequest.service/start failed with result 'dependency'.
[09:48:12] <elukey>	 anybody working on it?
[09:48:14] <wikibugs>	 (03PS2) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899)
[09:48:16] <wikibugs>	 (03PS3) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339)
[09:48:18] <wikibugs>	 (03PS4) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849
[09:48:20] <wikibugs>	 (03PS3) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339)
[09:48:29] * jbond42 taking a peek at cp4032
[09:48:39] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp4032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[09:48:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi)
[09:48:57] <ema>	 elukey: yeah, I downgraded varnish to 6.0.2-1 and apparently the ABI isn't compatible -.-
[09:49:05] <elukey>	 ah!
[09:49:10] <elukey>	 okok that explains
[09:49:21] <ema>	 sorry, I was about to !log
[09:49:39] <jbond42>	 ema: fyi i did a puppet run which started a hole bunch of other varnish related things
[09:49:47] <jbond42>	 do you want the puppet aganet output?
[09:49:59] <icinga-wm>	 RECOVERY - eventlogging Varnishkafka log producer on cp4032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[09:50:17] <icinga-wm>	 RECOVERY - statsv Varnishkafka log producer on cp4032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[09:50:23] <ema>	 jbond42: no need, I'm looking at /var/log/puppet.log. Thanks!
[09:50:29] <jbond42>	 ack
[09:51:48] <wikibugs>	 (03PS3) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899)
[09:51:50] <wikibugs>	 (03PS4) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339)
[09:51:52] <wikibugs>	 (03PS5) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849
[09:51:54] <wikibugs>	 (03PS4) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339)
[09:55:33] <wikibugs>	 (03PS1) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138)
[09:57:03] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:58:17] <wikibugs>	 (03PS2) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138)
[09:58:49] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:06] <wikibugs>	 (03PS2) 10Marostegui: orchestrator.conf: Add query to detect alias [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485)
[10:01:34] <wikibugs>	 10Operations, 10Analytics-Radar, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10MoritzMuehlenhoff) @gsingers We have three major types of NDA/MOU under which people get access to PII-sensitive data on our servers: * Everyone who's WMF staff has signed an NDA a...
[10:04:05] <wikibugs>	 (03CR) 10Jbond: "PCC running https://puppet-compiler.wmflabs.org/compiler1002/26150/" [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[10:04:57] <wikibugs>	 (03Abandoned) 10Hashar: Add dns entry for zuul.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/634913 (https://phabricator.wikimedia.org/T207008) (owner: 10Ladsgroup)
[10:05:00] <wikibugs>	 (03Abandoned) 10Hashar: mediawiki: Funnel zuul.wikimedia.org to integration.wikimedia.org/zuul [puppet] - 10https://gerrit.wikimedia.org/r/634914 (https://phabricator.wikimedia.org/T207008) (owner: 10Ladsgroup)
[10:05:11] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10hashar) 05Open→03Declined The canonical URL is https://integration.wikimedia.org/zuul/  which one ca...
[10:05:25] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic, and 3 others: Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10hashar)
[10:06:29] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.init-hadoop-worker: add explicit systemd reload steps [cookbooks] - 10https://gerrit.wikimedia.org/r/636631
[10:06:38] <XioNoX>	 !log update policies from-zone production to-zone junos-host on mr1-codfw - T265589
[10:06:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:45] <icinga-wm>	 PROBLEM - Host mr1-codfw IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:ffff::6)
[10:10:08] <XioNoX>	 ah
[10:10:17] <XioNoX>	 forgot to allow icmpv6?
[10:11:45] <XioNoX>	 yup
[10:12:06] <XioNoX>	 should come back?
[10:14:53] <icinga-wm>	 RECOVERY - Host mr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.58 ms
[10:15:03] <XioNoX>	 good
[10:15:10] <XioNoX>	 !log update policies from-zone production to-zone junos-host on mr1-esams - T265589
[10:15:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:04] <wikibugs>	 (03PS1) 10Jcrespo: dbstore_multiinstance: Add profile to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911)
[10:19:13] <XioNoX>	 !log update policies from-zone production to-zone junos-host on mr1-ulsfo - T265589
[10:19:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:40] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Indeed GerritExtDistProvider relies on 'repoListUrl' whenever it is set.  I am not familiar with extension distributor or I would just hav" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636083 (https://phabricator.wikimedia.org/T266024) (owner: 10Legoktm)
[10:20:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dbstore_multiinstance: Add profile to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) (owner: 10Jcrespo)
[10:20:15] <XioNoX>	 !log update policies from-zone production to-zone junos-host on mr1-eqsin - T265589
[10:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:39] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) Given that the amount of changes between 5.1.3 and 6.0.6 is considerable, I was thinking of following this "bisect-like" apporach: packa...
[10:21:44] <XioNoX>	 !log update policies from-zone production to-zone junos-host on mr1-eqiad - T265589
[10:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:25] <ema>	 elukey, jbond42: FTR this is the sadness of the day https://phabricator.wikimedia.org/T264398#6580942
[10:23:01] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10ayounsi)
[10:23:35] * elukey plays sad_trombone.wav for ema
[10:24:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-worker: add explicit systemd reload steps [cookbooks] - 10https://gerrit.wikimedia.org/r/636631 (owner: 10Elukey)
[10:24:56] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "The cron removal can be deployed." [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn)
[10:25:14] <wikibugs>	 (03PS2) 10Jcrespo: dbstore_multiinstance: Add profile to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911)
[10:25:42] <wikibugs>	 (03PS3) 10Jcrespo: dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911)
[10:26:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) (owner: 10Jcrespo)
[10:27:45] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[10:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:17] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[10:31:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:49] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[10:33:30] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324)
[10:33:32] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324)
[10:33:55] <icinga-wm>	 PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:34:27] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 2567 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Docker
[10:35:15] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "As I get it the restart sends a SIGTERM to java/Jenkins which would cause it to abruptly terminate all jobs currently running." [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff)
[10:35:20] <wikibugs>	 (03PS4) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341)
[10:35:31] <icinga-wm>	 RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:37:39] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff)
[10:39:56] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10akosiaris) > I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible:  {meme, src=f...
[10:40:19] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[10:40:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:17] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[10:44:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:37] <wikibugs>	 (03PS3) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138)
[10:45:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[10:46:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[10:46:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:01] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[10:52:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[10:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:57] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Yes that got enabled for release Jenkins instances back in May 2018 ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/430562 ).   I g" [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff)
[10:58:49] <wikibugs>	 (03PS8) 10Filippo Giunchedi: grafana: sync users and roles from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712)
[10:58:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: grafana: introduce 'profile::grafana::active_host' [puppet] - 10https://gerrit.wikimedia.org/r/636637 (https://phabricator.wikimedia.org/T265712)
[10:58:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712)
[11:00:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:01:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/26152/" [puppet] - 10https://gerrit.wikimedia.org/r/636637 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:02:02] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[11:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:13] <wikibugs>	 (03CR) 10Hashar: grafana: sync users and roles from LDAP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:04:43] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Indeed they are "ensure => absent" so that is merely a puppet manifest cleanup. Thx!" [puppet] - 10https://gerrit.wikimedia.org/r/636085 (owner: 10Dzahn)
[11:06:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712)
[11:06:59] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[11:07:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:09:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: grafana: sync users and roles from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:10:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/636637 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:12:12] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[11:12:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:39] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[11:13:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:05] <ema>	 !log A:cp remove libvarnishapi1, replaced by libvarnishapi2 a while ago T261487
[11:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:10] <stashbot>	 T261487: Varnish 6.0 needs a SONAME version bump - https://phabricator.wikimedia.org/T261487
[11:14:33] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324)
[11:14:57] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "What Filippo said: one can just run the tests via the Docker container /  ./utils/run_ci_locally.sh" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:15:00] <wikibugs>	 (03CR) 10Muehlenhoff: grafana: add user sync from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:19:03] <icinga-wm>	 PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:19:07] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[11:19:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: introduce 'profile::grafana::active_host' [puppet] - 10https://gerrit.wikimedia.org/r/636637 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:20:09] <wikibugs>	 (03PS4) 10Jcrespo: dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911)
[11:20:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: sync users and roles from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:20:57] <wikibugs>	 (03PS5) 10Jcrespo: dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911)
[11:21:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[11:21:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) (owner: 10Jcrespo)
[11:24:37] <wikibugs>	 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10matthiasmullie) At first sight, these DB operations seem to make sense: bots are in the process...
[11:25:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[11:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:48] <wikibugs>	 10Operations, 10DBA, 10Data-Persistence, 10Release-Engineering-Team-TODO, and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) I just came across this: https://github.com/openark/orchestrator/blob/master/docs/ci-env.md#run-orchestrator-with-envir...
[11:26:07] <icinga-wm>	 RECOVERY - Check systemd state on cp3052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:27:20] <wikibugs>	 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10jcrespo) Independently of the source of the issue, could these regenerations be throttled/rate l...
[11:27:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[11:27:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: grafana: add user sync from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:33:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[11:33:33] <wikibugs>	 (03PS1) 10Jbond: change to test pcc [puppet] - 10https://gerrit.wikimedia.org/r/636641
[11:35:22] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[11:35:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:46] <wikibugs>	 (03PS1) 10Jcrespo: dbprov: Apply the cleanup stale scope sessions to the right set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/636642 (https://phabricator.wikimedia.org/T199911)
[11:41:51] <wikibugs>	 (03Abandoned) 10Hashar: Recommendation API: upgrade node to version 10 [puppet] - 10https://gerrit.wikimedia.org/r/560454 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov)
[11:46:54] <wikibugs>	 (03PS2) 10Jcrespo: dbprov: Apply the cleanup stale scope sessions to the right set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/636642 (https://phabricator.wikimedia.org/T199911)
[11:49:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbprov: Apply the cleanup stale scope sessions to the right set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/636642 (https://phabricator.wikimedia.org/T199911) (owner: 10Jcrespo)
[11:50:13] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Cleanup old cron deletions after some time after deploy [puppet] - 10https://gerrit.wikimedia.org/r/636644 (https://phabricator.wikimedia.org/T265138)
[11:52:11] <wikibugs>	 (03CR) 10Jcrespo: "To be deployed in a few days." [puppet] - 10https://gerrit.wikimedia.org/r/636644 (https://phabricator.wikimedia.org/T265138) (owner: 10Jcrespo)
[11:52:46] <wikibugs>	 (03PS1) 10Volans: requests: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645
[11:57:15] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:29] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:03] <icinga-wm>	 RECOVERY - Check systemd state on dbprov1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] sretest: Experiment with preserving docker rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634192 (owner: 10Alexandros Kosiaris)
[12:31:31] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: sretest: Experiment with preserving docker rules [puppet] - 10https://gerrit.wikimedia.org/r/634192
[12:44:43] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:51:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[12:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:19] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:55:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[12:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:48] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[12:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:08] <elukey>	 I promise I am going to stop very soon, few hosts to go :)
[12:57:11] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:59:11] <wikibugs>	 (03PS1) 10Jbond: pcc: expore posting pcc to gerrit comments [puppet] - 10https://gerrit.wikimedia.org/r/636652
[13:01:54] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[13:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:09] <wikibugs>	 (03CR) 10Jbond: "@paladox, hashar wonder if you know of a way to authenticate to gerrit api using some type of user token like Jenkins instead of requiring" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond)
[13:03:01] <icinga-wm>	 PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:03:19] <wikibugs>	 (03CR) 10Paladox: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond)
[13:03:21] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[13:04:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[13:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:59] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 2567 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Docker
[13:05:27] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond)
[13:06:00] <wikibugs>	 (03CR) 10Jbond: "will leave this here as the other changes may be useful to merge" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond)
[13:07:59] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[13:08:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:15] <icinga-wm>	 RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:09:52] <wikibugs>	 (03PS3) 10Marostegui: orchestrator.conf: Add query to detect alias [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485)
[13:10:39] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[13:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:25] <wikibugs>	 10Operations, 10netops: Apply uRPF strict mode on Customer links - https://phabricator.wikimedia.org/T266561 (10ayounsi) p:05Triage→03Low
[13:13:59] <wikibugs>	 (03PS1) 10Ayounsi: Add uRPF strict mode to Customers links [homer/public] - 10https://gerrit.wikimedia.org/r/636653 (https://phabricator.wikimedia.org/T266561)
[13:15:00] <rzl>	 good morning! in 15 minutes in this channel, we'll start getting set up for the DC switchover from codfw back to eqiad -- MW read-only window starts in 45 minutes
[13:15:20] <wikibugs>	 (03PS4) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138)
[13:15:21] <rzl>	 if you have root on cumin1001, you can watch my screen by running "sudo -i tmux attach -rt switchdc" there
[13:15:34] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[13:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:47] <rzl>	 elukey: almost done? :)
[13:17:20] <elukey>	 I am done!
[13:17:28] <godog>	 rzl: thanks! will follow on tmux
[13:20:19] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:20:31] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:21:38] <XioNoX>	 I'm around if needed
[13:21:51] <wikibugs>	 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) >>! In T266373#6576166, @CDanis wrote: >>>! In T266373#6576161, @Framawiki wrote: >> See also [[...
[13:23:46] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul)
[13:23:51] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10jcrespo) 05Open→03Resolved a:03jcrespo ` <icinga-wm> RECOVERY - Check systemd state on dbprov1003 is OK: OK - running: The system is fully operational htt...
[13:23:53] <wikibugs>	 10Operations, 10SRE-tools: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 (10jcrespo)
[13:25:36] * volans here < rzl 
[13:26:39] <rzl>	 👋
[13:26:54] <wikibugs>	 (03PS6) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130)
[13:27:30] <ottomata>	 (don't worry not merging ^ til after switch settles)
[13:27:48] <rzl>	 thanks!
[13:28:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata)
[13:28:35] <wikibugs>	 (03PS7) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130)
[13:29:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata)
[13:30:17] <rzl>	 okay, let's go :)
[13:30:35] <rzl>	 again if you have root on cumin1001, you can watch my screen by running "sudo -i tmux attach -rt switchdc" there -- if you change the flags at all, please keep -r so that yours is read-only
[13:30:58] <rzl>	 also, the session overall takes the SMALLEST terminal size, so please don't join from a tiny window, or no one else will be able to see anything
[13:31:08] <rzl>	 whoever just joined with an 80x20 terminal, that's you :)
[13:31:13] <ottomata>	 cool!
[13:31:16] <ottomata>	 wasn't me! :p
[13:31:19] <marostegui>	 hahaha
[13:31:31] <rzl>	 aim for at least, let's say, 120x40 please
[13:31:33] <godog>	 hahah
[13:31:38] <_joe_>	 yeah please
[13:31:39] <kormat>	 12x4 it is
[13:31:40] <rzl>	 if your screen is cramped, I'd like you to deal with it so that I don't have to <3
[13:32:12] <ottomata>	 well this is fun.
[13:32:21] <_joe_>	 not really.
[13:32:31] <rzl>	 the commrel banner should be live now, but may not show up immediately everywhere due to caching
[13:32:45] <wikibugs>	 10Operations, 10Puppet: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10MoritzMuehlenhoff) Sounds good to me
[13:33:03] <marostegui>	 rzl: yep, not up yet on es wiki
[13:33:03] <rzl>	 _joe_, volans: check me?
[13:33:13] <volans>	 rzl: direction correct!
[13:33:19] <jynus>	 I see it on mw.org but not on enwiki
[13:33:27] <_joe_>	 +1
[13:33:51] <jynus>	 now I see it
[13:34:07] <rzl>	 whoever's terminal is 20 lines tall, please make it taller :) the whole session takes the smallest size
[13:34:24] <rzl>	 I'm going to start phase 0 prep steps now, any objectinos?
[13:34:26] <rzl>	 *objections
[13:34:47] <_joe_>	 go for me
[13:35:11] <marostegui>	 +1
[13:35:11] <rzl>	 running 00-reduce-ttl first just so we don't have to think about the five-minute wait time
[13:35:15] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl
[13:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:22] <rzl>	 some "failed to call" messages here are expected
[13:35:31] <rzl>	 since we just recheck until the records converge
[13:35:34] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0)
[13:35:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:44] <rzl>	 the old TTL will be up at 13:41
[13:35:56] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet
[13:35:59] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0)
[13:36:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:07] <marostegui>	 banner up on eswiki, for the record
[13:36:20] <_joe_>	 itwiki too
[13:36:24] <rzl>	 > The last Puppet run was at Tue Oct 27 13:22:37 UTC 2020 (13 minutes ago). Puppet is disabled. cookbooks.sre.switchdc.mediawiki
[13:36:31] <rzl>	 confirming puppet is disabled on mwmaint1002 and 2001
[13:36:58] <rzl>	 starting the cache warmups in eqiad, we may see some icinga noise about request latency, that's not user-impacting
[13:37:02] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches
[13:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:25] <_joe_>	 rzl: I would suggest we run it multiple times
[13:37:32] <rzl>	 yep agreed
[13:37:32] <volans>	 yep, as usual
[13:38:00] <rzl>	 I'll pause briefly in between for graph clarity
[13:38:16] <rzl>	 we can't set R/O for another 20+ minutes anyway, no rush
[13:38:50] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0)
[13:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:54] <volans>	 _joe_: is the current for i in range(3): enough or you meant re-run the whole warmup cookbook again?
[13:39:23] <volans>	 not that it harms to run it more, just that we could increase that range(3) directly ;)
[13:39:53] <rzl>	 haha, good morning eqiad appservers https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard
[13:40:38] <_joe_>	 so one thing I'm noticing is we're not warming up the apis at all
[13:41:04] <rzl>	 hm, that's true
[13:41:39] <rzl>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/%2B/master/cookbooks/sre/switchdc/mediawiki/00-warmup-caches.py#30 it looks like we just pass appservers.svc here, presumably we could just add the api name too?
[13:41:54] <rzl>	 I don't want to touch the cookbook now but we could run the warmup script directly
[13:42:33] <_joe_>	 well no probably the warmup script calls appserver-specific endpoints
[13:43:10] <rzl>	 mm, they probably have a lot of cache in common but that's fair
[13:44:06] <rzl>	 I'll make a note for next time -- do you want to do anything about it now, or postpone until we get it fixed, or keep going as-is?
[13:44:08] <volans>	 rzl: are you running the warmup again in the meanwhile?
[13:44:46] <rzl>	 a latency bump on the api servers wouldn't be as bad as on the appservers, I'm inclined to say let's not postpone over it
[13:45:03] <rzl>	 volans: can do but let's sort this out first, we're still plenty ahead of schedule
[13:45:27] <_joe_>	 go as-is
[13:45:42] <rzl>	 ack
[13:45:46] <rzl>	 do you want a rerun?
[13:45:52] <_joe_>	 I mean we did the same last time right?
[13:45:55] <_joe_>	 yes please
[13:46:06] <rzl>	 we did yeah
[13:46:10] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches
[13:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:16] <_joe_>	 I want to see those p75 and p99 go down when it runs
[13:46:28] <volans>	 the URLs called can be checked on the mwmaint hosts in /var/lib/mediawiki-cache-warmup/
[13:46:37] <volans>	 urls-cluster.txt and urls-server.txt
[13:46:58] <wikibugs>	 (03Abandoned) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 (owner: 10Gehel)
[13:47:02] <wikibugs>	 (03Abandoned) 10Gehel: logstash: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390402 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel)
[13:47:05] <wikibugs>	 (03Abandoned) 10Gehel: service: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/396072 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel)
[13:47:38] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0)
[13:47:41] <icinga-wm>	 PROBLEM - Disk space on maps2002 is CRITICAL: DISK CRITICAL - free space: /srv 63626 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2002&var-datasource=codfw+prometheus/ops
[13:47:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:01] <wikibugs>	 (03Abandoned) 10Gehel: wdqs: increase frequency of update lag check [puppet] - 10https://gerrit.wikimedia.org/r/443374 (owner: 10Gehel)
[13:48:06] <wikibugs>	 (03Abandoned) 10Gehel: Introduce parameter data types for a few defined types. [puppet] - 10https://gerrit.wikimedia.org/r/440117 (owner: 10Gehel)
[13:48:10] <wikibugs>	 (03Abandoned) 10Gehel: elasticsearch - notifiy nginx of SSL certificate changes [puppet] - 10https://gerrit.wikimedia.org/r/333664 (owner: 10Gehel)
[13:48:15] <wikibugs>	 (03Abandoned) 10Gehel: logstash: kafka analytics cluster isn't available from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/414638 (owner: 10Gehel)
[13:48:20] <rzl>	 tail latency is definitly less but I'll give it another run in a couple of minutes
[13:48:25] <_joe_>	 +1
[13:48:32] <volans>	 ack avg response time also much better
[13:48:44] <volans>	 I guess as improvement we could increase the range and add some sleep between the runs
[13:48:45] <rzl>	 I think the third is likely to be ~the same as the second, but
[13:48:47] <volans>	 in the cookbook
[13:48:55] <wikibugs>	 (03Abandoned) 10Gehel: Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 (https://phabricator.wikimedia.org/T136702) (owner: 10Gehel)
[13:49:01] <rzl>	 volans: yeah agreed, although the nice thing about having it this way is I can run other steps in between
[13:49:13] <rzl>	 I guess I haven't needed to, though, so that probably doesn't matter
[13:49:17] <volans>	 idealy it should parse the resulsts and retry until it converges
[13:49:38] <rzl>	 yeah, let's do more design work on it later :)
[13:49:46] <rzl>	 rerunning
[13:49:50] <_joe_>	 we can probably also work on a more comprehensive set of urls
[13:49:50] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches
[13:49:51] <volans>	 +1
[13:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:37] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0)
[13:50:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:54] <rzl>	 while we wait, now is a good time to pick a language wiki and prepare a test edit, without saving it
[13:51:08] <rzl>	 once we go read-only, I'll ask you to save your test edit and confirm that it doesn't work
[13:51:21] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:35] * volans gets meta
[13:51:46] <_joe_>	 I got itwiki (group2)
[13:51:47] <Urbanecm>	 meta is not a language wiki :-)
[13:51:50] <marostegui>	 rzl: I will take eswiki
[13:51:53] <moritzm>	 dewiki
[13:52:09] <_joe_>	 no one takes enwiki?
[13:52:11] <_joe_>	 or wikidata
[13:52:16] <jynus>	 testwiki (s3) here
[13:52:18] <marostegui>	 and commons
[13:52:39] <rzl>	 if someone could also test via mobile app I'd appreciate that too
[13:53:05] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[13:53:16] <rzl>	 ^ expected
[13:53:20] <wkandek>	 ptwiki
[13:53:26] <rzl>	 _joe_: is it me or is the third latency bump slightly worse than the second?
[13:53:29] <_joe_>	 I got also wikidata
[13:53:36] <_joe_>	 rzl: it's much worse
[13:53:39] <rzl>	 I'm going to give it one more
[13:53:41] <jynus>	 the mobile app shows no alert
[13:53:43] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches
[13:53:43] <marostegui>	 I will take enwiki then
[13:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:47] <marostegui>	 eswiki and enwiki
[13:53:52] <kormat>	 i'm taking enwiki on mobile
[13:53:54] <_joe_>	 which is because we clearly need to send more requests
[13:53:57] <marostegui>	 ah nice, thanks kormat 
[13:53:57] <volans>	 yeah teh last run was worse than the previous
[13:54:09] <jynus>	 the last, worse?
[13:54:25] <_joe_>	 I think because some urls we just require to the load-balancers
[13:54:26] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0)
[13:54:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:42] <_joe_>	 anyways, I think we're good to go
[13:54:43] <marostegui>	 rzl: going to downtime masters for 10 minutes now
[13:54:48] <rzl>	 marostegui: ack, thanks
[13:54:58] <volans>	 marostegui: make it 12
[13:54:59] <volans>	 :D
[13:55:03] <marostegui>	 volans: ok XD
[13:55:04] <volans>	 the time to run puppet
[13:55:23] <_joe_>	 rzl: do you want to start read-only at the hour?
[13:55:34] <rzl>	 _joe_: at or a little after, no particular urgency there
[13:55:34] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime
[13:55:35] <volans>	 it's 5m + RO time  + get to run the step that runs puppet
[13:55:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:47] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:55:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:52] <rzl>	 I'm going ahead with stopping maintenance scripts now, objections?
[13:56:00] <_joe_>	 yeah I was asking in the context of marostegui downtiming for 10 minutes
[13:56:01] <marostegui>	 rzl: go for it
[13:56:08] <_joe_>	 rzl: ack
[13:56:09] <volans>	 marostegui: [nit] double sudo
[13:56:18] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance
[13:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:24] <marostegui>	 volans: :(
[13:56:34] <wikibugs>	 (03CR) 10Mholloway: [C: 03+1] Update mobileapps to 2020-10-26-150740-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/636496 (https://phabricator.wikimedia.org/T264024) (owner: 10Ppchelko)
[13:56:37] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0)
[13:56:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:44] <volans>	 expect the stray php process message
[13:56:46] <volans>	 as usual
[13:56:54] <volans>	 who's looking?
[13:57:00] <rzl>	 yep, was just about to say
[13:57:17] <_joe_>	 I will take a look
[13:57:24] <volans>	 cirrus update index
[13:57:41] <volans>	 AFAICT
[13:57:57] <volans>	 seems the only one to me
[13:58:04] <_joe_>	 ryankemper: there is a cscript ran by you
[13:58:13] <_joe_>	 I am killing it now
[13:58:21] <rzl>	 cirrus will keep sending some queries to codfw after the switchover, as a mitigation from last time
[13:58:21] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[13:59:11] <volans>	 apparently since oct. 15th
[13:59:13] <rzl>	 not sure but I don't think that index update is part of that mitigation
[13:59:14] <volans>	 gehel: ^^^
[13:59:14] <_joe_>	 uhm 
[13:59:21] <_joe_>	 ryan is running a while true
[13:59:45] <volans>	 log in /home/ryankemper/cirrus_log/viwikisource.codfw.reindex.log
[13:59:50] <dcausse>	 _joe_: it's probably a complete reindex, you should kill its whole session
[13:59:57] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:00] <_joe_>	 ok all clear
[14:00:04] <_joe_>	 we can proceed 
[14:00:13] <Trizek>	 Good luck!
[14:00:16] <rzl>	 _joe_: is that systemd alert the same?
[14:00:19] <revi>	 :popcorn:
[14:00:20] <gehel>	 dcausse: thanks for being faster than me :)
[14:00:24] <volans>	 _joe_: I still see mwscriptwikiset extensions/FlaggedRevs/maintenance/updateStats.php
[14:00:30] <Spookreeeno>	 Echoing Trizek, good luck all
[14:00:32] <volans>	 seems new though
[14:01:04] <jynus>	 rzl: mediawiki_job_parser_cache_purging.service  loaded failed failed
[14:01:13] <jynus>	 on mwmaint2001
[14:01:15] <_joe_>	 volans: yes, we need to switchover so that jobs don't start again
[14:01:23] <volans>	 ok
[14:01:31] <_joe_>	 rzl: proceed please, I'm killing that script now
[14:01:32] <volans>	 thought we disabled maintenance
[14:01:36] <rzl>	 okay, as before, I won't stop between steps
[14:01:37] <volans>	 :)
[14:01:40] <volans>	 +1
[14:01:40] <rzl>	 say "stop" in here if anything is wrong
[14:01:54] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly
[14:01:54] <jynus>	 parsercache purging is not a problem unless failing for days
[14:01:55] <_joe_>	 go
[14:01:55] <logmsgbot>	 !log rzl@cumin1001 MediaWiki read-only period starts at: 2020-10-27 14:01:54.999830
[14:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:09] <_joe_>	 confirmed read-only
[14:02:15] <rzl>	 test edits please
[14:02:15] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0)
[14:02:18] <marostegui>	 confirmed on eswiki and enwiki
[14:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:19] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly
[14:02:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:25] <volans>	 confirmed RO
[14:02:26] <jynus>	 same on testwiki
[14:02:31] <Urbanecm>	 mobile app is RO for cswiki
[14:02:38] <revi>	 readonly on kowiki
[14:02:42] <kormat>	 mobile app RO for enwiki
[14:02:43] <WXIII>	 i see RO on nlwiki
[14:02:48] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0)
[14:02:51] <wkandek>	 RO on ptwiki
[14:02:51] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki
[14:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:54] <sobanski>	 RO: plwiki on mobile
[14:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:57] <marostegui>	 commons on RO too
[14:03:13] <moritzm>	 dewiki correctly readonly
[14:03:17] <jynus>	 "The wiki is currently in read-only mode" on app
[14:03:20] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0)
[14:03:24] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions
[14:03:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:25] <jynus>	 that error message was nicely fixed
[14:03:27] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0)
[14:03:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:30] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite
[14:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:32] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0)
[14:03:35] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[14:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:42] <rzl>	 test edits again after this, please
[14:03:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:50] <_joe_>	 uhm
[14:03:52] <cdanis>	 rzl: 3 edits on hatnote post-switchover
[14:03:53] <icinga-wm>	 PROBLEM - Check the last execution of mediawiki_job_parser_cache_purging on mwmaint2001 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_parser_cache_purging https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:03:56] <_joe_>	 "failed to get siteinfo?"
[14:04:12] <cdanis>	 now a few dozen
[14:04:13] <rzl>	 _joe_: that just means we're waiting for siteinfo to show RW again
[14:04:16] <marostegui>	 write ok on commons
[14:04:23] <logmsgbot>	 !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=99)
[14:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:27] <rzl>	 "failed to call" is just the generic "still retrying" message
[14:04:28] <revi>	 ok on kowiki
[14:04:31] <mark>	 that message is confusing, we should really rewrite that
[14:04:32] <_joe_>	 latencies are very high
[14:04:33] <rzl>	 we may have hit a bad host though, retrying
[14:04:37] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[14:04:38] <kormat>	 write succeeded on en.m.wikipedia
[14:04:40] <akosiaris>	 ok elwiki
[14:04:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:52] <WXIII>	 i could edit / do a deletion on nlwiki
[14:04:53] <revi>	 takes a bit more time on loading tho
[14:05:02] <jynus>	 uff sending an edit to testwiki is taking a lot
[14:05:14] <Spookreeeno>	 What Revi said
[14:05:15] <_joe_>	 yeah we're seeing latencies skyrocketing
[14:05:17] <WXIII>	 css loaded quite slow tho
[14:05:18] <marostegui>	 eswiki, enwiki and commons worked for me
[14:05:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 844 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:05:23] <logmsgbot>	 !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=99)
[14:05:26] <_joe_>	 but they should recovery soon
[14:05:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:33] <rzl>	 latencies are much higher than they were from the warmup script yeah
[14:05:35] <Urbanecm>	 managed to save an edit at cswiki via mobile app, but the app is lazier
[14:05:41] <volans>	 write succeeded on meta, although pretty slow
[14:05:41] <jynus>	 testwiki worked, but took 30s+
[14:05:44] <rzl>	 but it does look like caches are filling
[14:05:56] <_joe_>	 yes
[14:06:05] <sobanski>	 Edit worked on plwiki on mobile
[14:06:05] <wikibugs>	 10Operations, 10Traffic: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 (10ema)
[14:06:06] <cdanis>	 latencies are on their way down
[14:06:07] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime
[14:06:09] <rzl>	 I'm going to pause briefly and then rerun 07-set-readwrite one more time when the siteinfo check should be faster
[14:06:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:11] <_joe_>	 we're out of the woods for latency
[14:06:14] <wikibugs>	 10Operations, 10Traffic: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 (10ema) p:05Triage→03Medium
[14:06:15] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[14:06:19] <marostegui>	 Gave a bit more downtime to the masters
[14:06:21] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:38] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.1717 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver
[14:06:40] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1937 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[14:06:40] <_joe_>	 latencies should be ok in a couple minutes tops
[14:06:55] <icinga-wm>	 PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:57] <Spookreeeno>	 Enwp is hanging on login page
[14:07:03] <rzl>	 worker saturation is consistent with high latency, should recover
[14:07:05] <_joe_>	 s3 seems to have latencies
[14:07:09] * volans checking kubestage1001
[14:07:13] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:07:19] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[14:07:27] <volans>	 just wmf_auto_restart_mcelog.service, ignore it
[14:07:37] <_joe_>	 apis have recovered
[14:07:40] <akosiaris>	 volans: ok thanks. I missed that
[14:07:40] <jynus>	 write latency is better on s3 now
[14:07:42] <akosiaris>	 oh pages
[14:07:46] <moritzm>	 it's caused by the move to 4.19 on kubestage,patch for that one is pending
[14:07:53] <cdanis>	 akosiaris: yeah, php worker saturation, already recovering
[14:07:53] <_joe_>	 please focus
[14:08:03] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
[14:08:09] <_joe_>	 ^^
[14:08:11] <_joe_>	 good
[14:08:21] <rzl>	 latency is coming back down to the normal range
[14:08:24] <wkandek>	 ptwiki edit worked
[14:08:25] <Trizek>	 The banner displayed to warn our users is still visible for IPs, but would disappear soon. 
[14:08:26] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.7013 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver
[14:08:27] <_joe_>	 p99 is now normal
[14:08:28] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6271 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[14:08:38] <_joe_>	 we should be out of the woods
[14:08:40] <rzl>	 Trizek: ack, sounds good
[14:08:44] <marostegui>	 s8 which gave problems on the first switch is also ok
[14:08:52] <rzl>	 rerunning 07-set-readwrite one more time just to verify the siteinfo comes back correct
[14:09:01] <volans>	 k
[14:09:02] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
[14:09:03] <logmsgbot>	 !log rzl@cumin1001 MediaWiki read-only period ends at: 2020-10-27 14:09:02.873019
[14:09:03] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
[14:09:04] <_joe_>	 can someone check that the systemd timers are running correctly on mwmaint1002?
[14:09:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:07] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:09:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:31] <volans>	 _joe_: how so? the maintenance is not yet re-enabled there
[14:09:39] <rzl>	 running 08-start-maintenance now to get the wdqs dispatcher going, that's the only one controlled by that step
[14:09:41] <revi>	 loading fine, can lock accounts as normal, thanks people!
[14:09:45] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance
[14:09:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:53] <_joe_>	 volans: most scripts are just always running
[14:10:03] <_joe_>	 and do not start iff not in the active dc automatically
[14:10:05] <rzl>	 volans: everything using the systemd wrapper reads the active dc from conftool
[14:10:13] <volans>	 k
[14:10:38] <rzl>	 for future switchovers we might consider adding a "pause all maintenance scripts" flag to the conftool data so we can control it from here
[14:10:45] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:10:58] <marostegui>	 checking db1075 (s3) which seems to be struggling a bit
[14:11:01] <_joe_>	 confirmed, they're running
[14:11:10] <icinga-wm>	 PROBLEM - MariaDB read only s3 #page on db2105 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 23000057s, event_scheduler: True, 84.12 QPS, connection latency: 0.002210s, query latency: 0.000468s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:11:18] <_joe_>	 rzl: let's proceed?
[14:11:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime
[14:11:21] <akosiaris>	 Hm, I just got another page, delayed?
[14:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:23] <rzl>	 _joe_: already running
[14:11:25] <marostegui>	 rzl: giving them another 5 minutes
[14:11:29] <icinga-wm>	 PROBLEM - MariaDB read only es5 on es1024 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.12-MariaDB-log, Uptime 15127019s, event_scheduler: True, 117.28 QPS, connection latency: 0.002505s, query latency: 0.000627s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:11:30] <bblack>	 yeah delayed I think, alert1001
[14:11:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:11:33] <akosiaris>	 ok
[14:11:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:36] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0)
[14:11:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:42] <rzl>	 fyi the "master comes back in read only" alerts are fine
[14:11:48] <rzl>	 running puppet on the db masters now to clear them
[14:11:50] <cdanis>	 those are downtimes expiring
[14:11:52] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters
[14:11:55] <marostegui>	 rzl: +1
[14:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:01] <bblack>	 oh that makes more sense
[14:12:18] <rzl>	 while this runs, is anyone aware of any ongoing issues?
[14:12:19] <volans>	 rzl: you can also run the tendril one Id say
[14:12:28] <volans>	 leave just the TTL out for a bit
[14:12:29] <volans>	 JIC
[14:12:34] <rzl>	 volans: yep that's the plan
[14:12:49] <_joe_>	 everyone: any ongoing issues?
[14:12:58] <_joe_>	 things you see while using the wikis
[14:12:58] <icinga-wm>	 RECOVERY - MariaDB read only s3 #page on db2105 is OK: Version 10.1.43-MariaDB, Uptime 23000165s, read_only: True, event_scheduler: True, 193.82 QPS, connection latency: 0.002153s, query latency: 0.000537s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:12:59] <jynus>	 rzl: write activity is much lower than before switch from the db point of view
[14:13:17] <jynus>	 rzl: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&from=1603804389476&to=1603807989476&var-site=eqiad&var-group=core&var-shard=All&var-role=All
[14:13:24] <cdanis>	 rzl: I think the cookbook is missing a step -- I think we also need to run puppet on alert1001/alert2001 after running it on the primary DBs
[14:13:31] <_joe_>	 yes
[14:13:38] <cdanis>	 I ran puppet by hand on alert1001 and *that* is what triggered the above recovery
[14:13:55] <moritzm>	 dewiki seems to be perfectly fine, made an edit, logged off and on and all functionality I checked is fine
[14:14:12] <rzl>	 cdanis: ack, thanks
[14:14:22] <rzl>	 jynus: I turned off stacked view and that looks like it's mostly driven by s8
[14:14:33] <cdanis>	 also, there's no issues reported in #wp-en or #-tech, and nothing on enwiki or commons technical village pump either
[14:14:40] <icinga-wm>	 RECOVERY - MariaDB read only es5 on es1024 is OK: Version 10.4.12-MariaDB-log, Uptime 15127210s, read_only: False, event_scheduler: True, 113.86 QPS, connection latency: 0.002341s, query latency: 0.000626s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[14:14:44] <revi>	 Everything fine @ meta / kowiki as well
[14:14:49] <jynus>	 rzl: then it is just the dispatching
[14:14:59] <Spookreeeno>	 Scanning discord for any reports
[14:15:01] <_joe_>	 thanks revi :)
[14:15:06] <rzl>	 jynus: yeah agree -- should be back to normla in the next few minutes
[14:15:13] <_joe_>	 Spookreeeno: we have a discord? :O
[14:15:17] <rzl>	 let me know if it doesn't recover :)
[14:15:21] <revi>	 https://enwp.org/WP:DISCORD
[14:15:22] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters (exit_code=0)
[14:15:24] <Spookreeeno>	 ^
[14:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:28] <_joe_>	 TIL :P
[14:15:29] <Spookreeeno>	 We do _joe_
[14:15:33] <revi>	 or the lowercase Discord.
[14:15:40] <rzl>	 puppet run on DB masters complete, you can go ahead and un-downtime, marostegui
[14:15:42] <revi>	 I just typed and didn't check if that shortcut exist
[14:15:49] <Spookreeeno>	 No issues with Wikipedia reported
[14:15:52] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-update-tendril
[14:15:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:57] <marostegui>	 rzl: ack, should expire in less than 5 minutes anyways
[14:16:00] <rzl>	 sgtm
[14:16:02] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0)
[14:16:03] <_joe_>	 Spookreeeno: thanks :)
[14:16:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:18] <rzl>	 marostegui: give tendril a look?
[14:16:22] <marostegui>	 rzl: checking
[14:16:43] <rzl>	 keeping the DNS TTLs short until the dust clears, maybe another 15m or so if nothing comes up
[14:16:52] <marostegui>	 rzl: looks good apart from pcXXXX hosts, which we saw on the first switch, I will fix those manually
[14:16:57] <Spookreeeno>	 Least I can do _joe_
[14:17:06] <volans>	 marostegui: what's test-s1?
[14:17:13] <rzl>	 marostegui: cool, thanks
[14:17:15] <marostegui>	 volans: a test host
[14:17:29] <godog>	 I'm looking at incident 574 on VO, it should have recovered by itself but clearly not
[14:17:47] <cdanis>	 !log ran puppet on alert1001
[14:17:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:18] <_joe_>	 godog: can we remove /ack it?
[14:18:34] <godog>	 _joe_: yeah I ack'd and resolved it now
[14:18:48] <godog>	 the icinga alert isn't firing anymore anyways
[14:20:41] <marostegui>	 rzl: tendril fixed
[14:21:02] <_joe_>	 rzl: so we don't have as precise a timing as last time, but I think we can use the first edits after the switchover as indication, it was less than 2 minutes again
[14:21:08] <rzl>	 hmm, GET 5xxs from appservers are still nonzero, anyone able to logdive and see what that's about?
[14:21:29] <_joe_>	 rzl: sure, but... that's not unusual IIRC?
[14:21:41] <marostegui>	 s3 and s8 seem to have recovered from high latencies
[14:21:42] <_joe_>	 rzl: what are you basing your statement on?
[14:21:43] <rzl>	 they look a little higher than codfw pre-switch
[14:21:50] <rzl>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 before
[14:21:57] <rzl>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 after
[14:22:20] <rzl>	 sorry those are the whole page -- I'm at 5XX Error rate by HTTP method
[14:22:44] <jynus>	 this was the top failing url POST https://intake-analytics.wikimedia.org/v1/events?hasty=true
[14:23:00] <_joe_>	 that's eventgate-analytics
[14:23:14] <rzl>	 we usually serve the occasional 500, but not a consistent ~0.1%
[14:23:18] <_joe_>	 let's see fatalmonitor
[14:23:43] <cdanis>	 https://logstash.wikimedia.org/goto/26e46707185208cdd2df18155ce8ea47
[14:23:46] <rzl>	 jynus: whatever I'm looking at, it's GET not POST
[14:23:48] <cdanis>	 I think it's restbase?
[14:23:49] <_joe_>	 [{exception_id}] {exception_url} Wikimedia\Rdbms\DBReadOnlyError from line 994 of /srv/mediawiki/php-1.36.0-wmf.14/includes/libs/rdbms/database/Database.php: Database is read-only: You can't edit now. This is because of maintenance. Copy and save your t 
[14:23:49] <jynus>	 lots of failures from grafana
[14:24:02] <_joe_>	 oh sorry
[14:24:22] <cdanis>	 most of the 5xx I'm seeing right now are for /api/rest_v1/page/related queries
[14:24:25] <_joe_>	 logstash is useless
[14:24:32] <ottomata>	 hello looking?
[14:24:32] <cdanis>	 filtering out the grafana and the non-GETs
[14:24:42] <ottomata>	 intake-analytics is eventgate-analytics-external
[14:24:52] <_joe_>	 logstash is lagged, don't trust it
[14:24:57] <cdanis>	 oh, great
[14:24:58] <jynus>	 ottomata: that was overal, they may not be happening now
[14:25:03] <ottomata>	 ok
[14:25:06] <jynus>	 in fact I think it is what _joe_ says
[14:25:12] <cdanis>	 shouldn't we have an alert about that?
[14:25:13] <jynus>	 delayed ones
[14:25:44] <_joe_>	 oh no sorry
[14:25:50] <_joe_>	 those are joburnners
[14:25:57] <_joe_>	 with hanging jobs
[14:26:02] <wikibugs>	 (03PS1) 10Ema: Add 1.8 to NEWS [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636665
[14:26:03] <jynus>	 on codfw?
[14:26:04] <wikibugs>	 (03PS1) 10Ema: 1.8: add 'ABI vrt' to vmod_netmapper.vcc [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636666 (https://phabricator.wikimedia.org/T266567)
[14:26:27] <marostegui>	 we still have jobrunners attempting to write to codfw according to logstash, yes
[14:26:31] <_joe_>	 yes, videoscaling
[14:26:37] <_joe_>	 let me fix it
[14:26:47] <effie>	 could it be jobs still running?
[14:26:58] <jynus>	 maybe those could be very delayed as they disconnect and can take a long time
[14:27:17] <effie>	 or finished after the switchover anyway
[14:27:29] <jynus>	 yeah, that is what I mean
[14:27:38] <jynus>	 errors are still going down on logstash
[14:27:39] <_joe_>	 !log restart php-fpm on jobrunners in codfw
[14:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:47] <_joe_>	 now they should go away
[14:27:50] <dcausse>	 cdanis: /api/rest_v1/page/related is probably CirrusSearch, I still see some rejections despite the mitigations put in place (but should be fine by now, I no longer see rejections)
[14:28:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:33] <jynus>	 is wikidata dispatching enabled? write load on wikidata seems quite low
[14:28:34] <_joe_>	 rzl: sorry now that I've solved the spamminess on the jobrunners in codfw
[14:28:38] <cdanis>	 yeah dcausse restbase is still serving errors there
[14:28:55] <rzl>	 jynus: it should be, we ran that step, double-checking
[14:28:57] <cdanis>	 not an alarming absolute rate, but I think it is the majority of 5xx right now
[14:29:02] <_joe_>	 was it appservers or api-appservers?
[14:29:10] <rzl>	 _joe_: appservers is where I was looking
[14:29:39] <rzl>	 api-appserver 5xxs are an order of magnitude lower, looks much more normal
[14:29:55] <_joe_>	 rzl: I can't see errors right now on a random appserver
[14:31:01] <_joe_>	 oh I think I know what the problem is, persistent connections in cpjobqueue
[14:31:10] <rzl>	 jynus: I do see the wikidata dispatchers running in eqiad
[14:31:34] <jynus>	 rzl: I trust you, I am just seeng very different workload before and now
[14:31:53] <rzl>	 yeah for sure, just ruling out one cause :)
[14:31:58] <rzl>	 hmm
[14:32:06] <_joe_>	 can someone please check if the codfw lvs for jobrunners is sstill receiving requests?
[14:32:08] <marostegui>	 I think it is pretty similar already: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&from=now-6h&to=now&refresh=1m&var-site=All&var-group=core&var-shard=All&var-role=All
[14:32:20] <jynus>	 marostegui: see https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db1104&var-port=9104&from=1603798302103&to=1603809102104 vs https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db2079&var-port=9104&from=1603798335564&to=1603809135564
[14:32:55] <jynus>	 could be not necesarilly us
[14:33:02] <_joe_>	 akosiaris: can you do a rolling restart of cpjobqueue in codfw?
[14:33:07] <jynus>	 e.g. maybe bots were stopped/failed
[14:33:52] <marostegui>	 jynus: it is not that different if you zoom in I think (checking also the figures on the right side): https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db2079&var-port=9104&from=1603805967656&to=1603807409504 vs https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db1104&var-port=9104&from=1603806915643&to=1603809102104
[14:33:52] <akosiaris>	 σθρε
[14:33:55] <akosiaris>	 _joe_: sure
[14:34:01] <jynus>	 marostegui: ok
[14:34:26] <jynus>	 more spiky, though
[14:34:55] <jynus>	 although it may have stabilized now
[14:35:23] <marostegui>	 yeah, a bit more spiky but that can be related to coldness too
[14:36:22] <jynus>	 labsdb wikireplicas looking good
[14:36:30] <_joe_>	 akosiaris: turning it off and on is ok too
[14:36:50] <akosiaris>	 !log rolling restart of all pods in codfw changeprop-jobqueue
[14:36:54] <akosiaris>	 _joe_: already underway
[14:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:58] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Add 1.8 to NEWS [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636665 (owner: 10Ema)
[14:37:09] <_joe_>	 akosiaris: 
[14:37:12] <_joe_>	 thanks :)
[14:37:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] 1.8: add 'ABI vrt' to vmod_netmapper.vcc [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636666 (https://phabricator.wikimedia.org/T266567) (owner: 10Ema)
[14:37:21] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] Add 1.8 to NEWS [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636665 (owner: 10Ema)
[14:38:03] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] 1.8: add 'ABI vrt' to vmod_netmapper.vcc [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636666 (https://phabricator.wikimedia.org/T266567) (owner: 10Ema)
[14:39:48] <akosiaris>	 it's taking it's sweet time
[14:40:12] <_joe_>	 !log restarting envoyproxy on the jobrunners in codfw
[14:40:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:17] <_joe_>	 akosiaris: I found the right solution
[14:40:39] <_joe_>	 the errors in logstash should go down now
[14:41:06] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:41:29] <akosiaris>	 ok
[14:42:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 23 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:42:13] <wikibugs>	 (03PS1) 10Ema: Merge branch 'master' into debian [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636668
[14:42:15] <wikibugs>	 (03PS1) 10Ema: Release 1.9-1 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636669 (https://phabricator.wikimedia.org/T266567)
[14:42:28] <wikibugs>	 10Operations, 10observability: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570 (10fgiunchedi)
[14:42:32] <_joe_>	 rzl: I can't see evidence of an elevated error rate now besides the jobrunners I just fixed with the old dear "wrench in the wheels" method
[14:42:50] <rzl>	 _joe_: works for me
[14:42:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Release 1.9-1 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636669 (https://phabricator.wikimedia.org/T266567) (owner: 10Ema)
[14:43:06] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] Merge branch 'master' into debian [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636668 (owner: 10Ema)
[14:43:27] <_joe_>	 rzl: we should add a -restart-envoy-jobrunners step
[14:43:34] <rzl>	 noted
[14:43:52] <_joe_>	 it just serves to ensure we obliterate the perennial connections changeprop creates
[14:44:15] <_joe_>	 after that connection dies, changeprop re-resolves the ip and correctly goes to eqiad
[14:44:15] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] Release 1.9-1 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636669 (https://phabricator.wikimedia.org/T266567) (owner: 10Ema)
[14:44:18] <rzl>	 nod
[14:45:46] <wikibugs>	 (03PS1) 10Mholloway: [BETA] Fix Echo Push Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636670
[14:46:42] <rzl>	 _joe_: it's gone to zero on grafana too, nice
[14:47:15] <rzl>	 okay, are we still tracking anything else?
[14:47:36] <rzl>	 marostegui: are you happy?
[14:47:45] <marostegui>	 rzl: all good here
[14:48:29] <rzl>	 if no objections I'll restore the TTLs
[14:48:34] <_joe_>	 +1
[14:48:57] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl
[14:49:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi)
[14:49:11] <wikibugs>	 (03PS3) 10Filippo Giunchedi: grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712)
[14:49:23] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0)
[14:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:15] <_joe_>	 https://grafana.wikimedia.org/d/000000208/edit-count?viewPanel=8&orgId=1&refresh=5m&from=now-1h&to=now
[14:50:26] <_joe_>	 pretty nice
[14:50:53] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636670 (owner: 10Mholloway)
[14:51:15] <rzl>	 zooming out to the 3h view it's still lower than before -- wouldn't be surprised if some bots crashed on the first "you can't edit right now" and need to be restarted
[14:55:10] <rzl>	 at some point we should figure out how short a read-only period to claim this time :D but for now it looks like we're all clear
[14:55:14] <rzl>	 thanks everybody, nice work
[14:55:15] <ema>	 !log upload libvmod-netmapper 1.9-1 to buster-wikimedia component/varnish6 T266567
[14:55:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:21] <stashbot>	 T266567: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567
[14:55:25] <_joe_>	 rzl: that happens at *every* switchover
[14:55:43] <rzl>	 heh yeah, figures
[14:55:43] <bblack>	 good argument for regularly returning artificial versions of such rare responses at a low rate
[14:55:46] <_joe_>	 nice work indeed everyone :)
[14:55:54] <bblack>	 so developers aren't so surprised by them
[14:56:09] <_joe_>	 yeah let's surprise users instead :P
[14:56:14] <bblack>	 yes! :)
[14:56:42] <wikibugs>	 (03PS1) 10Niedzielski: admin: remove niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/636671
[15:00:19] <wikibugs>	 (03CR) 10Reedy: [C: 04-1] admin: remove niedzielski (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski)
[15:03:16] <wikibugs>	 (03CR) 10Muehlenhoff: "@Stephen, once" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski)
[15:07:53] <wikibugs>	 (03PS10) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300)
[15:08:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott)
[15:10:11] <wikibugs>	 (03CR) 10Bstorm: toolforge: script to make long-running processes on bastions less good (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm)
[15:12:41] <wikibugs>	 (03PS3) 10Muehlenhoff: Grafana config changes for CAS-enabled grafana-rw.w.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/629122 (https://phabricator.wikimedia.org/T262512)
[15:13:21] <ema>	 !log cp4032: varnish-frontend-restart with libvmod-netmapper 1.9-1 T266567
[15:13:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:27] <stashbot>	 T266567: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567
[15:14:31] <ottomata>	 NICE STUFF!
[15:15:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[15:17:13] <wikibugs>	 (03CR) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[15:18:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:20:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:21:42] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:25:33] <ema>	 !log cp4032: downgrade varnish to 6.0.4 T264398
[15:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:39] <stashbot>	 T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398
[15:26:03] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] Add growthexperiments to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) (owner: 10Urbanecm)
[15:26:13] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686
[15:26:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: systemd::timer::job: switch monitoring_enabled default to false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[15:26:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686 (owner: 10Jcrespo)
[15:26:39] <wikibugs>	 10Operations, 10netops: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) 05Stalled→03Resolved a:03ayounsi All done here. I don't think it's worth doing more intrusive testing.
[15:28:16] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Cleanup old cron deletions after some time after deploy [puppet] - 10https://gerrit.wikimedia.org/r/636644 (https://phabricator.wikimedia.org/T265138)
[15:28:18] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686
[15:28:18] <wikibugs>	 10Operations, 10Traffic: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 (10ema) 05Open→03Resolved a:03ema Done in libvmod-netmapper 1.9-1, closing.
[15:28:27] <wikibugs>	 (03PS3) 10Jcrespo: Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686
[15:28:30] <wikibugs>	 (03Abandoned) 10DannyS712: SpecialInvestigateBlock: Don't assume 'DisableUTEdit' exists [extensions/CheckUser] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631478 (https://phabricator.wikimedia.org/T264302) (owner: 10Dbarratt)
[15:28:44] <wikibugs>	 (03Abandoned) 10DannyS712: SpecialInvestigateBlock: Don't assume 'DisableUTEdit' exists [extensions/CheckUser] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631466 (https://phabricator.wikimedia.org/T264302) (owner: 10Jforrester)
[15:29:05] <wikibugs>	 (03CR) 10Jcrespo: "This was a configuration not reverted, not noticed until eqiad was primary again." [puppet] - 10https://gerrit.wikimedia.org/r/636686 (owner: 10Jcrespo)
[15:30:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686 (owner: 10Jcrespo)
[15:30:54] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) With T266567 out of the way, we can now try different Varnish 6 versions, at least as long as they're VRT-compatible.
[15:31:05] <wikibugs>	 (03Abandoned) 10DannyS712: Collect data about CodeMirror preference usage [extensions/WikimediaEvents] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631229 (https://phabricator.wikimedia.org/T260138) (owner: 10Krinkle)
[15:31:47] <wikibugs>	 (03PS5) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138)
[15:32:09] <wikibugs>	 (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[15:36:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[15:38:43] <wikibugs>	 10Operations, 10Patch-Needs-Improvement: logrotate cronspam on ms-be1040 - https://phabricator.wikimedia.org/T205974 (10fgiunchedi) 05Open→03Invalid IIRC this was fixed, boldly resolving
[15:39:17] <wikibugs>	 10Operations, 10Patch-Needs-Improvement: puppet should try to mount all mountable swift filesystems - https://phabricator.wikimedia.org/T126574 (10fgiunchedi) a:05fgiunchedi→03None
[15:39:32] <wikibugs>	 10Operations, 10observability, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10fgiunchedi) a:05fgiunchedi→03None
[15:42:28] <wikibugs>	 (03CR) 10Ema: Exclude predefined user agents from eventlogging data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata)
[15:45:06] <wikibugs>	 10Operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750 (10MoritzMuehlenhoff)
[15:46:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Ok as it is, although it could also be set as master on hiera (slightly more conventional), but not a huge issue given it is not the final" [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat)
[15:48:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:48:19] <wikibugs>	 (03PS2) 10Elukey: base::standard_packages: avoid mcelog with kernels >= 4.12 [puppet] - 10https://gerrit.wikimedia.org/r/636560
[15:49:42] <wikibugs>	 (03PS3) 10Elukey: base::standard_packages: avoid mcelog with kernels >= 4.12 [puppet] - 10https://gerrit.wikimedia.org/r/636560
[15:49:45] <sukhe>	 [1;5C/query rzl very nice work on the switchovers! 
[15:49:47] <sukhe>	 er
[15:49:48] <sukhe>	 ha
[15:49:54] <rzl>	 haha thank you! :D
[15:49:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:50:15] <sukhe>	 irssi plugin idea: intercept private messages
[15:50:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey)
[15:50:50] <rzl>	 the credit goes to everyone else for doing the work to make sure everything went right when I pressed the button
[15:51:15] <sukhe>	 I don't understand much of it so I have been following to try to learn what and how it happens :)
[15:52:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey)
[15:53:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] base::standard_packages: avoid mcelog with kernels >= 4.12 [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey)
[15:56:41] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] -    member-range ge-4/0/1 to ge-4/0/10; [edit interfaces interface-range disabled]      member...
[15:58:07] <WXIII>	 just wondering... how long was the read-only this time?
[15:58:59] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[15:59:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:38] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10elukey)
[16:01:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: hardware troubleshooting: PSU Failure on wtp1033 - https://phabricator.wikimedia.org/T266575 (10wiki_willy)
[16:02:22] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:05:02] <rzl>	 just missed WXIII but for anyone else who's wondering, the final score this time was 1 minute 49 seconds :)
[16:05:18] <elukey>	 !!!!
[16:05:20] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:05:21] <elukey>	 nice!
[16:05:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:44] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[16:05:45] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[16:05:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:52] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul)
[16:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:55] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[16:05:55] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[16:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:14] <rzl>	 (plus a brief period of sluggishness afterward, which is normal -- during which edit rates were nonzero but lower than usual)
[16:06:22] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] [BETA] Fix Echo Push Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636670 (owner: 10Mholloway)
[16:07:31] <wikibugs>	 (03Merged) 10jenkins-bot: [BETA] Fix Echo Push Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636670 (owner: 10Mholloway)
[16:12:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:12:50] <wikibugs>	 (03PS2) 10Matthias Mullie: Add another SDC property to search for matching media statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633982 (https://phabricator.wikimedia.org/T264925)
[16:12:57] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+1] Add another SDC property to search for matching media statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633982 (https://phabricator.wikimedia.org/T264925) (owner: 10Matthias Mullie)
[16:14:08] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:14:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:18:26] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:14] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:24:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: hardware troubleshooting: PSU Failure on wtp1033 - https://phabricator.wikimedia.org/T266575 (10Cmjohnson) 05Open→03Resolved power cable was loose...fixed
[16:25:38] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul)
[16:28:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:28:44] <icinga-wm>	 PROBLEM - Check systemd state on idp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:52] <wikibugs>	 (03PS1) 10Cwhite: Enable search slowlog by default for ECS indices. [software/ecs] - 10https://gerrit.wikimedia.org/r/636685
[16:30:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:30:48] <icinga-wm>	 PROBLEM - IPMI Sensor Status on analytics1072 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:34:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[16:34:05] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[16:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:32] <icinga-wm>	 RECOVERY - IPMI Sensor Status on wtp1033 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:51:51] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Urbanecm) @Ammarpad Thansk for your comment. I wasn't able to open any of those PDFs with any PDF viewer installed (all browsers I...
[16:54:27] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10RhinosF1) I get ERR_CONNECTION_CLOSED on https://en.wikipedia.org/api/rest_v1/page/pdf/Rail_transport_modelling
[16:58:28] <wikibugs>	 (03PS1) 10JMeybohm: Update patches and changelog [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/636713
[16:58:37] <wikibugs>	 10Operations: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371 (10akosiaris) a:05akosiaris→03None
[17:01:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Update patches and changelog [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/636713 (owner: 10JMeybohm)
[17:01:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update patches and changelog [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/636713 (owner: 10JMeybohm)
[17:01:46] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1098 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:03] <wikibugs>	 10Operations, 10Puppet: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10jbond)
[17:13:17] <wikibugs>	 (03PS2) 10Razzi: geoip: cleanup having moved archiving to launcher [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152)
[17:14:37] <wikibugs>	 10Operations: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155 (10Volans) 05Open→03Resolved Resolving as it's a too old audit now, we could re-run it again if needed, but we've also alerts that check most of those scenarios.
[17:16:50] <wikibugs>	 10Operations, 10User-Kormat: cumin: If no command is provided, output nodelist to stdout - https://phabricator.wikimedia.org/T261861 (10Volans) This is fixed in master in 6d16ec2 (list of hosts to stdout, rest to stderr), but not yet deployed.
[17:19:40] <wikibugs>	 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10Volans) a:05Volans→03None
[17:20:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: remove absented cron jobs for Bugzilla updates [puppet] - 10https://gerrit.wikimedia.org/r/636085 (owner: 10Dzahn)
[17:20:44] <wikibugs>	 (03CR) 10Razzi: geoip: cleanup having moved archiving to launcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi)
[17:21:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: remove 'list_mediawiki_extensions' cron job [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn)
[17:22:31] <wikibugs>	 (03CR) 10Elukey: geoip: cleanup having moved archiving to launcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi)
[17:22:39] <mutante>	 !log gerrit1001/2001 - sudo rm /var/www/mediawiki-extensions.txt
[17:22:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:00] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn)
[17:23:45] <wikibugs>	 (03Abandoned) 10Milimetric: Revert "camus - don't check eqiad topics while DC switchover to codfw is ongoing" [puppet] - 10https://gerrit.wikimedia.org/r/623556 (https://phabricator.wikimedia.org/T261865) (owner: 10Milimetric)
[17:27:09] <wikibugs>	 (03PS1) 10Milimetric: analytics/camus: switch back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/636717 (https://phabricator.wikimedia.org/T261865)
[17:29:33] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] analytics/camus: switch back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/636717 (https://phabricator.wikimedia.org/T261865) (owner: 10Milimetric)
[17:34:36] <wikibugs>	 (03CR) 10Ottomata: Exclude predefined user agents from eventlogging data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata)
[17:35:02] <wikibugs>	 (03PS8) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130)
[17:35:27] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] ldap::client::labs: fix 'Unknown variable: '::restricted..' [puppet] - 10https://gerrit.wikimedia.org/r/633838 (https://phabricator.wikimedia.org/T101447) (owner: 10Dzahn)
[17:36:28] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/26166/cp1075.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata)
[17:36:34] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata)
[17:40:58] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 03+1] "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) (owner: 10Urbanecm)
[17:41:15] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics-external - bump to image 2020-10-27-173311-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/636722 (https://phabricator.wikimedia.org/T266573)
[17:42:17] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics-external - bump to image 2020-10-27-173311-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/636722 (https://phabricator.wikimedia.org/T266573) (owner: 10Ottomata)
[17:44:20] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[17:44:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:42] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1099 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:46:41] <mutante>	 I think we have new systemd alerts on random hosts sometimes due to recent changes from crons to systemd timers.
[17:46:55] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[17:46:55] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[17:46:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:27] <rzl>	 mutante: yeah -- that's a feature, it means the jobs are monitored for failures we didn't used to know about :)
[17:48:24] <mutante>	 rzl: I think it's the auto_restart_service
[17:48:50] <mutante>	 i once ran systemctl reset-failed on a bunch of them and they all cleared.. but it wasn't a one-time thing due to the change 
[17:49:30] <icinga-wm>	 PROBLEM - Host ms-be1057 is DOWN: PING CRITICAL - Packet loss = 100%
[17:50:31] <mutante>	 yea, example releases2001.. restart_jenkins and restart_rsync .. hrmm
[17:51:42] <mutante>	 "Service jenkins not present or not running" ok:) so trying to restart a service that is not present = fail :)
[17:52:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Netbox Error for asw2-d4-eqiad - https://phabricator.wikimedia.org/T265393 (10wiki_willy) 05Open→03Resolved a:05wiki_willy→03Cmjohnson Netbox error is resolved now.   Closing task.  Thanks, Willy
[17:52:09] <mutante>	 and a bunch of failover servers have puppet code to make sure it's absent on the inactive host 
[17:53:01] <mutante>	 yep, one way it happens is when a service is masked .. like jenkins in this case
[17:53:27] <rzl>	 yeah that makes sense
[17:53:35] <rzl>	 sounds like wmf-auto-restart.py should exit 0 in that situation
[17:55:50] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[17:55:50] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[17:55:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:34] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) a:05Bstorm→03RobH Assigning to @RobH to start discussion of how to act...
[18:06:44] <wikibugs>	 (03PS1) 10Dzahn: wmf-auto-restart: return 0 if service is not present or running [puppet] - 10https://gerrit.wikimedia.org/r/636728
[18:09:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:12:04] <wikibugs>	 (03PS1) 10Volans: cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729
[18:13:52] <wikibugs>	 (03PS2) 10Volans: cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729
[18:14:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:15:07] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans)
[18:15:36] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "I think this will just cause the script to fail at the next step when it tries to get the process PID.  ultimately we shouldn't have servi" [puppet] - 10https://gerrit.wikimedia.org/r/636728 (owner: 10Dzahn)
[18:16:49] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10RobH) a:05RobH→03Cmjohnson So this appears like it is basically requiring some...
[18:18:01] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/26168/" [puppet] - 10https://gerrit.wikimedia.org/r/636420 (https://phabricator.wikimedia.org/T266470) (owner: 10DCausse)
[18:18:32] <wikibugs>	 (03PS1) 10Dzahn: releases: only use auto-restart script if actual rsyncd is present [puppet] - 10https://gerrit.wikimedia.org/r/636730
[18:18:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10RobH) p:05Triage→03Medium
[18:21:20] <wikibugs>	 (03CR) 10Jbond: "lgtm but might be worth a pcc just incase" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn)
[18:21:22] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "ah..no, wait. this needs to be only on the primary" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn)
[18:23:44] <wikibugs>	 (03PS2) 10Dzahn: releases: only use auto-restart script on primary server [puppet] - 10https://gerrit.wikimedia.org/r/636730
[18:25:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] releases: only use auto-restart script on primary server [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn)
[18:25:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) Thank you for helping me sort this out.
[18:26:46] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "jerkins: the variables are already used.. it was copy/paste.. what?" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn)
[18:27:50] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "ok.. I see. it's outside the loop..amending" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn)
[18:27:53] <wikibugs>	 (03PS3) 10Razzi: geoip: cleanup having moved archiving to launcher [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152)
[18:28:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson) production cables are in port 10 on both switches
[18:30:18] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1101 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/s 14 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:33:07] <wikibugs>	 (03PS3) 10Dzahn: releases: only use auto-restart script on primary server [puppet] - 10https://gerrit.wikimedia.org/r/636730
[18:35:00] <logmsgbot>	 !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[18:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:50] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "now it should be fine: https://puppet-compiler.wmflabs.org/compiler1002/26169/" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn)
[18:53:42] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1101 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:55:18] <wikibugs>	 (03CR) 10Ryan Kemper: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26170/" [puppet] - 10https://gerrit.wikimedia.org/r/636432 (https://phabricator.wikimedia.org/T255399) (owner: 10DCausse)
[19:08:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:10:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:41:01] <wikibugs>	 (03PS2) 10Ppchelko: Add changeprop rules for newcomerTasksCacheRefreshJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/636078 (https://phabricator.wikimedia.org/T260758) (owner: 10Catrope)
[19:44:17] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] "This will work, yeah. Will deploy tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/636078 (https://phabricator.wikimedia.org/T260758) (owner: 10Catrope)
[19:56:19] <wikibugs>	 10Operations, 10MediaWiki-General, 10Traffic, 10HTTPS: Protocol-relative URLs are poorly supported or unsupported by a number of HTTP clients - https://phabricator.wikimedia.org/T54253 (10Tgr)
[20:00:29] <wikibugs>	 (03PS1) 10Amire80: Add Fahrrad-Datenautobahn (Category:Wikipedia) to the German RSS Planet [puppet] - 10https://gerrit.wikimedia.org/r/636741
[20:07:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "looks good to me. thank you for handling these!" [puppet] - 10https://gerrit.wikimedia.org/r/636741 (owner: 10Amire80)
[20:10:38] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] [wdqs] add support for streaming updater lag metric [puppet] - 10https://gerrit.wikimedia.org/r/636432 (https://phabricator.wikimedia.org/T255399) (owner: 10DCausse)
[20:11:18] <wikibugs>	 (03PS1) 10Dzahn: wikistats: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636743
[20:11:29] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Peachey88)
[20:13:33] <mutante>	 !log gerrit1001/gerrit2001: manually deleting list_mediawiki_extensions cron job (T266024)
[20:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:42] <stashbot>	 T266024: Phase out https://gerrit.wikimedia.org/mediawiki-extensions.txt - https://phabricator.wikimedia.org/T266024
[20:21:26] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] releases: only use auto-restart script on primary server [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn)
[20:29:13] <wikibugs>	 (03CR) 10Dzahn: "Notice: /Stage[main]/Rsyslog/File[/etc/rsyslog.d/20-wmf-auto-restart-rsync.conf]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn)
[20:43:20] <icinga-wm>	 RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:43:23] <mutante>	 !log releases2002 - systemctl reset-failed .. after removing wmf_auto_restart_rsync
[20:43:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:55:09] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10jijiki)
[20:55:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: ms-be1057 down - cable disconnected? - https://phabricator.wikimedia.org/T266604 (10Dzahn)
[20:55:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:55:47] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be1057 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T266604
[20:56:19] <mutante>	 !log ms-be1057 is network down but running, NO-CARRIER on NIC, cable disconnected?
[20:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:36] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:40:26] <mutante>	 !log mwmaint2001 - systemctl reset-failed - mediawiki_job_parser_cache_purging.service
[21:40:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: ms-be1057 down - cable disconnected? - https://phabricator.wikimedia.org/T266604 (10RobH) p:05Triage→03High The switch port sees this port enabled (admin up) but link down, supporting that it could be a bad cable or cable disconnect.  ` Interface       Admin Link Desc...
[21:43:34] <icinga-wm>	 RECOVERY - Check the last execution of mediawiki_job_parser_cache_purging on mwmaint2001 is OK: OK: Status of the systemd unit mediawiki_job_parser_cache_purging https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:56:49] <wikibugs>	 (03CR) 10Razzi: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26171/" [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi)
[22:01:16] <wikibugs>	 (03PS3) 10Volans: cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729
[22:01:37] <volans>	 rzl: ^^^ hope it addresses your IRC comment from earlier
[22:18:38] <wikibugs>	 (03PS1) 10Dzahn: apache: add 20.wikipedia.org redirect to wikimediafoundation site [puppet] - 10https://gerrit.wikimedia.org/r/636755 (https://phabricator.wikimedia.org/T264367)
[22:20:05] <mutante>	 !log systemctl reset-failed on various servers to see which are coming back later from failed auto_restart and which don't
[22:20:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:58] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:21:04] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:21:48] <icinga-wm>	 RECOVERY - Check systemd state on mw1381 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:21:52] <icinga-wm>	 RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:22:08] <icinga-wm>	 RECOVERY - Check systemd state on idp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:22:18] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:22:18] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:22:52] <icinga-wm>	 RECOVERY - Check systemd state on kubestage1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:23:35] <wikibugs>	 10Operations, 10serviceops, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Volans) p:05Triage→03Medium
[22:23:38] <icinga-wm>	 RECOVERY - Check the last execution of package_builder_Clean_up_build_directory on deneb is OK: OK: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:24:04] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:22] <icinga-wm>	 PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:52] <wikibugs>	 (03PS3) 10Dzahn: wikistats: allow to 'absent' import/dump crons as well [puppet] - 10https://gerrit.wikimedia.org/r/633845
[22:32:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:34:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:42:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26172/wikistats-wild-tiger.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633845 (owner: 10Dzahn)
[22:44:16] <wikibugs>	 (03PS1) 10Dave Pifke: [WIP] Experimental support for speedscope.app [puppet] - 10https://gerrit.wikimedia.org/r/636759
[22:50:12] <wikibugs>	 (03PS2) 10Dzahn: wikistats: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636743 (https://phabricator.wikimedia.org/T266479)
[22:50:25] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/26173/wikistats-wild-tiger.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636743 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[22:51:18] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] wikistats: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636743 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[22:51:23] <wikibugs>	 (03PS3) 10Dzahn: wikistats: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636743 (https://phabricator.wikimedia.org/T266479)
[22:53:37] <wikibugs>	 (03PS2) 10Dave Pifke: [WIP] Experimental support for speedscope.app [puppet] - 10https://gerrit.wikimedia.org/r/636759
[22:58:17] <wikibugs>	 (03PS10) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138)
[22:58:21] <wikibugs>	 (03PS4) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138)
[22:59:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[23:05:45] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] base/labs: add systemd timer to clean puppet client bucket (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn)
[23:06:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] base::labs: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635905 (owner: 10Dzahn)
[23:07:27] <wikibugs>	 (03PS4) 10Dzahn: planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 (https://phabricator.wikimedia.org/T265138)
[23:07:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[23:09:59] <wikibugs>	 (03PS2) 10Dzahn: dumps: rm profile::dumps::distribution::datasets::cleanup_miscdatasets [puppet] - 10https://gerrit.wikimedia.org/r/636087 (https://phabricator.wikimedia.org/T265138)
[23:10:16] <wikibugs>	 (03PS8) 10Dzahn: gerrit: replace cron jobs with systemd timers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633857 (https://phabricator.wikimedia.org/T265138)
[23:10:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: replace cron jobs with systemd timers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633857 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[23:15:48] <wikibugs>	 10Operations, 10Puppet, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10Dzahn) patches belonging to this ticket that had not been linked and are already merged:  topic branch: https://gerrit.wikimedia.org/r/q/t...
[23:17:00] <wikibugs>	 (03PS3) 10Dzahn: base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885)
[23:17:13] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[23:17:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn)
[23:20:06] <wikibugs>	 (03PS4) 10Dzahn: base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885)
[23:24:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Andrew)
[23:24:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[23:26:06] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) @Cmjohnson, in case you were waiting to do all these in bulk: all remaining cloudvirts are now ready for upgrade.
[23:34:41] <wikibugs>	 (03PS9) 10Dzahn: gerrit: replace clear_gerrit_logs cron job with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/633857 (https://phabricator.wikimedia.org/T265138)
[23:45:49] <wikibugs>	 (03PS5) 10Dzahn: planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 (https://phabricator.wikimedia.org/T265138)
[23:57:13] <cmjohnson1>	 Andrewbogott the cloudvirts can be updated anytime?
[23:57:31] <wikibugs>	 (03PS1) 10Bstorm: k8s-haproxy: take steps to fix logging [puppet] - 10https://gerrit.wikimedia.org/r/636765 (https://phabricator.wikimedia.org/T266593)
[23:58:22] <andrewbogott>	 cmjohnson1: yep!  They're empty now.  Most are downtimed already I think.
[23:58:55] <cmjohnson1>	 Cool!! Thanks
[23:59:25] <wikibugs>	 (03CR) 10Bstorm: k8s-haproxy: take steps to fix logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636765 (https://phabricator.wikimedia.org/T266593) (owner: 10Bstorm)
[23:59:29] <andrewbogott>	 ok, now they're all downtimed :)