[00:00:04] Deploy window No deploys all day! DC Switchover. See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201027T0000) [00:12:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:13:45] RECOVERY - MariaDB Replica Lag: s4 on db1141 is OK: OK slave_sql_lag Replication lag: 59.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:13:45] RECOVERY - MariaDB Replica Lag: s4 on db1143 is OK: OK slave_sql_lag Replication lag: 59.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:13:45] RECOVERY - MariaDB Replica Lag: s4 on db1121 is OK: OK slave_sql_lag Replication lag: 59.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:13:47] RECOVERY - MariaDB Replica Lag: s4 on db1138 is OK: OK slave_sql_lag Replication lag: 59.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:13:57] RECOVERY - MariaDB Replica Lag: s4 on db1148 is OK: OK slave_sql_lag Replication lag: 56.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:13:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:14:01] RECOVERY - MariaDB Replica Lag: s4 on dbstore1004 is OK: OK slave_sql_lag Replication lag: 56.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:25] RECOVERY - MariaDB Replica Lag: s4 on db1144 is OK: OK slave_sql_lag Replication lag: 52.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:45] RECOVERY - MariaDB Replica Lag: s4 on db1149 is OK: OK slave_sql_lag Replication lag: 50.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:49] RECOVERY - MariaDB Replica Lag: s4 on db1142 is OK: OK slave_sql_lag Replication lag: 50.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:51] RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 50.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:57] RECOVERY - MariaDB Replica Lag: s4 on db1081 is OK: OK slave_sql_lag Replication lag: 48.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:19] RECOVERY - MariaDB Replica Lag: s4 on db1146 is OK: OK slave_sql_lag Replication lag: 42.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:25] RECOVERY - MariaDB Replica Lag: s4 on db1125 is OK: OK slave_sql_lag Replication lag: 40.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:27] RECOVERY - MariaDB Replica Lag: s4 on db1147 is OK: OK slave_sql_lag Replication lag: 40.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:23:00] (03PS7) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [00:23:47] (03CR) 10jerkins-bot: [V: 04-1] per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [01:31:45] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) For some reason (we found this out a few months ago), Dell Singapore part replacements don't go out with return tags. They require you to call and sch... [01:35:31] PROBLEM - Check systemd state on an-worker1101 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:57] PROBLEM - Check the last execution of package_builder_Clean_up_build_directory on deneb is CRITICAL: CRITICAL: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:04:09] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.15 [core] (wmf/1.36.0-wmf.15) - 10https://gerrit.wikimedia.org/r/636549 [02:09:06] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.15 [core] (wmf/1.36.0-wmf.15) - 10https://gerrit.wikimedia.org/r/636549 (https://phabricator.wikimedia.org/T263181) (owner: 10TrainBranchBot) [02:09:33] (03CR) 10DannyS712: "No deployment this week, though should this be merged anyway? Or just abandoned?" [core] (wmf/1.36.0-wmf.15) - 10https://gerrit.wikimedia.org/r/636549 (https://phabricator.wikimedia.org/T263181) (owner: 10TrainBranchBot) [03:27:51] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:27] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:35] PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:13] PROBLEM - Check systemd state on an-worker1100 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:13] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:32:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:55:19] PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:54] 10Operations, 10SRE-tools: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 (10Marostegui) [06:09:03] 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10Marostegui) 05Declined→03Open This has happened again: ` [05:55:19] <+icinga-wm> PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degrad... [06:35:56] (03PS1) 10JMeybohm: Don't send duplicate events from resync to sink [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636553 (https://phabricator.wikimedia.org/T262675) [06:35:58] (03PS1) 10JMeybohm: Lower label cardinality of prometheus metrics [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636554 (https://phabricator.wikimedia.org/T262675) [06:37:17] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Don't send duplicate events from resync to sink [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636553 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [06:37:22] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Lower label cardinality of prometheus metrics [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/636554 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [06:40:01] (03PS1) 10JMeybohm: eventrouter: don't send duplicate events, fix metrics [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636555 (https://phabricator.wikimedia.org/T262675) [06:40:32] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: don't send duplicate events, fix metrics [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636555 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [06:42:20] !log T263970 Set number of replicas to 2 (from previous value of 1) for all codfw indices matching `apifeatureusage*`, new shards have been assigned without issue [06:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:27] T263970: ElasticSearch unassigned shard check apifeatureusage-2020.06.30@codfw and enwiki_general_1587198756@eqiad - https://phabricator.wikimedia.org/T263970 [06:43:41] (03PS1) 10JMeybohm: eventrouter: Pump image version and resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/636556 (https://phabricator.wikimedia.org/T262675) [06:44:02] (03PS2) 10JMeybohm: eventrouter: Bump image version and resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/636556 (https://phabricator.wikimedia.org/T262675) [06:48:21] (03CR) 10JMeybohm: [C: 03+2] eventrouter: Bump image version and resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/636556 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [06:50:24] !log published docker-registry.discovery.wmnet/eventrouter:0.3.0-4 [06:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:09] (03Merged) 10jenkins-bot: eventrouter: Bump image version and resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/636556 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [06:58:37] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [06:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:11] (03CR) 10Elukey: geoip: cleanup having moved archiving to launcher (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [07:01:19] (03PS1) 10Marostegui: orchestrator.conf: Change RecoveryPeriodBlockSeconds default [puppet] - 10https://gerrit.wikimedia.org/r/636558 (https://phabricator.wikimedia.org/T265990) [07:02:51] (03CR) 10Elukey: "Let's also remove all the profile::tlsproxy::service::* configs in hiera :)" [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [07:06:29] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:55] RECOVERY - Check systemd state on an-worker1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:18:25] (03PS1) 10Elukey: base::standard_packages: avoid mcelog with kernels >= 4.12 [puppet] - 10https://gerrit.wikimedia.org/r/636560 [07:19:30] (03PS8) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [07:21:58] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634936 (owner: 10Giuseppe Lavagetto) [07:22:25] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634933 (owner: 10Giuseppe Lavagetto) [07:29:22] (03Abandoned) 10Tobias Andersson: Add new slow-bot group for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618245 (https://phabricator.wikimedia.org/T258354) (owner: 10Tobias Andersson) [07:31:02] 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) Another spike yesterday on DELETEs {F32415982} Checking binlogs from 22:17 to 22:22... [07:35:38] !log swift codfw-prod: bump object weight for ms-be2057 - T261633 [07:35:38] (03CR) 10DCausse: [C: 03+1] Increase cirrus morelike pool counter by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636480 (owner: 10Ebernhardson) [07:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:44] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [07:38:17] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:39:59] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [07:51:51] (03CR) 10Tobias Andersson: [C: 03+1] Enable propagatePageDeletion on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636453 (owner: 10Lucas Werkmeister (WMDE)) [08:04:28] 10Operations, 10vm-requests: Site: 1 VM request for Analytics test cluster - https://phabricator.wikimedia.org/T266064 (10elukey) 05Open→03Resolved a:03elukey This is done! [08:08:14] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) [08:08:39] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) Per my chat with Chris, updating the rack location from A2 to A1 and from C2 to C3 [08:15:48] !log update thanos-fe2002 to thanos 0.16.0 - T261281 [08:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:54] T261281: Improve performance of Thanos (+ Prometheus) - https://phabricator.wikimedia.org/T261281 [08:18:45] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:11] (03PS3) 10Elukey: sre.hadoop.init-hadoop-workers: add more defensive code [cookbooks] - 10https://gerrit.wikimedia.org/r/636403 (https://phabricator.wikimedia.org/T260411) [08:21:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey) [08:21:52] (03CR) 10Ayounsi: [C: 03+2] Add Z side device/interface/vlan and cable to PuppetDB importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [08:24:48] (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-workers: add more defensive code [cookbooks] - 10https://gerrit.wikimedia.org/r/636403 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [08:25:49] (03PS1) 10Muehlenhoff: Fix auto restart on IDP test hosts [puppet] - 10https://gerrit.wikimedia.org/r/636605 [08:26:33] 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) Another spike from 08:05 to 08:06 and this is what the binlog shows (number of state... [08:27:17] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:28] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [08:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [08:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:50] (03CR) 10Filippo Giunchedi: [C: 03+1] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [08:32:09] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [08:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:19] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:22] (03CR) 10Kormat: [C: 03+1] orchestrator.conf: Change RecoveryPeriodBlockSeconds default [puppet] - 10https://gerrit.wikimedia.org/r/636558 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [08:34:19] 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) From my POV as Swift maintainer I'm ok to go ahead with testing etc. Although it'd be really good to have at... [08:34:41] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Change RecoveryPeriodBlockSeconds default [puppet] - 10https://gerrit.wikimedia.org/r/636558 (https://phabricator.wikimedia.org/T265990) (owner: 10Marostegui) [08:35:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:37:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:39:48] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=97) [08:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:43] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:43:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:45:28] (03PS1) 10Muehlenhoff: Remove auto restart for pmacctd [puppet] - 10https://gerrit.wikimedia.org/r/636607 [08:47:19] RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:23] RECOVERY - Check systemd state on an-worker1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:32] (03PS1) 10Elukey: sre.hadoop.init-hadoop-workers: avoid wipefs [cookbooks] - 10https://gerrit.wikimedia.org/r/636608 [08:50:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:53:23] (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-workers: avoid wipefs [cookbooks] - 10https://gerrit.wikimedia.org/r/636608 (owner: 10Elukey) [08:54:09] (03PS2) 10Gehel: [wdqs-data-reload] load all lexemes chunks [cookbooks] - 10https://gerrit.wikimedia.org/r/636018 (owner: 10DCausse) [08:55:24] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [08:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:47] (03CR) 10Gehel: [C: 03+2] [wdqs-data-reload] load all lexemes chunks [cookbooks] - 10https://gerrit.wikimedia.org/r/636018 (owner: 10DCausse) [08:56:22] (03PS1) 10Kormat: mariadb: Set both db_inventory nodes read-write [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) [08:56:44] (03PS2) 10Kormat: mariadb: Set both db_inventory nodes read-write [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) [08:58:33] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=97) [08:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:53] (03PS3) 10Kormat: mariadb: Set both db_inventory nodes read-write [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) [09:02:06] (03CR) 10Kormat: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1001/26145/" [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat) [09:02:20] (03CR) 10Kormat: [C: 04-2] "Not merging this today." [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat) [09:02:35] (03CR) 10Marostegui: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat) [09:04:29] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:40] (03CR) 10Ema: [C: 03+2] ATS: add metric trafficserver_tls_client_total_time [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) (owner: 10Ema) [09:14:49] 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10jcrespo) So my guess is this is only happening on buster. [09:15:25] (03PS1) 10Kormat: orchestrator: Search both eqiad and codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/636613 (https://phabricator.wikimedia.org/T265990) [09:17:02] (03CR) 10Kormat: [C: 04-2] "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1003/26146/dborch1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636613 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [09:17:13] (03CR) 10Marostegui: [C: 03+1] orchestrator: Search both eqiad and codfw dns [puppet] - 10https://gerrit.wikimedia.org/r/636613 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [09:19:21] (03PS1) 10Muehlenhoff: Only handle auto restart of Jenkins on active instance [puppet] - 10https://gerrit.wikimedia.org/r/636614 [09:20:24] (03CR) 10Ayounsi: [C: 03+1] Remove auto restart for pmacctd [puppet] - 10https://gerrit.wikimedia.org/r/636607 (owner: 10Muehlenhoff) [09:21:40] (03CR) 10Jbond: [C: 03+1] "lgtm optional minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey) [09:21:50] (03CR) 10Muehlenhoff: [C: 03+2] Remove auto restart for pmacctd [puppet] - 10https://gerrit.wikimedia.org/r/636607 (owner: 10Muehlenhoff) [09:22:34] (03CR) 10Jbond: [C: 03+2] remote-backup-mariadb: update cron to systemd::timer::job [puppet] - 10https://gerrit.wikimedia.org/r/636410 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [09:23:04] jbond42: shall I merge along? [09:23:04] (03PS1) 10Kormat: orchestrator: Install mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/636616 (https://phabricator.wikimedia.org/T265990) [09:23:18] yes go ahead i need a follow up one but that one isfine to go now [09:23:19] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:24] ack, doing [09:23:37] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:53] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:49] (03CR) 10Kormat: [C: 04-2] "-2: Don't merge today" [puppet] - 10https://gerrit.wikimedia.org/r/636616 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [09:25:08] (03PS1) 10Marostegui: orchestrator.conf: Add query to detect alias [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) [09:25:17] (03PS1) 10Jbond: mariadb-snapshot: disable monitoring on timer job [puppet] - 10https://gerrit.wikimedia.org/r/636618 [09:25:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] mariadb-snapshot: disable monitoring on timer job [puppet] - 10https://gerrit.wikimedia.org/r/636618 (owner: 10Jbond) [09:26:02] (03CR) 10Kormat: [C: 03+1] orchestrator.conf: Add query to detect alias [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) (owner: 10Marostegui) [09:26:33] (03CR) 10Marostegui: [C: 03+1] orchestrator: Install mariadb client [puppet] - 10https://gerrit.wikimedia.org/r/636616 (https://phabricator.wikimedia.org/T265990) (owner: 10Kormat) [09:27:18] (03CR) 10Kormat: "Great, thank you! <3" [puppet] - 10https://gerrit.wikimedia.org/r/636067 (https://phabricator.wikimedia.org/T266338) (owner: 10Dzahn) [09:33:51] (03CR) 10Arturo Borrero Gonzalez: toolforge: script to make long-running processes on bastions less good (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [09:35:36] (03CR) 10Kormat: "Can you update the orchestrator grants template as well, please?" [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) (owner: 10Marostegui) [09:36:05] (03CR) 10Marostegui: "> Patch Set 1: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) (owner: 10Marostegui) [09:37:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [09:40:43] (03PS1) 10Jbond: systemd::timer::job: add complex interval type checking [puppet] - 10https://gerrit.wikimedia.org/r/636622 (https://phabricator.wikimedia.org/T265138) [09:43:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636605 (owner: 10Muehlenhoff) [09:44:59] (03PS1) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) [09:45:03] (03CR) 10Muehlenhoff: [C: 03+2] Fix auto restart on IDP test hosts [puppet] - 10https://gerrit.wikimedia.org/r/636605 (owner: 10Muehlenhoff) [09:45:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff) [09:46:06] (03CR) 10Jbond: [C: 03+2] systemd::timer::job: add complex interval type checking [puppet] - 10https://gerrit.wikimedia.org/r/636622 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [09:46:23] PROBLEM - eventlogging Varnishkafka log producer on cp4032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:46:43] PROBLEM - statsv Varnishkafka log producer on cp4032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:46:50] mmmm [09:46:53] PROBLEM - Webrequests Varnishkafka log producer on cp4032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:47:24] looking [09:48:05] it says varnishkafka-webrequest.service: Job varnishkafka-webrequest.service/start failed with result 'dependency'. [09:48:12] anybody working on it? [09:48:14] (03PS2) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) [09:48:16] (03PS3) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) [09:48:18] (03PS4) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [09:48:20] (03PS3) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [09:48:29] * jbond42 taking a peek at cp4032 [09:48:39] RECOVERY - Webrequests Varnishkafka log producer on cp4032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:48:42] (03CR) 10jerkins-bot: [V: 04-1] ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [09:48:57] elukey: yeah, I downgraded varnish to 6.0.2-1 and apparently the ABI isn't compatible -.- [09:49:05] ah! [09:49:10] okok that explains [09:49:21] sorry, I was about to !log [09:49:39] ema: fyi i did a puppet run which started a hole bunch of other varnish related things [09:49:47] do you want the puppet aganet output? [09:49:59] RECOVERY - eventlogging Varnishkafka log producer on cp4032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:50:17] RECOVERY - statsv Varnishkafka log producer on cp4032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:50:23] jbond42: no need, I'm looking at /var/log/puppet.log. Thanks! [09:50:29] ack [09:51:48] (03PS3) 10Ayounsi: ImportPuppetDB: add cable color/type [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/636627 (https://phabricator.wikimedia.org/T262899) [09:51:50] (03PS4) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) [09:51:52] (03PS5) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [09:51:54] (03PS4) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [09:55:33] (03PS1) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) [09:57:03] PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:17] (03PS2) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) [09:58:49] RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:06] (03PS2) 10Marostegui: orchestrator.conf: Add query to detect alias [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) [10:01:34] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10MoritzMuehlenhoff) @gsingers We have three major types of NDA/MOU under which people get access to PII-sensitive data on our servers: * Everyone who's WMF staff has signed an NDA a... [10:04:05] (03CR) 10Jbond: "PCC running https://puppet-compiler.wmflabs.org/compiler1002/26150/" [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [10:04:57] (03Abandoned) 10Hashar: Add dns entry for zuul.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/634913 (https://phabricator.wikimedia.org/T207008) (owner: 10Ladsgroup) [10:05:00] (03Abandoned) 10Hashar: mediawiki: Funnel zuul.wikimedia.org to integration.wikimedia.org/zuul [puppet] - 10https://gerrit.wikimedia.org/r/634914 (https://phabricator.wikimedia.org/T207008) (owner: 10Ladsgroup) [10:05:11] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10hashar) 05Open→03Declined The canonical URL is https://integration.wikimedia.org/zuul/ which one ca... [10:05:25] 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic, and 3 others: Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10hashar) [10:06:29] (03PS1) 10Elukey: sre.hadoop.init-hadoop-worker: add explicit systemd reload steps [cookbooks] - 10https://gerrit.wikimedia.org/r/636631 [10:06:38] !log update policies from-zone production to-zone junos-host on mr1-codfw - T265589 [10:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:45] PROBLEM - Host mr1-codfw IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:ffff::6) [10:10:08] ah [10:10:17] forgot to allow icmpv6? [10:11:45] yup [10:12:06] should come back? [10:14:53] RECOVERY - Host mr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.58 ms [10:15:03] good [10:15:10] !log update policies from-zone production to-zone junos-host on mr1-esams - T265589 [10:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:04] (03PS1) 10Jcrespo: dbstore_multiinstance: Add profile to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) [10:19:13] !log update policies from-zone production to-zone junos-host on mr1-ulsfo - T265589 [10:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:40] (03CR) 10Hashar: [C: 03+1] "Indeed GerritExtDistProvider relies on 'repoListUrl' whenever it is set. I am not familiar with extension distributor or I would just hav" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636083 (https://phabricator.wikimedia.org/T266024) (owner: 10Legoktm) [10:20:04] (03CR) 10jerkins-bot: [V: 04-1] dbstore_multiinstance: Add profile to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) (owner: 10Jcrespo) [10:20:15] !log update policies from-zone production to-zone junos-host on mr1-eqsin - T265589 [10:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:39] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) Given that the amount of changes between 5.1.3 and 6.0.6 is considerable, I was thinking of following this "bisect-like" apporach: packa... [10:21:44] !log update policies from-zone production to-zone junos-host on mr1-eqiad - T265589 [10:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:25] elukey, jbond42: FTR this is the sadness of the day https://phabricator.wikimedia.org/T264398#6580942 [10:23:01] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10ayounsi) [10:23:35] * elukey plays sad_trombone.wav for ema [10:24:44] (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-worker: add explicit systemd reload steps [cookbooks] - 10https://gerrit.wikimedia.org/r/636631 (owner: 10Elukey) [10:24:56] (03CR) 10Hashar: [C: 03+1] "The cron removal can be deployed." [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn) [10:25:14] (03PS2) 10Jcrespo: dbstore_multiinstance: Add profile to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) [10:25:42] (03PS3) 10Jcrespo: dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) [10:26:59] (03CR) 10jerkins-bot: [V: 04-1] dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) (owner: 10Jcrespo) [10:27:45] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [10:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [10:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:49] PROBLEM - Docker registry HTTPS interface on registry2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [10:33:30] (03PS4) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) [10:33:32] (03PS1) 10Giuseppe Lavagetto: Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) [10:33:55] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:34:27] RECOVERY - Docker registry HTTPS interface on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 2567 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Docker [10:35:15] (03CR) 10Hashar: [C: 04-1] "As I get it the restart sends a SIGTERM to java/Jenkins which would cause it to abruptly terminate all jobs currently running." [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff) [10:35:20] (03PS4) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) [10:35:31] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:37:39] (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff) [10:39:56] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10akosiaris) > I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible: {meme, src=f... [10:40:19] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [10:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [10:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:37] (03PS3) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) [10:45:51] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [10:46:16] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [10:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [10:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:33] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [10:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:57] (03CR) 10Hashar: [C: 04-1] "Yes that got enabled for release Jenkins instances back in May 2018 ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/430562 ). I g" [puppet] - 10https://gerrit.wikimedia.org/r/636614 (owner: 10Muehlenhoff) [10:58:49] (03PS8) 10Filippo Giunchedi: grafana: sync users and roles from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) [10:58:51] (03PS1) 10Filippo Giunchedi: grafana: introduce 'profile::grafana::active_host' [puppet] - 10https://gerrit.wikimedia.org/r/636637 (https://phabricator.wikimedia.org/T265712) [10:58:53] (03PS1) 10Filippo Giunchedi: grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) [11:00:24] (03CR) 10jerkins-bot: [V: 04-1] grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:01:41] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/26152/" [puppet] - 10https://gerrit.wikimedia.org/r/636637 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:02:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [11:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:13] (03CR) 10Hashar: grafana: sync users and roles from LDAP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:04:43] (03CR) 10Hashar: [C: 03+1] "Indeed they are "ensure => absent" so that is merely a puppet manifest cleanup. Thx!" [puppet] - 10https://gerrit.wikimedia.org/r/636085 (owner: 10Dzahn) [11:06:30] (03PS2) 10Filippo Giunchedi: grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) [11:06:59] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [11:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:09:07] (03CR) 10Filippo Giunchedi: grafana: sync users and roles from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:10:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/636637 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:12:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [11:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:39] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [11:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:05] !log A:cp remove libvarnishapi1, replaced by libvarnishapi2 a while ago T261487 [11:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:10] T261487: Varnish 6.0 needs a SONAME version bump - https://phabricator.wikimedia.org/T261487 [11:14:33] (03PS2) 10Giuseppe Lavagetto: Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) [11:14:57] (03CR) 10Hashar: [C: 03+1] "What Filippo said: one can just run the tests via the Docker container / ./utils/run_ci_locally.sh" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:15:00] (03CR) 10Muehlenhoff: grafana: add user sync from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:19:03] PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [11:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:47] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: introduce 'profile::grafana::active_host' [puppet] - 10https://gerrit.wikimedia.org/r/636637 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:20:09] (03PS4) 10Jcrespo: dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) [11:20:22] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: sync users and roles from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/635559 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:20:57] (03PS5) 10Jcrespo: dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) [11:21:05] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [11:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:48] (03CR) 10Jcrespo: [C: 03+2] dbstore_multiinstance: Add module to cleanup stale scope sessions [puppet] - 10https://gerrit.wikimedia.org/r/636633 (https://phabricator.wikimedia.org/T199911) (owner: 10Jcrespo) [11:24:37] 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10matthiasmullie) At first sight, these DB operations seem to make sense: bots are in the process... [11:25:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [11:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:48] 10Operations, 10DBA, 10Data-Persistence, 10Release-Engineering-Team-TODO, and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10Kormat) I just came across this: https://github.com/openark/orchestrator/blob/master/docs/ci-env.md#run-orchestrator-with-envir... [11:26:07] RECOVERY - Check systemd state on cp3052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:20] 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10jcrespo) Independently of the source of the issue, could these regenerations be throttled/rate l... [11:27:22] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [11:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:39] (03CR) 10Filippo Giunchedi: grafana: add user sync from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:33:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:33:33] (03PS1) 10Jbond: change to test pcc [puppet] - 10https://gerrit.wikimedia.org/r/636641 [11:35:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [11:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:46] (03PS1) 10Jcrespo: dbprov: Apply the cleanup stale scope sessions to the right set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/636642 (https://phabricator.wikimedia.org/T199911) [11:41:51] (03Abandoned) 10Hashar: Recommendation API: upgrade node to version 10 [puppet] - 10https://gerrit.wikimedia.org/r/560454 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [11:46:54] (03PS2) 10Jcrespo: dbprov: Apply the cleanup stale scope sessions to the right set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/636642 (https://phabricator.wikimedia.org/T199911) [11:49:52] (03CR) 10Jcrespo: [C: 03+2] dbprov: Apply the cleanup stale scope sessions to the right set of hosts [puppet] - 10https://gerrit.wikimedia.org/r/636642 (https://phabricator.wikimedia.org/T199911) (owner: 10Jcrespo) [11:50:13] (03PS1) 10Jcrespo: mariadb: Cleanup old cron deletions after some time after deploy [puppet] - 10https://gerrit.wikimedia.org/r/636644 (https://phabricator.wikimedia.org/T265138) [11:52:11] (03CR) 10Jcrespo: "To be deployed in a few days." [puppet] - 10https://gerrit.wikimedia.org/r/636644 (https://phabricator.wikimedia.org/T265138) (owner: 10Jcrespo) [11:52:46] (03PS1) 10Volans: requests: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/636645 [11:57:15] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:29] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:03] RECOVERY - Check systemd state on dbprov1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] sretest: Experiment with preserving docker rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634192 (owner: 10Alexandros Kosiaris) [12:31:31] (03PS4) 10Alexandros Kosiaris: sretest: Experiment with preserving docker rules [puppet] - 10https://gerrit.wikimedia.org/r/634192 [12:44:43] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:17] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [12:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:55:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [12:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:48] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [12:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:08] I promise I am going to stop very soon, few hosts to go :) [12:57:11] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:59:11] (03PS1) 10Jbond: pcc: expore posting pcc to gerrit comments [puppet] - 10https://gerrit.wikimedia.org/r/636652 [13:01:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [13:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:09] (03CR) 10Jbond: "@paladox, hashar wonder if you know of a way to authenticate to gerrit api using some type of user token like Jenkins instead of requiring" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [13:03:01] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:03:19] (03CR) 10Paladox: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [13:03:21] PROBLEM - Docker registry HTTPS interface on registry2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [13:04:28] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [13:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:59] RECOVERY - Docker registry HTTPS interface on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 2567 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Docker [13:05:27] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [13:06:00] (03CR) 10Jbond: "will leave this here as the other changes may be useful to merge" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [13:07:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [13:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:15] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:09:52] (03PS3) 10Marostegui: orchestrator.conf: Add query to detect alias [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) [13:10:39] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [13:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:25] 10Operations, 10netops: Apply uRPF strict mode on Customer links - https://phabricator.wikimedia.org/T266561 (10ayounsi) p:05Triage→03Low [13:13:59] (03PS1) 10Ayounsi: Add uRPF strict mode to Customers links [homer/public] - 10https://gerrit.wikimedia.org/r/636653 (https://phabricator.wikimedia.org/T266561) [13:15:00] good morning! in 15 minutes in this channel, we'll start getting set up for the DC switchover from codfw back to eqiad -- MW read-only window starts in 45 minutes [13:15:20] (03PS4) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) [13:15:21] if you have root on cumin1001, you can watch my screen by running "sudo -i tmux attach -rt switchdc" there [13:15:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [13:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:47] elukey: almost done? :) [13:17:20] I am done! [13:17:28] rzl: thanks! will follow on tmux [13:20:19] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:31] PROBLEM - Check systemd state on ms-be2038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:38] I'm around if needed [13:21:51] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 3 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) >>! In T266373#6576166, @CDanis wrote: >>>! In T266373#6576161, @Framawiki wrote: >> See also [[... [13:23:46] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul) [13:23:51] 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10jcrespo) 05Open→03Resolved a:03jcrespo ` RECOVERY - Check systemd state on dbprov1003 is OK: OK - running: The system is fully operational htt... [13:23:53] 10Operations, 10SRE-tools: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 (10jcrespo) [13:25:36] * volans here < rzl [13:26:39] 👋 [13:26:54] (03PS6) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) [13:27:30] (don't worry not merging ^ til after switch settles) [13:27:48] thanks! [13:28:11] (03CR) 10jerkins-bot: [V: 04-1] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [13:28:35] (03PS7) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) [13:29:53] (03CR) 10jerkins-bot: [V: 04-1] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [13:30:17] okay, let's go :) [13:30:35] again if you have root on cumin1001, you can watch my screen by running "sudo -i tmux attach -rt switchdc" there -- if you change the flags at all, please keep -r so that yours is read-only [13:30:58] also, the session overall takes the SMALLEST terminal size, so please don't join from a tiny window, or no one else will be able to see anything [13:31:08] whoever just joined with an 80x20 terminal, that's you :) [13:31:13] cool! [13:31:16] wasn't me! :p [13:31:19] hahaha [13:31:31] aim for at least, let's say, 120x40 please [13:31:33] hahah [13:31:38] <_joe_> yeah please [13:31:39] 12x4 it is [13:31:40] if your screen is cramped, I'd like you to deal with it so that I don't have to <3 [13:32:12] well this is fun. [13:32:21] <_joe_> not really. [13:32:31] the commrel banner should be live now, but may not show up immediately everywhere due to caching [13:32:45] 10Operations, 10Puppet: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10MoritzMuehlenhoff) Sounds good to me [13:33:03] rzl: yep, not up yet on es wiki [13:33:03] _joe_, volans: check me? [13:33:13] rzl: direction correct! [13:33:19] I see it on mw.org but not on enwiki [13:33:27] <_joe_> +1 [13:33:51] now I see it [13:34:07] whoever's terminal is 20 lines tall, please make it taller :) the whole session takes the smallest size [13:34:24] I'm going to start phase 0 prep steps now, any objectinos? [13:34:26] *objections [13:34:47] <_joe_> go for me [13:35:11] +1 [13:35:11] running 00-reduce-ttl first just so we don't have to think about the five-minute wait time [13:35:15] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [13:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:22] some "failed to call" messages here are expected [13:35:31] since we just recheck until the records converge [13:35:34] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [13:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:44] the old TTL will be up at 13:41 [13:35:56] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [13:35:59] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [13:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:07] banner up on eswiki, for the record [13:36:20] <_joe_> itwiki too [13:36:24] > The last Puppet run was at Tue Oct 27 13:22:37 UTC 2020 (13 minutes ago). Puppet is disabled. cookbooks.sre.switchdc.mediawiki [13:36:31] confirming puppet is disabled on mwmaint1002 and 2001 [13:36:58] starting the cache warmups in eqiad, we may see some icinga noise about request latency, that's not user-impacting [13:37:02] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [13:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:25] <_joe_> rzl: I would suggest we run it multiple times [13:37:32] yep agreed [13:37:32] yep, as usual [13:38:00] I'll pause briefly in between for graph clarity [13:38:16] we can't set R/O for another 20+ minutes anyway, no rush [13:38:50] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [13:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:54] _joe_: is the current for i in range(3): enough or you meant re-run the whole warmup cookbook again? [13:39:23] not that it harms to run it more, just that we could increase that range(3) directly ;) [13:39:53] haha, good morning eqiad appservers https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard [13:40:38] <_joe_> so one thing I'm noticing is we're not warming up the apis at all [13:41:04] hm, that's true [13:41:39] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/%2B/master/cookbooks/sre/switchdc/mediawiki/00-warmup-caches.py#30 it looks like we just pass appservers.svc here, presumably we could just add the api name too? [13:41:54] I don't want to touch the cookbook now but we could run the warmup script directly [13:42:33] <_joe_> well no probably the warmup script calls appserver-specific endpoints [13:43:10] mm, they probably have a lot of cache in common but that's fair [13:44:06] I'll make a note for next time -- do you want to do anything about it now, or postpone until we get it fixed, or keep going as-is? [13:44:08] rzl: are you running the warmup again in the meanwhile? [13:44:46] a latency bump on the api servers wouldn't be as bad as on the appservers, I'm inclined to say let's not postpone over it [13:45:03] volans: can do but let's sort this out first, we're still plenty ahead of schedule [13:45:27] <_joe_> go as-is [13:45:42] ack [13:45:46] do you want a rerun? [13:45:52] <_joe_> I mean we did the same last time right? [13:45:55] <_joe_> yes please [13:46:06] we did yeah [13:46:10] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [13:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:16] <_joe_> I want to see those p75 and p99 go down when it runs [13:46:28] the URLs called can be checked on the mwmaint hosts in /var/lib/mediawiki-cache-warmup/ [13:46:37] urls-cluster.txt and urls-server.txt [13:46:58] (03Abandoned) 10Gehel: [wip] logstash: move to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/390039 (owner: 10Gehel) [13:47:02] (03Abandoned) 10Gehel: logstash: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390402 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [13:47:05] (03Abandoned) 10Gehel: service: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/396072 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [13:47:38] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [13:47:41] PROBLEM - Disk space on maps2002 is CRITICAL: DISK CRITICAL - free space: /srv 63626 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2002&var-datasource=codfw+prometheus/ops [13:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:01] (03Abandoned) 10Gehel: wdqs: increase frequency of update lag check [puppet] - 10https://gerrit.wikimedia.org/r/443374 (owner: 10Gehel) [13:48:06] (03Abandoned) 10Gehel: Introduce parameter data types for a few defined types. [puppet] - 10https://gerrit.wikimedia.org/r/440117 (owner: 10Gehel) [13:48:10] (03Abandoned) 10Gehel: elasticsearch - notifiy nginx of SSL certificate changes [puppet] - 10https://gerrit.wikimedia.org/r/333664 (owner: 10Gehel) [13:48:15] (03Abandoned) 10Gehel: logstash: kafka analytics cluster isn't available from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/414638 (owner: 10Gehel) [13:48:20] tail latency is definitly less but I'll give it another run in a couple of minutes [13:48:25] <_joe_> +1 [13:48:32] ack avg response time also much better [13:48:44] I guess as improvement we could increase the range and add some sleep between the runs [13:48:45] I think the third is likely to be ~the same as the second, but [13:48:47] in the cookbook [13:48:55] (03Abandoned) 10Gehel: Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 (https://phabricator.wikimedia.org/T136702) (owner: 10Gehel) [13:49:01] volans: yeah agreed, although the nice thing about having it this way is I can run other steps in between [13:49:13] I guess I haven't needed to, though, so that probably doesn't matter [13:49:17] idealy it should parse the resulsts and retry until it converges [13:49:38] yeah, let's do more design work on it later :) [13:49:46] rerunning [13:49:50] <_joe_> we can probably also work on a more comprehensive set of urls [13:49:50] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [13:49:51] +1 [13:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:37] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [13:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:54] while we wait, now is a good time to pick a language wiki and prepare a test edit, without saving it [13:51:08] once we go read-only, I'll ask you to save your test edit and confirm that it doesn't work [13:51:21] RECOVERY - Check systemd state on ms-be2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:35] * volans gets meta [13:51:46] <_joe_> I got itwiki (group2) [13:51:47] meta is not a language wiki :-) [13:51:50] rzl: I will take eswiki [13:51:53] dewiki [13:52:09] <_joe_> no one takes enwiki? [13:52:11] <_joe_> or wikidata [13:52:16] testwiki (s3) here [13:52:18] and commons [13:52:39] if someone could also test via mobile app I'd appreciate that too [13:53:05] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:53:16] ^ expected [13:53:20] ptwiki [13:53:26] _joe_: is it me or is the third latency bump slightly worse than the second? [13:53:29] <_joe_> I got also wikidata [13:53:36] <_joe_> rzl: it's much worse [13:53:39] I'm going to give it one more [13:53:41] the mobile app shows no alert [13:53:43] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [13:53:43] I will take enwiki then [13:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:47] eswiki and enwiki [13:53:52] i'm taking enwiki on mobile [13:53:54] <_joe_> which is because we clearly need to send more requests [13:53:57] ah nice, thanks kormat [13:53:57] yeah teh last run was worse than the previous [13:54:09] the last, worse? [13:54:25] <_joe_> I think because some urls we just require to the load-balancers [13:54:26] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [13:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:42] <_joe_> anyways, I think we're good to go [13:54:43] rzl: going to downtime masters for 10 minutes now [13:54:48] marostegui: ack, thanks [13:54:58] marostegui: make it 12 [13:54:59] :D [13:55:03] volans: ok XD [13:55:04] the time to run puppet [13:55:23] <_joe_> rzl: do you want to start read-only at the hour? [13:55:34] _joe_: at or a little after, no particular urgency there [13:55:34] !log root@cumin1001 START - Cookbook sre.hosts.downtime [13:55:35] it's 5m + RO time + get to run the step that runs puppet [13:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:47] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:52] I'm going ahead with stopping maintenance scripts now, objections? [13:56:00] <_joe_> yeah I was asking in the context of marostegui downtiming for 10 minutes [13:56:01] rzl: go for it [13:56:08] <_joe_> rzl: ack [13:56:09] marostegui: [nit] double sudo [13:56:18] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [13:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:24] volans: :( [13:56:34] (03CR) 10Mholloway: [C: 03+1] Update mobileapps to 2020-10-26-150740-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/636496 (https://phabricator.wikimedia.org/T264024) (owner: 10Ppchelko) [13:56:37] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [13:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:44] expect the stray php process message [13:56:46] as usual [13:56:54] who's looking? [13:57:00] yep, was just about to say [13:57:17] <_joe_> I will take a look [13:57:24] cirrus update index [13:57:41] AFAICT [13:57:57] seems the only one to me [13:58:04] <_joe_> ryankemper: there is a cscript ran by you [13:58:13] <_joe_> I am killing it now [13:58:21] cirrus will keep sending some queries to codfw after the switchover, as a mitigation from last time [13:58:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:59:11] apparently since oct. 15th [13:59:13] not sure but I don't think that index update is part of that mitigation [13:59:14] gehel: ^^^ [13:59:14] <_joe_> uhm [13:59:21] <_joe_> ryan is running a while true [13:59:45] log in /home/ryankemper/cirrus_log/viwikisource.codfw.reindex.log [13:59:50] _joe_: it's probably a complete reindex, you should kill its whole session [13:59:57] PROBLEM - Check systemd state on mwmaint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:00] <_joe_> ok all clear [14:00:04] <_joe_> we can proceed [14:00:13] Good luck! [14:00:16] _joe_: is that systemd alert the same? [14:00:19] :popcorn: [14:00:20] dcausse: thanks for being faster than me :) [14:00:24] _joe_: I still see mwscriptwikiset extensions/FlaggedRevs/maintenance/updateStats.php [14:00:30] Echoing Trizek, good luck all [14:00:32] seems new though [14:01:04] rzl: mediawiki_job_parser_cache_purging.service loaded failed failed [14:01:13] on mwmaint2001 [14:01:15] <_joe_> volans: yes, we need to switchover so that jobs don't start again [14:01:23] ok [14:01:31] <_joe_> rzl: proceed please, I'm killing that script now [14:01:32] thought we disabled maintenance [14:01:36] okay, as before, I won't stop between steps [14:01:37] :) [14:01:40] +1 [14:01:40] say "stop" in here if anything is wrong [14:01:54] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [14:01:54] parsercache purging is not a problem unless failing for days [14:01:55] <_joe_> go [14:01:55] !log rzl@cumin1001 MediaWiki read-only period starts at: 2020-10-27 14:01:54.999830 [14:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:09] <_joe_> confirmed read-only [14:02:15] test edits please [14:02:15] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [14:02:18] confirmed on eswiki and enwiki [14:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:19] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [14:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:25] confirmed RO [14:02:26] same on testwiki [14:02:31] mobile app is RO for cswiki [14:02:38] readonly on kowiki [14:02:42] mobile app RO for enwiki [14:02:43] i see RO on nlwiki [14:02:48] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [14:02:51] RO on ptwiki [14:02:51] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [14:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:54] RO: plwiki on mobile [14:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:57] commons on RO too [14:03:13] dewiki correctly readonly [14:03:17] "The wiki is currently in read-only mode" on app [14:03:20] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [14:03:24] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions [14:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:25] that error message was nicely fixed [14:03:27] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) [14:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:30] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [14:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:32] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [14:03:35] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:42] test edits again after this, please [14:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:50] <_joe_> uhm [14:03:52] rzl: 3 edits on hatnote post-switchover [14:03:53] PROBLEM - Check the last execution of mediawiki_job_parser_cache_purging on mwmaint2001 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_parser_cache_purging https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:03:56] <_joe_> "failed to get siteinfo?" [14:04:12] now a few dozen [14:04:13] _joe_: that just means we're waiting for siteinfo to show RW again [14:04:16] write ok on commons [14:04:23] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=99) [14:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:27] "failed to call" is just the generic "still retrying" message [14:04:28] ok on kowiki [14:04:31] that message is confusing, we should really rewrite that [14:04:32] <_joe_> latencies are very high [14:04:33] we may have hit a bad host though, retrying [14:04:37] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:04:38] write succeeded on en.m.wikipedia [14:04:40] ok elwiki [14:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:52] i could edit / do a deletion on nlwiki [14:04:53] takes a bit more time on loading tho [14:05:02] uff sending an edit to testwiki is taking a lot [14:05:14] What Revi said [14:05:15] <_joe_> yeah we're seeing latencies skyrocketing [14:05:17] css loaded quite slow tho [14:05:18] eswiki, enwiki and commons worked for me [14:05:23] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 844 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:05:23] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=99) [14:05:26] <_joe_> but they should recovery soon [14:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:33] latencies are much higher than they were from the warmup script yeah [14:05:35] managed to save an edit at cswiki via mobile app, but the app is lazier [14:05:41] write succeeded on meta, although pretty slow [14:05:41] testwiki worked, but took 30s+ [14:05:44] but it does look like caches are filling [14:05:56] <_joe_> yes [14:06:05] Edit worked on plwiki on mobile [14:06:05] 10Operations, 10Traffic: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 (10ema) [14:06:06] latencies are on their way down [14:06:07] !log root@cumin1001 START - Cookbook sre.hosts.downtime [14:06:09] I'm going to pause briefly and then rerun 07-set-readwrite one more time when the siteinfo check should be faster [14:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:11] <_joe_> we're out of the woods for latency [14:06:14] 10Operations, 10Traffic: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 (10ema) p:05Triage→03Medium [14:06:15] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:06:19] Gave a bit more downtime to the masters [14:06:21] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:38] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.1717 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [14:06:40] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1937 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [14:06:40] <_joe_> latencies should be ok in a couple minutes tops [14:06:55] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:57] Enwp is hanging on login page [14:07:03] worker saturation is consistent with high latency, should recover [14:07:05] <_joe_> s3 seems to have latencies [14:07:09] * volans checking kubestage1001 [14:07:13] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:07:19] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [14:07:27] just wmf_auto_restart_mcelog.service, ignore it [14:07:37] <_joe_> apis have recovered [14:07:40] volans: ok thanks. I missed that [14:07:40] write latency is better on s3 now [14:07:42] oh pages [14:07:46] it's caused by the move to 4.19 on kubestage,patch for that one is pending [14:07:53] akosiaris: yeah, php worker saturation, already recovering [14:07:53] <_joe_> please focus [14:08:03] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [14:08:09] <_joe_> ^^ [14:08:11] <_joe_> good [14:08:21] latency is coming back down to the normal range [14:08:24] ptwiki edit worked [14:08:25] The banner displayed to warn our users is still visible for IPs, but would disappear soon. [14:08:26] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.7013 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [14:08:27] <_joe_> p99 is now normal [14:08:28] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6271 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [14:08:38] <_joe_> we should be out of the woods [14:08:40] Trizek: ack, sounds good [14:08:44] s8 which gave problems on the first switch is also ok [14:08:52] rerunning 07-set-readwrite one more time just to verify the siteinfo comes back correct [14:09:01] k [14:09:02] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [14:09:03] !log rzl@cumin1001 MediaWiki read-only period ends at: 2020-10-27 14:09:02.873019 [14:09:03] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [14:09:04] <_joe_> can someone check that the systemd timers are running correctly on mwmaint1002? [14:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:07] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:31] _joe_: how so? the maintenance is not yet re-enabled there [14:09:39] running 08-start-maintenance now to get the wdqs dispatcher going, that's the only one controlled by that step [14:09:41] loading fine, can lock accounts as normal, thanks people! [14:09:45] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [14:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:53] <_joe_> volans: most scripts are just always running [14:10:03] <_joe_> and do not start iff not in the active dc automatically [14:10:05] volans: everything using the systemd wrapper reads the active dc from conftool [14:10:13] k [14:10:38] for future switchovers we might consider adding a "pause all maintenance scripts" flag to the conftool data so we can control it from here [14:10:45] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:10:58] checking db1075 (s3) which seems to be struggling a bit [14:11:01] <_joe_> confirmed, they're running [14:11:10] PROBLEM - MariaDB read only s3 #page on db2105 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 23000057s, event_scheduler: True, 84.12 QPS, connection latency: 0.002210s, query latency: 0.000468s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:11:18] <_joe_> rzl: let's proceed? [14:11:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [14:11:21] Hm, I just got another page, delayed? [14:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:23] _joe_: already running [14:11:25] rzl: giving them another 5 minutes [14:11:29] PROBLEM - MariaDB read only es5 on es1024 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.12-MariaDB-log, Uptime 15127019s, event_scheduler: True, 117.28 QPS, connection latency: 0.002505s, query latency: 0.000627s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:11:30] yeah delayed I think, alert1001 [14:11:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:11:33] ok [14:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:36] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [14:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:42] fyi the "master comes back in read only" alerts are fine [14:11:48] running puppet on the db masters now to clear them [14:11:50] those are downtimes expiring [14:11:52] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters [14:11:55] rzl: +1 [14:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:01] oh that makes more sense [14:12:18] while this runs, is anyone aware of any ongoing issues? [14:12:19] rzl: you can also run the tendril one Id say [14:12:28] leave just the TTL out for a bit [14:12:29] JIC [14:12:34] volans: yep that's the plan [14:12:49] <_joe_> everyone: any ongoing issues? [14:12:58] <_joe_> things you see while using the wikis [14:12:58] RECOVERY - MariaDB read only s3 #page on db2105 is OK: Version 10.1.43-MariaDB, Uptime 23000165s, read_only: True, event_scheduler: True, 193.82 QPS, connection latency: 0.002153s, query latency: 0.000537s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:12:59] rzl: write activity is much lower than before switch from the db point of view [14:13:17] rzl: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&from=1603804389476&to=1603807989476&var-site=eqiad&var-group=core&var-shard=All&var-role=All [14:13:24] rzl: I think the cookbook is missing a step -- I think we also need to run puppet on alert1001/alert2001 after running it on the primary DBs [14:13:31] <_joe_> yes [14:13:38] I ran puppet by hand on alert1001 and *that* is what triggered the above recovery [14:13:55] dewiki seems to be perfectly fine, made an edit, logged off and on and all functionality I checked is fine [14:14:12] cdanis: ack, thanks [14:14:22] jynus: I turned off stacked view and that looks like it's mostly driven by s8 [14:14:33] also, there's no issues reported in #wp-en or #-tech, and nothing on enwiki or commons technical village pump either [14:14:40] RECOVERY - MariaDB read only es5 on es1024 is OK: Version 10.4.12-MariaDB-log, Uptime 15127210s, read_only: False, event_scheduler: True, 113.86 QPS, connection latency: 0.002341s, query latency: 0.000626s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:14:44] Everything fine @ meta / kowiki as well [14:14:49] rzl: then it is just the dispatching [14:14:59] Scanning discord for any reports [14:15:01] <_joe_> thanks revi :) [14:15:06] jynus: yeah agree -- should be back to normla in the next few minutes [14:15:13] <_joe_> Spookreeeno: we have a discord? :O [14:15:17] let me know if it doesn't recover :) [14:15:21] https://enwp.org/WP:DISCORD [14:15:22] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-run-puppet-on-db-masters (exit_code=0) [14:15:24] ^ [14:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:28] <_joe_> TIL :P [14:15:29] We do _joe_ [14:15:33] or the lowercase Discord. [14:15:40] puppet run on DB masters complete, you can go ahead and un-downtime, marostegui [14:15:42] I just typed and didn't check if that shortcut exist [14:15:49] No issues with Wikipedia reported [14:15:52] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-update-tendril [14:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:57] rzl: ack, should expire in less than 5 minutes anyways [14:16:00] sgtm [14:16:02] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) [14:16:03] <_joe_> Spookreeeno: thanks :) [14:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:18] marostegui: give tendril a look? [14:16:22] rzl: checking [14:16:43] keeping the DNS TTLs short until the dust clears, maybe another 15m or so if nothing comes up [14:16:52] rzl: looks good apart from pcXXXX hosts, which we saw on the first switch, I will fix those manually [14:16:57] Least I can do _joe_ [14:17:06] marostegui: what's test-s1? [14:17:13] marostegui: cool, thanks [14:17:15] volans: a test host [14:17:29] I'm looking at incident 574 on VO, it should have recovered by itself but clearly not [14:17:47] !log ran puppet on alert1001 [14:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:18] <_joe_> godog: can we remove /ack it? [14:18:34] _joe_: yeah I ack'd and resolved it now [14:18:48] the icinga alert isn't firing anymore anyways [14:20:41] rzl: tendril fixed [14:21:02] <_joe_> rzl: so we don't have as precise a timing as last time, but I think we can use the first edits after the switchover as indication, it was less than 2 minutes again [14:21:08] hmm, GET 5xxs from appservers are still nonzero, anyone able to logdive and see what that's about? [14:21:29] <_joe_> rzl: sure, but... that's not unusual IIRC? [14:21:41] s3 and s8 seem to have recovered from high latencies [14:21:42] <_joe_> rzl: what are you basing your statement on? [14:21:43] they look a little higher than codfw pre-switch [14:21:50] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 before [14:21:57] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 after [14:22:20] sorry those are the whole page -- I'm at 5XX Error rate by HTTP method [14:22:44] this was the top failing url POST https://intake-analytics.wikimedia.org/v1/events?hasty=true [14:23:00] <_joe_> that's eventgate-analytics [14:23:14] we usually serve the occasional 500, but not a consistent ~0.1% [14:23:18] <_joe_> let's see fatalmonitor [14:23:43] https://logstash.wikimedia.org/goto/26e46707185208cdd2df18155ce8ea47 [14:23:46] jynus: whatever I'm looking at, it's GET not POST [14:23:48] I think it's restbase? [14:23:49] <_joe_> [{exception_id}] {exception_url} Wikimedia\Rdbms\DBReadOnlyError from line 994 of /srv/mediawiki/php-1.36.0-wmf.14/includes/libs/rdbms/database/Database.php: Database is read-only: You can't edit now. This is because of maintenance. Copy and save your t [14:23:49] lots of failures from grafana [14:24:02] <_joe_> oh sorry [14:24:22] most of the 5xx I'm seeing right now are for /api/rest_v1/page/related queries [14:24:25] <_joe_> logstash is useless [14:24:32] hello looking? [14:24:32] filtering out the grafana and the non-GETs [14:24:42] intake-analytics is eventgate-analytics-external [14:24:52] <_joe_> logstash is lagged, don't trust it [14:24:57] oh, great [14:24:58] ottomata: that was overal, they may not be happening now [14:25:03] ok [14:25:06] in fact I think it is what _joe_ says [14:25:12] shouldn't we have an alert about that? [14:25:13] delayed ones [14:25:44] <_joe_> oh no sorry [14:25:50] <_joe_> those are joburnners [14:25:57] <_joe_> with hanging jobs [14:26:02] (03PS1) 10Ema: Add 1.8 to NEWS [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636665 [14:26:03] on codfw? [14:26:04] (03PS1) 10Ema: 1.8: add 'ABI vrt' to vmod_netmapper.vcc [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636666 (https://phabricator.wikimedia.org/T266567) [14:26:27] we still have jobrunners attempting to write to codfw according to logstash, yes [14:26:31] <_joe_> yes, videoscaling [14:26:37] <_joe_> let me fix it [14:26:47] could it be jobs still running? [14:26:58] maybe those could be very delayed as they disconnect and can take a long time [14:27:17] or finished after the switchover anyway [14:27:29] yeah, that is what I mean [14:27:38] errors are still going down on logstash [14:27:39] <_joe_> !log restart php-fpm on jobrunners in codfw [14:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:47] <_joe_> now they should go away [14:27:50] cdanis: /api/rest_v1/page/related is probably CirrusSearch, I still see some rejections despite the mitigations put in place (but should be fine by now, I no longer see rejections) [14:28:10] PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:33] is wikidata dispatching enabled? write load on wikidata seems quite low [14:28:34] <_joe_> rzl: sorry now that I've solved the spamminess on the jobrunners in codfw [14:28:38] yeah dcausse restbase is still serving errors there [14:28:55] jynus: it should be, we ran that step, double-checking [14:28:57] not an alarming absolute rate, but I think it is the majority of 5xx right now [14:29:02] <_joe_> was it appservers or api-appservers? [14:29:10] _joe_: appservers is where I was looking [14:29:39] api-appserver 5xxs are an order of magnitude lower, looks much more normal [14:29:55] <_joe_> rzl: I can't see errors right now on a random appserver [14:31:01] <_joe_> oh I think I know what the problem is, persistent connections in cpjobqueue [14:31:10] jynus: I do see the wikidata dispatchers running in eqiad [14:31:34] rzl: I trust you, I am just seeng very different workload before and now [14:31:53] yeah for sure, just ruling out one cause :) [14:31:58] hmm [14:32:06] <_joe_> can someone please check if the codfw lvs for jobrunners is sstill receiving requests? [14:32:08] I think it is pretty similar already: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&from=now-6h&to=now&refresh=1m&var-site=All&var-group=core&var-shard=All&var-role=All [14:32:20] marostegui: see https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db1104&var-port=9104&from=1603798302103&to=1603809102104 vs https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db2079&var-port=9104&from=1603798335564&to=1603809135564 [14:32:55] could be not necesarilly us [14:33:02] <_joe_> akosiaris: can you do a rolling restart of cpjobqueue in codfw? [14:33:07] e.g. maybe bots were stopped/failed [14:33:52] jynus: it is not that different if you zoom in I think (checking also the figures on the right side): https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db2079&var-port=9104&from=1603805967656&to=1603807409504 vs https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db1104&var-port=9104&from=1603806915643&to=1603809102104 [14:33:52] σθρε [14:33:55] _joe_: sure [14:34:01] marostegui: ok [14:34:26] more spiky, though [14:34:55] although it may have stabilized now [14:35:23] yeah, a bit more spiky but that can be related to coldness too [14:36:22] labsdb wikireplicas looking good [14:36:30] <_joe_> akosiaris: turning it off and on is ok too [14:36:50] !log rolling restart of all pods in codfw changeprop-jobqueue [14:36:54] _joe_: already underway [14:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:58] (03CR) 10Vgutierrez: [C: 03+1] Add 1.8 to NEWS [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636665 (owner: 10Ema) [14:37:09] <_joe_> akosiaris: [14:37:12] <_joe_> thanks :) [14:37:19] (03CR) 10Vgutierrez: [C: 03+1] 1.8: add 'ABI vrt' to vmod_netmapper.vcc [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636666 (https://phabricator.wikimedia.org/T266567) (owner: 10Ema) [14:37:21] (03CR) 10Ema: [V: 03+2 C: 03+2] Add 1.8 to NEWS [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636665 (owner: 10Ema) [14:38:03] (03CR) 10Ema: [V: 03+2 C: 03+2] 1.8: add 'ABI vrt' to vmod_netmapper.vcc [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/636666 (https://phabricator.wikimedia.org/T266567) (owner: 10Ema) [14:39:48] it's taking it's sweet time [14:40:12] <_joe_> !log restarting envoyproxy on the jobrunners in codfw [14:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:17] <_joe_> akosiaris: I found the right solution [14:40:39] <_joe_> the errors in logstash should go down now [14:41:06] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:29] ok [14:42:00] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 23 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:42:13] (03PS1) 10Ema: Merge branch 'master' into debian [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636668 [14:42:15] (03PS1) 10Ema: Release 1.9-1 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636669 (https://phabricator.wikimedia.org/T266567) [14:42:28] 10Operations, 10observability: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570 (10fgiunchedi) [14:42:32] <_joe_> rzl: I can't see evidence of an elevated error rate now besides the jobrunners I just fixed with the old dear "wrench in the wheels" method [14:42:50] _joe_: works for me [14:42:54] (03CR) 10jerkins-bot: [V: 04-1] Release 1.9-1 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636669 (https://phabricator.wikimedia.org/T266567) (owner: 10Ema) [14:43:06] (03CR) 10Ema: [V: 03+2 C: 03+2] Merge branch 'master' into debian [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636668 (owner: 10Ema) [14:43:27] <_joe_> rzl: we should add a -restart-envoy-jobrunners step [14:43:34] noted [14:43:52] <_joe_> it just serves to ensure we obliterate the perennial connections changeprop creates [14:44:15] <_joe_> after that connection dies, changeprop re-resolves the ip and correctly goes to eqiad [14:44:15] (03CR) 10Ema: [V: 03+2 C: 03+2] Release 1.9-1 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/636669 (https://phabricator.wikimedia.org/T266567) (owner: 10Ema) [14:44:18] nod [14:45:46] (03PS1) 10Mholloway: [BETA] Fix Echo Push Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636670 [14:46:42] _joe_: it's gone to zero on grafana too, nice [14:47:15] okay, are we still tracking anything else? [14:47:36] marostegui: are you happy? [14:47:45] rzl: all good here [14:48:29] if no objections I'll restore the TTLs [14:48:34] <_joe_> +1 [14:48:57] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [14:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:04] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [14:49:11] (03PS3) 10Filippo Giunchedi: grafana: add user sync from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/636638 (https://phabricator.wikimedia.org/T265712) [14:49:23] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [14:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:15] <_joe_> https://grafana.wikimedia.org/d/000000208/edit-count?viewPanel=8&orgId=1&refresh=5m&from=now-1h&to=now [14:50:26] <_joe_> pretty nice [14:50:53] (03CR) 10MSantos: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636670 (owner: 10Mholloway) [14:51:15] zooming out to the 3h view it's still lower than before -- wouldn't be surprised if some bots crashed on the first "you can't edit right now" and need to be restarted [14:55:10] at some point we should figure out how short a read-only period to claim this time :D but for now it looks like we're all clear [14:55:14] thanks everybody, nice work [14:55:15] !log upload libvmod-netmapper 1.9-1 to buster-wikimedia component/varnish6 T266567 [14:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:21] T266567: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 [14:55:25] <_joe_> rzl: that happens at *every* switchover [14:55:43] heh yeah, figures [14:55:43] good argument for regularly returning artificial versions of such rare responses at a low rate [14:55:46] <_joe_> nice work indeed everyone :) [14:55:54] so developers aren't so surprised by them [14:56:09] <_joe_> yeah let's surprise users instead :P [14:56:14] yes! :) [14:56:42] (03PS1) 10Niedzielski: admin: remove niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/636671 [15:00:19] (03CR) 10Reedy: [C: 04-1] admin: remove niedzielski (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski) [15:03:16] (03CR) 10Muehlenhoff: "@Stephen, once" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636671 (owner: 10Niedzielski) [15:07:53] (03PS10) 10Bstorm: toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) [15:08:21] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps resolv.conf: replace hardcoded .eqiad.wmflabs search with hiera [puppet] - 10https://gerrit.wikimedia.org/r/636476 (https://phabricator.wikimedia.org/T266227) (owner: 10Andrew Bogott) [15:10:11] (03CR) 10Bstorm: toolforge: script to make long-running processes on bastions less good (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [15:12:41] (03PS3) 10Muehlenhoff: Grafana config changes for CAS-enabled grafana-rw.w.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/629122 (https://phabricator.wikimedia.org/T262512) [15:13:21] !log cp4032: varnish-frontend-restart with libvmod-netmapper 1.9-1 T266567 [15:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:27] T266567: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 [15:14:31] NICE STUFF! [15:15:52] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:17:13] (03CR) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:18:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:20:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:21:42] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:33] !log cp4032: downgrade varnish to 6.0.4 T264398 [15:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:39] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [15:26:03] (03CR) 10DannyS712: [C: 03+1] Add growthexperiments to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) (owner: 10Urbanecm) [15:26:13] (03PS1) 10Jcrespo: Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686 [15:26:15] (03CR) 10Filippo Giunchedi: systemd::timer::job: switch monitoring_enabled default to false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:26:26] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686 (owner: 10Jcrespo) [15:26:39] 10Operations, 10netops: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) 05Stalled→03Resolved a:03ayounsi All done here. I don't think it's worth doing more intrusive testing. [15:28:16] (03PS2) 10Jcrespo: mariadb: Cleanup old cron deletions after some time after deploy [puppet] - 10https://gerrit.wikimedia.org/r/636644 (https://phabricator.wikimedia.org/T265138) [15:28:18] (03PS2) 10Jcrespo: Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686 [15:28:18] 10Operations, 10Traffic: libvmod-netmapper: must specify ABI stanza - https://phabricator.wikimedia.org/T266567 (10ema) 05Open→03Resolved a:03ema Done in libvmod-netmapper 1.9-1, closing. [15:28:27] (03PS3) 10Jcrespo: Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686 [15:28:30] (03Abandoned) 10DannyS712: SpecialInvestigateBlock: Don't assume 'DisableUTEdit' exists [extensions/CheckUser] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/631478 (https://phabricator.wikimedia.org/T264302) (owner: 10Dbarratt) [15:28:44] (03Abandoned) 10DannyS712: SpecialInvestigateBlock: Don't assume 'DisableUTEdit' exists [extensions/CheckUser] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631466 (https://phabricator.wikimedia.org/T264302) (owner: 10Jforrester) [15:29:05] (03CR) 10Jcrespo: "This was a configuration not reverted, not noticed until eqiad was primary again." [puppet] - 10https://gerrit.wikimedia.org/r/636686 (owner: 10Jcrespo) [15:30:13] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Set db1077 in read-write" [puppet] - 10https://gerrit.wikimedia.org/r/636686 (owner: 10Jcrespo) [15:30:54] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) With T266567 out of the way, we can now try different Varnish 6 versions, at least as long as they're VRT-compatible. [15:31:05] (03Abandoned) 10DannyS712: Collect data about CodeMirror preference usage [extensions/WikimediaEvents] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/631229 (https://phabricator.wikimedia.org/T260138) (owner: 10Krinkle) [15:31:47] (03PS5) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) [15:32:09] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:36:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [15:38:43] 10Operations, 10Patch-Needs-Improvement: logrotate cronspam on ms-be1040 - https://phabricator.wikimedia.org/T205974 (10fgiunchedi) 05Open→03Invalid IIRC this was fixed, boldly resolving [15:39:17] 10Operations, 10Patch-Needs-Improvement: puppet should try to mount all mountable swift filesystems - https://phabricator.wikimedia.org/T126574 (10fgiunchedi) a:05fgiunchedi→03None [15:39:32] 10Operations, 10observability, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10fgiunchedi) a:05fgiunchedi→03None [15:42:28] (03CR) 10Ema: Exclude predefined user agents from eventlogging data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [15:45:06] 10Operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750 (10MoritzMuehlenhoff) [15:46:13] (03CR) 10Jcrespo: [C: 03+1] "Ok as it is, although it could also be set as master on hiera (slightly more conventional), but not a huge issue given it is not the final" [puppet] - 10https://gerrit.wikimedia.org/r/636609 (https://phabricator.wikimedia.org/T266003) (owner: 10Kormat) [15:48:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:48:19] (03PS2) 10Elukey: base::standard_packages: avoid mcelog with kernels >= 4.12 [puppet] - 10https://gerrit.wikimedia.org/r/636560 [15:49:42] (03PS3) 10Elukey: base::standard_packages: avoid mcelog with kernels >= 4.12 [puppet] - 10https://gerrit.wikimedia.org/r/636560 [15:49:45] /query rzl very nice work on the switchovers! [15:49:47] er [15:49:48] ha [15:49:54] haha thank you! :D [15:49:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:50:15] irssi plugin idea: intercept private messages [15:50:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey) [15:50:50] the credit goes to everyone else for doing the work to make sure everything went right when I pressed the button [15:51:15] I don't understand much of it so I have been following to try to learn what and how it happens :) [15:52:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey) [15:53:56] (03CR) 10Elukey: [C: 03+2] base::standard_packages: avoid mcelog with kernels >= 4.12 [puppet] - 10https://gerrit.wikimedia.org/r/636560 (owner: 10Elukey) [15:56:41] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] - member-range ge-4/0/1 to ge-4/0/10; [edit interfaces interface-range disabled] member... [15:58:07] just wondering... how long was the read-only this time? [15:58:59] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:38] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10elukey) [16:01:52] 10Operations, 10ops-eqiad, 10DC-Ops: hardware troubleshooting: PSU Failure on wtp1033 - https://phabricator.wikimedia.org/T266575 (10wiki_willy) [16:02:22] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:02] just missed WXIII but for anyone else who's wondering, the final score this time was 1 minute 49 seconds :) [16:05:18] !!!! [16:05:20] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:21] nice! [16:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:44] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [16:05:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [16:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:52] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Papaul) [16:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:55] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [16:05:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [16:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:14] (plus a brief period of sluggishness afterward, which is normal -- during which edit rates were nonzero but lower than usual) [16:06:22] (03CR) 10Mholloway: [C: 03+2] [BETA] Fix Echo Push Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636670 (owner: 10Mholloway) [16:07:31] (03Merged) 10jenkins-bot: [BETA] Fix Echo Push Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636670 (owner: 10Mholloway) [16:12:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:12:50] (03PS2) 10Matthias Mullie: Add another SDC property to search for matching media statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633982 (https://phabricator.wikimedia.org/T264925) [16:12:57] (03CR) 10Matthias Mullie: [C: 03+1] Add another SDC property to search for matching media statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633982 (https://phabricator.wikimedia.org/T264925) (owner: 10Matthias Mullie) [16:14:08] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:18:26] RECOVERY - Check systemd state on an-worker1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:14] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:43] 10Operations, 10ops-eqiad, 10DC-Ops: hardware troubleshooting: PSU Failure on wtp1033 - https://phabricator.wikimedia.org/T266575 (10Cmjohnson) 05Open→03Resolved power cable was loose...fixed [16:25:38] 10Operations, 10ops-codfw, 10serviceops: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) [16:28:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:28:44] PROBLEM - Check systemd state on idp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:52] (03PS1) 10Cwhite: Enable search slowlog by default for ECS indices. [software/ecs] - 10https://gerrit.wikimedia.org/r/636685 [16:30:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:30:48] PROBLEM - IPMI Sensor Status on analytics1072 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:34:02] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [16:34:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [16:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:32] RECOVERY - IPMI Sensor Status on wtp1033 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:51:51] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Urbanecm) @Ammarpad Thansk for your comment. I wasn't able to open any of those PDFs with any PDF viewer installed (all browsers I... [16:54:27] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10RhinosF1) I get ERR_CONNECTION_CLOSED on https://en.wikipedia.org/api/rest_v1/page/pdf/Rail_transport_modelling [16:58:28] (03PS1) 10JMeybohm: Update patches and changelog [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/636713 [16:58:37] 10Operations: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371 (10akosiaris) a:05akosiaris→03None [17:01:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update patches and changelog [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/636713 (owner: 10JMeybohm) [17:01:28] (03CR) 10JMeybohm: [C: 03+2] Update patches and changelog [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/636713 (owner: 10JMeybohm) [17:01:46] PROBLEM - Check systemd state on an-worker1098 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:03] 10Operations, 10Puppet: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10jbond) [17:13:17] (03PS2) 10Razzi: geoip: cleanup having moved archiving to launcher [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) [17:14:37] 10Operations: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155 (10Volans) 05Open→03Resolved Resolving as it's a too old audit now, we could re-run it again if needed, but we've also alerts that check most of those scenarios. [17:16:50] 10Operations, 10User-Kormat: cumin: If no command is provided, output nodelist to stdout - https://phabricator.wikimedia.org/T261861 (10Volans) This is fixed in master in 6d16ec2 (list of hosts to stdout, rest to stderr), but not yet deployed. [17:19:40] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10Volans) a:05Volans→03None [17:20:41] (03CR) 10Dzahn: [C: 03+2] phabricator: remove absented cron jobs for Bugzilla updates [puppet] - 10https://gerrit.wikimedia.org/r/636085 (owner: 10Dzahn) [17:20:44] (03CR) 10Razzi: geoip: cleanup having moved archiving to launcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [17:21:24] (03CR) 10Dzahn: [C: 03+2] gerrit: remove 'list_mediawiki_extensions' cron job [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn) [17:22:31] (03CR) 10Elukey: geoip: cleanup having moved archiving to launcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [17:22:39] !log gerrit1001/2001 - sudo rm /var/www/mediawiki-extensions.txt [17:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:00] (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/636084 (https://phabricator.wikimedia.org/T266024) (owner: 10Dzahn) [17:23:45] (03Abandoned) 10Milimetric: Revert "camus - don't check eqiad topics while DC switchover to codfw is ongoing" [puppet] - 10https://gerrit.wikimedia.org/r/623556 (https://phabricator.wikimedia.org/T261865) (owner: 10Milimetric) [17:27:09] (03PS1) 10Milimetric: analytics/camus: switch back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/636717 (https://phabricator.wikimedia.org/T261865) [17:29:33] (03CR) 10Ottomata: [C: 03+2] analytics/camus: switch back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/636717 (https://phabricator.wikimedia.org/T261865) (owner: 10Milimetric) [17:34:36] (03CR) 10Ottomata: Exclude predefined user agents from eventlogging data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [17:35:02] (03PS8) 10Ottomata: Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) [17:35:27] (03CR) 10Dzahn: [V: 03+1 C: 03+2] ldap::client::labs: fix 'Unknown variable: '::restricted..' [puppet] - 10https://gerrit.wikimedia.org/r/633838 (https://phabricator.wikimedia.org/T101447) (owner: 10Dzahn) [17:36:28] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/26166/cp1075.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [17:36:34] (03CR) 10Ottomata: [C: 03+2] Exclude predefined user agents from eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/636493 (https://phabricator.wikimedia.org/T266130) (owner: 10Ottomata) [17:40:58] (03CR) 10Zoranzoki21: [C: 03+1] "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/636436 (https://phabricator.wikimedia.org/T266477) (owner: 10Urbanecm) [17:41:15] (03PS1) 10Ottomata: eventgate-analytics-external - bump to image 2020-10-27-173311-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/636722 (https://phabricator.wikimedia.org/T266573) [17:42:17] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics-external - bump to image 2020-10-27-173311-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/636722 (https://phabricator.wikimedia.org/T266573) (owner: 10Ottomata) [17:44:20] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [17:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:42] PROBLEM - Check systemd state on an-worker1099 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:41] I think we have new systemd alerts on random hosts sometimes due to recent changes from crons to systemd timers. [17:46:55] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [17:46:55] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [17:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:27] mutante: yeah -- that's a feature, it means the jobs are monitored for failures we didn't used to know about :) [17:48:24] rzl: I think it's the auto_restart_service [17:48:50] i once ran systemctl reset-failed on a bunch of them and they all cleared.. but it wasn't a one-time thing due to the change [17:49:30] PROBLEM - Host ms-be1057 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:31] yea, example releases2001.. restart_jenkins and restart_rsync .. hrmm [17:51:42] "Service jenkins not present or not running" ok:) so trying to restart a service that is not present = fail :) [17:52:08] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Netbox Error for asw2-d4-eqiad - https://phabricator.wikimedia.org/T265393 (10wiki_willy) 05Open→03Resolved a:05wiki_willy→03Cmjohnson Netbox error is resolved now. Closing task. Thanks, Willy [17:52:09] and a bunch of failover servers have puppet code to make sure it's absent on the inactive host [17:53:01] yep, one way it happens is when a service is masked .. like jenkins in this case [17:53:27] yeah that makes sense [17:53:35] sounds like wmf-auto-restart.py should exit 0 in that situation [17:55:50] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [17:55:50] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [17:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:34] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) a:05Bstorm→03RobH Assigning to @RobH to start discussion of how to act... [18:06:44] (03PS1) 10Dzahn: wmf-auto-restart: return 0 if service is not present or running [puppet] - 10https://gerrit.wikimedia.org/r/636728 [18:09:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:12:04] (03PS1) 10Volans: cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 [18:13:52] (03PS2) 10Volans: cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 [18:14:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:15:07] (03CR) 10BBlack: [C: 03+1] cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [18:15:36] (03CR) 10Jbond: [C: 04-1] "I think this will just cause the script to fail at the next step when it tries to get the process PID. ultimately we shouldn't have servi" [puppet] - 10https://gerrit.wikimedia.org/r/636728 (owner: 10Dzahn) [18:16:49] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10RobH) a:05RobH→03Cmjohnson So this appears like it is basically requiring some... [18:18:01] (03CR) 10Ryan Kemper: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/26168/" [puppet] - 10https://gerrit.wikimedia.org/r/636420 (https://phabricator.wikimedia.org/T266470) (owner: 10DCausse) [18:18:32] (03PS1) 10Dzahn: releases: only use auto-restart script if actual rsyncd is present [puppet] - 10https://gerrit.wikimedia.org/r/636730 [18:18:33] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10RobH) p:05Triage→03Medium [18:21:20] (03CR) 10Jbond: "lgtm but might be worth a pcc just incase" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn) [18:21:22] (03CR) 10Dzahn: [C: 04-2] "ah..no, wait. this needs to be only on the primary" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn) [18:23:44] (03PS2) 10Dzahn: releases: only use auto-restart script on primary server [puppet] - 10https://gerrit.wikimedia.org/r/636730 [18:25:03] (03CR) 10jerkins-bot: [V: 04-1] releases: only use auto-restart script on primary server [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn) [18:25:52] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) Thank you for helping me sort this out. [18:26:46] (03CR) 10Dzahn: [C: 04-2] "jerkins: the variables are already used.. it was copy/paste.. what?" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn) [18:27:50] (03CR) 10Dzahn: [C: 04-2] "ok.. I see. it's outside the loop..amending" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn) [18:27:53] (03PS3) 10Razzi: geoip: cleanup having moved archiving to launcher [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) [18:28:24] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T265086 (10Cmjohnson) production cables are in port 10 on both switches [18:30:18] PROBLEM - Disk space on Hadoop worker on an-worker1101 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/s 14 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:33:07] (03PS3) 10Dzahn: releases: only use auto-restart script on primary server [puppet] - 10https://gerrit.wikimedia.org/r/636730 [18:35:00] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [18:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:50] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "now it should be fine: https://puppet-compiler.wmflabs.org/compiler1002/26169/" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn) [18:53:42] RECOVERY - Disk space on Hadoop worker on an-worker1101 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:55:18] (03CR) 10Ryan Kemper: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26170/" [puppet] - 10https://gerrit.wikimedia.org/r/636432 (https://phabricator.wikimedia.org/T255399) (owner: 10DCausse) [19:08:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:10:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:41:01] (03PS2) 10Ppchelko: Add changeprop rules for newcomerTasksCacheRefreshJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/636078 (https://phabricator.wikimedia.org/T260758) (owner: 10Catrope) [19:44:17] (03CR) 10Ppchelko: [C: 03+1] "This will work, yeah. Will deploy tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/636078 (https://phabricator.wikimedia.org/T260758) (owner: 10Catrope) [19:56:19] 10Operations, 10MediaWiki-General, 10Traffic, 10HTTPS: Protocol-relative URLs are poorly supported or unsupported by a number of HTTP clients - https://phabricator.wikimedia.org/T54253 (10Tgr) [20:00:29] (03PS1) 10Amire80: Add Fahrrad-Datenautobahn (Category:Wikipedia) to the German RSS Planet [puppet] - 10https://gerrit.wikimedia.org/r/636741 [20:07:01] (03CR) 10Dzahn: [C: 03+2] "looks good to me. thank you for handling these!" [puppet] - 10https://gerrit.wikimedia.org/r/636741 (owner: 10Amire80) [20:10:38] (03CR) 10Ryan Kemper: [C: 03+2] [wdqs] add support for streaming updater lag metric [puppet] - 10https://gerrit.wikimedia.org/r/636432 (https://phabricator.wikimedia.org/T255399) (owner: 10DCausse) [20:11:18] (03PS1) 10Dzahn: wikistats: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636743 [20:11:29] 10Operations, 10ops-codfw, 10serviceops: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Peachey88) [20:13:33] !log gerrit1001/gerrit2001: manually deleting list_mediawiki_extensions cron job (T266024) [20:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:42] T266024: Phase out https://gerrit.wikimedia.org/mediawiki-extensions.txt - https://phabricator.wikimedia.org/T266024 [20:21:26] (03CR) 10Dzahn: [V: 03+1 C: 03+2] releases: only use auto-restart script on primary server [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn) [20:29:13] (03CR) 10Dzahn: "Notice: /Stage[main]/Rsyslog/File[/etc/rsyslog.d/20-wmf-auto-restart-rsync.conf]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/636730 (owner: 10Dzahn) [20:43:20] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:23] !log releases2002 - systemctl reset-failed .. after removing wmf_auto_restart_rsync [20:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:55:09] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10jijiki) [20:55:15] 10Operations, 10ops-eqiad, 10DC-Ops: ms-be1057 down - cable disconnected? - https://phabricator.wikimedia.org/T266604 (10Dzahn) [20:55:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:55:47] ACKNOWLEDGEMENT - Host ms-be1057 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T266604 [20:56:19] !log ms-be1057 is network down but running, NO-CARRIER on NIC, cable disconnected? [20:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:36] RECOVERY - Check systemd state on mwmaint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:26] !log mwmaint2001 - systemctl reset-failed - mediawiki_job_parser_cache_purging.service [21:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:53] 10Operations, 10ops-eqiad, 10DC-Ops: ms-be1057 down - cable disconnected? - https://phabricator.wikimedia.org/T266604 (10RobH) p:05Triage→03High The switch port sees this port enabled (admin up) but link down, supporting that it could be a bad cable or cable disconnect. ` Interface Admin Link Desc... [21:43:34] RECOVERY - Check the last execution of mediawiki_job_parser_cache_purging on mwmaint2001 is OK: OK: Status of the systemd unit mediawiki_job_parser_cache_purging https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:56:49] (03CR) 10Razzi: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26171/" [puppet] - 10https://gerrit.wikimedia.org/r/636514 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [22:01:16] (03PS3) 10Volans: cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 [22:01:37] rzl: ^^^ hope it addresses your IRC comment from earlier [22:18:38] (03PS1) 10Dzahn: apache: add 20.wikipedia.org redirect to wikimediafoundation site [puppet] - 10https://gerrit.wikimedia.org/r/636755 (https://phabricator.wikimedia.org/T264367) [22:20:05] !log systemctl reset-failed on various servers to see which are coming back later from failed auto_restart and which don't [22:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:58] RECOVERY - Check systemd state on an-worker1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:04] RECOVERY - Check systemd state on an-worker1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:48] RECOVERY - Check systemd state on mw1381 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:52] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:08] RECOVERY - Check systemd state on idp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:18] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:18] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:52] RECOVERY - Check systemd state on kubestage1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:35] 10Operations, 10serviceops, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Volans) p:05Triage→03Medium [22:23:38] RECOVERY - Check the last execution of package_builder_Clean_up_build_directory on deneb is OK: OK: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:24:04] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:22] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:52] (03PS3) 10Dzahn: wikistats: allow to 'absent' import/dump crons as well [puppet] - 10https://gerrit.wikimedia.org/r/633845 [22:32:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:34:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:42:02] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26172/wikistats-wild-tiger.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633845 (owner: 10Dzahn) [22:44:16] (03PS1) 10Dave Pifke: [WIP] Experimental support for speedscope.app [puppet] - 10https://gerrit.wikimedia.org/r/636759 [22:50:12] (03PS2) 10Dzahn: wikistats: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636743 (https://phabricator.wikimedia.org/T266479) [22:50:25] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/26173/wikistats-wild-tiger.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/636743 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [22:51:18] (03CR) 10Dzahn: [V: 03+1 C: 03+2] wikistats: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636743 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [22:51:23] (03PS3) 10Dzahn: wikistats: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/636743 (https://phabricator.wikimedia.org/T266479) [22:53:37] (03PS2) 10Dave Pifke: [WIP] Experimental support for speedscope.app [puppet] - 10https://gerrit.wikimedia.org/r/636759 [22:58:17] (03PS10) 10Dzahn: mirrors: replace cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636082 (https://phabricator.wikimedia.org/T265138) [22:58:21] (03PS4) 10Dzahn: puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) [22:59:00] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: replace cron to remove old reports with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/636104 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:05:45] (03CR) 10Dzahn: [V: 03+1] base/labs: add systemd timer to clean puppet client bucket (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [23:06:32] (03CR) 10Dzahn: [C: 03+2] base::labs: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635905 (owner: 10Dzahn) [23:07:27] (03PS4) 10Dzahn: planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 (https://phabricator.wikimedia.org/T265138) [23:07:49] (03CR) 10jerkins-bot: [V: 04-1] planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:09:59] (03PS2) 10Dzahn: dumps: rm profile::dumps::distribution::datasets::cleanup_miscdatasets [puppet] - 10https://gerrit.wikimedia.org/r/636087 (https://phabricator.wikimedia.org/T265138) [23:10:16] (03PS8) 10Dzahn: gerrit: replace cron jobs with systemd timers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633857 (https://phabricator.wikimedia.org/T265138) [23:10:27] (03CR) 10jerkins-bot: [V: 04-1] gerrit: replace cron jobs with systemd timers (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/633857 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:15:48] 10Operations, 10Puppet, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10Dzahn) patches belonging to this ticket that had not been linked and are already merged: topic branch: https://gerrit.wikimedia.org/r/q/t... [23:17:00] (03PS3) 10Dzahn: base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) [23:17:13] (03CR) 10Dzahn: "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:17:41] (03CR) 10jerkins-bot: [V: 04-1] base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [23:20:06] (03PS4) 10Dzahn: base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) [23:24:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10Andrew) [23:24:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [23:26:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) @Cmjohnson, in case you were waiting to do all these in bulk: all remaining cloudvirts are now ready for upgrade. [23:34:41] (03PS9) 10Dzahn: gerrit: replace clear_gerrit_logs cron job with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/633857 (https://phabricator.wikimedia.org/T265138) [23:45:49] (03PS5) 10Dzahn: planet: replace update cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/636105 (https://phabricator.wikimedia.org/T265138) [23:57:13] Andrewbogott the cloudvirts can be updated anytime? [23:57:31] (03PS1) 10Bstorm: k8s-haproxy: take steps to fix logging [puppet] - 10https://gerrit.wikimedia.org/r/636765 (https://phabricator.wikimedia.org/T266593) [23:58:22] cmjohnson1: yep! They're empty now. Most are downtimed already I think. [23:58:55] Cool!! Thanks [23:59:25] (03CR) 10Bstorm: k8s-haproxy: take steps to fix logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636765 (https://phabricator.wikimedia.org/T266593) (owner: 10Bstorm) [23:59:29] ok, now they're all downtimed :)