[00:00:10] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:38] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:00] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:04] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:12] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:11] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DStrine) [03:47:46] (03PS3) 10KartikMistry: Add --notify-age-in-days option to notify users before draft purge [puppet] - 10https://gerrit.wikimedia.org/r/622528 (https://phabricator.wikimedia.org/T261189) [03:48:01] (03CR) 10KartikMistry: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/622528 (https://phabricator.wikimedia.org/T261189) (owner: 10KartikMistry) [05:12:06] (03PS3) 10ArielGlenn: move dumps around on the snapshots in prep for network upgrade work [puppet] - 10https://gerrit.wikimedia.org/r/623177 (https://phabricator.wikimedia.org/T196487) [05:14:20] (03CR) 10ArielGlenn: [C: 03+2] move dumps around on the snapshots in prep for network upgrade work [puppet] - 10https://gerrit.wikimedia.org/r/623177 (https://phabricator.wikimedia.org/T196487) (owner: 10ArielGlenn) [05:18:56] (03PS1) 10Marostegui: dbproxy1021,1017: Test db1128 as m5 master [puppet] - 10https://gerrit.wikimedia.org/r/623222 [05:19:32] (03CR) 10Marostegui: [C: 03+2] dbproxy1021,1017: Test db1128 as m5 master [puppet] - 10https://gerrit.wikimedia.org/r/623222 (owner: 10Marostegui) [05:21:06] !log Reload haproxy on dbproxy1017 and dbproxy1021 to test db1128 [05:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:16] PROBLEM - Check systemd state on snapshot1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:00] PROBLEM - puppet last run on labsdb1012 is CRITICAL: CRITICAL: Puppet last ran 5 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:23:12] ^ I just enabled it :) [05:28:50] RECOVERY - puppet last run on labsdb1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:03:43] marostegui: something is off, 8 AM and still not alter table running on a Monday [06:03:49] *no alter [06:04:11] :D [06:06:38] !log reimage kafka-jumbo1005 to Debian Buster [06:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:52] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [06:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:24] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 130 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [06:29:32] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 95 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [06:29:58] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops, 10FR-Email: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10CCogdill_WMF) [06:30:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:35] the kafka alerts are expected [06:39:04] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 113 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [06:53:34] (03PS1) 10Elukey: install_server: ignore swap when reimaging kafka jumbo nodes [puppet] - 10https://gerrit.wikimedia.org/r/623227 (https://phabricator.wikimedia.org/T255123) [06:55:30] elukey: cause we have stopped the alters before the DC switch tomorrow! [06:55:40] ahhhhh [06:55:42] :D [06:59:24] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [07:00:18] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [07:02:20] (03PS1) 10Giuseppe Lavagetto: dnsdisc: change retry logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/623228 (https://phabricator.wikimedia.org/T260889) [07:03:06] (03CR) 10Volans: [C: 03+1] "Sure, LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/623228 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [07:05:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] dnsdisc: change retry logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/623228 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [07:08:16] (03PS2) 10Volans: dnsdisc: change retry logic [software/spicerack] - 10https://gerrit.wikimedia.org/r/623228 (https://phabricator.wikimedia.org/T260889) (owner: 10Giuseppe Lavagetto) [07:10:01] (03PS1) 10Muehlenhoff: Add debmonitor::server role to debmonitor2002 [puppet] - 10https://gerrit.wikimedia.org/r/623229 (https://phabricator.wikimedia.org/T261489) [07:12:02] !log Sanitize jawikivoyage on db2094:3325 and db1124:3325 T260482 [07:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:07] T260482: Prepare and check storage layer for jawikivoyage - https://phabricator.wikimedia.org/T260482 [07:13:18] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [07:13:59] 10Operations, 10Patch-For-Review: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:14:25] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.41 [software/spicerack] - 10https://gerrit.wikimedia.org/r/623230 [07:16:55] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.0.41 [software/spicerack] - 10https://gerrit.wikimedia.org/r/623230 (owner: 10Volans) [07:17:39] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.0.41 [software/spicerack] - 10https://gerrit.wikimedia.org/r/623230 [07:19:51] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v0.0.41 [software/spicerack] - 10https://gerrit.wikimedia.org/r/623230 (owner: 10Volans) [07:24:44] !log installing openexr security updates on buster [07:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:08] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/623230 (owner: 10Volans) [07:30:50] !log installing squid security updates [07:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:43] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) @Jclark-ctr should we sync about this to schedule the first host (when you have time of course)? [07:33:05] (03PS1) 10Matthias Mullie: Disable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/623150 [07:33:53] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.41 [software/spicerack] - 10https://gerrit.wikimedia.org/r/623230 (owner: 10Volans) [07:36:10] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.41 [software/spicerack] - 10https://gerrit.wikimedia.org/r/623230 (owner: 10Volans) [07:42:03] (03PS1) 10Volans: Upstream release v0.0.41 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/623236 [07:45:22] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.41 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/623236 (owner: 10Volans) [07:45:42] (03CR) 10Kormat: [C: 04-1] "This won't do what you want it to :)" [puppet] - 10https://gerrit.wikimedia.org/r/623227 (https://phabricator.wikimedia.org/T255123) (owner: 10Elukey) [07:48:12] (03Merged) 10jenkins-bot: Upstream release v0.0.41 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/623236 (owner: 10Volans) [07:53:59] !log uploaded spicerack_0.0.41 to apt.wikimedia.org buster-wikimedia [07:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:21] (03PS1) 10Muehlenhoff: Bump changelog for new 0.2 package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/623310 [07:56:40] (03CR) 10Elukey: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/623227 (https://phabricator.wikimedia.org/T255123) (owner: 10Elukey) [08:01:34] (03PS2) 10Elukey: install_server: ignore swap when reimaging kafka jumbo nodes [puppet] - 10https://gerrit.wikimedia.org/r/623227 (https://phabricator.wikimedia.org/T255123) [08:02:23] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog for new 0.2 package [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/623310 (owner: 10Muehlenhoff) [08:24:29] (03PS1) 10Muehlenhoff: Add Provides/Replaces/Conflicts for the old wmf-sre-laptop package name [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/623313 [08:33:10] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for UG Greece - https://phabricator.wikimedia.org/T261607 (10geraki) [08:38:48] (03CR) 10Kormat: [C: 03+2] install_server: ignore swap when reimaging kafka jumbo nodes [puppet] - 10https://gerrit.wikimedia.org/r/623227 (https://phabricator.wikimedia.org/T255123) (owner: 10Elukey) [08:43:47] !log installing bind9 security updates on stretch/buster (client-side tools/libs only) [08:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:07] (03CR) 10Ema: [C: 03+1] "LGTM, you could explicitly mention "rename binary package wmf-sre-laptop to wmf-laptop-sre" in the changelog/commit message though." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/623313 (owner: 10Muehlenhoff) [08:52:06] (03PS6) 10Kormat: mariadb: Create profile::mariadb::packages_wmf [puppet] - 10https://gerrit.wikimedia.org/r/622972 (https://phabricator.wikimedia.org/T256972) [08:52:08] (03PS3) 10Kormat: mariadb: Allow overriding of wmf-mariadb version in hiera [puppet] - 10https://gerrit.wikimedia.org/r/622995 (https://phabricator.wikimedia.org/T256972) [08:52:10] (03PS15) 10Kormat: mariadb: Add profile::mariadb::common [puppet] - 10https://gerrit.wikimedia.org/r/622578 (https://phabricator.wikimedia.org/T256972) [08:54:11] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add Provides/Replaces/Conflicts for the old wmf-sre-laptop package name [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/623313 (owner: 10Muehlenhoff) [08:59:22] (03PS1) 10Muehlenhoff: Merge the two Conflicts: [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/623317 [09:02:23] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Merge the two Conflicts: [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/623317 (owner: 10Muehlenhoff) [09:20:45] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:24:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:24:39] XioNoX: FYI ^^^ [09:26:16] volans: that's CF peering, not in use yet [09:26:21] (03PS1) 10Volans: scripts: remove unused code path [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/623321 [09:26:36] XioNoX: ack, I didn't make in time to check on icinga the interface name [09:26:40] it should not be flapping though, so if there is more of it I'll remove it from monitoring [09:27:12] (03CR) 10Volans: [C: 03+2] "Trivial, self-merging, lmk if you have any comment." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/623321 (owner: 10Volans) [09:31:36] (03CR) 10Gehel: [C: 03+1] "minor style issues reported by prospector, otherwise LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [09:35:20] (03PS1) 10Giuseppe Lavagetto: spicerack: create configuration for the switchdc.services cookbook [puppet] - 10https://gerrit.wikimedia.org/r/623323 [09:36:27] (03CR) 10jerkins-bot: [V: 04-1] spicerack: create configuration for the switchdc.services cookbook [puppet] - 10https://gerrit.wikimedia.org/r/623323 (owner: 10Giuseppe Lavagetto) [09:37:59] (03PS2) 10Giuseppe Lavagetto: spicerack: create configuration for the switchdc.services cookbook [puppet] - 10https://gerrit.wikimedia.org/r/623323 [09:39:04] (03CR) 10jerkins-bot: [V: 04-1] spicerack: create configuration for the switchdc.services cookbook [puppet] - 10https://gerrit.wikimedia.org/r/623323 (owner: 10Giuseppe Lavagetto) [09:41:49] (03PS3) 10Giuseppe Lavagetto: spicerack: create configuration for the switchdc.services cookbook [puppet] - 10https://gerrit.wikimedia.org/r/623323 [09:42:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24812/cumin1001.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/623323 (owner: 10Giuseppe Lavagetto) [09:51:33] !log executed /srv/phab/phabricator/bin/remove destroy @klausman on phab1001 (following https://wikitech.wikimedia.org/wiki/Phabricator#Delete_a_user) to clear incosistent state of new account (wrong email address) [09:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:32] elukey: "destroy @klausman" - i approve! [09:53:43] /o\ Ohnoes. [09:53:51] elukey: can you also do the same with kormat's? [09:54:05] * kormat sobs [09:54:45] marostegui: I already tried but there was some extra protection in place, he must have a lot of powerful friends [09:55:00] friends? that's surprising! [09:55:06] marostegui: i agree [09:55:09] ahahhahaha [09:58:41] (03PS1) 10Hnowlan: Add title for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623325 (https://phabricator.wikimedia.org/T246945) [09:59:40] Who needs friends when you have Kompromat. [09:59:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:59:52] (note the eerie similarity of that word to "kormat") [10:00:40] huuh [10:01:07] marostegui: less destroying, more DBA [10:01:11] ;-) [10:04:00] (03PS3) 10Hnowlan: mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [10:04:15] (03PS7) 10Jbond: use dnsmasq: add configuration to use dnsmasq with WMF config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 [10:04:17] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [10:05:15] (03PS8) 10Jbond: use dnsmasq: add configuration to use dnsmasq with WMF config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 [10:06:07] (03PS9) 10Jbond: use dnsmasq: add configuration to use dnsmasq with WMF config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 [10:07:06] (03CR) 10Jbond: "rebased, ready for review" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 (owner: 10Jbond) [10:07:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:09:47] topic diff: taking over clinic duty today from Cole [10:12:27] (03PS3) 10Hnowlan: api-gateway: Collect metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/623012 (https://phabricator.wikimedia.org/T254910) [10:13:00] PROBLEM - Logstash Elasticsearch indexing errors #o11y on icinga1001 is CRITICAL: 8.021 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:15:03] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for klausman - https://phabricator.wikimedia.org/T261626 (10klausman) [10:15:36] 10Operations, 10SRE-Access-Requests: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10klausman) [10:18:30] (03PS4) 10Hnowlan: mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [10:18:38] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) I prepared the banner that will be displayed from 13:30 to 14:05 UTC on all wikis, for both logged-in and logged out users. At... [10:20:05] (03CR) 10Jbond: [C: 04-1] base: remove override and conditionals for rasdaemon install (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623027 (https://phabricator.wikimedia.org/T205396) (owner: 10Dzahn) [10:22:15] RECOVERY - Logstash Elasticsearch indexing errors #o11y on icinga1001 is OK: (C)8 ge (W)1 ge 0.04583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:22:21] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Collect metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/623012 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [10:23:32] (03Merged) 10jenkins-bot: api-gateway: Collect metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/623012 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [10:23:44] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) p:05Medium→03High [10:25:44] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) [10:27:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:51] 10Operations, 10observability, 10Patch-For-Review: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10jbond) >>! In T205396#4955858, @CDanis wrote: > @jbond kindly backported the buster version of rasdaemon to stretch. I'm going to attempt installing it o... [10:30:05] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200831T1030). [10:30:36] (03PS1) 10Muehlenhoff: Adapt Makefile to build artefacts for Buster [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/623329 [10:33:30] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Daimona) [10:34:55] (03CR) 10Jbond: [C: 04-1] "LGTM however the ircecho class is also used by shinken which is called by profile::wmcs::shinken and configured via role::wmcs::shinken" [puppet] - 10https://gerrit.wikimedia.org/r/623018 (owner: 10Dzahn) [10:36:22] (03PS1) 10Giuseppe Lavagetto: sre.switchdc.services: use configuration file [cookbooks] - 10https://gerrit.wikimedia.org/r/623330 [10:36:31] (03PS2) 10Muehlenhoff: Adapt Makefile/Dockerfile to build artefacts for Buster [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/623329 [10:37:04] PROBLEM - Logstash Elasticsearch indexing errors #o11y on icinga1001 is CRITICAL: 8.438 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:38:40] (03CR) 10Volans: [C: 03+1] "LGTM, if possible try locally to build the artifacts to check that everything works before merging." [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/623329 (owner: 10Muehlenhoff) [10:38:44] (03CR) 10Jbond: [C: 03+1] "LGTM but im no expert on scap" [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/623329 (owner: 10Muehlenhoff) [10:39:26] (03CR) 10Jbond: [C: 03+2] hiera3: remove old hiera backend files [puppet] - 10https://gerrit.wikimedia.org/r/615161 (owner: 10Jbond) [10:40:48] RECOVERY - Logstash Elasticsearch indexing errors #o11y on icinga1001 is OK: (C)8 ge (W)1 ge 0.008333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:46:22] PROBLEM - Logstash Elasticsearch indexing errors #o11y on icinga1001 is CRITICAL: 13.63 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [10:47:33] (03PS1) 10Giuseppe Lavagetto: service::catalog: fix a couple monitoring instances. [puppet] - 10https://gerrit.wikimedia.org/r/623331 [10:48:24] (03CR) 10Volans: [C: 04-1] "One small issue, see inline, looks good otherwise" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/623330 (owner: 10Giuseppe Lavagetto) [10:49:14] 10Operations, 10SRE-Access-Requests: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10Aklapper) > Name of approving party (hiring manager for WMF staff): Nuria Ruiz CC'ing @nuria [10:51:26] (03PS2) 10Giuseppe Lavagetto: service::catalog: add monitoring to restbase via TLS [puppet] - 10https://gerrit.wikimedia.org/r/623331 [10:56:23] (03CR) 10Giuseppe Lavagetto: sre.switchdc.services: use configuration file (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/623330 (owner: 10Giuseppe Lavagetto) [10:56:36] (03PS2) 10Giuseppe Lavagetto: sre.switchdc.services: use configuration file [cookbooks] - 10https://gerrit.wikimedia.org/r/623330 [10:58:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24813/" [puppet] - 10https://gerrit.wikimedia.org/r/623331 (owner: 10Giuseppe Lavagetto) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200831T1100). [11:00:05] Zoranzoki21, Ashot1997, and matthiasmullie: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:34] Hello, I'm here :) [11:00:45] o/ [11:00:50] Hello [11:01:00] (03CR) 10Volans: [C: 03+1] "LGTM, let's test it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/623330 (owner: 10Giuseppe Lavagetto) [11:02:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.switchdc.services: use configuration file [cookbooks] - 10https://gerrit.wikimedia.org/r/623330 (owner: 10Giuseppe Lavagetto) [11:02:39] !log upgraded spicerack to 0.0.41 on cumin hosts [11:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:11] (03Merged) 10jenkins-bot: sre.switchdc.services: use configuration file [cookbooks] - 10https://gerrit.wikimedia.org/r/623330 (owner: 10Giuseppe Lavagetto) [11:04:56] (03CR) 10Muehlenhoff: "I made a local test run and the buster artefacts look sane; the binary packages use their cp37m counterpart on Buster." [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/623329 (owner: 10Muehlenhoff) [11:05:01] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Adapt Makefile/Dockerfile to build artefacts for Buster [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/623329 (owner: 10Muehlenhoff) [11:08:24] Zoranzoki21: matthiasmullie : Ashot1997: I can deploy today [11:09:00] (03PS1) 10Giuseppe Lavagetto: sre.switchdc.services: fix bad syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/623334 [11:09:13] Urbanecm: Great, I came home earlier. :) [11:09:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.switchdc.services: fix bad syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/623334 (owner: 10Giuseppe Lavagetto) [11:10:46] (03Merged) 10jenkins-bot: sre.switchdc.services: fix bad syntax [cookbooks] - 10https://gerrit.wikimedia.org/r/623334 (owner: 10Giuseppe Lavagetto) [11:11:00] (03CR) 10Urbanecm: [C: 03+2] Disable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/623150 (owner: 10Matthias Mullie) [11:11:38] (03CR) 10Urbanecm: [C: 03+2] Enable sitenotice on mobile for closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622777 (https://phabricator.wikimedia.org/T261357) (owner: 10Zoranzoki21) [11:12:27] (03Merged) 10jenkins-bot: Enable sitenotice on mobile for closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622777 (https://phabricator.wikimedia.org/T261357) (owner: 10Zoranzoki21) [11:13:11] (03CR) 10Urbanecm: [C: 03+2] Enable Signature button on Wikiproject for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623109 (https://phabricator.wikimedia.org/T261550) (owner: 10Ashot1997) [11:13:45] Zoranzoki21: please test at mwdebug1002 [11:14:12] Urbanecm: Ok, testing... [11:14:16] (03Merged) 10jenkins-bot: Disable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/623150 (owner: 10Matthias Mullie) [11:15:18] Umm... My internet is currently slow.. I can't open Gerrit nor Phabricator nor wikis... [11:15:37] (03PS2) 10Urbanecm: Enable Signature button on Wikiproject for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623109 (https://phabricator.wikimedia.org/T261550) (owner: 10Ashot1997) [11:15:39] (03CR) 10Urbanecm: [C: 03+2] Enable Signature button on Wikiproject for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623109 (https://phabricator.wikimedia.org/T261550) (owner: 10Ashot1997) [11:16:08] Okay is, you can deploy my patch... [11:16:41] (03Merged) 10jenkins-bot: Enable Signature button on Wikiproject for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623109 (https://phabricator.wikimedia.org/T261550) (owner: 10Ashot1997) [11:17:11] Zoranzoki21: syncing [11:17:15] Ashot1997: you're next! [11:17:23] Great! [11:17:55] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: b74893fecdaae599077daad5b1219ad3b9bc7fc9: Enable sitenotice on mobile for closed wikis (T261357) (duration: 00m 56s) [11:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:00] T261357: Sitenotice of locked wikis is not displayed on mobile - https://phabricator.wikimedia.org/T261357 [11:18:19] Urbanecm: Should work, I see sitenotice on my phone :) [11:18:21] Ashot1997: could you test at mwdebug1002, please? [11:18:23] Zoranzoki21: cool! [11:18:27] (03PS1) 10Muehlenhoff: Create artifacts for 0.2.7 on Buster [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/623335 (https://phabricator.wikimedia.org/T261489) [11:19:15] @Urbanecm: It works [11:20:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "We tested this patch (livehacked) in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/615161 (owner: 10Jbond) [11:20:58] Ashot1997: thanks, syncing [11:21:17] Urbanecm: cool, thanks ^_^ [11:22:37] !log removing old hiera version 1 and 3 backends [11:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:43] (03CR) 10Jbond: [C: 03+2] hiera3: remove old hiera backend files [puppet] - 10https://gerrit.wikimedia.org/r/615161 (owner: 10Jbond) [11:22:48] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 81f88fde2aad23a619047b1177a6188f51df11a9: Enable Signature button on Wikiproject for hywiki (T261550) (duration: 00m 54s) [11:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:51] T261550: Enable "Signature" button on "Վիքինախագիծ" (ns 102) namespace for hywiki - https://phabricator.wikimedia.org/T261550 [11:23:06] Ashot1997: okay, should be live :) [11:23:38] Yes, it is. Thanks ^_^ [11:23:49] happy to help! [11:24:31] matthiasmullie: pulled your backport onto mwdebug1002, could you test, please? [11:24:43] will do! [11:26:18] (03PS2) 10Muehlenhoff: Add debmonitor::server role to debmonitor2002 [puppet] - 10https://gerrit.wikimedia.org/r/623229 (https://phabricator.wikimedia.org/T261489) [11:29:16] Urbanecm: not exactly doing what I expect, but could be cache-related [11:29:21] works fine with ?debug=1 [11:29:25] so let's proceed [11:29:34] matthiasmullie: okay, syncing then :) [11:29:35] should be fine anyway :) [11:29:39] thanks! [11:31:40] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623229 (https://phabricator.wikimedia.org/T261489) (owner: 10Muehlenhoff) [11:32:11] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.6/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: 5d583d9550787a8e36c29ca841233615405fcb7e: Disable MediaSearch A/B test (duration: 00m 55s) [11:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:22] matthiasmullie: here you go! [11:32:34] thanks Urbanecm ! [11:37:32] RECOVERY - Logstash Elasticsearch indexing errors #o11y on icinga1001 is OK: (C)8 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [11:37:46] (03PS6) 10Jbond: profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) [11:38:50] (03CR) 10jerkins-bot: [V: 04-1] profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [11:39:54] (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/623335 (https://phabricator.wikimedia.org/T261489) (owner: 10Muehlenhoff) [11:44:43] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Create artifacts for 0.2.7 on Buster [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/623335 (https://phabricator.wikimedia.org/T261489) (owner: 10Muehlenhoff) [11:45:32] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Marostegui) [11:45:43] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10Marostegui) [11:45:53] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Thur, Sept 10 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10Marostegui) [11:46:00] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Mon, Sept 14 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 - https://phabricator.wikimedia.org/T261455 (10Marostegui) [11:46:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C4 and C5 - https://phabricator.wikimedia.org/T261456 (10Marostegui) [11:46:14] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks C6 and C7 - https://phabricator.wikimedia.org/T261457 (10Marostegui) [11:46:24] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Mon, Sept 21 PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 (10Marostegui) [11:47:24] 10Operations, 10Analytics, 10Traffic: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [11:47:49] 10Operations, 10Traffic, 10Patch-For-Review: Analyze custom varnish 5.1 patches considering the migration to varnish 6 - https://phabricator.wikimedia.org/T260702 (10ema) [11:47:51] 10Operations, 10Analytics, 10Traffic: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [11:48:22] 10Operations, 10Analytics, 10Traffic: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) p:05Triage→03Medium [11:49:45] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10Elitre) [11:49:55] 10Operations, 10MediaWiki-General, 10serviceops, 10MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10matej_suchanek) Where is this now? Did the maintenance script act... [11:50:47] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10Elitre) [11:58:49] !log reimage kafka-jumbo1001 to Buster [11:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:03] (03CR) 10Urbanecm: [C: 03+1] Add title for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623325 (https://phabricator.wikimedia.org/T246945) (owner: 10Hnowlan) [12:05:11] !log oblivian@cumin2001 START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep [12:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:25] !log oblivian@cumin2001 END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) [12:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:26] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Mon, Sept 21 PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 (10Marostegui) dbproxy1016 needs to be failover. I will take care of that now Please take extra care of db1125. Thanks! [12:13:07] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10Marostegui) Please take extra care of db1122 as it is an eqiad master and lots of slaves hang from it. We might stop mysql there just in case prior the maintenance. [12:13:29] (03PS1) 10Jbond: apereo_Cas: fix fixtures [puppet] - 10https://gerrit.wikimedia.org/r/623341 [12:13:54] !log oblivian@cumin2001 START - Cookbook sre.switchdc.services.01-switch-dc [12:13:54] !log oblivian@cumin2001 Switching services restbase-async: eqiad => codfw [12:13:56] !log oblivian@cumin2001 END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) [12:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:03] !log oblivian@cumin2001 START - Cookbook sre.switchdc.services.02-restore-ttl [12:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:20] !log oblivian@cumin2001 END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) [12:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:08] (03CR) 10jerkins-bot: [V: 04-1] apereo_Cas: fix fixtures [puppet] - 10https://gerrit.wikimedia.org/r/623341 (owner: 10Jbond) [12:15:15] 10Operations, 10SRE-Access-Requests: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10klausman) L3 is signed. [12:15:24] (03CR) 10Jbond: [V: 03+2 C: 03+2] "overriding CI base on the shared spec helper is broken" [puppet] - 10https://gerrit.wikimedia.org/r/623341 (owner: 10Jbond) [12:15:43] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Thur, Sept 10 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10Marostegui) Please take extra care with db1123, db1093 and db1109, they are an eqiad masters and lots of slaves hang from them. We might stop mysql just in case. [12:16:01] <_joe_> volans: lol I did not make the sleep depend on --dry-run [12:16:21] 10Operations, 10SRE-Access-Requests: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10klausman) [12:16:54] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 94 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [12:16:59] 10Operations, 10SRE-Access-Requests: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10klausman) [12:17:07] _joe_: yeah I noticed earlier too [12:17:12] we do in the mediawiki one [12:17:15] nbd [12:17:32] 10Operations, 10SRE-Access-Requests: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10klausman) [12:18:19] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Mon, Sept 14 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 - https://phabricator.wikimedia.org/T261455 (10Marostegui) Please take extra care with db1087, db1100 and db1109, they are an eqiad masters and lots of slaves hang from them. We might stop mysql just in case. [12:19:29] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 108 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [12:19:31] (03PS1) 10Giuseppe Lavagetto: service::catalog: add monitoring to kibana-ssl [puppet] - 10https://gerrit.wikimedia.org/r/623344 [12:20:00] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C4 and C5 - https://phabricator.wikimedia.org/T261456 (10Marostegui) Be careful with dbproxy1018 and dbproxy1019, and labsdb1010 as they serve cloud infra and that service will not be switched to codfw. [12:21:58] (03CR) 10Filippo Giunchedi: [C: 03+1] service::catalog: add monitoring to kibana-ssl [puppet] - 10https://gerrit.wikimedia.org/r/623344 (owner: 10Giuseppe Lavagetto) [12:23:29] (03PS1) 10Giuseppe Lavagetto: cxserver-https: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/623345 [12:23:41] (03CR) 10Muehlenhoff: [C: 03+2] Add debmonitor::server role to debmonitor2002 [puppet] - 10https://gerrit.wikimedia.org/r/623229 (https://phabricator.wikimedia.org/T261489) (owner: 10Muehlenhoff) [12:23:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::catalog: add monitoring to kibana-ssl [puppet] - 10https://gerrit.wikimedia.org/r/623344 (owner: 10Giuseppe Lavagetto) [12:24:10] 10Operations, 10SRE-swift-storage: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) [12:24:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cxserver-https: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/623345 (owner: 10Giuseppe Lavagetto) [12:24:25] _joe_: shall I merge your patch along?for kibana-ssl [12:24:27] (03PS2) 10Giuseppe Lavagetto: cxserver-https: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/623345 [12:24:52] <_joe_> moritzm: sure, I was planning to merge it and the next one together, but nothing really changes :) [12:25:31] ok, simply merge mine alongside when you're ready, then? [12:25:36] <_joe_> let me merge your change too, yes [12:25:40] (03PS1) 10Jbond: spec_helper: fix hiera definition in shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/623346 [12:25:58] <_joe_> moritzm: done [12:26:26] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) [12:26:45] thx [12:26:52] (03CR) 10Jbond: [C: 03+2] spec_helper: fix hiera definition in shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/623346 (owner: 10Jbond) [12:30:11] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 72 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [12:30:18] this is me --^ [12:32:23] (03PS7) 10Jbond: profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) [12:33:25] (03PS1) 10CDanis: Add SRV record verification for Element Matrix Services [dns] - 10https://gerrit.wikimedia.org/r/623348 (https://phabricator.wikimedia.org/T261531) [12:33:56] (03CR) 10Jbond: "ready: rebased" [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [12:34:22] (03PS1) 10Marostegui: wmnet: Failover m3 dbproxy [dns] - 10https://gerrit.wikimedia.org/r/623349 (https://phabricator.wikimedia.org/T261459) [12:36:34] (03CR) 10Filippo Giunchedi: icinga: support contactgroups stubs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622588 (owner: 10Filippo Giunchedi) [12:36:49] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 80 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [12:37:45] !log oblivian@cumin2001 START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep [12:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:37] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: move kibana.yml settings to parameters [puppet] - 10https://gerrit.wikimedia.org/r/622651 (owner: 10Herron) [12:41:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, please note that new metrics will be created as part of this change (different instance label) and the old ones will stop updating." [puppet] - 10https://gerrit.wikimedia.org/r/622836 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [12:43:02] !log oblivian@cumin2001 END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) [12:43:04] (03PS3) 10Filippo Giunchedi: aptrepo: import current reprepro 'updates' keys [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) [12:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:11] (03CR) 10Filippo Giunchedi: aptrepo: import reprepro 'updates' public keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [12:44:44] !log oblivian@cumin2001 START - Cookbook sre.switchdc.services.01-switch-dc [12:44:44] !log oblivian@cumin2001 Switching services restbase-async: eqiad => codfw [12:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:46] !log oblivian@cumin2001 END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) [12:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:11] !log oblivian@cumin2001 START - Cookbook sre.switchdc.services.02-restore-ttl [12:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:28] (03PS1) 10Muehlenhoff: debmonitor: Fix dependencies on Buster [puppet] - 10https://gerrit.wikimedia.org/r/623354 [12:45:32] !log oblivian@cumin2001 END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) [12:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:32] (03CR) 10jerkins-bot: [V: 04-1] debmonitor: Fix dependencies on Buster [puppet] - 10https://gerrit.wikimedia.org/r/623354 (owner: 10Muehlenhoff) [12:46:59] (03PS1) 10Elukey: install_server: fix partman recipe for kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/623356 [12:47:42] (03PS3) 10Filippo Giunchedi: aptrepo: import reprepro 'updates' public keys [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) [12:47:43] (03PS4) 10Filippo Giunchedi: aptrepo: import current reprepro 'updates' keys [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) [12:47:59] !log oblivian@cumin2001 START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep [12:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:17] (03CR) 10Elukey: [C: 03+2] install_server: fix partman recipe for kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/623356 (owner: 10Elukey) [12:50:31] (03PS2) 10Muehlenhoff: debmonitor: Fix dependencies on Buster [puppet] - 10https://gerrit.wikimedia.org/r/623354 [12:50:48] (03PS1) 10Klausman: admin: Add user klausman [puppet] - 10https://gerrit.wikimedia.org/r/623357 [12:50:50] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/623357 (owner: 10Klausman) [12:51:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621485 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [12:53:21] !log oblivian@cumin2001 END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) [12:53:22] (03CR) 10Elukey: "It is also ok to add yourself to the groups!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623357 (owner: 10Klausman) [12:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:51] !log oblivian@cumin2001 START - Cookbook sre.switchdc.services.01-switch-dc [12:53:51] !log oblivian@cumin2001 Switching services parsoid: eqiad => codfw [12:53:53] !log oblivian@cumin2001 END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) [12:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:13] !log oblivian@cumin2001 START - Cookbook sre.switchdc.services.02-restore-ttl [12:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:27] !log oblivian@cumin2001 END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) [12:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:31] (03PS2) 10Klausman: admin: Add user klausman [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) [12:55:08] 10Operations: FY2020-2021 Q1 eqiad -> codfw switchover - https://phabricator.wikimedia.org/T243316 (10Marostegui) [12:55:40] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) p:05Triage→03High [12:55:58] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10fgiunchedi) p:05Triage→03Medium [12:56:42] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10fgiunchedi) p:05Triage→03Medium [12:56:58] 10Operations, 10Analytics, 10Traffic: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) When it comes to varnish-modules, our current version (0.12.1-1+wmf2) does not build against 6.0.x, and same goes for varnish-modules 0.16.0 currently in testing. Luckily though, with a few changes... [12:57:00] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for UG Greece - https://phabricator.wikimedia.org/T261607 (10fgiunchedi) p:05Triage→03Medium [12:57:03] (03CR) 10Klausman: "> It is also ok to add yourself to the groups!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [12:57:48] (03CR) 10Muehlenhoff: [C: 03+1] "Grepping for VerifyRelease in updates this seems complete. However there are two definitions with two keys (Cassanda and HW monitoring too" [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [12:59:01] 10Operations, 10SRE-Access-Requests: Request for access to analytics-privatedata-users for cparle - https://phabricator.wikimedia.org/T260450 (10fgiunchedi) [12:59:18] (03CR) 10Elukey: "For the groups, I'd say "ops" and "analytics-privatedata-users", should be good for the moment. Nuria will have to approve, and we'll also" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [13:02:42] (03PS3) 10Klausman: admin: Add user klausman [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) [13:02:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [13:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:49] (03CR) 10Klausman: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [13:05:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:57] (03CR) 10Kormat: [C: 03+1] wmnet: Failover m3 dbproxy [dns] - 10https://gerrit.wikimedia.org/r/623349 (https://phabricator.wikimedia.org/T261459) (owner: 10Marostegui) [13:06:30] I am going to failover m3 (phabricator) dbproxy, it should be transparent, but if you notice issues, please let me know [13:06:43] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3 dbproxy [dns] - 10https://gerrit.wikimedia.org/r/623349 (https://phabricator.wikimedia.org/T261459) (owner: 10Marostegui) [13:07:09] !log Failover m3 (phabricator) proxy from dbproxy1016 to dbproxy1020 - T261459 [13:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:12] T261459: Mon, Sept 21 PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 [13:07:42] (03CR) 10Filippo Giunchedi: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [13:09:53] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Mon, Sept 21 PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 (10Marostegui) dbproxy1016 is no longer active, its service has been failed over. [13:10:03] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Fix dependencies on Buster [puppet] - 10https://gerrit.wikimedia.org/r/623354 (owner: 10Muehlenhoff) [13:10:40] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is CRITICAL: connect to address 10.192.32.42 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [13:13:14] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor2002 is CRITICAL: connect to address 10.192.32.42 and port 7443: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [13:14:03] (03PS5) 10Filippo Giunchedi: aptrepo: import current reprepro 'updates' keys [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) [13:14:05] (03PS1) 10Filippo Giunchedi: aptrepo: remove obsolete keys for external repos [puppet] - 10https://gerrit.wikimedia.org/r/623359 (https://phabricator.wikimedia.org/T260883) [13:14:26] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor2002 is OK: HTTP OK: Status line output matched HTTP/1.1 400 - 405 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [13:14:42] (03PS1) 10Ema: Depool eqiad for 2020-2021 DC switchover [dns] - 10https://gerrit.wikimedia.org/r/623360 (https://phabricator.wikimedia.org/T243314) [13:15:26] (03CR) 10Clarakosi: [C: 03+1] Install OAuthRateLimiter III: Install where enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622898 (https://phabricator.wikimedia.org/T258423) (owner: 10Ppchelko) [13:15:43] PROBLEM - debmonitor.wikimedia.org:80 on debmonitor2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Debmonitor [13:15:52] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:15:53] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:51] (03CR) 10Ottomata: "Cool! for kicks, want to try to schedule this on in a backport window? Starting at step 2. here" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623048 (https://phabricator.wikimedia.org/T260382) (owner: 10Bearloga) [13:17:57] (03CR) 10Cicalese: [C: 03+1] Add title for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623325 (https://phabricator.wikimedia.org/T246945) (owner: 10Hnowlan) [13:18:00] (03PS1) 10Jbond: sslcert::x509_to_pkcs12: add define for creating p12 files [puppet] - 10https://gerrit.wikimedia.org/r/623361 (https://phabricator.wikimedia.org/T253957) [13:18:02] (03PS1) 10Jbond: base::puppet: add ability to create p12 puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/623362 (https://phabricator.wikimedia.org/T253957) [13:18:04] (03PS1) 10Jbond: puppet ssl p12: enable generation of puppet p12 cert on test cluster [puppet] - 10https://gerrit.wikimedia.org/r/623363 (https://phabricator.wikimedia.org/T253957) [13:19:24] (03CR) 10jerkins-bot: [V: 04-1] base::puppet: add ability to create p12 puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/623362 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [13:21:36] (03PS2) 10Jbond: base::puppet: add ability to create p12 puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/623362 (https://phabricator.wikimedia.org/T253957) [13:22:40] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [13:24:12] (03PS2) 10Jbond: puppet ssl p12: enable generation of puppet p12 cert on test cluster [puppet] - 10https://gerrit.wikimedia.org/r/623363 (https://phabricator.wikimedia.org/T253957) [13:30:37] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [13:31:26] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [13:34:50] (03PS7) 10Kormat: WIP mariadb: simplify package_wmf [puppet] - 10https://gerrit.wikimedia.org/r/622569 [13:35:53] FYI: services switchover will start in about 25 minutes, followed by depooling eqiad -- please plan to hold off on any other production changes for a bit :) [13:35:56] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [13:39:14] (03CR) 10Andrew Bogott: [C: 03+1] profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [13:39:55] (03CR) 10Andrew Bogott: [C: 03+1] remove python2-only function from puppet_alert to move to py3 [puppet] - 10https://gerrit.wikimedia.org/r/622844 (https://phabricator.wikimedia.org/T218426) (owner: 10Bstorm) [13:40:27] (03PS1) 10Ostrzyciel: Disable the reverted tag on all wikis except testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623364 (https://phabricator.wikimedia.org/T254074) [13:41:53] !log dropping many databases from m5, as per T261152 [13:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:57] T261152: Drop openstack databases from m5-master - https://phabricator.wikimedia.org/T261152 [13:42:27] andrewbogott: \o/ [13:43:45] (03CR) 10Bearloga: "> Cool! for kicks, want to try to schedule this on in a backport window?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623048 (https://phabricator.wikimedia.org/T260382) (owner: 10Bearloga) [13:43:53] (03CR) 10Kosta Harlan: [C: 03+1] Disable the reverted tag on all wikis except testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623364 (https://phabricator.wikimedia.org/T254074) (owner: 10Ostrzyciel) [13:48:34] (03PS6) 10Jakob: Add `wmgWikibaseClientMainEntitySourceName` to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622612 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [14:00:00] * volans here [14:00:11] one second late, I'm disappointed :) [14:00:19] Mon 16:00:00 * | volans here [14:00:20] that's just the latency [14:00:22] yep [14:00:22] from my log [14:00:40] so, recap of the plan for anyone following along: [14:01:24] today we're depooling eqiad for active-active services, using the sre.switchdc.services cookbook: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/services [14:01:49] following that we'll be depooling it at the caching layer as well, so that all user requests flow to the other four DCs [14:02:11] none of this is expected to be user-impacting, and we do individual pieces of it pretty regularly [14:02:36] <_joe_> there might be some noticeable latency added to some requests [14:02:40] tomorrow we'll be making the larger maneuver, which is moving the active DC for mediawiki from eqiad to codfw [14:02:52] <_joe_> given we'll be retreiving sessions from a remote dc [14:03:45] right -- and in the period in between, we'll see some hairpinning, as codfw users are sent to eqiad for MW and then back to codfw for sessions [14:04:51] we'll be keeping the TTLs short for about an hour or so, and if the latency hit is unacceptable, or we have any other issues, the backout plan is to repool some or all services in eqiad as needed [14:07:13] I'm sharing my terminal on cumin1001, for anyone (with root there) who wants to follow along: sudo -i tmux attach -rt switchdc [14:07:23] if modifying the flags, please keep -r, so that your session is read-only :) [14:08:00] we can read it :) [14:08:10] I'll wait one minute for anyone else to connect, then get started -- any questions or objections here? [14:08:12] and you have to match the window size with that of rzl :) [14:08:42] yeah, it takes the size of the smallest window, so as long as yours isn't too tiny I'll be fine :D [14:09:14] okay, rolling [14:09:48] volans, _joe_: lgty? [14:09:50] 80x24 should be enough for everyone :-P [14:10:04] <_joe_> +1 [14:10:23] --services SERVICES is not needed as we're doing all of htem right? [14:10:34] right, we're leaning on the new defaults [14:10:37] <_joe_> exactly [14:10:46] that got deployed earlier, right? [14:11:11] oh it's just the cookbook change, so it goes out with puppet [14:11:29] (03PS2) 10Ppchelko: Add title for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623325 (https://phabricator.wikimedia.org/T246945) (owner: 10Hnowlan) [14:11:33] reducing the ttls now [14:11:34] and the spicerack change that got released [14:11:36] (03CR) 10Muehlenhoff: "Dug a little deeper and those are actually not expired keys, but those repos use multiple keys:" [puppet] - 10https://gerrit.wikimedia.org/r/623359 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [14:11:39] so yeah, go ahead [14:11:39] !log rzl@cumin1001 START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep [14:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:57] for anyone watching, these retries are expected [14:12:04] we're just waiting for the TTL change to propagate everywhere [14:12:08] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=99) [14:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:26] slightly misleading message, maybe something to correct for the future [14:12:38] _joe_, volans: we expected that to have caught up with the longer retries though, right? [14:12:47] https://streamhut.io/ might be a good tool for wider-audience-watches-wizard. But we'd need some private hosting of it (by default it's pretty public, and I wouldn't want to be the one accidentally sharing key material or a password that way) [14:12:57] I guess more records make it slower to converge for some reason [14:13:02] <_joe_> rzl: yes, retry please? [14:13:09] !log rzl@cumin1001 START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep [14:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:22] it's going now [14:13:25] * volans tailing the logs [14:13:41] context here is we've been working on an issue where multiple simultaneous changes to confd don't always make it out everywhere consistently, but they settle down on a retry [14:13:49] *changes via confd excuse me [14:14:11] it also only affects TTL changes, so we're not worried about this affecting the performance of the actual switch [14:14:26] now that we've set the TTL from 300 to 10 seconds, we wait 300 seconds for it to expire everywhere [14:15:37] while we blame joe [14:18:07] annnnnnnd, [14:18:08] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10jijiki) a:05jijiki→03None [14:18:35] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) [14:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:44] okay! now the fun part :) [14:18:54] <_joe_> indeed [14:19:04] I'm going to have appserver dashboards open, but the more eyes the better [14:19:23] FTR: the TTL change worked, on cp3050 I see the A record of restbase being resolved every 10s [14:19:35] perfect, thanks [14:19:52] rzl: I'll check with confctl --object-type discovery select 'dnsdisc=.*' get | sort [14:20:10] 👍 [14:20:14] going ahead with 01-switch-dc, objections? [14:20:26] nope [14:20:37] !log rzl@cumin1001 START - Cookbook sre.switchdc.services.01-switch-dc [14:20:37] !log rzl@cumin1001 Switching services apertium, termbox, search, api-gateway, ores, sessionstore, eventgate-main, graphoid, eventstreams, wikifeeds, wdqs, parsoid, eventgate-logging-external, wdqs-internal, echostore, mathoid, mobileapps, proton, restbase, kartotherian, recommendation-api, eventgate-analytics-external, restbase-async, citoid, schema, cxserver, eventgate-analytics, zotero: eqiad => codfw [14:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:54] again, expected retries, for about ten seconds [14:20:57] woohoo [14:21:08] hmm [14:21:39] okay, that took a little longer than expected to converge, possibly the confd issue still? [14:21:48] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:13] we'll hold here until 15:30 UTC and monitor [14:23:18] <_joe_> there seems to be more errors than usual on POSTs [14:23:24] question, is aqs expected to be out of the list? [14:23:39] (03PS1) 10Jason Linehan: Enables MediaWiki client errors on commonswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) [14:23:50] volans: aqs is only in eqiad [14:23:51] <_joe_> volans: yes [14:23:55] k, thx [14:24:14] yeah, "don't touch anything that belongs to analytics" is the rule of thumb I've been following ) [14:24:17] *:) [14:24:49] _joe_: say more about POSTs? [14:25:08] there's the latency increase, not horrible though [14:25:12] oh, I see a spike that recovered [14:25:19] <_joe_> rzl: it was a spike [14:25:37] <_joe_> cdanis: we expected a latency increase [14:25:41] I know :) [14:25:49] there's also two similar-magnitude and shape spikes in POST errors over the past 3h [14:25:49] <_joe_> given sessionstore has an added 40ms of latnecy now [14:25:52] two other [14:26:16] one place to follow the latency increase is from the POV of ATS backends in esams [14:26:18] <_joe_> over the last day [14:26:19] https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=text&var-origin=api-rw.discovery.wmnet&var-origin=appservers-rw.discovery.wmnet&var-origin=restbase.discovery.wmnet&from=now-1h&to=now [14:26:20] so I think that is just coincidence [14:26:25] yeah, it costs us about 50 ms at the mean and 100 ms at the 95th, about as expected [14:26:40] as reported by the appserver, that is [14:27:09] ema: nod [14:27:46] median TTFB is 236 ms now, 197 yesterday at this time [14:28:00] is anyone watching logstash? [14:28:05] yes [14:28:07] PROBLEM - Kartotherian LVS codfw #page on kartotherian.svc.codfw.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [14:28:22] looking, that might be us [14:28:34] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3061 is CRITICAL: cluster=cache_upload instance=cp3061 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3061 [14:28:38] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1084 is CRITICAL: cluster=cache_upload instance=cp1084 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1084 [14:28:40] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3059 is CRITICAL: cluster=cache_upload instance=cp3059 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3059 [14:28:40] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2037 is CRITICAL: cluster=cache_text instance=cp2037 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [14:28:52] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5003 is CRITICAL: cluster=cache_upload instance=cp5003 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [14:28:52] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4027 is CRITICAL: cluster=cache_text instance=cp4027 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4027 [14:29:00] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3056 is CRITICAL: cluster=cache_text instance=cp3056 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [14:29:14] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [14:29:20] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2035 is CRITICAL: cluster=cache_text instance=cp2035 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [14:29:21] <_joe_> did we move the jobrunners? [14:29:22] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1089 is CRITICAL: cluster=cache_text instance=cp1089 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089 [14:29:22] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1080 is CRITICAL: cluster=cache_upload instance=cp1080 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1080 [14:29:22] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1082 is CRITICAL: cluster=cache_upload instance=cp1082 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1082 [14:29:24] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2039 is CRITICAL: cluster=cache_text instance=cp2039 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [14:29:24] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2041 is CRITICAL: cluster=cache_text instance=cp2041 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041 [14:29:26] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2038 is CRITICAL: cluster=cache_upload instance=cp2038 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2038 [14:29:26] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2036 is CRITICAL: cluster=cache_upload instance=cp2036 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036 [14:29:26] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4021 is CRITICAL: cluster=cache_upload instance=cp4021 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4021 [14:29:27] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4023 is CRITICAL: cluster=cache_upload instance=cp4023 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4023 [14:29:30] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3052 is CRITICAL: cluster=cache_text instance=cp3052 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [14:29:32] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3054 is CRITICAL: cluster=cache_text instance=cp3054 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [14:29:34] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [14:29:37] _joe_: no [14:29:38] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5006 is CRITICAL: cluster=cache_upload instance=cp5006 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [14:29:40] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2040 is CRITICAL: cluster=cache_upload instance=cp2040 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2040 [14:29:40] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2032 is CRITICAL: cluster=cache_upload instance=cp2032 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2032 [14:29:40] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4030 is CRITICAL: cluster=cache_text instance=cp4030 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4030 [14:29:40] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3064 is CRITICAL: cluster=cache_text instance=cp3064 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064 [14:29:42] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1083 is CRITICAL: cluster=cache_text instance=cp1083 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083 [14:29:44] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2027 is CRITICAL: cluster=cache_text instance=cp2027 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [14:29:50] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: cluster=cache_text instance=cp5007 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [14:29:52] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5010 is CRITICAL: cluster=cache_text instance=cp5010 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [14:29:54] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: cluster=cache_text instance=cp5011 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [14:29:58] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: cluster=cache_text instance=cp4028 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [14:30:00] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2028 is CRITICAL: cluster=cache_upload instance=cp2028 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028 [14:30:00] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1075 is CRITICAL: cluster=cache_text instance=cp1075 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075 [14:30:00] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3058 is CRITICAL: cluster=cache_text instance=cp3058 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3058 [14:30:00] moving to #wikimedia-sre to dodge icinga noise [14:30:02] I believe the purged alerts are a false alarm [14:30:02] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1085 is CRITICAL: cluster=cache_text instance=cp1085 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1085 [14:30:06] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2031 is CRITICAL: cluster=cache_text instance=cp2031 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2031 [14:30:08] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1086 is CRITICAL: cluster=cache_upload instance=cp1086 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1086 [14:30:10] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1078 is CRITICAL: cluster=cache_upload instance=cp1078 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1078 [14:30:10] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [14:30:10] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5005 is CRITICAL: cluster=cache_upload instance=cp5005 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [14:30:10] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5008 is CRITICAL: cluster=cache_text instance=cp5008 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [14:30:12] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2030 is CRITICAL: cluster=cache_upload instance=cp2030 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030 [14:30:12] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3062 is CRITICAL: cluster=cache_text instance=cp3062 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3062 [14:30:16] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2034 is CRITICAL: cluster=cache_upload instance=cp2034 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2034 [14:30:16] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2029 is CRITICAL: cluster=cache_text instance=cp2029 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2029 [14:30:16] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [14:30:20] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5004 is CRITICAL: cluster=cache_upload instance=cp5004 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5004 [14:30:20] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4026 is CRITICAL: cluster=cache_upload instance=cp4026 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4026 [14:30:24] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1081 is CRITICAL: cluster=cache_text instance=cp1081 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1081 [14:30:24] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3063 is CRITICAL: cluster=cache_upload instance=cp3063 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3063 [14:30:26] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3065 is CRITICAL: cluster=cache_upload instance=cp3065 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3065 [14:30:26] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5002 is CRITICAL: cluster=cache_upload instance=cp5002 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [14:30:26] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1077 is CRITICAL: cluster=cache_text instance=cp1077 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [14:30:28] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1088 is CRITICAL: cluster=cache_upload instance=cp1088 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1088 [14:30:30] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1090 is CRITICAL: cluster=cache_upload instance=cp1090 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1090 [14:30:32] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4032 is CRITICAL: cluster=cache_text instance=cp4032 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4032 [14:30:32] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4029 is CRITICAL: cluster=cache_text instance=cp4029 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4029 [14:30:36] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4022 is CRITICAL: cluster=cache_upload instance=cp4022 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022 [14:30:36] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4024 is CRITICAL: cluster=cache_upload instance=cp4024 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4024 [14:30:37] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3051 is CRITICAL: cluster=cache_upload instance=cp3051 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3051 [14:30:37] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: cluster=cache_text instance=cp4031 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [14:30:37] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3053 is CRITICAL: cluster=cache_upload instance=cp3053 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3053 [14:30:44] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5001 is CRITICAL: cluster=cache_upload instance=cp5001 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [14:30:44] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged site=eqsin topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [14:30:46] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1076 is CRITICAL: cluster=cache_upload instance=cp1076 job=purged site=eqiad topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1076 [14:30:46] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4025 is CRITICAL: cluster=cache_upload instance=cp4025 job=purged site=ulsfo topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [14:30:54] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3055 is CRITICAL: cluster=cache_upload instance=cp3055 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3055 [14:31:02] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2033 is CRITICAL: cluster=cache_text instance=cp2033 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [14:31:13] wow [14:31:14] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2042 is CRITICAL: cluster=cache_upload instance=cp2042 job=purged site=codfw topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042 [14:31:16] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3057 is CRITICAL: cluster=cache_upload instance=cp3057 job=purged site=esams topic=eqiad.resource-purge https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3057 [14:32:16] <_joe_> vgutierrez: it's expected [14:32:25] <_joe_> we're not producing purges via eqiad anymore [14:33:16] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:33:18] !log rzl@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [14:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:50] <_joe_> so it seems maps can't handle the load in a single DC [14:33:52] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:33:52] <_joe_> gehel: ^^ [14:34:06] also ryankemper [14:34:10] looking [14:35:01] <_joe_> elukey: this is more a matter of "we need more servers", hence I was pinging the boss :P [14:35:06] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.410 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:35:34] RECOVERY - Kartotherian LVS codfw #page on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [14:35:35] _joe_ I didn't want to say that the boss wasn't enough, added Ryan in Cc just in case :) [14:35:42] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:36:39] _joe_: known, I think [14:36:47] ugh I apologize, I heard the ping of the sms just now and realized that I must have heard an ealier one but was too deep in the code weeds to realize it :-( [14:36:53] there is more hardware coming Soon™ [14:37:00] who's the boss of maps? [14:37:10] apergos: no worries, a bunch of us were looking [14:37:33] thank goodness! [14:38:29] !log installing rake security updates on stretch [14:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:15] moritzm: we're in the switchdc maint window, was this coordinated? [14:40:44] CPU saturation on maps codfw seems to go down, not much else we can do except wait for new hardware... [14:41:05] ack [14:41:22] call it a successful test then, I guess :) sorry to spice up your afternoon [14:42:02] mark: this is just for edge DCs, but I can also wait [14:42:38] if it can wait then let's not, and if it can't wait let's coordinate it, we have enough variables as it is :) [14:42:54] ok [14:42:57] CPU almost back to normal on maps, I assume there is cache somewhere that is now hot [14:43:58] gehel: we repooled kartotherian in eqiad at 14:33 [14:44:22] so, not caching, we just put the capacity back [14:44:23] rzl: Oh, I missed that one. That correlates [14:44:28] or at least not only caching [14:44:34] gehel: https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview?orgId=1&from=now-1h&to=now&var-site=eqiad&var-cluster=maps&var-instance=All&var-datasource=thanos [14:49:43] (03PS1) 10Vgutierrez: Release 2.0.91-3wm [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/623396 (https://phabricator.wikimedia.org/T261632) [14:54:24] (03PS1) 10CDanis: purged: only care about the lowest last-event-time [puppet] - 10https://gerrit.wikimedia.org/r/623398 [14:55:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Depool eqiad for 2020-2021 DC switchover [dns] - 10https://gerrit.wikimedia.org/r/623360 (https://phabricator.wikimedia.org/T243314) (owner: 10Ema) [14:56:09] (03CR) 10RLazarus: [C: 03+1] Depool eqiad for 2020-2021 DC switchover [dns] - 10https://gerrit.wikimedia.org/r/623360 (https://phabricator.wikimedia.org/T243314) (owner: 10Ema) [14:56:50] (03PS2) 10Ema: Depool eqiad for 2020-2021 DC switchover [dns] - 10https://gerrit.wikimedia.org/r/623360 (https://phabricator.wikimedia.org/T243316) [14:57:16] (03PS1) 10Hnowlan: api-gateway: expose port for admin interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/623399 (https://phabricator.wikimedia.org/T254910) [14:57:22] (03PS1) 10Effie Mouzeli: admin: add cparle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/623400 (https://phabricator.wikimedia.org/T260450) [14:57:38] (03CR) 10jerkins-bot: [V: 04-1] admin: add cparle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/623400 (https://phabricator.wikimedia.org/T260450) (owner: 10Effie Mouzeli) [14:57:58] (03CR) 10Ema: [C: 03+2] Depool eqiad for 2020-2021 DC switchover [dns] - 10https://gerrit.wikimedia.org/r/623360 (https://phabricator.wikimedia.org/T243316) (owner: 10Ema) [14:58:26] !log Traffic: depool eqiad from user traffic T243316 [14:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:31] T243316: FY2020-2021 Q1 eqiad -> codfw switchover - https://phabricator.wikimedia.org/T243316 [15:01:00] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10Bstorm) Are DCops considered part of ops only? They are already in that with... [15:01:12] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [15:01:13] (03CR) 10Jbond: [C: 03+2] profile::openstack: drop legacy validate_ functions [puppet] - 10https://gerrit.wikimedia.org/r/616742 (https://phabricator.wikimedia.org/T259013) (owner: 10Jbond) [15:05:37] (03Abandoned) 10Effie Mouzeli: admin: add cparle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/623400 (https://phabricator.wikimedia.org/T260450) (owner: 10Effie Mouzeli) [15:09:29] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10CKoerner_WMF) 05Open→03Resolved a:03CKoerner_WMF Thanks @Nintendofan885 for the reminder. Resolved! [15:09:46] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 39.47 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:10:13] <_joe_> 😱 [15:10:22] <_joe_> ema: ^^ :) [15:10:24] :) [15:10:54] traffic drop in eqiad expected, it's depooled [15:11:00] 📉 [15:11:01] <_joe_> I know [15:11:14] very not stonks [15:11:23] <_joe_> have you ever seen me react with 😱 to a production issue? [15:12:17] _joe_: sure, just mentioning that for the record/lurkers [15:12:58] (03CR) 10Ppchelko: "Hm, the access to it breaches more and more defenses..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/623399 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [15:23:31] (03PS2) 10CDanis: purged: only care about the lowest last-event-time [puppet] - 10https://gerrit.wikimedia.org/r/623398 [15:24:51] (03CR) 10Ema: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/623398 (owner: 10CDanis) [15:25:14] (03CR) 10CDanis: [C: 03+2] purged: only care about the lowest last-event-time [puppet] - 10https://gerrit.wikimedia.org/r/623398 (owner: 10CDanis) [15:27:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:30:19] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4021 is OK: (C)5000 gt (W)3000 gt 95.02 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4021 [15:30:33] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3064 is OK: (C)5000 gt (W)3000 gt 279.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064 [15:30:33] 🎉 ema ^ [15:30:37] prepare for more icinga-wm spam :) [15:30:47] \o/ [15:30:53] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3050 is OK: (C)5000 gt (W)3000 gt 247.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [15:30:59] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 477.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [15:32:58] !log rzl@cumin1001 START - Cookbook sre.switchdc.services.02-restore-ttl [15:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:28] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=99) [15:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:43] !log rzl@cumin1001 START - Cookbook sre.switchdc.services.02-restore-ttl [15:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:56] same as before (for context) [15:34:09] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) [15:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:38:16] (03PS1) 10Effie Mouzeli: admin: add cparle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/623407 (https://phabricator.wikimedia.org/T260450) [15:41:38] (03PS1) 10Ryan Kemper: elasticsearch: fix prom query syntax [software/spicerack] - 10https://gerrit.wikimedia.org/r/623408 [15:42:49] (03PS1) 10MarcoAurelio: [WIP] Limit Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 [15:43:13] (03PS2) 10Ryan Kemper: elasticsearch: fix prom query syntax [software/spicerack] - 10https://gerrit.wikimedia.org/r/623408 [15:43:51] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to production shell and wmf ldap access for Razzi Abuissa - https://phabricator.wikimedia.org/T261443 (10razzi) 05Open→03Resolved SSH is working! Thanks all [15:44:07] (03CR) 10Ryan Kemper: "Minor error in query syntax was breaking everything. This makes the prometheus query work as tested like so:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/623408 (owner: 10Ryan Kemper) [15:44:21] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1075 is OK: (C)5000 gt (W)3000 gt 90.77 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075 [15:44:21] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2028 is OK: (C)5000 gt (W)3000 gt 35.55 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028 [15:44:21] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2041 is OK: (C)5000 gt (W)3000 gt 49.43 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041 [15:44:23] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 506.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [15:44:27] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: (C)5000 gt (W)3000 gt 103.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [15:44:31] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3058 is OK: (C)5000 gt (W)3000 gt 338.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3058 [15:44:31] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1078 is OK: (C)5000 gt (W)3000 gt 202.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1078 [15:44:33] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1087 is OK: (C)5000 gt (W)3000 gt 77.48 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [15:44:33] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2031 is OK: (C)5000 gt (W)3000 gt 33.95 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2031 [15:44:41] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2042 is OK: (C)5000 gt (W)3000 gt 54.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042 [15:44:41] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2039 is OK: (C)5000 gt (W)3000 gt 130.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [15:44:43] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2030 is OK: (C)5000 gt (W)3000 gt 27.54 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030 [15:44:47] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3056 is OK: (C)5000 gt (W)3000 gt 311.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [15:44:47] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3062 is OK: (C)5000 gt (W)3000 gt 296.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3062 [15:44:47] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2034 is OK: (C)5000 gt (W)3000 gt 26.69 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2034 [15:44:47] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2029 is OK: (C)5000 gt (W)3000 gt 55.16 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2029 [15:44:51] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1077 is OK: (C)5000 gt (W)3000 gt 133.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [15:44:51] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3059 is OK: (C)5000 gt (W)3000 gt 193.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3059 [15:44:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3051 is OK: (C)5000 gt (W)3000 gt 261.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3051 [15:44:59] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3057 is OK: (C)5000 gt (W)3000 gt 363.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3057 [15:45:07] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1079 is OK: (C)5000 gt (W)3000 gt 143.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [15:45:09] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4027 is OK: (C)5000 gt (W)3000 gt 87.49 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4027 [15:45:51] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [15:46:18] (03CR) 10Gehel: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/623408 (owner: 10Ryan Kemper) [15:47:43] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2033 is OK: (C)5000 gt (W)3000 gt 121.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [15:48:08] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4024 is OK: (C)5000 gt (W)3000 gt 78.31 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4024 [15:48:10] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4032 is OK: (C)5000 gt (W)3000 gt 90.98 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4032 [15:50:50] !log rzl@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad [15:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:26] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3055 is OK: (C)5000 gt (W)3000 gt 257.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3055 [15:51:26] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4025 is OK: (C)5000 gt (W)3000 gt 72.97 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [15:51:28] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5003 is OK: (C)5000 gt (W)3000 gt 438.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [15:51:34] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3053 is OK: (C)5000 gt (W)3000 gt 152.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3053 [15:51:48] PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:40] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5006 is OK: (C)5000 gt (W)3000 gt 625.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [15:52:56] RECOVERY - Check systemd state on wdqs2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:40] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1089 is OK: (C)5000 gt (W)3000 gt 195.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089 [15:53:42] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4029 is OK: (C)5000 gt (W)3000 gt 261.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4029 [15:53:52] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:54:01] wdqs2008 is me [15:54:10] (it is depooled) [15:56:00] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1081 is OK: (C)5000 gt (W)3000 gt 121.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1081 [15:56:02] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5001 is OK: (C)5000 gt (W)3000 gt 380.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [15:56:02] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2036 is OK: (C)5000 gt (W)3000 gt 37.77 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036 [15:56:04] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: (C)5000 gt (W)3000 gt 310.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [15:56:10] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1076 is OK: (C)5000 gt (W)3000 gt 184.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1076 [15:56:18] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4022 is OK: (C)5000 gt (W)3000 gt 90.29 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022 [15:56:18] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4030 is OK: (C)5000 gt (W)3000 gt 70.79 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4030 [15:56:28] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5005 is OK: (C)5000 gt (W)3000 gt 307 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [15:56:28] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2040 is OK: (C)5000 gt (W)3000 gt 38.29 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2040 [15:57:05] (03PS2) 10MarcoAurelio: [WIP] Limit Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 [15:57:30] (03CR) 10MarcoAurelio: "WIP patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (owner: 10MarcoAurelio) [15:57:52] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3061 is OK: (C)5000 gt (W)3000 gt 235.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3061 [15:57:52] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3063 is OK: (C)5000 gt (W)3000 gt 220.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3063 [15:57:52] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3065 is OK: (C)5000 gt (W)3000 gt 192.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3065 [15:57:54] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4031 is OK: (C)5000 gt (W)3000 gt 225.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [15:57:57] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Limit Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (owner: 10MarcoAurelio) [15:57:58] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1082 is OK: (C)5000 gt (W)3000 gt 542.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1082 [15:59:02] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1083 is OK: (C)5000 gt (W)3000 gt 1567 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083 [15:59:02] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2032 is OK: (C)5000 gt (W)3000 gt 23.69 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2032 [15:59:38] PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag [15:59:38] ow&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:00:02] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2035 is OK: (C)5000 gt (W)3000 gt 42.17 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [16:00:04] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4026 is OK: (C)5000 gt (W)3000 gt 61.96 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4026 [16:01:19] (03CR) 10Nuria: [C: 03+1] admin: Add user klausman [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [16:03:02] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2038 is OK: (C)5000 gt (W)3000 gt 30.75 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2038 [16:03:04] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1084 is OK: (C)5000 gt (W)3000 gt 166.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1084 [16:03:20] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5002 is OK: (C)5000 gt (W)3000 gt 441.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [16:03:59] (03PS3) 10MarcoAurelio: [WIP] Limit Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 [16:04:10] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1085 is OK: (C)5000 gt (W)3000 gt 233.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1085 [16:04:22] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2037 is OK: (C)5000 gt (W)3000 gt 47.37 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [16:05:26] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3054 is OK: (C)5000 gt (W)3000 gt 296.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [16:05:26] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3060 is OK: (C)5000 gt (W)3000 gt 386.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [16:06:38] mdholloway: hi - would https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/623370 serve the purpose? [16:08:46] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5004 is OK: (C)5000 gt (W)3000 gt 431.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5004 [16:09:48] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3052 is OK: (C)5000 gt (W)3000 gt 289.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [16:13:05] (03CR) 10Ottomata: admin: Add razzi to users and add to analytics groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622878 (https://phabricator.wikimedia.org/T261443) (owner: 10Razzi) [16:13:12] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1090 is OK: (C)5000 gt (W)3000 gt 449.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1090 [16:13:45] (03PS4) 10MarcoAurelio: [WIP] Limit Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 [16:14:20] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2027 is OK: (C)5000 gt (W)3000 gt 52.39 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [16:16:44] (03PS2) 10Urbanecm: itwiki: Assign patrol right to autopatrolled instead of autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623191 (https://phabricator.wikimedia.org/T261587) [16:17:28] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1088 is OK: (C)5000 gt (W)3000 gt 181.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1088 [16:17:30] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5010 is OK: (C)5000 gt (W)3000 gt 764.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [16:17:42] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4023 is OK: (C)5000 gt (W)3000 gt 105.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4023 [16:17:54] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5008 is OK: (C)5000 gt (W)3000 gt 561.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [16:19:07] (03CR) 10Dzahn: "ah, ok. well, that's unfortunate because shinken is unmaintained both uptream and at WMF. I don't think it's worth getting into it then." [puppet] - 10https://gerrit.wikimedia.org/r/623018 (owner: 10Dzahn) [16:19:40] (03Abandoned) 10Dzahn: ircecho: split server var into FQDN and port, data types, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/623018 (owner: 10Dzahn) [16:19:42] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1080 is OK: (C)5000 gt (W)3000 gt 223.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1080 [16:19:46] (03PS5) 10MarcoAurelio: CommonSettings.php: limit new Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) [16:19:58] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [16:20:08] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1086 is OK: (C)5000 gt (W)3000 gt 280.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1086 [16:20:25] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): wikimedia.pl returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T261506 (10TOR) If this is not fixed today (within the next couple hours) we will be forced to use the English-language interface at... [16:20:39] (03CR) 10Jdlrobson: [C: 03+1] Enables MediaWiki client errors on commonswiki and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [16:21:10] (03Abandoned) 10Dzahn: base: remove override and conditionals for rasdaemon install [puppet] - 10https://gerrit.wikimedia.org/r/623027 (https://phabricator.wikimedia.org/T205396) (owner: 10Dzahn) [16:21:13] (03CR) 10Jdlrobson: [C: 03+1] "Traffic is 2417 errors per day, so I think we have capacity for these 2 wikis, even without the pending change that's rolling out on this " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [16:21:14] jouncebot: next [16:21:15] In 0 hour(s) and 38 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200831T1700) [16:21:37] (03PS4) 10Dzahn: decom releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) [16:22:48] (03CR) 10Urbanecm: [C: 03+1] "LGTM, this should fix it 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [16:23:49] meow :) [16:24:44] (03PS1) 10CDanis: add wikimedia.pl to list of allowed Maps Referers [puppet] - 10https://gerrit.wikimedia.org/r/623416 (https://phabricator.wikimedia.org/T261506) [16:25:20] (03CR) 10Majavah: CommonSettings.php: limit new Echo's `push-subscription-manager` group to Meta-Wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [16:27:52] RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:28:14] (03PS1) 10Urbanecm: Add two domains to wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623418 (https://phabricator.wikimedia.org/T261562) [16:28:16] (03PS2) 10CDanis: add wikimedia.pl to list of allowed Maps Referers [puppet] - 10https://gerrit.wikimedia.org/r/623416 (https://phabricator.wikimedia.org/T261506) [16:29:22] (03PS2) 10Hnowlan: api-gateway: expose port for admin interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/623399 (https://phabricator.wikimedia.org/T254910) [16:30:31] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10Milimetric) [16:30:49] (03CR) 10Hnowlan: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/623399 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [16:33:08] (03CR) 10Jason Linehan: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [16:33:24] (03CR) 10BBlack: [C: 03+1] add wikimedia.pl to list of allowed Maps Referers [puppet] - 10https://gerrit.wikimedia.org/r/623416 (https://phabricator.wikimedia.org/T261506) (owner: 10CDanis) [16:33:38] (03CR) 10CDanis: [C: 03+2] add wikimedia.pl to list of allowed Maps Referers [puppet] - 10https://gerrit.wikimedia.org/r/623416 (https://phabricator.wikimedia.org/T261506) (owner: 10CDanis) [16:35:20] 10Operations, 10Maps, 10Traffic, 10Patch-For-Review, 10Wiki-Loves-Monuments (2020): wikimedia.pl returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T261506 (10CDanis) 05Open→03Resolved A fix has been merged and should take effect within the next half hou... [16:37:26] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) In terms of disk benchmarks, I've ran an initial ~1h stress test with `fio` running a mix of random read/writes and sequential reads and writes. The idea be... [16:38:11] (03CR) 10Dzahn: "@hashar helm charts move is not related to this" [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [16:38:35] (03PS2) 10Dzahn: graphite: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621364 [16:40:49] (03CR) 10Dzahn: [C: 03+2] graphite: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621364 (owner: 10Dzahn) [16:41:09] (03PS6) 10Filippo Giunchedi: aptrepo: import current reprepro 'updates' keys [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) [16:41:11] (03PS2) 10Filippo Giunchedi: aptrepo: add note re: multiple keys for 'updates' [puppet] - 10https://gerrit.wikimedia.org/r/623359 (https://phabricator.wikimedia.org/T260883) [16:41:13] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/623359 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [16:42:21] (03CR) 10MarcoAurelio: CommonSettings.php: limit new Echo's `push-subscription-manager` group to Meta-Wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [16:42:41] (03PS8) 10Dzahn: prometheus: hiera() -> lookup(), add data type for prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/621759 [16:42:52] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Papaul) [16:43:14] (03CR) 10Jdlrobson: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [16:43:26] (03CR) 10Elukey: "Approved from the Analytics side, waiting for the feedback from the SRE team for the inclusion in the ops group :)" [puppet] - 10https://gerrit.wikimedia.org/r/623357 (https://phabricator.wikimedia.org/T261626) (owner: 10Klausman) [16:43:58] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10CDanis) FTR, in T261506 I added wikimedia.pl to our list of allowed domains. * They're an affiliate, listed on metawiki for some time, whi... [16:46:07] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.42 [software/spicerack] - 10https://gerrit.wikimedia.org/r/623422 [16:46:21] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn) Was this done on blog.wikimedia.org itself without needing a change by SRE? [16:47:29] (03CR) 10Ppchelko: [C: 03+1] "This looks so much better!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/623399 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [16:48:13] (03CR) 10Dzahn: [C: 03+2] prometheus: hiera() -> lookup(), add data type for prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/621759 (owner: 10Dzahn) [16:48:37] (03PS3) 10Hnowlan: api-gateway: expose port for admin interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/623399 (https://phabricator.wikimedia.org/T254910) [16:48:58] (03CR) 10Urbanecm: [C: 04-1] "Majavah is right 😊" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [16:49:02] PROBLEM - Long running screen/tmux on an-launcher1002 is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 1271, 1732356s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [16:49:11] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.42 [software/spicerack] - 10https://gerrit.wikimedia.org/r/623422 (owner: 10Volans) [16:50:17] (03CR) 10Majavah: CommonSettings.php: limit new Echo's `push-subscription-manager` group to Meta-Wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [16:51:22] (03CR) 10Effie Mouzeli: [C: 03+2] admin: add cparle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/623407 (https://phabricator.wikimedia.org/T260450) (owner: 10Effie Mouzeli) [16:51:50] (03CR) 10Hnowlan: [C: 03+2] api-gateway: expose port for admin interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/623399 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [16:51:59] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.42 [software/spicerack] - 10https://gerrit.wikimedia.org/r/623422 (owner: 10Volans) [16:53:00] (03Merged) 10jenkins-bot: api-gateway: expose port for admin interface [deployment-charts] - 10https://gerrit.wikimedia.org/r/623399 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [16:53:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-privatedata-users for cparle - https://phabricator.wikimedia.org/T260450 (10jijiki) 05Open→03Resolved a:03jijiki @Cparle done :) [16:53:46] 10Operations, 10serviceops: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) 05Open→03Resolved We can close this for now [16:55:44] (03CR) 10Jason Linehan: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [16:57:07] 10Operations, 10Icinga, 10serviceops: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10Dzahn) 05Declined→03Open [16:58:29] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10CKoerner_WMF) Yes, this was handled on the host side of things. Sorry for the noise. [16:59:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-privatedata-users for cparle - https://phabricator.wikimedia.org/T260450 (10Dzahn) @jijiki I think it's missing the "krb: present" in admin.yaml and other steps needed for T260450#6415718. [17:00:04] gehel and onimisionipe: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200831T1700). [17:00:52] greg-g: looks like my change to the deploy calendar isn't effective. Did I miss someting? Or is there a 1 week delay or similar? [17:01:08] cc thcipriani ^ [17:01:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-privatedata-users for cparle - https://phabricator.wikimedia.org/T260450 (10jijiki) yeah I missed the update from the other task [17:01:35] jouncebot: refresh [17:01:37] I refreshed my knowledge about deployments. [17:01:39] jouncebot: now [17:01:39] For the next 0 hour(s) and 28 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200831T1700) [17:01:43] 🤔 [17:01:50] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:11] gehel: I made the calendar a few days before I merged your change, next week it'll be up-to-date, sorry for confusion [17:02:23] thcipriani: thanks! [17:02:55] likewise for puppet request window [17:02:59] (03PS6) 10MarcoAurelio: CommonSettings.php: limit new Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) [17:04:52] (03CR) 10MarcoAurelio: "I'm not sure this is the right approach now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [17:05:09] (03CR) 10MarcoAurelio: CommonSettings.php: limit new Echo's `push-subscription-manager` group to Meta-Wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [17:07:23] (03PS1) 10Volans: Upstream release v0.0.42 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/623426 [17:09:28] 10Operations, 10Analytics, 10Research, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Milimetric) We have to think more about how to accomplish this, taking into account all the security implications we've... [17:09:37] (03PS7) 10MarcoAurelio: CommonSettings.php: limit new Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) [17:14:14] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10RobH) [17:14:17] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10RobH) [17:17:04] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.42 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/623426 (owner: 10Volans) [17:18:27] (03PS1) 10Ottomata: Add type annotation to profile::analytics::cluster::packages::common [puppet] - 10https://gerrit.wikimedia.org/r/623428 (https://phabricator.wikimedia.org/T252617) [17:20:52] (03PS1) 10Effie Mouzeli: admin: add krb:present to cparle [puppet] - 10https://gerrit.wikimedia.org/r/623429 (https://phabricator.wikimedia.org/T260450) [17:21:18] !log uploaded spicerack_0.0.42 to apt.wikimedia.org buster-wikimedia [17:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:27] ryankemper, gehel ^^^ [17:21:45] volans: thanks! [17:21:47] thanks! [17:21:57] (03CR) 10Effie Mouzeli: [C: 03+2] admin: add krb:present to cparle [puppet] - 10https://gerrit.wikimedia.org/r/623429 (https://phabricator.wikimedia.org/T260450) (owner: 10Effie Mouzeli) [17:25:29] (03PS2) 10Ottomata: Add type annotation to profile::analytics::cluster::packages::common [puppet] - 10https://gerrit.wikimedia.org/r/623428 (https://phabricator.wikimedia.org/T252617) [17:26:03] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories reload is failing on thankyouwiki - https://phabricator.wikimedia.org/T261097 (10CBogen) [17:33:51] (03PS2) 10CDanis: Add SRV record verification for Element Matrix Services [dns] - 10https://gerrit.wikimedia.org/r/623348 (https://phabricator.wikimedia.org/T261531) [17:34:34] (03CR) 10CDanis: [C: 03+2] Add SRV record verification for Element Matrix Services [dns] - 10https://gerrit.wikimedia.org/r/623348 (https://phabricator.wikimedia.org/T261531) (owner: 10CDanis) [17:42:21] (03CR) 10MarcoAurelio: [C: 03+1] Update help URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609748 (https://phabricator.wikimedia.org/T256623) (owner: 10Awight) [17:54:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623359 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [17:57:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621506 (https://phabricator.wikimedia.org/T260883) (owner: 10Filippo Giunchedi) [18:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200831T1800) [18:00:04] bearloga, Urbanecm, and hauskatze: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:26] * hauskatze raises hand [18:00:32] I can deploy today [18:01:07] bearloga: hello, are you around? :-) [18:01:19] Urbanecm: yep [18:01:40] cool! [18:01:58] (03PS2) 10Urbanecm: wgEventStreams: Stream for MEP-iOS pilot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623048 (https://phabricator.wikimedia.org/T260382) (owner: 10Bearloga) [18:02:01] (03CR) 10Urbanecm: [C: 03+2] wgEventStreams: Stream for MEP-iOS pilot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623048 (https://phabricator.wikimedia.org/T260382) (owner: 10Bearloga) [18:02:21] duplicated wikibugs? [18:02:49] (03Merged) 10jenkins-bot: wgEventStreams: Stream for MEP-iOS pilot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623048 (https://phabricator.wikimedia.org/T260382) (owner: 10Bearloga) [18:02:54] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 (10MoritzMuehlenhoff) For posterity: This update also fixed CVE-2019-19499 (https://swarm.ptsecurity.com/grafana-6-4-3-arbitrary-file-read/), which was only di... [18:03:43] bearloga: I pulled your change onto mwdebug1002 - is it possible to test it there? [18:05:46] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10bcampbell) Vendor says it looks to be all correct now. They shared this link: https://federationte... [18:06:15] Urbanecm: not really :\ it's an event stream config that's not registered with eventlogging and is for the wikipedia ios app to send events to Analytics Engineer's EventGate intake service [18:06:23] oh wait [18:06:39] does mwdebug1002 have a mediawiki api I can query? [18:06:58] bearloga: yes [18:07:02] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) 05Open→03Resolved a:03CDanis [18:07:15] bearloga: you can do something like `curl -H 'X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet' 'https://en.wikipedia.org/w/api.php' [18:07:20] `curl -H 'X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet' 'https://en.wikipedia.org/w/api.php` [18:08:59] `curl -H 'X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet' 'https://en.wikipedia.org/w/api.php?action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external'` cool! yep, I can see 'ios.edit_history_compare' in the response [18:09:17] Urbanecm: thank you for helping with that! [18:09:43] great, I'll sync that then :) [18:09:54] thank you very much!!! [18:10:11] happy to help! [18:11:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 846c5448f950b4d0d7eedce570e46d74ca62ca38: wgEventStreams: Stream for MEP-iOS pilot (T260382) (duration: 00m 55s) [18:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:08] T260382: Migrate MobileWikiAppiOSEditHistoryCompare schema to MEP - https://phabricator.wikimedia.org/T260382 [18:11:14] (03CR) 10Urbanecm: [C: 03+2] CommonSettings.php: limit new Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [18:11:19] bearloga: should be done :) [18:11:23] hauskatze: you're next! [18:11:37] I have a bad feeling about it Urbanecm [18:11:46] thankfully the revert button exists [18:11:56] (03Merged) 10jenkins-bot: CommonSettings.php: limit new Echo's `push-subscription-manager` group to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623370 (https://phabricator.wikimedia.org/T261625) (owner: 10MarcoAurelio) [18:11:57] yup :) [18:12:11] hauskatze: pulled onto mwdebug1002 [18:12:15] * hauskatze sees smoke already comming from mwdebug [18:12:58] Urbanecm: ok, x-wikimedia-debug got uninstalled for some reason, re-downloading [18:13:00] one min [18:13:12] hauskatze: it works at 110 % for some reason, it removes it from everywhere [18:13:37] even meta? [18:13:56] unless I'm missing something [18:14:19] checking [18:14:44] (03Abandoned) 10Dzahn: prometheus: add more data types to all exporters [puppet] - 10https://gerrit.wikimedia.org/r/621770 (owner: 10Dzahn) [18:15:12] enwiki checked, removed there [18:15:14] moving to meta [18:15:55] Yeah, it's gone from meta as well [18:16:08] f#### [18:16:21] you might need a global $wgDBname [18:16:34] or move the if check outside the function [18:16:36] ah, ah... [18:16:51] is it something that can be fixed on the fly? [18:16:54] hauskatze: on it [18:17:31] * hauskatze registers mediawiki-hacks-are-bad.site :P [18:18:03] :D [18:18:24] hauskatze: try now? [18:18:40] * hauskatze rechecks [18:18:47] (03PS3) 10Ppchelko: Install OAuthRateLimiter III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622898 (https://phabricator.wikimedia.org/T246271) [18:18:49] (03PS3) 10Ppchelko: Install OAuthRateLimiter extension IV: Enable on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622899 [18:18:55] see it a meta [18:19:00] going to check elsewhere [18:19:28] (03PS1) 10Urbanecm: Follow-up for a1b0d6e: Get $wgDBname in Echo's ext function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623435 (https://phabricator.wikimedia.org/T261625) [18:19:42] cannot see at enwiki [18:19:55] so, I'm going to merge & sync then :) [18:19:59] thanks hauskatze and Majavah [18:20:15] (03CR) 10Urbanecm: [C: 03+2] Follow-up for a1b0d6e: Get $wgDBname in Echo's ext function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623435 (https://phabricator.wikimedia.org/T261625) (owner: 10Urbanecm) [18:20:20] cannot see at frwiki either [18:20:24] lgtm now I think [18:20:56] (03Merged) 10jenkins-bot: Follow-up for a1b0d6e: Get $wgDBname in Echo's ext function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623435 (https://phabricator.wikimedia.org/T261625) (owner: 10Urbanecm) [18:20:57] of course I had to be wgDBname [18:21:03] thanks :) [18:23:25] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: a1b0d6e4e7da9bf45ae7381d2c1d9814e6b36498: b609cd53273e922cd8af5507660b9d10c6da09b3: CommonSettings.php: limit new Echos `push-subscription-manager` group to Meta-Wiki (T261625) (duration: 00m 54s) [18:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:35] T261625: Limit the new "push subscription managers" user group to metawiki - https://phabricator.wikimedia.org/T261625 [18:23:54] so, that's done hauskatze :) [18:24:03] (03PS3) 10Urbanecm: itwiki: Assign patrol right to autopatrolled instead of autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623191 (https://phabricator.wikimedia.org/T261587) [18:24:10] (03CR) 10Urbanecm: [C: 03+2] itwiki: Assign patrol right to autopatrolled instead of autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623191 (https://phabricator.wikimedia.org/T261587) (owner: 10Urbanecm) [18:24:15] thanks! [18:24:23] happy to help [18:25:04] (03Merged) 10jenkins-bot: itwiki: Assign patrol right to autopatrolled instead of autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623191 (https://phabricator.wikimedia.org/T261587) (owner: 10Urbanecm) [18:27:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bb28e9da8057a4c92cd4d564ffd000f320338cda: itwiki: Assign patrol right to autopatrolled instead of autoconfirmed (T261587) (duration: 00m 53s) [18:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:12] T261587: Moving "patrol" right from autoconfirmed to autopatrolled users on it.wiki - https://phabricator.wikimedia.org/T261587 [18:29:35] (03PS2) 10Urbanecm: Add two domains to wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623418 (https://phabricator.wikimedia.org/T261562) [18:29:42] (03CR) 10Urbanecm: [C: 03+2] Add two domains to wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623418 (https://phabricator.wikimedia.org/T261562) (owner: 10Urbanecm) [18:30:30] (03Merged) 10jenkins-bot: Add two domains to wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623418 (https://phabricator.wikimedia.org/T261562) (owner: 10Urbanecm) [18:32:44] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 16197aabc88f098568a04984a20149de3b7fdeaf: Add two domains to wgCopyUploadsDomains for commonswiki (T261562; T261575) (duration: 00m 54s) [18:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:54] T261575: Add storage.idigbio.org to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T261575 [18:32:55] T261562: Add serv.biokic.asu.edu to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T261562 [18:38:17] !log Morning B&C done [18:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:04] 10Operations, 10Wikimedia-Mailing-lists: Disable google code in mailinglists - https://phabricator.wikimedia.org/T261084 (10jijiki) 05Open→03Resolved done [18:47:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production for klausman - https://phabricator.wikimedia.org/T261626 (10jijiki) a:03klausman [18:48:37] Urbanecm: I lost the oportunity to schedule https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/609748 as well [18:48:43] but probably needs a rebase [18:49:15] hauskatze: well, I can sync that too :) [18:49:33] (03PS2) 10Urbanecm: Update help URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609748 (https://phabricator.wikimedia.org/T256623) (owner: 10Awight) [18:50:13] hauskatze: do you know how to test that? [18:50:21] Nice. I'll send my fees to awight later then :P [18:50:35] let me see [18:51:06] perhaps by importing a file to commons using fileimporter [18:51:28] I don't have one at hand at this moment [18:51:47] me neither [18:52:38] we can then let that one for another time [18:52:39] hauskatze: what about https://en.wikipedia.org/wiki/File:A_young_man_hiking.jpg? [18:53:47] looks ok for commons but I gotta go now [18:54:01] I'll be back, I think, in an hour or so [18:54:20] okay, so let's leave it for later [19:05:57] (03PS1) 10Andrew Bogott: Ceph mon nodes: re-enable prometheus access [puppet] - 10https://gerrit.wikimedia.org/r/623440 (https://phabricator.wikimedia.org/T261684) [19:09:40] (03PS2) 10Andrew Bogott: Ceph mon nodes: re-enable prometheus access [puppet] - 10https://gerrit.wikimedia.org/r/623440 (https://phabricator.wikimedia.org/T261684) [19:11:37] (03CR) 10Andrew Bogott: [C: 03+2] Ceph mon nodes: re-enable prometheus access [puppet] - 10https://gerrit.wikimedia.org/r/623440 (https://phabricator.wikimedia.org/T261684) (owner: 10Andrew Bogott) [19:14:18] (03CR) 10Dzahn: [V: 03+1 C: 03+2] OTRS: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/623023 (owner: 10Dzahn) [19:17:52] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:17:53] @seen ckoerner [19:17:53] hauskatze: Last time I saw ckoerner they were leaving the channel #wmhack at 5/3/2017 5:45:57 PM (1216d1h31m56s ago) [19:18:00] hmm [19:18:05] 10Operations, 10Traffic, 10conftool, 10serviceops, 10Patch-For-Review: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Volans) After today's failure of the `check_ttl` step in the switchdc of the services, I had a ch... [19:19:59] (03CR) 10Dzahn: "noop on mendelevium" [puppet] - 10https://gerrit.wikimedia.org/r/623023 (owner: 10Dzahn) [19:24:55] (03CR) 10Dzahn: [C: 03+2] "noop in compiler in prod. cloud is already broken unrelated to this." [puppet] - 10https://gerrit.wikimedia.org/r/623076 (owner: 10Dzahn) [19:30:20] (03CR) 10Dzahn: "noop on deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/623076 (owner: 10Dzahn) [19:35:49] (03PS1) 10Dzahn: unbreak puppet on deployment_servers due to missing 'mcrouter_wancache::use_onhost_memcache' [puppet] - 10https://gerrit.wikimedia.org/r/623444 [19:36:12] (03CR) 10jerkins-bot: [V: 04-1] unbreak puppet on deployment_servers due to missing 'mcrouter_wancache::use_onhost_memcache' [puppet] - 10https://gerrit.wikimedia.org/r/623444 (owner: 10Dzahn) [19:48:07] (03PS2) 10Dzahn: unbreak deployment_servers due to missing mcrouter 'use_onhost_memcache' [puppet] - 10https://gerrit.wikimedia.org/r/623444 [19:56:44] (03PS1) 10Dzahn: devtools: unbreak deployment_server by mcrouter use_onhost_memcache: false [puppet] - 10https://gerrit.wikimedia.org/r/623446 [20:00:04] halfak and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200831T2000). [20:02:11] Do you know if Chris Koerner connects to IRC? [20:02:57] @seen ckoerner_wmf [20:02:57] hauskatze: Last time I saw CKoerner_WMF they were joining the channel, they are still in the channel #wikivoyage at 8/25/2020 11:01:15 AM (6d9h1m42s ago) [20:03:48] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:12:18] ^ looking at this [20:20:43] !log `sudo systemctl restart elasticsearch_6@production-search-psi-eqiad.service` on `elastic1052.eqiad.wmnet` [20:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:22] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:33:32] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10JMinor) @CDanis Yes, I will make a subtask for tracking these and any other affiliated domains that need an exemption. We're also send... [20:34:27] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10JMinor) To this specifically: > Someone recently mention to me "the only way to prevent WMF people from doing making stupid mistakes thes... [21:00:04] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200831T2100). [21:00:14] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [21:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:58] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Support maps serving for affiliate sites via an allow list. - https://phabricator.wikimedia.org/T261694 (10JMinor) [21:04:21] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10JMinor) [21:06:13] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:53] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [21:08:36] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [21:58:16] 10Operations, 10Wikimedia-Mailing-lists: Disable google code in mailinglists - https://phabricator.wikimedia.org/T261084 (10Urbanecm) Thanks! [22:04:17] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10DStrine) 05Open→03Declined [22:11:20] wikibugs_: restart [22:46:12] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Urbanecm) Just out of curiosity, why was this declined @DStrine? [23:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200831T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:20] I'll deploy my own change [23:02:58] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10DStrine) 05Declined→03Open [23:04:18] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10DStrine) This is done from my perspective. I'll open it if others are still using it. We... [23:09:35] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable (future) mw-reverted tag for all wikis except testwiki (T254074) (duration: 00m 57s) [23:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:43] T254074: Implement the reverted edit tag - https://phabricator.wikimedia.org/T254074 [23:27:33] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10Jclark-ctr) [23:27:56] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10Jclark-ctr) Received memory. placed in storage room [23:30:30] !log crusnov@deploy1001 Started deploy [netbox/deploy@2fc439e]: Test deploy of 2.8.9 to netbox-next [23:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:42] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Jclark-ctr) [23:30:52] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Jclark-ctr) received memory placed in storage room [23:31:27] !log crusnov@deploy1001 Finished deploy [netbox/deploy@2fc439e]: Test deploy of 2.8.9 to netbox-next (duration: 00m 57s) [23:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:31] !log crusnov@deploy1001 Started deploy [netbox/deploy@2fc439e]: Test deploy of 2.8.9 to netbox-next pt2 [23:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:36] !log crusnov@deploy1001 Finished deploy [netbox/deploy@2fc439e]: Test deploy of 2.8.9 to netbox-next pt2 (duration: 00m 05s) [23:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:34] !log crusnov@deploy1001 Started deploy [netbox/deploy@2fc439e]: Deploy of 2.8.9 to netbox1001 [23:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:33] !log crusnov@deploy1001 Finished deploy [netbox/deploy@2fc439e]: Deploy of 2.8.9 to netbox1001 (duration: 00m 58s) [23:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:42] !log crusnov@deploy1001 Started deploy [netbox/deploy@2fc439e]: Deploy of 2.8.9 to netbox2001 [23:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:55] !log crusnov@deploy1001 Finished deploy [netbox/deploy@2fc439e]: Deploy of 2.8.9 to netbox2001 (duration: 01m 12s) [23:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:10] !log crusnov@deploy1001 Started deploy [netbox/deploy@2fc439e]: Deploy of 2.8.9 (final) [23:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:27] !log crusnov@deploy1001 Finished deploy [netbox/deploy@2fc439e]: Deploy of 2.8.9 (final) (duration: 00m 17s) [23:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:40:52] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:18] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:44] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state