[00:32:45] (03CR) 10Dzahn: "I was about to say the name needs to be added to the cert, but then i saw we already have etherpad.discovery.wmnet in files/ssl even." [dns] - 10https://gerrit.wikimedia.org/r/631557 (owner: 10Dzahn) [00:36:30] (03CR) 10Dzahn: "~/puppet/files/ssl$ openssl x509 -in etherpad.discovery.wmnet.crt -noout -text | grep DNS" [dns] - 10https://gerrit.wikimedia.org/r/631557 (owner: 10Dzahn) [00:37:17] (03CR) 10Dzahn: "[etherpad1002:/etc/envoy/listeners.d] $ grep etherpad 00-tls_terminator_7443.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/631555 (owner: 10Dzahn) [00:39:59] (03PS1) 10Dzahn: add etherpad-next.discovery, point to etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/631559 [00:41:25] (03PS1) 10Dzahn: site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631560 [00:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:14] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) Hey, sorry for the delay, we should be able to deploy this tomorrow. [00:50:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:52] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:08] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:27:55] (03CR) 10Ammarpad: "The commit summary does not say much and maybe is not quite accurate. But I am assuming this only applies to links for CI test results jus" [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [04:34:56] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.37 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:36:34] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 72.44 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:56:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:57:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:08:26] 10Operations, 10ops-eqiad, 10netops, 10User-Kormat, 10User-jijiki: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Marostegui) [05:27:50] (03PS1) 10Marostegui: dns: Remove es2011 dns entries [dns] - 10https://gerrit.wikimedia.org/r/631566 (https://phabricator.wikimedia.org/T264261) [05:28:45] (03PS1) 10Marostegui: instances.yaml: Remove es2011 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/631567 (https://phabricator.wikimedia.org/T264261) [05:29:30] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2011 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/631567 (https://phabricator.wikimedia.org/T264261) (owner: 10Marostegui) [05:30:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2011 from dbctl T264261', diff saved to https://phabricator.wikimedia.org/P12893 and previous config saved to /var/cache/conftool/dbconfig/20201002-053020-marostegui.json [05:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:27] T264261: decommission es2011.codfw.wmnet - https://phabricator.wikimedia.org/T264261 [05:33:55] (03PS1) 10Marostegui: es2026: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631568 [05:34:29] (03CR) 10Marostegui: [C: 03+2] es2026: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631568 (owner: 10Marostegui) [05:41:08] (03PS1) 10Marostegui: mariadb: Remove es2011 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/631569 (https://phabricator.wikimedia.org/T264261) [05:41:25] (03PS2) 10Marostegui: mariadb: Remove es2011 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/631569 (https://phabricator.wikimedia.org/T264261) [05:43:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:57] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove es2011 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/631569 (https://phabricator.wikimedia.org/T264261) (owner: 10Marostegui) [05:48:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:00] (03CR) 10Marostegui: [C: 03+2] dns: Remove es2011 dns entries [dns] - 10https://gerrit.wikimedia.org/r/631566 (https://phabricator.wikimedia.org/T264261) (owner: 10Marostegui) [05:50:34] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2011.codfw.wmnet - https://phabricator.wikimedia.org/T264261 (10Marostegui) Ready for #dc-ops [05:57:35] (03CR) 10Elukey: [C: 03+2] Set debian buster for stat100[467] [puppet] - 10https://gerrit.wikimedia.org/r/631544 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [05:57:41] (03PS2) 10Elukey: Set debian buster for stat100[467] [puppet] - 10https://gerrit.wikimedia.org/r/631544 (https://phabricator.wikimedia.org/T255028) [05:59:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui) Great news, thank you @Cmjohnson! The racking plan looks good to me, we don't have much requirements other than trying to spr... [06:20:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:12] (03CR) 10Elukey: [C: 03+2] Add page_props and user_properties to analytics sqoop [puppet] - 10https://gerrit.wikimedia.org/r/629070 (https://phabricator.wikimedia.org/T258047) (owner: 10Joal) [06:47:50] (03PS1) 10Elukey: Set an-worker110[0-2] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631683 (https://phabricator.wikimedia.org/T255138) [06:51:15] (03CR) 10Elukey: [C: 03+2] Set an-worker110[0-2] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631683 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey) [06:51:16] <_joe_> !log restarting php-fpm on all appservers in eqiad, in batches of 10%, for testing the procedure suggested at T264362 [06:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:22] T264362: Scap feature: restart php-fpm on deployment - https://phabricator.wikimedia.org/T264362 [06:52:40] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [06:52:59] this is downtime expired, it has bitten me last week --^ [06:53:01] <_joe_> uh that wasn't me :P [06:53:06] <_joe_> hah ok [06:53:12] :D [06:53:19] <_joe_> why do we just downtime for a short time hosts that are completely down? [06:53:40] I added a week and reported in the task, didn't know how much time it was needed, checking again [06:53:58] https://phabricator.wikimedia.org/T261130 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201002T0700) [07:03:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631438 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [07:07:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631421 (https://phabricator.wikimedia.org/T264221) (owner: 10Arturo Borrero Gonzalez) [07:08:43] (03PS1) 10Giuseppe Lavagetto: Reduce reconnectTimeout for etcd to 0.1 seconds [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/631686 (https://phabricator.wikimedia.org/T264362) [07:12:26] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Joe) >>! In T93049#6501395, @Legoktm wrote: > While MassMessage is how users see the problem (e.g. no one... [07:14:26] (03CR) 10Hashar: "I had the same stance as Kosta: target=_blank is an antipattern since browsers do not offer a way to NOT open in a new page whereas one ca" [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [07:20:45] (03Abandoned) 10Gehel: [wip] adding some type annotations [software/cumin] - 10https://gerrit.wikimedia.org/r/630202 (owner: 10Gehel) [07:23:57] !log installing libx11 security updates on buster [07:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:08] (03PS1) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [07:29:43] !log prometheus codfw/k8s, add 50G to the LV [07:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:45] (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:31:59] (03PS2) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [07:34:41] (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:35:17] !log swift codfw-prod bump weight for ms-be2057 - T261633 [07:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:22] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [07:37:39] (03PS3) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [07:40:17] (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:40:26] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [07:40:32] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove rollout flag for rsyslog queues [puppet] - 10https://gerrit.wikimedia.org/r/631438 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [07:40:57] (03PS4) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [07:42:05] !log installing libcommons-compress-java security updates [07:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:39] (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:50:40] 10Operations: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10MoritzMuehlenhoff) [07:52:40] 10Operations, 10vm-requests: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 (10MoritzMuehlenhoff) [07:52:49] (03PS5) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [07:52:59] 10Operations, 10vm-requests: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:53:16] 10Operations: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:54:56] (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:55:39] 10Operations, 10SRE-Access-Requests: Update public key for production shell - https://phabricator.wikimedia.org/T264392 (10DED) [07:56:06] (03PS6) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [07:58:24] (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [07:59:01] 10Operations, 10SRE-Access-Requests: Update public key for production shell - https://phabricator.wikimedia.org/T264392 (10ArielGlenn) Note to whoever handles this: after seeing accountcheck email, I let the user know via irc that there was an issue, and this task was filed as a result. [08:01:51] 10Operations, 10SRE-Access-Requests: Update public key for production shell - https://phabricator.wikimedia.org/T264392 (10MoritzMuehlenhoff) a:03herron This got added in T263692, also assigning to Keith. [08:07:08] 10Operations, 10Wikimedia-Mailing-lists: Wrong charset on mailman HTML (Japanese) - https://phabricator.wikimedia.org/T264384 (10Aklapper) [08:07:19] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Aklapper) [08:10:49] (03CR) 10Ema: [C: 03+1] Reduce reconnectTimeout for etcd to 0.1 seconds [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/631686 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto) [08:16:17] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@5713fb0]: Fix lexeme dumps expected date [08:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:52] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@5713fb0]: Fix lexeme dumps expected date (duration: 01m 35s) [08:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:59] (03PS7) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [08:18:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:19:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:20:15] (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [08:21:36] (03PS8) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [08:24:23] !log installing nginx security updates on puppetdb* [08:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:17] (03CR) 10Alexandros Kosiaris: "Just out of curiosity, why not just try to import 5.2 instead? It's not that big a deal (couple of changes and cross fleet PCCs) and it wi" [puppet] - 10https://gerrit.wikimedia.org/r/631522 (owner: 10Dzahn) [08:26:24] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [08:29:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Heh, thanks! I 've never had the time to do this more consistently, this is nice!" [puppet] - 10https://gerrit.wikimedia.org/r/631555 (owner: 10Dzahn) [08:29:44] !log installing pyzmq bugfix update from buster point release [08:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:56] (03PS9) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) [08:29:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631560 (owner: 10Dzahn) [08:30:09] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@5713fb0]: Fix lexeme dumps expected date [08:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:42] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@5713fb0]: Fix lexeme dumps expected date (duration: 00m 33s) [08:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:31] 10Operations, 10vm-requests: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 (10akosiaris) /me rubberstamping. Thanks for this! [08:36:02] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [08:36:13] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10JMeybohm) [08:41:26] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@5713fb0]: Test stat1007 deploy [08:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:01] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@5713fb0]: Test stat1007 deploy (duration: 00m 34s) [08:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:06] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) a:05RobH→03Cmjohnson @Cmjohnson an-worker1111 seems to be in the wrong rack: cloudsw1-c8-eqiad.mgmt.eqiad.wmnet https://librenms.wikimedia.... [08:43:09] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) [08:46:26] 10Operations, 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Migrate Gerrit to profile::java - https://phabricator.wikimedia.org/T264182 (10hashar) A few notes: Gerrit runs with Java 8 for now. The reason, I believe, is that wh... [08:47:49] RECOVERY - cassandra-b service on restbase1029 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:48:39] RECOVERY - cassandra-b SSL 10.64.16.181:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-b valid until 2022-09-29 10:16:48 +0000 (expires in 727 days) https://phabricator.wikimedia.org/T120662 [08:49:46] 10Operations, 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Migrate Gerrit to profile::java - https://phabricator.wikimedia.org/T264182 (10MoritzMuehlenhoff) >>! In T264182#6511478, @hashar wrote: > We need the `dbg` package in... [08:52:55] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) The most logical suspect is the rollout of Varnish 6 {T263557} [08:53:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: thirdparty/kubeadm-k8s-1-17: introduce helm3 package [puppet] - 10https://gerrit.wikimedia.org/r/631421 (https://phabricator.wikimedia.org/T264221) (owner: 10Arturo Borrero Gonzalez) [08:54:04] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@5713fb0]: Test stat1007 deploy [08:54:07] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@5713fb0]: Test stat1007 deploy (duration: 00m 03s) [08:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] add etherpad.discovery.wmnet, point to etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/631557 (owner: 10Dzahn) [08:58:15] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) [[ https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1 | It seems to affect North America before Europe ]], and the timing lines up with the ro... [08:58:28] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) p:05Triage→03High [08:59:54] !log root@cumin1001 START - Cookbook sre.hosts.downtime [08:59:55] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:59] !log root@cumin1001 START - Cookbook sre.hosts.downtime [09:00:00] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:05] !log root@cumin1001 START - Cookbook sre.hosts.downtime [09:00:05] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:20] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add gpg key for baltocdn external repository [puppet] - 10https://gerrit.wikimedia.org/r/631711 (https://phabricator.wikimedia.org/T264221) [09:01:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add gpg key for baltocdn external repository [puppet] - 10https://gerrit.wikimedia.org/r/631711 (https://phabricator.wikimedia.org/T264221) (owner: 10Arturo Borrero Gonzalez) [09:01:56] (03PS3) 10JMeybohm: lvs: Remove mathoid non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629329 (https://phabricator.wikimedia.org/T255875) [09:05:03] !log gerrit: running garbage collector [09:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:04] !log bootstrapping restbase1029-b cassandra [09:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:00] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove mathoid non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629329 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [09:08:21] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [09:08:21] !log jmm@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [09:08:21] (03PS3) 10JMeybohm: lvs: Remove zotero non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629339 (https://phabricator.wikimedia.org/T255869) [09:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:02] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [09:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:42] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove zotero non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629339 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [09:10:44] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) Checkin in to report that calls from OKAPI have stopped tonight. Thanks @RBrounley_WMF (and the team)! So if we still see the starvatio... [09:11:13] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: updates: fix baltocdn key id [puppet] - 10https://gerrit.wikimedia.org/r/631712 (https://phabricator.wikimedia.org/T264221) [09:11:49] !log added helm3 package to buster-wikimedia/thirdparty/kubeadm-k8s-1-17 (T264221) [09:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:54] T264221: Upgrade the nginx ingress controller in Toolforge (and likely PAWS) - https://phabricator.wikimedia.org/T264221 [09:12:03] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/631522 (owner: 10Dzahn) [09:12:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: updates: fix baltocdn key id [puppet] - 10https://gerrit.wikimedia.org/r/631712 (https://phabricator.wikimedia.org/T264221) (owner: 10Arturo Borrero Gonzalez) [09:12:25] !log running puppet on lvs servers - T255875 T255869 [09:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:31] T255875: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 [09:12:32] T255869: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 [09:14:41] !log restarting pybal on lvs1016.eqiad.wmnet,lvs2010.codfw.wmnet - T255875 T255869 [09:14:44] 10Operations, 10SRE-Access-Requests: Update public key for production shell for dedcode - https://phabricator.wikimedia.org/T264392 (10Peachey88) [09:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1117.eqiad.wmnet'] ` The l... [09:17:23] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.16:1969, 10.2.1.20:10042]) https://wikitech.wikimedia.org/wiki/PyBal [09:17:32] pybal is me [09:17:49] !log restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T255875 T255869 [09:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:55] T255875: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 [09:17:56] T255869: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 [09:18:14] !log running ipvsadm -D -t 10.2.2.20:10042; ipvsadm -D -t 10.2.2.16:1969 on lvs1016.eqiad.wmnet,lvs1015.eqiad.wmnet - T255875 T255869 [09:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:01] !log running ipvsadm -D -t 10.2.1.20:10042; ipvsadm -D -t 10.2.1.16:1969 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet - T255875 T255869 [09:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:19] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:20:28] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10whym) In the case of [Japanese](https://lists.wikimedia.org/mailman/listinfo/wikija-l), the encoding set by... [09:21:19] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Actually, this won't work. etherpad can't be run multi instance currently. The reason for that is that the ueberdb, the component that abs" [puppet] - 10https://gerrit.wikimedia.org/r/631560 (owner: 10Dzahn) [09:22:13] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm) [09:22:18] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm) [09:22:23] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Same reasoning as https://gerrit.wikimedia.org/r/c/operations/puppet/+/631560. We can't have 2 etherpad instances running in parallel, it " [dns] - 10https://gerrit.wikimedia.org/r/631559 (owner: 10Dzahn) [09:22:38] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [09:27:00] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:27:01] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db2106 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12896 and previous config saved to /var/cache/conftool/dbconfig/20201002-092715-kormat.json [09:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:21] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:28:14] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [09:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:35] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [09:30:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:16] (03PS1) 10Muehlenhoff: Add DNS entry ldap-replica1001.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631713 (https://phabricator.wikimedia.org/T264390) [09:35:43] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entry ldap-replica1001.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631713 (https://phabricator.wikimedia.org/T264390) (owner: 10Muehlenhoff) [09:47:25] PROBLEM - SSH on ms-be2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:48:07] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [09:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:16] (03CR) 10Filippo Giunchedi: [C: 03+2] am: ensure ends_at is sent with a timezone [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631495 (owner: 10Filippo Giunchedi) [09:48:18] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: ensure ends_at is sent with a timezone [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631495 (owner: 10Filippo Giunchedi) [09:48:26] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: catch and report ApiException [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631499 (owner: 10Filippo Giunchedi) [09:48:47] RECOVERY - SSH on ms-be2020 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:48:49] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:51:29] (03PS1) 10Kormat: dbutil: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631717 [09:51:29] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:52:23] (03PS2) 10Kormat: dbutil: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631717 [09:56:08] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [09:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:19] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) [09:58:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [09:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:23] 10Operations: makevm cookbook fails get_vm() call - https://phabricator.wikimedia.org/T264409 (10MoritzMuehlenhoff) [09:59:44] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) an-worker1117 is fixed, it was preferring to PXE boot as opposed to boot from disk, so the loop was endless. [10:06:47] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [10:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:14] !log kormat@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 33%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12897 and previous config saved to /var/cache/conftool/dbconfig/20201002-101313-kormat.json [10:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:19] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:13:20] (03PS1) 10JMeybohm: deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) [10:14:22] (03CR) 10jerkins-bot: [V: 04-1] deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) (owner: 10JMeybohm) [10:14:37] (03CR) 10Kormat: [C: 03+2] dbutil: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631717 (owner: 10Kormat) [10:15:41] (03Merged) 10jenkins-bot: dbutil: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631717 (owner: 10Kormat) [10:16:22] (03PS2) 10JMeybohm: deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) [10:16:24] (03PS1) 10Muehlenhoff: Add DNS entry ldap-replica1001.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631721 (https://phabricator.wikimedia.org/T264390) [10:16:53] (03PS2) 10Muehlenhoff: Add DNS entry ldap-replica1002.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631721 (https://phabricator.wikimedia.org/T264390) [10:20:43] (03PS1) 10Kormat: WMFReplication: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631722 [10:21:57] (03CR) 10Kormat: [C: 03+2] WMFReplication: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631722 (owner: 10Kormat) [10:22:46] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entry ldap-replica1002.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631721 (https://phabricator.wikimedia.org/T264390) (owner: 10Muehlenhoff) [10:22:58] (03Merged) 10jenkins-bot: WMFReplication: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631722 (owner: 10Kormat) [10:23:52] (03PS1) 10JMeybohm: Add dummy default secrets [labs/private] - 10https://gerrit.wikimedia.org/r/631724 (https://phabricator.wikimedia.org/T260917) [10:23:57] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:24:19] (03PS1) 10Klausman: aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) [10:24:30] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add dummy default secrets [labs/private] - 10https://gerrit.wikimedia.org/r/631724 (https://phabricator.wikimedia.org/T260917) (owner: 10JMeybohm) [10:26:05] (03CR) 10Kosta Harlan: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [10:26:53] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:28:17] !log kormat@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 67%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12898 and previous config saved to /var/cache/conftool/dbconfig/20201002-102817-kormat.json [10:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:23] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:36:06] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [10:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:47] (03PS3) 10JMeybohm: deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) [10:38:57] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:25] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [10:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:59] (03CR) 10Filippo Giunchedi: [C: 03+1] mtail: upgrade mtail across the fleet to 3.0.0~rc35-3+wmf3 [puppet] - 10https://gerrit.wikimedia.org/r/631501 (https://phabricator.wikimedia.org/T263728) (owner: 10Cwhite) [10:43:21] !log kormat@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12899 and previous config saved to /var/cache/conftool/dbconfig/20201002-104320-kormat.json [10:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:27] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:44:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:44:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db2110 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12900 and previous config saved to /var/cache/conftool/dbconfig/20201002-104453-kormat.json [10:44:57] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) I've improved the per-DC/host dashboard: https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host The change is clearly visible on Esams on 2020-09-29 and on Eqsin... [10:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:19] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) a:03ema [10:46:16] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) @ema as discussed on IRC, it seems sensible to roll back the change on at least one host on Esams for a few days next week to verify that this is what's causing the issue. [10:46:25] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:24] !log jmm@cumin2001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [10:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:33] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [10:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:15] !log jmm@cumin2001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [10:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:56] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [10:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:27] (03PS1) 10Muehlenhoff: Add DNS entry ldap-replica2004.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631750 (https://phabricator.wikimedia.org/T264390) [11:10:01] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entry ldap-replica2004.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631750 (https://phabricator.wikimedia.org/T264390) (owner: 10Muehlenhoff) [11:12:13] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6512038, @Gilles wrote: > @ema as discussed on IRC, it seems sensible to roll back the change on at least one host on Esams for a few days next week to verify t... [11:22:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [11:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:54] 10Operations: makevm cookbook fails get_vm() call - https://phabricator.wikimedia.org/T264409 (10MoritzMuehlenhoff) So, I've created three VMs and this happened in one out of three cases only. [11:33:33] (03PS1) 10Muehlenhoff: Add DHCP entries for ldap-replica100[12], ldap-replica2004 [puppet] - 10https://gerrit.wikimedia.org/r/631755 (https://phabricator.wikimedia.org/T264390) [11:35:07] (03PS2) 10Muehlenhoff: Add DHCP entries for ldap-replica100[12], ldap-replica2004 [puppet] - 10https://gerrit.wikimedia.org/r/631755 (https://phabricator.wikimedia.org/T264390) [11:42:13] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP entries for ldap-replica100[12], ldap-replica2004 [puppet] - 10https://gerrit.wikimedia.org/r/631755 (https://phabricator.wikimedia.org/T264390) (owner: 10Muehlenhoff) [11:53:22] !log kormat@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12901 and previous config saved to /var/cache/conftool/dbconfig/20201002-115322-kormat.json [11:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:45] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:02:23] RECOVERY - cassandra-b CQL 10.64.16.181:9042 on restbase1029 is OK: TCP OK - 0.000 second response time on 10.64.16.181 port 9042 https://phabricator.wikimedia.org/T93886 [12:03:34] (03PS1) 10Filippo Giunchedi: hieradata: move deployment-prep swift settings off Horizon [puppet] - 10https://gerrit.wikimedia.org/r/631758 [12:04:18] (03CR) 10Filippo Giunchedi: "Once this is merged I'll remove the settings from Horizon hiera project/prefix/instance (!) puppet" [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi) [12:05:19] !log bootstrapping restbase1029-c [12:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:51] RECOVERY - cassandra-c service on restbase1029 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:06:35] RECOVERY - cassandra-c SSL 10.64.16.182:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-c valid until 2022-09-29 10:16:51 +0000 (expires in 726 days) https://phabricator.wikimedia.org/T120662 [12:08:26] !log kormat@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12902 and previous config saved to /var/cache/conftool/dbconfig/20201002-120825-kormat.json [12:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:38] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:14:06] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: ms-be2017 slower than the rest of the cluster while rebalancing - https://phabricator.wikimedia.org/T264270 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Tentatively resolving since a reboot seem to indeed bring speed back up, will reopen if/when... [12:14:08] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) [12:18:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:18:19] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:31] !log kormat@cumin1001 dbctl commit (dc=all): 'db2140 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12903 and previous config saved to /var/cache/conftool/dbconfig/20201002-121830-kormat.json [12:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:35] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35487824 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:23:18] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:25:57] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 101856 and 80 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:25] !log disable puppet on mwdebug1001 [12:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:21] PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [12:35:45] RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.010 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [12:42:35] PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [12:44:50] that is me ^ [12:45:05] sorry [12:45:34] another alert is coming, sorry for the noise [12:50:09] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:24] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Pchelolo) In this case the queue is executing the job multiple times because these are different jobs. Th... [12:52:49] RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.001 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [12:54:38] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Samwalton9) It seems like the Signpost article was a user error, per the discussion at https://en.wikiped... [12:57:28] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Pchelolo) > The Books & Bytes example I mentioned further up, however, was definitely only sent once. Un... [12:58:19] (03PS1) 10Elukey: Set an-worker110[6-9] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631764 (https://phabricator.wikimedia.org/T255140) [12:58:30] (03CR) 10Muehlenhoff: "Something to fix before this gets enabled in general: This currently breaks the Icinga check for free disk space:" [puppet] - 10https://gerrit.wikimedia.org/r/630889 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [13:00:12] !log installing Linux 4.19.146 on Buster updates (from latest Buster point release, at this point only installing the updates, no reboots (yet)) [13:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:40] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10BBlack) Just throwing in some random points/counterpoints to ponder: * It's possible it does take more than a day or three for the frontend caches to settle into an optimal patter... [13:04:45] PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:06:03] (03CR) 10Elukey: [C: 03+2] Set an-worker110[6-9] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631764 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [13:06:25] RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.002 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [13:09:07] PROBLEM - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:09:35] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1002 is CRITICAL: connect to address 10.64.32.178 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [13:09:55] (03PS1) 10Marostegui: labsdb: Change weights on labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/631768 [13:10:07] PROBLEM - etherpad_lite_process_running on etherpad1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [13:12:18] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10BBlack) Eh maybe a few more to think about too: * The train this week caused a fair amount of churn with the rollout + rollback of 1.36.0-wmf.11. Is there any chance the train i... [13:13:07] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) **What happens when onhost memcached in unavailable? ** https://phabr... [13:13:47] (03CR) 10Marostegui: [C: 03+2] labsdb: Change weights on labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/631768 (owner: 10Marostegui) [13:14:45] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) **What happens when onhost memcached in unavailable? ** https://phabricator.wikimedia.org/T244340#6211682 @elukey @aaron With the con... [13:15:35] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:42] !log kormat@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12904 and previous config saved to /var/cache/conftool/dbconfig/20201002-132042-kormat.json [13:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:49] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:25:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:27:38] PROBLEM - DPKG on etherpad1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:32:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:34:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:46] !log kormat@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12905 and previous config saved to /var/cache/conftool/dbconfig/20201002-133545-kormat.json [13:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:52] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:36:17] (03PS1) 10Muehlenhoff: Add Cumin aliases for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/631773 [13:37:53] !log Create bot_passwords table at fishbowl wikis (T258356) [13:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:59] T258356: Allow users at all private/fishbowl wikis to use botpasswords - https://phabricator.wikimedia.org/T258356 [13:38:44] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 23579 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:39:44] (03PS1) 10Ppchelko: Force local short descriptions for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631775 (https://phabricator.wikimedia.org/T263493) [13:46:55] (03CR) 10Elukey: "pcc looks good https://puppet-compiler.wmflabs.org/compiler1003/25623/" [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [13:51:36] PROBLEM - SSH on ms-be2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:52:30] RECOVERY - etherpad_lite_process_running on etherpad1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [13:52:40] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8996 bytes in 0.696 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org [13:55:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:56:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:56:24] PROBLEM - Disk space on ping3001 is CRITICAL: DISK CRITICAL - free space: / 72 MB (2% inode=69%): /tmp 72 MB (2% inode=69%): /var/tmp 72 MB (2% inode=69%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping3001&var-datasource=esams+prometheus/ops [13:56:35] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) >>! In T244340#6197415, @Krinkle wrote: > If the local-memcached's bl... [13:57:44] RECOVERY - DPKG on etherpad1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:57:50] RECOVERY - SSH on ms-be2020 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:00:43] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [14:03:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:42] PROBLEM - Disk space on ping2001 is CRITICAL: DISK CRITICAL - free space: / 64 MB (2% inode=68%): /tmp 64 MB (2% inode=68%): /var/tmp 64 MB (2% inode=68%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops [14:07:45] (03CR) 10Filippo Giunchedi: [C: 03+1] Add Cumin aliases for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/631773 (owner: 10Muehlenhoff) [14:08:13] !log purging some unused kernels on ping* (these only have 3GB "disks") [14:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:25] !log enable puppet on mwdebug1001 [14:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:22] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631782 [14:15:57] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631783 [14:16:56] RECOVERY - Disk space on ping3001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping3001&var-datasource=esams+prometheus/ops [14:17:14] (03PS2) 10JMeybohm: envoy: New upstream version 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631383 (https://phabricator.wikimedia.org/T264157) [14:18:52] RECOVERY - Disk space on ping2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops [14:19:48] !log installing LLVM 7 bugfix updates from Buster point release [14:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:06] (03PS1) 10CDanis: /home/cdanis dotfiles updates [puppet] - 10https://gerrit.wikimedia.org/r/631785 [14:21:06] RECOVERY - etherpad_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:01] (03CR) 10CDanis: [C: 03+2] /home/cdanis dotfiles updates [puppet] - 10https://gerrit.wikimedia.org/r/631785 (owner: 10CDanis) [14:23:56] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:24:26] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:58] (03PS2) 10Klausman: aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) [14:36:38] (03CR) 10Elukey: [C: 03+1] aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [14:37:53] (03CR) 10CDanis: [C: 03+2] foundation.wikimedia.org: Add .well-known/matrix/server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631530 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle) [14:37:59] (03CR) 10CDanis: [C: 03+2] docroot: expand foundation.wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631529 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle) [14:38:46] (03Merged) 10jenkins-bot: docroot: expand foundation.wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631529 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle) [14:38:50] (03Merged) 10jenkins-bot: foundation.wikimedia.org: Add .well-known/matrix/server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631530 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle) [14:43:12] ^ changes look good on mwdebug2002, didn't break foundation.wm.o, and the .well-known file is served [14:45:18] !log cdanis@deploy1001 Synchronized docroot/wikimediafoundation.org: Separate foundation.wikimedia.org docroot & add .well-known/matrix/server T261531 4573776bd 2fb4c20ae (duration: 01m 01s) [14:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:23] T261531: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 [14:46:11] (03PS1) 10Herron: admin: update ssh key for user dedcode [puppet] - 10https://gerrit.wikimedia.org/r/631788 (https://phabricator.wikimedia.org/T264392) [14:46:34] (03PS1) 10Hnowlan: WIP: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [14:47:25] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) This is live now @bcampbell -- have Element give it a shot and let us know? [14:47:40] (03CR) 10jerkins-bot: [V: 04-1] WIP: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [14:49:15] nice! --^ [14:51:15] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis) [14:51:30] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis) [14:53:43] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) p:05Triage→03High [14:54:03] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) >>! In T264364#6510505, @Johan wrote: > Whoever ends up handling this on the community side:: I'll add this to next week's Tec... [14:55:24] RECOVERY - cassandra-c CQL 10.64.16.182:9042 on restbase1029 is OK: TCP OK - 0.000 second response time on 10.64.16.182 port 9042 https://phabricator.wikimedia.org/T93886 [14:57:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_restbase_esams,swagger_check_wikifeeds_eqiad} site={eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:58:35] (03PS1) 10Elukey: Import the action module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) [15:00:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:01:13] (03CR) 10jerkins-bot: [V: 04-1] Import the action module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [15:04:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:05:15] (03PS1) 10Clarakosi: changeprop: Add x-request-id header to jobqueue requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794 [15:05:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:06:01] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) [15:08:04] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) [15:12:12] (03CR) 10Ppchelko: [C: 04-1] "You need to bump the chart version in Chart.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794 (owner: 10Clarakosi) [15:13:02] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) I just had to do another service restart. [15:14:07] (03PS2) 10Clarakosi: changeprop: Add x-request-id header to jobqueue requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794 [15:14:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:14:35] (03CR) 10Ppchelko: [C: 03+1] "Let's deploy on Mon" [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794 (owner: 10Clarakosi) [15:16:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:19:39] (03PS2) 10Hnowlan: WIP: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [15:20:47] (03CR) 10jerkins-bot: [V: 04-1] WIP: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [15:21:07] (03CR) 10Dzahn: [C: 03+2] bastionhost::pop: remove tftp from bastions [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [15:23:39] (03CR) 10Dzahn: "This removed the ferm firewall holes for TFTP (but service was already stopped anyways)" [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [15:25:38] !log cdanis@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=ores,name=eqiad [15:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:40] (03CR) 10Elukey: "@Volans: I am a bit confused about why prospector fails here and not for spicerack, so let me know if you have any ideas before I get too " [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [15:28:08] RECOVERY - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-a valid until 2022-09-29 10:16:53 +0000 (expires in 726 days) https://phabricator.wikimedia.org/T120662 [15:28:11] !log bootstrapping restbase1030-a [15:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:46] RECOVERY - cassandra-a service on restbase1030 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:32:22] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10bcampbell) Thanks all. It's working. https://federationtester.matrix.org/#foundation.wikimedia.org [15:33:13] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) 05Open→03Resolved 🎉 [15:39:17] <_joe_> !log restarting redis on rdb2003, instance 6380 [15:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:25] (03CR) 10Dzahn: "> Patch Set 1: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/631560 (owner: 10Dzahn) [15:42:50] (03Abandoned) 10Dzahn: add etherpad-next.discovery, point to etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/631559 (owner: 10Dzahn) [15:45:47] (03CR) 10Dzahn: [C: 03+1] "I'm fine with either and don't have an opinion on this. Let me know once there is consensus one way or the other for merging." [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [15:51:13] (03PS1) 10JMeybohm: profile::envoy::builder: fix repos in chroot [puppet] - 10https://gerrit.wikimedia.org/r/631802 [15:54:25] (03PS2) 10Elukey: Import the action module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) [15:54:36] (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/629653 (https://phabricator.wikimedia.org/T263727) (owner: 10Giuseppe Lavagetto) [15:55:17] (03CR) 10Elukey: "Interesting, I swapped "setup" with "setup_method" and it worked." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [15:57:33] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) Small status update: in order to grant everyone a quieter weekend (hopefully!), we've repooled eqiad and raised manually the max client... [16:05:59] (03CR) 10Hnowlan: [C: 03+1] profile::envoy::builder: fix repos in chroot [puppet] - 10https://gerrit.wikimedia.org/r/631802 (owner: 10JMeybohm) [16:10:34] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) Thanks everyone for your help here. [16:11:27] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Elitre) Thanks for the shoutout in All Staff BTW! [16:11:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:13:10] (03PS2) 10JMeybohm: profile::envoy::builder: fix repos in chroot [puppet] - 10https://gerrit.wikimedia.org/r/631802 [16:13:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:15:13] (03CR) 10Dzahn: "very good. imho nothing should be in Horizon and everything in the repo. for this reason." [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi) [16:23:32] (03CR) 10Dzahn: "I am seeing more things like "swift::params::account_keys" in project puppet but they are not in this change yet?" [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi) [16:25:30] (03CR) 10Dzahn: "But I also don't see a prefix puppet for swift hosts.. were these on individual instances?" [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi) [16:27:59] (03CR) 10Dzahn: [C: 03+2] add etherpad.discovery.wmnet, point to etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/631557 (owner: 10Dzahn) [16:28:02] (03PS2) 10Dzahn: add etherpad.discovery.wmnet, point to etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/631557 [16:32:11] (03PS2) 10Dzahn: ATS/Etherpad: replace backend host name with discovery record [puppet] - 10https://gerrit.wikimedia.org/r/631555 [16:39:33] 10Operations, 10SRE-tools: makevm cookbook fails get_vm() call - https://phabricator.wikimedia.org/T264409 (10herron) p:05Triage→03Medium [16:42:30] 10Operations, 10SRE-tools: makevm cookbook fails get_vm() call - https://phabricator.wikimedia.org/T264409 (10Volans) 05Open→03Invalid Nothing to do here, that is not a failure, the cookbook is just polling to get the craeted VM. The `[1/20, retrying in 3.00s]` is the first call that failed, the second one... [16:43:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Change urbanecm's SSH production key - https://phabricator.wikimedia.org/T264345 (10herron) p:05Triage→03Medium [16:46:31] (03CR) 10Hnowlan: [C: 03+1] profile::envoy::builder: fix repos in chroot [puppet] - 10https://gerrit.wikimedia.org/r/631802 (owner: 10JMeybohm) [16:49:03] !log disable puppet on mw2271 and briefly depool it [16:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:51] (03PS1) 10Ssingh: dnsdist: temporarily disable validate_cmd for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/631827 (https://phabricator.wikimedia.org/T263789) [16:56:14] PROBLEM - Memcached on mw2271 is CRITICAL: connect to address 10.192.48.93 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [16:57:14] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/25628/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631827 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [17:03:24] PROBLEM - Check systemd state on mw2271 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:14] RECOVERY - Check systemd state on mw2271 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:12] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:55] (03CR) 10Dzahn: [C: 03+2] ATS/Etherpad: replace backend host name with discovery record [puppet] - 10https://gerrit.wikimedia.org/r/631555 (owner: 10Dzahn) [17:15:13] 10Operations, 10Maps: Migrate maps to Buster - https://phabricator.wikimedia.org/T264292 (10herron) p:05Triage→03Medium [17:15:45] 10Operations, 10Data-Persistence-Backup, 10Goal: Track all directly-owned SRE datasets into the new inventory system - https://phabricator.wikimedia.org/T264275 (10herron) p:05Triage→03Medium [17:15:55] !log submitted puppet refactoring change on maps servers [17:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:02] 10Operations, 10Data-Persistence-Backup, 10Epic, 10Goal: Plan WMF infrastructure for 100% coverage of data recovery - https://phabricator.wikimedia.org/T264272 (10herron) p:05Triage→03Medium [17:16:19] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10herron) p:05Triage→03Medium [17:16:47] 10Operations, 10Puppet: Switch puppetdb to profile::java - https://phabricator.wikimedia.org/T264178 (10herron) p:05Triage→03Medium [17:17:18] 10Operations, 10Patch-For-Review: Switch cergen to profile::java - https://phabricator.wikimedia.org/T264177 (10herron) p:05Triage→03Medium [17:17:36] (03CR) 10Dzahn: "confirmed NOOP on maps2004" [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [17:17:56] 10Operations, 10Analytics: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10herron) p:05Triage→03Medium [17:18:21] 10Operations: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10herron) p:05Triage→03Medium [17:18:36] 10Operations, 10Data-Persistence-Backup, 10Goal: Define a methodology to track WMF services backup requirements - https://phabricator.wikimedia.org/T264274 (10herron) p:05Triage→03Medium [17:18:59] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Update public key for production shell for dedcode - https://phabricator.wikimedia.org/T264392 (10herron) p:05Triage→03High [17:19:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:20:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:21:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10Sbailey) Ok, there must be some other way to verify security. My previous SSH key is gone, I need a new one installed so I can log in to scandium somehow. [17:22:43] (03CR) 10CDanis: [C: 03+1] admin: update ssh key for user dedcode [puppet] - 10https://gerrit.wikimedia.org/r/631788 (https://phabricator.wikimedia.org/T264392) (owner: 10Herron) [17:32:13] (03CR) 10Herron: [C: 03+2] admin: update ssh key for user dedcode [puppet] - 10https://gerrit.wikimedia.org/r/631788 (https://phabricator.wikimedia.org/T264392) (owner: 10Herron) [17:40:58] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10wiki_willy) After escalating to technical account rep, replacement CPUs are being shipped by Dell, and can wait to be replaced when Papaul is back from vacation. [17:43:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:46:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:48:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10herron) Hi @Sbailey I've reached out to you via google chat and by email to verify. Thanks! [17:53:35] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/631522 (owner: 10Dzahn) [18:04:02] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:51] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Update public key for production shell for dedcode - https://phabricator.wikimedia.org/T264392 (10herron) 05Open→03Resolved This has been done, I'll transition to resolved now [18:11:03] (03Abandoned) 10Dzahn: site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631560 (owner: 10Dzahn) [18:14:54] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@da6a098]: oozie: query_clicks_hourly needs to wait on codfw events [18:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:55] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@da6a098]: oozie: query_clicks_hourly needs to wait on codfw events (duration: 02m 01s) [18:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:22:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:27:31] !log enable puppet on mw2271 [18:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:53] (03PS1) 10Dzahn: remove etherpad1003 from site and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/631842 [18:35:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [18:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:38] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:34] RECOVERY - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.234 port 9042 https://phabricator.wikimedia.org/T93886 [18:59:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_wikifeeds_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:00:26] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/631848 [19:01:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:04:52] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/631848 [19:05:35] (03PS1) 10Razzi: oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) [19:08:36] (03CR) 10Razzi: "Catalog compiler: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/25634/console" [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [19:09:02] (03CR) 10Dzahn: "@Razzi Hi, are you ok with this? would be nice if we can maybe merge this before the oozie changes you are working on, will probably requi" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [19:10:18] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10Krinkle) Short summary of IRC convo: Per [doc](https://docs.google.com/docume... [19:12:02] (03CR) 10Dzahn: "this shows how it's no difference on an oozie::server https://puppet-compiler.wmflabs.org/compiler1002/25636/an-coord1001.eqiad.wmnet/inde" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [19:14:33] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [19:14:38] 10Operations, 10vm-requests: EQIAD: 1 VM request for etherpad - https://phabricator.wikimedia.org/T101492 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `etherpad1003.eqiad.wmnet` - etherpad1003.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found Ganeti... [19:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:42] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "and here is the same for some oozie::clients https://puppet-compiler.wmflabs.org/compiler1003/25637/" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [19:17:20] (03CR) 10Dzahn: [C: 03+2] remove etherpad1003 from site and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/631842 (owner: 10Dzahn) [19:17:57] (03CR) 10Razzi: [C: 03+2] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [19:18:44] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:51] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "thank you:) submitting" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [19:20:17] (03PS3) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848 [19:22:05] (03CR) 10Dzahn: "I ran puppet on an-coord1001 an-airflow1001 and there was no change." [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [19:22:53] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:23:45] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:34] (03CR) 10Dzahn: "great idea to avoid the additonal list of admins in yaml !:) Thanks for letting me merge the hiera->lookup change before this. This will " [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [19:27:34] (03PS2) 10Dzahn: oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [19:28:35] (03CR) 10Dzahn: "PS2: manual rebase on top of I4d869ebfdcb9e1de (there, fixed it so you don't have to because of my change)" [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [19:30:29] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25638/an-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [19:31:05] (03PS1) 10Revi: GrowthExperiments: Change Help Page URL for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631854 (https://phabricator.wikimedia.org/T254364) [19:33:07] (03PS1) 10Dzahn: remove etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/631855 [19:33:52] (03CR) 10Dzahn: [C: 03+2] remove etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/631855 (owner: 10Dzahn) [19:35:39] (03CR) 10Dzahn: [C: 03+1] dnsdist: temporarily disable validate_cmd for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/631827 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [19:37:37] (03CR) 10Dzahn: [C: 03+1] dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [19:38:12] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "per https://phabricator.wikimedia.org/T264345#6510040" [puppet] - 10https://gerrit.wikimedia.org/r/631515 (https://phabricator.wikimedia.org/T264345) (owner: 10Urbanecm) [19:40:07] chaomodus: could you have a look at the uncommitted dns check please? [19:42:41] (03PS1) 10Bstorm: Fixing typo in docker registry class [puppet] - 10https://gerrit.wikimedia.org/r/631856 [19:43:26] (03CR) 10Bstorm: [C: 03+2] Fixing typo in docker registry class [puppet] - 10https://gerrit.wikimedia.org/r/631856 (owner: 10Bstorm) [19:43:50] (03CR) 10Dzahn: [C: 03+1] "While I can't really confirm the firewall part (I get connected to something from bast3004 with ipmitool command provided), I am still all" [puppet] - 10https://gerrit.wikimedia.org/r/631430 (owner: 10Muehlenhoff) [19:48:59] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:30] (03CR) 10Ssingh: [C: 03+2] dnsdist: temporarily disable validate_cmd for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/631827 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [19:55:09] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:57:15] (03PS2) 10Dzahn: Remove profile::ipmi::mgmt from role::bastionhost::pop [puppet] - 10https://gerrit.wikimedia.org/r/631430 (owner: 10Muehlenhoff) [19:57:17] (03PS1) 10Dzahn: consolidate bastionhost roles, remove module [puppet] - 10https://gerrit.wikimedia.org/r/631858 [19:57:57] (03CR) 10Dzahn: [C: 03+1] "after this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/631858" [puppet] - 10https://gerrit.wikimedia.org/r/631430 (owner: 10Muehlenhoff) [20:03:28] (03PS18) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) [20:03:33] (03PS2) 10Dzahn: consolidate bastionhost roles, remove module [puppet] - 10https://gerrit.wikimedia.org/r/631858 [20:04:40] (03CR) 10Ssingh: "Rebased on top of production; no other changes: https://puppet-compiler.wmflabs.org/compiler1003/25641/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [20:04:51] (03CR) 10Ssingh: [C: 03+2] dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [20:06:37] 10Operations, 10Traffic, 10Patch-For-Review: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) [20:06:41] (03PS3) 10Dzahn: consolidate bastionhost roles, remove module [puppet] - 10https://gerrit.wikimedia.org/r/631858 [20:11:02] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25642/" [puppet] - 10https://gerrit.wikimedia.org/r/631858 (owner: 10Dzahn) [20:12:49] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:52] (03CR) 10Herron: [C: 03+2] admin: Change urbanecm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/631515 (https://phabricator.wikimedia.org/T264345) (owner: 10Urbanecm) [20:16:59] (03PS3) 10Herron: admin: Change urbanecm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/631515 (https://phabricator.wikimedia.org/T264345) (owner: 10Urbanecm) [20:21:03] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:10] herron: thanks, the new key seems to work [20:30:33] (03PS2) 10Dzahn: trafficserver: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/631291 [20:30:54] (03CR) 10Dzahn: trafficserver: replace hiera() with lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631291 (owner: 10Dzahn) [20:31:51] Urbanecm: np! [20:34:23] (03PS3) 10Dzahn: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 [20:34:43] (03CR) 10Dzahn: cassandra: add data types, remove validation code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [20:34:45] (03CR) 10jerkins-bot: [V: 04-1] cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [20:52:59] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10leila) @gsingers Please review and approve if you are fine with it. [20:56:41] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10gsingers) Approved. [21:07:55] mutante: the 'Uncommitted DNS changes in Netbox' icinga alert is because of changes related to etherpad1003 that were not committed apparently [21:08:10] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1003/25643/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/631306 (owner: 10Dzahn) [21:09:27] volans|off: oh? but there is nothing pending commit [21:10:00] I used decom cookbook and it told me to manually remove from DNS and then i did that? [21:10:59] see the alert above (21:55 UTC and the related wikitech page) [21:11:29] Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2) [21:11:31] here is the change where i removed it https://gerrit.wikimedia.org/r/c/operations/dns/+/631855 [21:11:39] it's the automated one [21:11:41] not the manual one [21:11:46] see https://phabricator.wikimedia.org/T101492#6513678 [21:11:55] it told me explicitly to manually remove it [21:12:10] yes but the automated one failed too [21:12:21] we need both automated and manual until we fully migrate all DCs [21:12:26] 3 out of 5 are done [21:12:31] (the pops) [21:12:53] sorry, i don't know what you mean by "not committed" then [21:13:53] repeat running the decom cookbook after making the manual change? [21:14:24] Fri 21:55:09 icinga-wm| PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:15:21] it means that changes in netbox are not committed to the automated dns repository [21:15:28] i can help from here if needbe [21:15:42] I'll dig more next week on the transient failure [21:22:55] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [21:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:15] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [21:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:37] I ran sre.dns.netbox and it showed the diff.. committed.. rescheduled icinga check. did not fix it [21:36:53] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:36:59] ran cookbook a second time. no more diff [21:37:02] there it is, ok [21:43:44] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10Dzahn) The hosts here are showing up in a weird state. When running the DNS cookbook you get warnings that these hosts exist but are not "in devices... [21:55:33] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,name=mw2271.codfw.wmnet [22:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:19] !log depooling mw2271 because Icinga alerts about memcached and SAL shows there were ongoing tests of some kind on it [22:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:09] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:03:23] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:49] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/re [22:04:17] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:17] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:17] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image dat [22:04:17] 016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a respo [22:04:17] https://wikitech.wikimedia.org/wiki/Wikifeeds [22:04:39] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_4101: Servers kubernetes2010.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:04:49] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:49] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:04:51] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:04:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_restbase_cluster_eqiad,swagger_check_wikifeeds_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:05:01] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:59] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:59] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:59] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:05:59] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:06:09] ugh.. short spike in response time on appservers but already over [22:06:31] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:06:37] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:06:43] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_4101: Servers kubernetes2016.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:06:45] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:06:48] PROBLEM - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:06:53] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:23] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [22:07:23] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:23] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:23] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:27] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:07:29] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value [22:07:29] g keys: [mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:40] good evening [22:07:59] there was a short spike in response time on appservers but it is already back to normal [22:08:05] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:08:06] then rescheduled some of the restbase checks [22:08:10] and seeing recoveries [22:08:11] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:11] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:11] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:11] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:11] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:11] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:11] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:13] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:13] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:23] mutante: ah cool [22:08:24] RECOVERY - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1003 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:08:26] no idea where it came from? [22:08:27] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:08:28] looks like i can just ACK it in VO [22:08:29] no [22:09:05] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:09:05] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:09:07] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:09:07] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:09:19] was about to send the RESOLVED code in VO but it is already done too [22:09:48] rescheduling the rest of them [22:10:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:10:16] whatever it was, it correlates with a jump in s8 scrape time [22:10:45] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:59] hi [22:11:24] cdanis: hi, latency spike recovered on its own, just curiosity at this point [22:11:29] dbtree does not show lag on s8 [22:11:49] not replication lag, mutante, but the time that it took for the prometheus exporter to run its load-checking queries [22:12:04] which tracks reasonably well with either query queue depth or cpu saturation [22:12:05] cdanis: this spike https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-datasource=codfw%20prometheus%2Fops&var-method=GET&viewPanel=9 [22:12:14] which, on s8, because it is hit in some way by ~every wikipedia [22:12:20] often causes appserver latency spikes [22:12:47] we had several incidents of this in eqiad due to cpu starvation on s8 until DBA added an extra host [22:14:28] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2080&var-port=9104 [22:14:57] there was a pretty big read load spike on all s8 replicas [22:15:55] db2080 (part of s8) seems to be pretty busy and just started having traffic this morning [22:16:38] on some of the hosts there was a smaller load spike at 21:00 (the top of the previous hour) -- and the latest one correlates with the top of this hour [22:16:40] 👀 [22:16:51] https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db2091&var-port=9104 [22:17:18] we didn't add any new crons or anything did we [22:17:36] 2098 and 2099 also show CPU jumps from ~0 to ~30% starting at 21:30 and 22:04 respectively [22:17:37] good point it started at the top of the hour. same is true for response time on appservers [22:17:42] 30 seconds after the hour [22:18:05] those are s2 and s4 though, might be unrelated [22:18:32] just stands out because the rest of them are flat, except for the CPU bump on the s8 hosts [22:18:41] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=codfw&var-cluster=mysql&var-instance=All&var-datasource=thanos&from=now-1h&to=now is where I'm looking [22:21:45] 17899 Oct 2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job mediawiki_tor_exit_node. [22:21:47] I'm inclined to leave it and go back to our evenings, I don't think there's anything much tod o here [22:21:48] 17900 Oct 2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job db_lag_stats_reporter. [22:21:51] 17901 Oct 2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job wikibase_repo_prune2. [22:21:54] 17902 Oct 2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job wikibase_repo_prune_test. [22:21:57] 17903 Oct 2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job wikidata-updateQueryServiceLag. [22:22:00] 17904 Oct 2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job update_flaggedrev_stats. [22:22:03] all those at that time , heh [22:22:18] well, our evenings in ET that is, excuse me mutante [22:22:35] I don't think I'd seen "mediawiki_tor_exit_node" before [22:22:48] I assume that flags them as open proxies [22:22:56] for blocking editing from them [22:23:02] ah sure yeah [22:23:25] yes, that's it, and you did not see it because we used that as the first cron to convert to a timer [22:23:28] and then you did the rest [22:23:33] or so [22:23:37] ha, got it [22:24:26] anyway it doesn't look like any of those are new [22:24:42] wait that's mwmaint1002 [22:24:47] mwmaint2001 is the right host [22:24:53] the only part i am wondering is "weren't these more spaced out in time" [22:25:00] ah, of course [22:25:26] (the timers will run in eqiad, they'll just check etcd, see they're in the wrong place, and not do anything) [22:25:37] ok, you answered the next question [22:33:41] yeah I got nothing 🤷 if it comes back we can take another look [22:34:29] yea, i checked mwmaint2001 logs around that time but the crons are not new and not matching the pattern to start at exactly the hour and stop 3 min later [22:34:36] agreed rzl [22:37:52] btw, replica lag on db2099 (s4) which you mentioned earlier has disabled notifications in Icinga. so that could mean it's known unless they were forgotten [22:42:45] (03CR) 10Dzahn: [V: 04-1] "Duh, I was wondering why this still says "require_encrypted_keys' expects a Boolean value, got String" but "yes" is not really Boolean :)" [puppet] - 10https://gerrit.wikimedia.org/r/631306 (owner: 10Dzahn) [22:44:49] (03PS3) 10Dzahn: keyholder: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/631306 [22:58:23] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25644/deploy1001.eqiad.wmnet/index.html https://puppet-compiler.wmflabs.org/compiler1002/" [puppet] - 10https://gerrit.wikimedia.org/r/631306 (owner: 10Dzahn) [23:00:36] (03CR) 10Dzahn: "and here is the reason I added the Optionals before:" [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [23:04:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25648/netflow3001.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631303 (owner: 10Dzahn) [23:13:20] 10Operations, 10Analytics, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) [23:16:31] PROBLEM - SSH on ms-be2056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:21:21] RECOVERY - SSH on ms-be2056 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:21:43] (03CR) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn) [23:21:54] (03PS3) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 [23:22:34] (03CR) 10Dzahn: "yea, these should not even be inside the module but refactoring base is for another day" [puppet] - 10https://gerrit.wikimedia.org/r/631307 (owner: 10Dzahn) [23:22:56] (03CR) 10jerkins-bot: [V: 04-1] cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn) [23:27:45] 10Operations, 10MediaWiki-Documentation, 10User-Dereckson, 10patch-welcome: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Dereckson) **Current status** svn.wikimedia.org/ redirects to https://phabricator.wikimedia.org/diffusion/ svn.wikimedia... [23:33:25] 10Operations, 10Analytics, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) My opinion on this request is that having non throughly supervised contributors accessing data introduces... [23:51:28] (03PS2) 10Dzahn: thumbor: role->profile, hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/630694 [23:53:15] (03PS4) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 [23:54:17] (03CR) 10jerkins-bot: [V: 04-1] cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn) [23:56:00] (03CR) 10Dzahn: wmcs::postgres: hiera->lookup and add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [23:56:05] (03PS2) 10Dzahn: wmcs::postgres: hiera->lookup and add data types [puppet] - 10https://gerrit.wikimedia.org/r/628459 [23:57:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:58:51] (03PS5) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 [23:59:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets