[00:22:11] (03PS14) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [00:27:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:29:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:11:44] PROBLEM - MariaDB read only s1 on db1091 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [01:11:56] PROBLEM - MariaDB Replica IO: s1 #page on db1091 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:12:27] PROBLEM - mysqld processes #page on db1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:12:45] PROBLEM - MariaDB Replica SQL: s1 #page on db1091 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:13:04] PROBLEM - Check systemd state on db1091 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:13:05] PROBLEM - MariaDB disk space #page on db1091 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:13:36] 👋 [01:16:01] !log rzl@cumin1001 dbctl commit (dc=all): 'Depool db1091', diff saved to https://phabricator.wikimedia.org/P13124 and previous config saved to /var/cache/conftool/dbconfig/20201101-011600-rzl.json [01:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:19] 👋 sorry I'm late [01:16:24] thanks rzl [01:16:52] haven't dug any further yet, just starting to look around [01:17:15] previously on battlestar galactica, https://phabricator.wikimedia.org/T225060 [01:17:28] yeah I thought the machine name was familiar [01:18:46] Is it the battery? [01:20:52] the host is up, but I can't log in -- I get an input/output error on executing /usr/bin/zsh. specifying another command doesn't work either. I'm suspecting some filesystem badness? [01:21:04] cdanis: yeah was just in the middle of pasting the same [01:21:18] https://www.irccloud.com/pastebin/sCkE9o4X/ [01:21:50] its root is readonly, judging from its logs on centrallog1001 [01:22:44] it has a USB hub bouncing on and off over and over [01:23:10] spooky [01:23:29] I'm inclined to just write it up and leave it depooled until Monday [01:23:34] yeah +1 [01:23:56] mind dumping what you have in a task? [01:24:04] willdo [01:24:21] PROBLEM - MariaDB Replica Lag: s1 #page on db1091 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:24:39] heh, correction, write it up, leave it depooled, and downtime it [01:25:13] heh yes [01:25:20] we ... should have smarter alerts [01:25:30] Some Day™ [01:25:34] I'm downtiming for 40 hours, which takes us to midday Monday [01:26:44] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [01:26:45] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:50] specifically it takes us to 17:26 UTC Monday (cc cdanis to note in the task) [01:27:55] also acked the victorops alerts so it stops calling me 🙃 [01:31:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10CDanis) 05Resolved→03Open db1091 had some hardware failure again about 01:11 UTC. Got a bunch of errors on sd 0:1:0:0 / sda, a bunch of SCSI commands failing with hostbyte=DID_NO_CONNECT... [01:34:01] thanks cdanis [01:34:57] thanks! [01:34:59] the VO incident is acked but not resolved, does it re-fire after 24h in that state? [01:35:04] oh no [01:35:18] I probably shouldn't still be fuzzy on that but I am [01:35:25] I think it does but I am not 100% [01:36:05] I guess we should resolve it then? [01:36:22] I will do that now and learn more about this on Monday [01:36:37] heh or cdanis will beat me to it [01:36:38] beat you ;) [01:38:20] closing out for now, thanks all [01:40:13] Yeah, it would fire again 24h later if ack'd. +1 for resolve. [01:51:30] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10colewhite) ` /system1/log1/record21 Targets Properties number=21 severity=Critical date=11/01/2020 time=01:07 description=Drive Array Controller Failure (Slot 1) Ver... [02:01:36] PROBLEM - snapshot of s7 in codfw on alert1001 is CRITICAL: snapshot for s7 at codfw taken more than 3 days ago: Most recent backup 2020-10-29 01:45:27 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:27:24] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:32:28] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:56:16] (03CR) 10jerkins-bot: [V: 04-1] peek: make git::clone ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/637790 (https://phabricator.wikimedia.org/T265912) (owner: 10Reedy) [02:57:16] (03PS3) 10Reedy: peek: make git::clone ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/637790 (https://phabricator.wikimedia.org/T265912) [02:57:53] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se website, 10HTTPS: Set HSTS on wikiba.se (force HTTPS) - https://phabricator.wikimedia.org/T232246 (10Ladsgroup) [03:34:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:01] (03PS1) 10Ladsgroup: Rework DNS entries of wikis in wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882) [03:41:42] 10Operations, 10DNS, 10Traffic, 10Mobile, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Ladsgroup) a:03Ladsgroup This should fix it ^ [04:05:05] (03PS1) 10Ladsgroup: [WIP] varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) [04:13:30] 10Operations, 10Education-Program-Dashboard, 10Traffic, 10Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509 (10Ladsgroup) 05Open→03Declined Let's close it. Reopen if you disagree [04:30:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:45] (03PS1) 10Ladsgroup: varnish: Replace "Expires" in Set-Cookie with "Max-Age" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967) [04:47:12] 10Operations, 10Analytics, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560 (10Ladsgroup) 05Open→03Declined Three years have passed from this incident and as result, there's no data left from that time to examin... [05:15:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:17:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:18:00] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [05:21:18] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [05:56:56] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:58] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:00] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:56:08] (03PS1) 10Ladsgroup: mailman: Set the charset utf-8 as charset of English [puppet] - 10https://gerrit.wikimedia.org/r/637852 (https://phabricator.wikimedia.org/T261031) [06:58:26] 10Operations, 10Wikimedia-Mailing-lists, 10I18n, 10Patch-For-Review: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Ladsgroup) >>! In T261031#6593671, @gerritbot wrote: > Change 637852 had a related pa... [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201101T0700) [07:26:34] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster={misc,swift} file={intel_microcode.prom,smartmon.prom} instance={ms-be1057,relforge1004} job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [07:27:28] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:32:32] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:03:42] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:42] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:52] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:10:22] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:11:54] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:14:00] (03PS1) 10ArielGlenn: Make sure there is always a directory in which to write table info per wiki [dumps] - 10https://gerrit.wikimedia.org/r/637855 [09:15:06] (03CR) 10ArielGlenn: [C: 03+2] Make sure there is always a directory in which to write table info per wiki [dumps] - 10https://gerrit.wikimedia.org/r/637855 (owner: 10ArielGlenn) [09:15:36] (03Merged) 10jenkins-bot: Make sure there is always a directory in which to write table info per wiki [dumps] - 10https://gerrit.wikimedia.org/r/637855 (owner: 10ArielGlenn) [09:16:34] !log ariel@deploy1001 Started deploy [dumps/dumps@6c7d811]: create empty dir for tableinfo if needed [09:16:38] !log ariel@deploy1001 Finished deploy [dumps/dumps@6c7d811]: create empty dir for tableinfo if needed (duration: 00m 04s) [09:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:13] that's a ubn deploy, sure enough my carefully tested fixes broke the beginning of the run but I think it should be fixed now [09:23:23] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Bodhisattwa) @Hrishikes couldn't reproduce the problem and downloaded the normal pdf. I am sharing the file here as per his request... [09:35:10] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:36:52] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:47:34] (03PS1) 10ArielGlenn: enable general directory creation during createdirs job [dumps] - 10https://gerrit.wikimedia.org/r/637856 [09:50:04] (03PS2) 10ArielGlenn: enable general directory creation during createdirs job [dumps] - 10https://gerrit.wikimedia.org/r/637856 [09:50:50] (03CR) 10ArielGlenn: [C: 03+2] enable general directory creation during createdirs job [dumps] - 10https://gerrit.wikimedia.org/r/637856 (owner: 10ArielGlenn) [09:51:15] (03Merged) 10jenkins-bot: enable general directory creation during createdirs job [dumps] - 10https://gerrit.wikimedia.org/r/637856 (owner: 10ArielGlenn) [09:52:40] !log ariel@deploy1001 Started deploy [dumps/dumps@de4c823]: actually allow per run dir to be made early in the run [09:52:44] !log ariel@deploy1001 Finished deploy [dumps/dumps@de4c823]: actually allow per run dir to be made early in the run (duration: 00m 04s) [09:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:32] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:58:56] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:59:02] ok well that actually fixed the problem. I can kick the crons on the other snapshots and they'll all start up nw [10:23:34] and they are all running. in the spirit of 'deploy on a sunday and then run away' I have to go do errands (but everything is now running properly). I will be checking back in on things later [10:46:43] (03PS1) 10Marostegui: db1091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/637857 (https://phabricator.wikimedia.org/T225060) [10:47:24] (03CR) 10Marostegui: [C: 03+2] db1091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/637857 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [10:49:52] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) Thanks everyone who responded to this incident. Looks like we'd need another disk for this host. @wiki_willy do we have some spares? This host is scheduled for replacement with th... [11:27:28] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:32:28] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:12:00] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:08] everything still looks fine, so that's that [13:09:54] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Lam1982019) [13:43:27] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Hrishikes) And here is the Czech Wiki file. {F32421293} [13:46:10] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Hrishikes) I am not getting invalid pdfs. {F32421295} [13:46:14] 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Aklapper) [14:07:15] (03PS1) 10Hamish: Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) [14:27:58] (03PS1) 10Hamish: Add wgNamespaceAliases for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637870 (https://phabricator.wikimedia.org/T266925) [15:18:00] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:56:58] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:02] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:18] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:32:20] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:38:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:40:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:57:22] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:34] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:14] (03CR) 10Zoranzoki21: [C: 03+1] "This should be good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [17:31:25] (03CR) 10jerkins-bot: [V: 04-1] Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [17:32:12] (03CR) 10Zoranzoki21: [C: 04-1] Add wgImportSources for zhwikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [17:42:56] 10Operations, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Multichill) I just noticed this also breaks https://commons.wikimedia.org/wiki/Sp... [18:57:22] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:01:04] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:24] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:27:05] (03PS15) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [19:55:28] (03Abandoned) 10DannyS712: Branch commit for wmf/1.36.0-wmf.15 [core] (wmf/1.36.0-wmf.15) - 10https://gerrit.wikimedia.org/r/636549 (https://phabricator.wikimedia.org/T263181) (owner: 10TrainBranchBot) [19:57:20] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:06] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:58] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops, 10FR-Email: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Aklapper) 05Stalled→03Open Resetting task status per last comment. [21:26:58] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:32:02] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:27:26] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:30:48] 10Operations, 10Commons, 10SRE-swift-storage: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 (10Peachey88) [22:32:28] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:41:01] !log mwscript extensions/OATHAuth/maintenance/disableOATHAuthForUser.php --wiki=metawiki Turkmen # T266976 [22:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:09] T266976: Please reset 2FA for my account. - https://phabricator.wikimedia.org/T266976 [22:49:59] 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) Neither @brion nor IA folks have answered, and @80686 hasn't been active for two years here. Proposing... [22:51:40] 10Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10Aklapper) 05Stalled→03Open The previous comments don't explain who or what (task?) exactly this task is stalled on (["If a report is waiting for further input (e.g. from its reporter or a third party)...