[00:22:11] <wikibugs>	 (03PS14) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396)
[00:27:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:29:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:11:44] <icinga-wm>	 PROBLEM - MariaDB read only s1 on db1091 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[01:11:56] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s1 #page on db1091 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:12:27] <icinga-wm>	 PROBLEM - mysqld processes #page on db1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[01:12:45] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s1 #page on db1091 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:13:04] <icinga-wm>	 PROBLEM - Check systemd state on db1091 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:13:05] <icinga-wm>	 PROBLEM - MariaDB disk space #page on db1091 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[01:13:36] <rzl>	 👋
[01:16:01] <logmsgbot>	 !log rzl@cumin1001 dbctl commit (dc=all): 'Depool db1091', diff saved to https://phabricator.wikimedia.org/P13124 and previous config saved to /var/cache/conftool/dbconfig/20201101-011600-rzl.json
[01:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:16:19] <cdanis>	 👋 sorry I'm late
[01:16:24] <cdanis>	 thanks rzl
[01:16:52] <rzl>	 haven't dug any further yet, just starting to look around
[01:17:15] <rzl>	 previously on battlestar galactica, https://phabricator.wikimedia.org/T225060
[01:17:28] <cdanis>	 yeah I thought the machine name was familiar
[01:18:46] <wkandek>	 Is it the battery?
[01:20:52] <cdanis>	 the host is up, but I can't log in -- I get an input/output error on executing /usr/bin/zsh.  specifying another command doesn't work either.  I'm suspecting some filesystem badness?
[01:21:04] <rzl>	 cdanis: yeah was just in the middle of pasting the same
[01:21:18] <rzl>	 https://www.irccloud.com/pastebin/sCkE9o4X/
[01:21:50] <cdanis>	 its root is readonly, judging from its logs on centrallog1001
[01:22:44] <cdanis>	 it has a USB hub bouncing on and off over and over
[01:23:10] <rzl>	 spooky
[01:23:29] <rzl>	 I'm inclined to just write it up and leave it depooled until Monday
[01:23:34] <cdanis>	 yeah +1
[01:23:56] <rzl>	 mind dumping what you have in a task?
[01:24:04] <cdanis>	 willdo
[01:24:21] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 #page on db1091 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:24:39] <rzl>	 heh, correction, write it up, leave it depooled, and downtime it
[01:25:13] <cdanis>	 heh yes
[01:25:20] <cdanis>	 we ... should have smarter alerts
[01:25:30] <cdanis>	 Some Day™
[01:25:34] <rzl>	 I'm downtiming for 40 hours, which takes us to midday Monday
[01:26:44] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.downtime
[01:26:45] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[01:26:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:27:50] <rzl>	 specifically it takes us to 17:26 UTC Monday (cc cdanis to note in the task)
[01:27:55] <cdanis>	 also acked the victorops alerts so it stops calling me 🙃
[01:31:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10CDanis) 05Resolved→03Open db1091 had some hardware failure again about 01:11 UTC.  Got a bunch of errors on sd 0:1:0:0 / sda, a bunch of SCSI commands failing with hostbyte=DID_NO_CONNECT...
[01:34:01] <rzl>	 thanks cdanis 
[01:34:57] <cdanis>	 thanks!
[01:34:59] <rzl>	 the VO incident is acked but not resolved, does it re-fire after 24h in that state?
[01:35:04] <cdanis>	 oh no
[01:35:18] <rzl>	 I probably shouldn't still be fuzzy on that but I am
[01:35:25] <cdanis>	 I think it does but I am not 100%
[01:36:05] <rzl>	 I guess we should resolve it then?
[01:36:22] <rzl>	 I will do that now and learn more about this on Monday
[01:36:37] <rzl>	 heh or cdanis will beat me to it
[01:36:38] <cdanis>	 beat you ;)
[01:38:20] <rzl>	 closing out for now, thanks all
[01:40:13] <shdubsh>	 Yeah, it would fire again 24h later if ack'd.  +1 for resolve. 
[01:51:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10colewhite) ` /system1/log1/record21   Targets   Properties     number=21     severity=Critical     date=11/01/2020     time=01:07     description=Drive Array Controller Failure (Slot 1)   Ver...
[02:01:36] <icinga-wm>	 PROBLEM - snapshot of s7 in codfw on alert1001 is CRITICAL: snapshot for s7 at codfw taken more than 3 days ago: Most recent backup 2020-10-29 01:45:27 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[02:27:24] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:32:28] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:56:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] peek: make git::clone ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/637790 (https://phabricator.wikimedia.org/T265912) (owner: 10Reedy)
[02:57:16] <wikibugs>	 (03PS3) 10Reedy: peek: make git::clone ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/637790 (https://phabricator.wikimedia.org/T265912)
[02:57:53] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10wikiba.se website, 10HTTPS: Set HSTS on wikiba.se (force HTTPS) - https://phabricator.wikimedia.org/T232246 (10Ladsgroup)
[03:34:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:37:01] <wikibugs>	 (03PS1) 10Ladsgroup: Rework DNS entries of wikis in wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/637849 (https://phabricator.wikimedia.org/T152882)
[03:41:42] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10Mobile, and 2 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Ladsgroup) a:03Ladsgroup This should fix it ^
[04:05:05] <wikibugs>	 (03PS1) 10Ladsgroup: [WIP] varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656)
[04:13:30] <wikibugs>	 10Operations, 10Education-Program-Dashboard, 10Traffic, 10Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509 (10Ladsgroup) 05Open→03Declined Let's close it. Reopen if you disagree
[04:30:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:45] <wikibugs>	 (03PS1) 10Ladsgroup: varnish: Replace "Expires" in Set-Cookie with "Max-Age" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967)
[04:47:12] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Artificial spike in offset of unique devices  from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560 (10Ladsgroup) 05Open→03Declined Three years have passed from this incident and as result, there's no data left from that time to examin...
[05:15:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:17:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:18:00] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[05:21:18] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[05:56:56] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:01:58] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:12:00] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:56:08] <wikibugs>	 (03PS1) 10Ladsgroup: mailman: Set the charset utf-8 as charset of English [puppet] - 10https://gerrit.wikimedia.org/r/637852 (https://phabricator.wikimedia.org/T261031)
[06:58:26] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10I18n, 10Patch-For-Review: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Ladsgroup) >>! In T261031#6593671, @gerritbot wrote: > Change 637852 had a related pa...
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201101T0700)
[07:26:34] <icinga-wm>	 PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster={misc,swift} file={intel_microcode.prom,smartmon.prom} instance={ms-be1057,relforge1004} job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[07:27:28] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:32:32] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:03:42] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:42] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:51:52] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:10:22] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:11:54] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:14:00] <wikibugs>	 (03PS1) 10ArielGlenn: Make sure there is always a directory in which to write table info per wiki [dumps] - 10https://gerrit.wikimedia.org/r/637855
[09:15:06] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Make sure there is always a directory in which to write table info per wiki [dumps] - 10https://gerrit.wikimedia.org/r/637855 (owner: 10ArielGlenn)
[09:15:36] <wikibugs>	 (03Merged) 10jenkins-bot: Make sure there is always a directory in which to write table info per wiki [dumps] - 10https://gerrit.wikimedia.org/r/637855 (owner: 10ArielGlenn)
[09:16:34] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@6c7d811]: create empty dir for tableinfo if needed
[09:16:38] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@6c7d811]: create empty dir for tableinfo if needed (duration: 00m 04s)
[09:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:13] <apergos>	 that's a ubn deploy, sure enough my carefully tested fixes broke the beginning of the run but I think it should be fixed now
[09:23:23] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Bodhisattwa) @Hrishikes couldn't reproduce the problem and downloaded the normal pdf. I am sharing the file here as per his request...
[09:35:10] <icinga-wm>	 PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[09:36:52] <icinga-wm>	 RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/
[09:47:34] <wikibugs>	 (03PS1) 10ArielGlenn: enable general directory creation during createdirs job [dumps] - 10https://gerrit.wikimedia.org/r/637856
[09:50:04] <wikibugs>	 (03PS2) 10ArielGlenn: enable general directory creation during createdirs job [dumps] - 10https://gerrit.wikimedia.org/r/637856
[09:50:50] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] enable general directory creation during createdirs job [dumps] - 10https://gerrit.wikimedia.org/r/637856 (owner: 10ArielGlenn)
[09:51:15] <wikibugs>	 (03Merged) 10jenkins-bot: enable general directory creation during createdirs job [dumps] - 10https://gerrit.wikimedia.org/r/637856 (owner: 10ArielGlenn)
[09:52:40] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@de4c823]: actually allow per run dir to be made early in the run
[09:52:44] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@de4c823]: actually allow per run dir to be made early in the run (duration: 00m 04s)
[09:52:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:32] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:58:56] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:59:02] <apergos>	 ok well that actually fixed the problem. I can kick the crons on the other snapshots and they'll all start up nw
[10:23:34] <apergos>	 and they are all running. in the spirit of 'deploy on a sunday and then run away' I have to go do errands (but everything is now running properly). I will be checking back in on things later 
[10:46:43] <wikibugs>	 (03PS1) 10Marostegui: db1091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/637857 (https://phabricator.wikimedia.org/T225060)
[10:47:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/637857 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui)
[10:49:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) Thanks everyone who responded to this incident. Looks like we'd need another disk for this host. @wiki_willy do we have some spares? This host is scheduled for replacement with th...
[11:27:28] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:32:28] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:12:00] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:13:08] <apergos>	 everything still looks fine, so that's that
[13:09:54] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Lam1982019)
[13:43:27] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Hrishikes) And here is the Czech Wiki file.  {F32421293}
[13:46:10] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Traffic, and 2 others: PDF download generates invalid PDF files - https://phabricator.wikimedia.org/T266559 (10Hrishikes) I am not getting invalid pdfs.  {F32421295}
[13:46:14] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10serviceops, 10Performance Issue: Very long response time on frwiki main page - https://phabricator.wikimedia.org/T266865 (10Aklapper)
[14:07:15] <wikibugs>	 (03PS1) 10Hamish: Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388)
[14:27:58] <wikibugs>	 (03PS1) 10Hamish: Add wgNamespaceAliases for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637870 (https://phabricator.wikimedia.org/T266925)
[15:18:00] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[15:56:58] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:02] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:27:18] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:32:20] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:38:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:40:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:57:22] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:02:34] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:14] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 03+1] "This should be good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish)
[17:31:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish)
[17:32:12] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 04-1] Add wgImportSources for zhwikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish)
[17:42:56] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Multichill) I just noticed this also breaks https://commons.wikimedia.org/wiki/Sp...
[18:57:22] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:01:04] <icinga-wm>	 RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:24] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:27:05] <wikibugs>	 (03PS15) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396)
[19:55:28] <wikibugs>	 (03Abandoned) 10DannyS712: Branch commit for wmf/1.36.0-wmf.15 [core] (wmf/1.36.0-wmf.15) - 10https://gerrit.wikimedia.org/r/636549 (https://phabricator.wikimedia.org/T263181) (owner: 10TrainBranchBot)
[19:57:20] <icinga-wm>	 RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:02:06] <icinga-wm>	 PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:57:58] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops, 10FR-Email: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Aklapper) 05Stalled→03Open Resetting task status per last comment.
[21:26:58] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:32:02] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:27:26] <icinga-wm>	 RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:30:48] <wikibugs>	 10Operations, 10Commons, 10SRE-swift-storage: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 (10Peachey88)
[22:32:28] <icinga-wm>	 PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:41:01] <Urbanecm>	 !log mwscript extensions/OATHAuth/maintenance/disableOATHAuthForUser.php --wiki=metawiki Turkmen # T266976
[22:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:09] <stashbot>	 T266976: Please reset 2FA for my account. - https://phabricator.wikimedia.org/T266976
[22:49:59] <wikibugs>	 10Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216 (10Aklapper) Neither @brion nor IA folks have answered, and @80686 hasn't been active for two years here. Proposing...
[22:51:40] <wikibugs>	 10Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584 (10Aklapper) 05Stalled→03Open The previous comments don't explain who or what (task?) exactly this task is stalled on (["If a report is waiting for further input (e.g. from its reporter or a third party)...