[00:04:13] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Superset access for Mikeraish - https://phabricator.wikimedia.org/T279147 (10Dzahn)
[00:20:49] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.121:7001 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[00:20:49] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.121:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.121 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[00:21:31] <icinga-wm>	 PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:22:13] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:39:39] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:40:39] <icinga-wm>	 RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:41:25] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:42:23] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.48.121:7001 on restbase2017 is OK: SSL OK - Certificate restbase2017-a valid until 2022-10-08 10:53:59 +0000 (expires in 553 days) https://phabricator.wikimedia.org/T120662
[00:42:23] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.48.121:9042 on restbase2017 is OK: TCP OK - 0.033 second response time on 10.192.48.121 port 9042 https://phabricator.wikimedia.org/T93886
[00:44:27] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:20:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Netbox/Accounting Discrepancies - https://phabricator.wikimedia.org/T279214 (10Papaul) 05Open→03Resolved fixed
[02:57:32] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Krinkle)
[02:57:51] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Krinkle)
[03:18:15] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:23:01] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.186 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[04:13:05] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:14:29] <legoktm>	 Eh
[04:16:20] <legoktm>	 I marked the page as resolved
[05:35:22] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@35199a3]: upgrade labtesthorizon to the Wallaby branch
[05:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:38:28] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@35199a3]: upgrade labtesthorizon to the Wallaby branch (duration: 03m 05s)
[05:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:17] <wikibugs>	 (03PS2) 10ArielGlenn: preclaim job fragments before claiming them [dumps] - 10https://gerrit.wikimedia.org/r/676632 (https://phabricator.wikimedia.org/T252396)
[06:43:55] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[06:52:35] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is CRITICAL: 285.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210403T0700)
[07:22:31] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[07:29:45] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[07:31:23] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3175 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[07:38:29] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[07:46:19] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[07:59:49] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5556 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[08:00:35] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[08:19:29] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[08:25:05] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[08:49:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:52:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:12:25] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37
[10:10:09] <wikibugs>	 (03CR) 10Aklapper: "Lego: I have no idea how to do that; care to elaborate? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) (owner: 10Aklapper)
[10:34:11] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 186476776 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:36:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 354936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:36:41] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikibase_repo_prune_test.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:41:43] <wikibugs>	 (03PS1) 10Zabe: Add 'apihighlimits' to bots on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226)
[10:46:29] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:54] <wikibugs>	 (03CR) 10Reedy: [C: 04-2] "As per https://commons.wikimedia.org/wiki/Special:ListGroupRights, the bots user group already has it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe)
[11:03:23] <wikibugs>	 (03CR) 10Zabe: "> Patch Set 1: Code-Review-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe)
[12:17:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:19:35] <wikibugs>	 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Nemo_bis) >  Swap URLs of lists-next.wikimedia.org to lists.wikimedia.org and lists.wikimedia.org to lists-old.wikimedia.org (make sure archive URLs don't break)  So what'...
[12:20:15] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:49:37] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 165309816 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:52:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 635952 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:59:47] <icinga-wm>	 PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100%
[13:01:54] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) I think we should be dropping archives of wikidata-bugs {T262773}, It's tiny in fs but migration and other work is a big...
[13:02:40] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) >>! In T278609#6969578, @Legoktm wrote: >>>! In T278609#6967023, @Ladsgroup wrote: >>  - The search index though is 17 MB...
[13:58:20] <wikibugs>	 (03CR) 10DharmrajRathod98: "> Patch Set 9:" (034 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98)
[15:00:44] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@8833f80]: upgrade labtesthorizon to the Wallaby branch
[15:00:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:35] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@8833f80]: upgrade labtesthorizon to the Wallaby branch (duration: 11m 51s)
[15:12:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:39] <icinga-wm>	 RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms
[15:33:11] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@3a84c77]: upgrade labtesthorizon to the Wallaby branch
[15:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:29] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@3a84c77]: upgrade labtesthorizon to the Wallaby branch (duration: 02m 18s)
[15:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:07] <wikibugs>	 (03CR) 10H.krishna123: "No problem!" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98)
[15:43:47] <wikibugs>	 (03CR) 10H.krishna123: "Just to add, the old regex considers .gz invalid, so it is a safe assumption that the filename should have either "no format" or ".tar.gz"" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98)
[16:08:15] <icinga-wm>	 PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100%
[16:21:31] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) One silver lining would be that lots of large mailing lists are large because they have lots of massive attachments and h...
[16:44:14] <elukey>	 !log power reset for ms-be2028 - not reachable via ssh, no tty available via mgmt console, NMI unrecoverable errors logged in iLo's system logs
[16:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:21] <icinga-wm>	 RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms
[16:48:43] <Amir1>	 I mistakenly opened a 1GB file using vim in lists1001, if it went down, I'm sorry
[16:54:33] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2028 is CRITICAL: CRITICAL - load average: 170.18, 134.75, 66.56 https://wikitech.wikimedia.org/wiki/Swift
[17:04:09] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2028 is CRITICAL: CRITICAL - load average: 214.16, 167.54, 108.86 https://wikitech.wikimedia.org/wiki/Swift
[17:10:42] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) Okay, I made a "discovery" which I I wish I didn't. It's fully clear to me yet. But I'm sure with migrating to mailman3,...
[17:12:08] <wikibugs>	 (03PS10) 10DharmrajRathod98: Improved: regex-validation in cli/recover-dump and added unit test file in test/unit [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754)
[17:12:49] <icinga-wm>	 PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100%
[17:26:52] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@3a84c77]: upgrade labtesthorizon to the Wallaby branch
[17:26:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:06] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) so if my calculations are correct, for each public mailing list with archives enabled, it keeps nine copies of each email...
[17:30:28] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@3a84c77]: upgrade labtesthorizon to the Wallaby branch (duration: 03m 35s)
[17:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:46] <RhinosF1>	 Amir1: that's insane
[17:33:27] <Majavah>	 Amir1: mailman2 or mailman3 keeps nine copies?
[17:33:34] <Amir1>	 mailman2
[17:35:03] <Amir1>	 RhinosF1: shocked? no, surprised? also no
[17:37:24] <RhinosF1>	 Amir1: insane but yeah somehow not surprising
[17:38:45] <Majavah>	 how many copies will mailman3 keep? given that they're stored on a replicated mysql cluster with multiple replicas in each dc
[17:38:56] <tabbycat>	 nine copies, lol
[17:39:18] <tabbycat>	 I guess they took the 'extra precautions' very serious
[17:40:13] <tabbycat>	 Amir1: by the way, I sent you an email; nothing urgent
[17:40:29] <RhinosF1>	 Tbh most insane software isn't shocking or surprising
[17:42:35] <Amir1>	 Majavah: m5 only have replication to three replicas, so it'll be four + one search index on the vm itself but that gives us ability to search 
[17:43:25] <Majavah>	 oh, m5 is that small
[17:44:07] <Amir1>	 and it's also horizontally scalable (i.e. you store them in different hardware across the country, you can just buy more while the VM is all in one place)
[17:44:21] <Amir1>	 tabbycat: yeah, I'll respond once I'm done with the ticket
[17:44:40] <tabbycat>	 no trouble & thanks
[18:11:29] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack Horizon: add version switching for policy files [puppet] - 10https://gerrit.wikimedia.org/r/676742 (https://phabricator.wikimedia.org/T261138)
[18:12:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] OpenStack Horizon: add version switching for policy files [puppet] - 10https://gerrit.wikimedia.org/r/676742 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott)
[18:17:52] <wikibugs>	 (03Abandoned) 10Andrew Bogott: OpenStack Horizon: add version switching for policy files [puppet] - 10https://gerrit.wikimedia.org/r/676742 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott)
[18:18:05] <icinga-wm>	 RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:25:23] <icinga-wm>	 PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:56:27] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: clean up some old ensure => absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/676744
[18:56:29] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138)
[18:57:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: clean up some old ensure => absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/676744 (owner: 10Andrew Bogott)
[18:57:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott)
[19:00:02] <wikibugs>	 (03PS2) 10Andrew Bogott: OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138)
[19:01:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott)
[19:01:08] <wikibugs>	 (03PS3) 10Andrew Bogott: OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138)
[19:18:18] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@df2b0b4]: upgrade labtesthorizon to the Wallaby branch
[19:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:30] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@df2b0b4]: upgrade labtesthorizon to the Wallaby branch (duration: 02m 11s)
[19:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:03] <icinga-wm>	 RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms
[19:30:27] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2028 is OK: OK - load average: 44.49, 15.96, 5.76 https://wikitech.wikimedia.org/wiki/Swift
[19:33:48] <wikibugs>	 (03PS1) 10Andrew Bogott: Added some templates and files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676747 (https://phabricator.wikimedia.org/T261138)
[19:33:50] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: move codfw1dev to Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676748 (https://phabricator.wikimedia.org/T261138)
[19:40:30] <wikibugs>	 (03PS2) 10Andrew Bogott: Added some templates and files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676747 (https://phabricator.wikimedia.org/T261138)
[19:40:32] <wikibugs>	 (03PS2) 10Andrew Bogott: Horizon: move codfw1dev to Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676748 (https://phabricator.wikimedia.org/T261138)
[19:42:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Added some templates and files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676747 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott)
[19:42:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: move codfw1dev to Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676748 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott)
[20:02:35] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The following units failed: session-49922.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:07:47] <icinga-wm>	 PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:09:49] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:29:41] <icinga-wm>	 RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:41:45] <icinga-wm>	 PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:44:09] <icinga-wm>	 RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:53:45] <icinga-wm>	 PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:18:05] <icinga-wm>	 RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:34:57] <icinga-wm>	 PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:49:35] <icinga-wm>	 RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:56:49] <icinga-wm>	 PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:08:49] <icinga-wm>	 RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:19:11] <wikibugs>	 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) >>! In T52864#6970300, @Nemo_bis wrote: >>  Swap URLs of lists-next.wikimedia.org to lists.wikimedia.org and lists.wikimedia.org to lists-old.wikimedia.org (make...
[22:20:13] <wikibugs>	 (03CR) 10Legoktm: Fix broken rendering of characters in EasyTimeline for Yue Chinese (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) (owner: 10Aklapper)
[22:28:19] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on thumbor2001 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops
[22:42:53] <icinga-wm>	 PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:47:49] <icinga-wm>	 RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:07:01] <icinga-wm>	 PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:16:37] <icinga-wm>	 RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:24:09] <icinga-wm>	 PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:47:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:49:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets