[00:04:13] 10SRE, 10LDAP-Access-Requests: Superset access for Mikeraish - https://phabricator.wikimedia.org/T279147 (10Dzahn) [00:20:49] PROBLEM - cassandra-a SSL 10.192.48.121:7001 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [00:20:49] PROBLEM - cassandra-a CQL 10.192.48.121:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.121 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [00:21:31] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:13] PROBLEM - cassandra-a service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:39:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:40:39] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:25] RECOVERY - cassandra-a service on restbase2017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:42:23] RECOVERY - cassandra-a SSL 10.192.48.121:7001 on restbase2017 is OK: SSL OK - Certificate restbase2017-a valid until 2022-10-08 10:53:59 +0000 (expires in 553 days) https://phabricator.wikimedia.org/T120662 [00:42:23] RECOVERY - cassandra-a CQL 10.192.48.121:9042 on restbase2017 is OK: TCP OK - 0.033 second response time on 10.192.48.121 port 9042 https://phabricator.wikimedia.org/T93886 [00:44:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:20:20] 10SRE, 10ops-codfw, 10DC-Ops: Netbox/Accounting Discrepancies - https://phabricator.wikimedia.org/T279214 (10Papaul) 05Open→03Resolved fixed [02:57:32] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Krinkle) [02:57:51] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Krinkle) [03:18:15] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:23:01] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.186 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:13:05] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:14:29] Eh [04:16:20] I marked the page as resolved [05:35:22] !log andrew@deploy1002 Started deploy [horizon/deploy@35199a3]: upgrade labtesthorizon to the Wallaby branch [05:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:28] !log andrew@deploy1002 Finished deploy [horizon/deploy@35199a3]: upgrade labtesthorizon to the Wallaby branch (duration: 03m 05s) [05:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:17] (03PS2) 10ArielGlenn: preclaim job fragments before claiming them [dumps] - 10https://gerrit.wikimedia.org/r/676632 (https://phabricator.wikimedia.org/T252396) [06:43:55] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [06:52:35] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is CRITICAL: 285.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210403T0700) [07:22:31] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [07:29:45] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [07:31:23] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3175 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:38:29] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:46:19] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [07:59:49] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5556 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:00:35] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [08:19:29] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:25:05] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [08:49:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:12:25] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37 [10:10:09] (03CR) 10Aklapper: "Lego: I have no idea how to do that; care to elaborate? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) (owner: 10Aklapper) [10:34:11] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 186476776 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:36:33] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 354936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:36:41] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikibase_repo_prune_test.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:43] (03PS1) 10Zabe: Add 'apihighlimits' to bots on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) [10:46:29] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:54] (03CR) 10Reedy: [C: 04-2] "As per https://commons.wikimedia.org/wiki/Special:ListGroupRights, the bots user group already has it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe) [11:03:23] (03CR) 10Zabe: "> Patch Set 1: Code-Review-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe) [12:17:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:19:35] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Nemo_bis) > Swap URLs of lists-next.wikimedia.org to lists.wikimedia.org and lists.wikimedia.org to lists-old.wikimedia.org (make sure archive URLs don't break) So what'... [12:20:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:49:37] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 165309816 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:07] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 635952 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:59:47] PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100% [13:01:54] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) I think we should be dropping archives of wikidata-bugs {T262773}, It's tiny in fs but migration and other work is a big... [13:02:40] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) >>! In T278609#6969578, @Legoktm wrote: >>>! In T278609#6967023, @Ladsgroup wrote: >> - The search index though is 17 MB... [13:58:20] (03CR) 10DharmrajRathod98: "> Patch Set 9:" (034 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [15:00:44] !log andrew@deploy1002 Started deploy [horizon/deploy@8833f80]: upgrade labtesthorizon to the Wallaby branch [15:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:35] !log andrew@deploy1002 Finished deploy [horizon/deploy@8833f80]: upgrade labtesthorizon to the Wallaby branch (duration: 11m 51s) [15:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:39] RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [15:33:11] !log andrew@deploy1002 Started deploy [horizon/deploy@3a84c77]: upgrade labtesthorizon to the Wallaby branch [15:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:29] !log andrew@deploy1002 Finished deploy [horizon/deploy@3a84c77]: upgrade labtesthorizon to the Wallaby branch (duration: 02m 18s) [15:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:07] (03CR) 10H.krishna123: "No problem!" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [15:43:47] (03CR) 10H.krishna123: "Just to add, the old regex considers .gz invalid, so it is a safe assumption that the filename should have either "no format" or ".tar.gz"" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [16:08:15] PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100% [16:21:31] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) One silver lining would be that lots of large mailing lists are large because they have lots of massive attachments and h... [16:44:14] !log power reset for ms-be2028 - not reachable via ssh, no tty available via mgmt console, NMI unrecoverable errors logged in iLo's system logs [16:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:21] RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [16:48:43] I mistakenly opened a 1GB file using vim in lists1001, if it went down, I'm sorry [16:54:33] PROBLEM - very high load average likely xfs on ms-be2028 is CRITICAL: CRITICAL - load average: 170.18, 134.75, 66.56 https://wikitech.wikimedia.org/wiki/Swift [17:04:09] PROBLEM - very high load average likely xfs on ms-be2028 is CRITICAL: CRITICAL - load average: 214.16, 167.54, 108.86 https://wikitech.wikimedia.org/wiki/Swift [17:10:42] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) Okay, I made a "discovery" which I I wish I didn't. It's fully clear to me yet. But I'm sure with migrating to mailman3,... [17:12:08] (03PS10) 10DharmrajRathod98: Improved: regex-validation in cli/recover-dump and added unit test file in test/unit [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) [17:12:49] PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100% [17:26:52] !log andrew@deploy1002 Started deploy [horizon/deploy@3a84c77]: upgrade labtesthorizon to the Wallaby branch [17:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:06] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) so if my calculations are correct, for each public mailing list with archives enabled, it keeps nine copies of each email... [17:30:28] !log andrew@deploy1002 Finished deploy [horizon/deploy@3a84c77]: upgrade labtesthorizon to the Wallaby branch (duration: 03m 35s) [17:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:46] Amir1: that's insane [17:33:27] Amir1: mailman2 or mailman3 keeps nine copies? [17:33:34] mailman2 [17:35:03] RhinosF1: shocked? no, surprised? also no [17:37:24] Amir1: insane but yeah somehow not surprising [17:38:45] how many copies will mailman3 keep? given that they're stored on a replicated mysql cluster with multiple replicas in each dc [17:38:56] nine copies, lol [17:39:18] I guess they took the 'extra precautions' very serious [17:40:13] Amir1: by the way, I sent you an email; nothing urgent [17:40:29] Tbh most insane software isn't shocking or surprising [17:42:35] Majavah: m5 only have replication to three replicas, so it'll be four + one search index on the vm itself but that gives us ability to search [17:43:25] oh, m5 is that small [17:44:07] and it's also horizontally scalable (i.e. you store them in different hardware across the country, you can just buy more while the VM is all in one place) [17:44:21] tabbycat: yeah, I'll respond once I'm done with the ticket [17:44:40] no trouble & thanks [18:11:29] (03PS1) 10Andrew Bogott: OpenStack Horizon: add version switching for policy files [puppet] - 10https://gerrit.wikimedia.org/r/676742 (https://phabricator.wikimedia.org/T261138) [18:12:15] (03CR) 10jerkins-bot: [V: 04-1] OpenStack Horizon: add version switching for policy files [puppet] - 10https://gerrit.wikimedia.org/r/676742 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott) [18:17:52] (03Abandoned) 10Andrew Bogott: OpenStack Horizon: add version switching for policy files [puppet] - 10https://gerrit.wikimedia.org/r/676742 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott) [18:18:05] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:23] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:27] (03PS1) 10Andrew Bogott: Horizon: clean up some old ensure => absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/676744 [18:56:29] (03PS1) 10Andrew Bogott: OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138) [18:57:08] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: clean up some old ensure => absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/676744 (owner: 10Andrew Bogott) [18:57:17] (03CR) 10jerkins-bot: [V: 04-1] OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott) [19:00:02] (03PS2) 10Andrew Bogott: OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138) [19:01:03] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott) [19:01:08] (03PS3) 10Andrew Bogott: OpenStack Horion: create /etc/openstack-dashboard/default_policies [puppet] - 10https://gerrit.wikimedia.org/r/676745 (https://phabricator.wikimedia.org/T261138) [19:18:18] !log andrew@deploy1002 Started deploy [horizon/deploy@df2b0b4]: upgrade labtesthorizon to the Wallaby branch [19:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:30] !log andrew@deploy1002 Finished deploy [horizon/deploy@df2b0b4]: upgrade labtesthorizon to the Wallaby branch (duration: 02m 11s) [19:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:03] RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [19:30:27] RECOVERY - very high load average likely xfs on ms-be2028 is OK: OK - load average: 44.49, 15.96, 5.76 https://wikitech.wikimedia.org/wiki/Swift [19:33:48] (03PS1) 10Andrew Bogott: Added some templates and files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676747 (https://phabricator.wikimedia.org/T261138) [19:33:50] (03PS1) 10Andrew Bogott: Horizon: move codfw1dev to Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676748 (https://phabricator.wikimedia.org/T261138) [19:40:30] (03PS2) 10Andrew Bogott: Added some templates and files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676747 (https://phabricator.wikimedia.org/T261138) [19:40:32] (03PS2) 10Andrew Bogott: Horizon: move codfw1dev to Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676748 (https://phabricator.wikimedia.org/T261138) [19:42:03] (03CR) 10Andrew Bogott: [C: 03+2] Added some templates and files for OpenStack Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676747 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott) [19:42:15] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: move codfw1dev to Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/676748 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott) [20:02:35] PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The following units failed: session-49922.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:47] PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:09:49] RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:41] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:41:45] PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:44:09] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:53:45] PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:18:05] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:34:57] PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:49:35] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:56:49] PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:08:49] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:19:11] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) >>! In T52864#6970300, @Nemo_bis wrote: >> Swap URLs of lists-next.wikimedia.org to lists.wikimedia.org and lists.wikimedia.org to lists-old.wikimedia.org (make... [22:20:13] (03CR) 10Legoktm: Fix broken rendering of characters in EasyTimeline for Yue Chinese (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) (owner: 10Aklapper) [22:28:19] PROBLEM - Memory correctable errors -EDAC- on thumbor2001 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops [22:42:53] PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:47:49] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:07:01] PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:16:37] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:24:09] PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:47:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:49:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets