[00:30:19] PROBLEM - PHP opcache health on mw2332 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:37:33] RECOVERY - PHP opcache health on mw2332 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:41:31] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 39336472 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:07] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1917936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:05] PROBLEM - PHP opcache health on mw2291 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:07:51] RECOVERY - PHP opcache health on mw2291 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:16:33] (03PS3) 10Dave Pifke: php: $enable_request_profiling should affect CLI [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) [01:17:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:20:47] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:26:50] (03CR) 10Ryan Kemper: [C: 03+2] [cirrus] prepend the wiki id in comp suggest logs [puppet] - 10https://gerrit.wikimedia.org/r/602015 (owner: 10DCausse) [01:27:28] (03CR) 10Ryan Kemper: [C: 03+2] "This seems like a great quality-of-life change. (Sorry for the delayed review, I need to start checking my review queue more regularly :P)" [puppet] - 10https://gerrit.wikimedia.org/r/602015 (owner: 10DCausse) [01:28:32] (03CR) 10Ryan Kemper: "I just did a puppet-merge so we should start seeing this change take effect in the next runs." [puppet] - 10https://gerrit.wikimedia.org/r/602015 (owner: 10DCausse) [01:34:59] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Move shared config into common file [puppet] - 10https://gerrit.wikimedia.org/r/599145 (owner: 10EBernhardson) [01:35:21] (03CR) 10Ryan Kemper: [C: 03+2] "This looks great! Let's get this submitted/puppet-merged on monday" [puppet] - 10https://gerrit.wikimedia.org/r/599145 (owner: 10EBernhardson) [01:36:32] (03CR) 10Ryan Kemper: [C: 03+2] osm: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602647 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [01:37:03] (03CR) 10Ryan Kemper: "These changes have been puppet-merged." [puppet] - 10https://gerrit.wikimedia.org/r/602647 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [01:46:53] (03CR) 10Krinkle: [C: 03+1] php: $enable_request_profiling should affect CLI [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) (owner: 10Dave Pifke) [01:50:04] (03CR) 10Krinkle: [C: 03+1] "Puppet compiler: https://wikitech.wikimedia.org/wiki/Performance/Runbook/Puppet_patches#Puppet_compiler" [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) (owner: 10Dave Pifke) [01:50:49] PROBLEM - Check systemd state on db1141 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:39] RECOVERY - Check systemd state on db1141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:25] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:33:19] RECOVERY - Check the last execution of mediawiki_job_cirrus_build_completion_indices_codfw on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_codfw https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:35:47] RECOVERY - Check the last execution of mediawiki_job_cirrus_build_completion_indices_eqiad on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_eqiad https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:38:48] (03PS1) 10Andrew Bogott: Add an even newer cloud-vps cumin key [labs/private] - 10https://gerrit.wikimedia.org/r/602788 (https://phabricator.wikimedia.org/T254589) [02:38:57] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add an even newer cloud-vps cumin key [labs/private] - 10https://gerrit.wikimedia.org/r/602788 (https://phabricator.wikimedia.org/T254589) (owner: 10Andrew Bogott) [03:22:19] PROBLEM - PHP opcache health on mw2289 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:42:23] RECOVERY - PHP opcache health on mw2289 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [04:34:40] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10Andrew) [05:08:47] 10Operations, 10MachineVision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 3 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Lazy-restless) 05Resolved→03Open [06:42:39] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200606T0700) [07:03:13] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:09:57] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:03] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:05:09] RECOVERY - ganeti-confd running on ganeti2016 is OK: PROCS OK: 1 process with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:05:29] RECOVERY - ganeti-mond running on ganeti2016 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [09:57:57] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:03:43] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 578 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:04:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:06:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:12:13] PROBLEM - PHP opcache health on mw2262 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:28:33] RECOVERY - PHP opcache health on mw2262 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:37:07] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10QChris) [12:52:45] (03CR) 10QChris: [C: 04-1] "Voting CR-1 to mark that we won't include this in the upcoming Gerrit 3.x upgrade in June 2020. Since the Gerrit upgrade itself is huge," [puppet] - 10https://gerrit.wikimedia.org/r/556270 (owner: 10Paladox) [13:25:59] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:25:59] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:26:41] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:33:15] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:33:59] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:36:55] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:48:42] (03Abandoned) 10Paladox: gerrit: Switch db from mysql to H2 [puppet] - 10https://gerrit.wikimedia.org/r/488093 (https://phabricator.wikimedia.org/T211139) (owner: 10Paladox) [13:49:22] (03Abandoned) 10Paladox: letsencrypt: Sync acme-tiny script from upstream [puppet] - 10https://gerrit.wikimedia.org/r/539618 (owner: 10Paladox) [13:49:35] (03Abandoned) 10Paladox: Test: Do not merge [puppet] - 10https://gerrit.wikimedia.org/r/513642 (owner: 10Paladox) [13:51:55] (03CR) 10QChris: "Just to help me with the rationale here: This is cosmetics, right?" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [13:52:22] (03PS5) 10Paladox: Test [puppet] - 10https://gerrit.wikimedia.org/r/351540 [13:52:28] (03Abandoned) 10Paladox: Test [puppet] - 10https://gerrit.wikimedia.org/r/351540 (owner: 10Paladox) [13:53:49] (03CR) 10Paladox: "@Qchris yes, this is not needed for the upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [13:54:30] (03Abandoned) 10Paladox: gerrit: Make ipv6 optional (for the cloud) [puppet] - 10https://gerrit.wikimedia.org/r/602690 (owner: 10Paladox) [13:55:41] (03CR) 10QChris: "@paladox: Since Dzahn said that we should not need this, what's the" [puppet] - 10https://gerrit.wikimedia.org/r/539211 (owner: 10Paladox) [13:56:43] (03CR) 10Paladox: "@Qchris this is not needed for the upgrade (nor for prod). This is for cloud/wmcs." [puppet] - 10https://gerrit.wikimedia.org/r/539211 (owner: 10Paladox) [13:58:35] (03CR) 10QChris: "It seems WMF is fine with the current log setup." [puppet] - 10https://gerrit.wikimedia.org/r/508657 (owner: 10Paladox) [15:50:26] (03PS4) 10Privacybatm: Write documentation using Sphinx [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) [15:55:25] (03PS5) 10Privacybatm: Write documentation using Sphinx [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) [16:03:08] 10Operations, 10Hindi-Sites, 10MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), 10Patch-For-Review, and 3 others: Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10CptViraj) [16:39:41] 10Operations, 10Wikimedia-Mailing-lists, 10Privacy, 10Security, 10User-Josve05a: Stop storing Mailman passwords in plain text - https://phabricator.wikimedia.org/T181803 (10Ladsgroup) [16:39:46] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) 05Stalled→03Open The official support for upgrading from mailman2 to mailman3 is there. You can see it in https://... [16:55:28] Thanks Amir1 [16:56:01] hauskatze: no worries, this should have been done looong time ago :D [16:56:29] I think so [16:56:38] Glad to see it is moving again :) [17:15:32] Amir1: is it mailman 3 that includes searching list archives? [17:15:40] yup [17:16:01] and no longer unencrypted passwords, or shared passwords [17:16:10] YES [17:16:23] it's like 'WT*' in 2020 [17:16:30] replace * with F :P [17:17:05] Amir1: I asked as that was one question on my 6 month old https://phabricator.wikimedia.org/T242520 [17:24:26] https://docs.mailman3.org/en/latest/migration.html#why-upgrade [17:24:30] Basically this [17:25:05] "A real database backend for settings and configuration" what did it even use before? flat files? [17:25:09] Amir1: I wonder if I should nudge that task to see if it's needed [17:25:34] Is strict sql mode on for prod wikis? [17:25:51] Majavah: I assume lol [17:26:17] RhinosF1: yeah, good idea. Also I should send an email to SRE asking for help [17:26:37] Amir1: I'll comment now [17:27:18] I need to find out the Exim config of beta cluster [17:27:31] and some puppet files [17:28:01] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-RhinosF1: Allow Cloud mailing list to be indexed - https://phabricator.wikimedia.org/T242520 (10RhinosF1) Quick Nudge: T52864 has been unblocked so mailman been searchable might be soon™️. Is this still desired? [17:31:41] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-RhinosF1: Allow Cloud mailing list to be indexed - https://phabricator.wikimedia.org/T242520 (10RhinosF1) https://lists.wikimedia.org/pipermail/cloud/2020-June/001117.html [17:47:17] 10Operations, 10MachineVision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 3 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Multichill) @Lazy-restless care to explain why you re-opened this task? Looks like @Mholloway... [17:51:18] 10Operations, 10MachineVision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 3 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Multichill) 05Open→03Resolved Nevermind, this user is blocked on two Wikipedia's, see http... [18:12:45] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Ladsgroup) >>! In T244340#6195903, @Joe wrote: > The problem we're trying to solve here is not an individu... [18:39:45] (03PS2) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [18:48:36] marostegui: hola, you there? [19:20:14] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10MrJaroslavik) Hey, can be fixed this problem? [19:27:00] (03PS1) 10Privacybatm: setup.py: Add RemoteExecution module to setup.py [software/transferpy] - 10https://gerrit.wikimedia.org/r/602879 (https://phabricator.wikimedia.org/T248256) [19:35:25] (03CR) 10Privacybatm: "> For the error you were getting, check virtual env configuration, it may be pointing to the wrong place and making the python path not wo" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602618 (https://phabricator.wikimedia.org/T248256) (owner: 10Privacybatm) [20:44:33] 10Operations, 10MachineVision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 3 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Ladsgroup) >>! In T225664#6199341, @Multichill wrote: > Nevermind, this user is blocked on two... [20:47:54] 10Operations, 10Wikimedia-Mailing-lists: Official support for upgrade from existing Mailman 2.1 lists to Mailman 3 - https://phabricator.wikimedia.org/T130554 (10Ladsgroup) 05Open→03Resolved The support it seems official now: https://docs.mailman3.org/en/latest/migration.html It says the links to archives... [20:48:00] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) [20:52:16] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:55:53] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:56:39] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:01:19] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:03:29] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:17:49] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:18:07] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:18:33] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:19:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:21:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:23:19] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:23:36] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:24:01] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:31:13] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:32:39] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:34:11] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:35:25] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:39:39] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:42:45] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:46:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:48:58] (03CR) 10Krinkle: [C: 04-1] arclamp: add svgs for some key entrypoint/singleton methods calls [puppet] - 10https://gerrit.wikimedia.org/r/598292 (owner: 10Aaron Schulz) [21:59:47] 10Operations, 10netops: Telia eqiad<->codfw (IC-307235) outage ref: 01171084 - https://phabricator.wikimedia.org/T254674 (10CDanis) [22:00:35] ACKNOWLEDGEMENT - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP CDanis https://phabricator.wikimedia.org/T254674 Telia Carrier Ref: 01171084 - The acknowledgement expires at: 2020-06-07 09:59:59. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:00:35] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis https://phabricator.wikimedia.org/T254674 Telia Carrier Ref: 01171084 - The acknowledgement expires at: 2020-06-07 09:59:59. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:11:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:12:23] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status