[00:50:18] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:46] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [02:30:44] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [02:56:20] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:56:24] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:59:04] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:10:06] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:14:44] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:14:48] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:49:38] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:49:38] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:49:46] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:10:44] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:16:32] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:22:12] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:28:00] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:14:26] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:20:16] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:41:16] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:44:56] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200531T0700) [08:52:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:56:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:56:51] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Vox Golf' 'Colonel Chicken' (T254068) [09:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:56] T254068: Unblock stuck global rename of Colonel_Chicken - https://phabricator.wikimedia.org/T254068 [11:39:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 53 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:45:16] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 46 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:36:36] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332 (owner: 10Alexandros Kosiaris) [13:35:42] (03PS1) 10JMeybohm: changeprop: Migrate to common_templates 0.2 tls_helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/600862 (https://phabricator.wikimedia.org/T253396) [16:33:16] 10Operations, 10Analytics, 10Event-Platform, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Aklapper) >>! In T213561#4881255, Joe wrote: > Might I suggest that you use a SRV dns record instead? >>! In T213561#4882509, Ottomata wrote: > Kafka doesn... [16:42:05] (03PS1) 10Cwhite: wmflib: add systemd.timer onCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 [16:43:09] (03PS2) 10Cwhite: wmflib: add systemd.timer onCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 [16:48:59] (03PS3) 10Cwhite: wmflib: add systemd.timer onCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 [16:50:00] (03CR) 10Cwhite: "I took a stab at adding onCalendar support to cron_splay(). Let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/600928 (owner: 10Cwhite) [18:33:55] (03PS3) 10Elukey: Add support to pull datapoints from Kafka [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 [18:37:22] (03PS4) 10Elukey: Add support to pull datapoints from Kafka [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 [18:40:25] (03PS4) 10Cwhite: wmflib: add systemd.timer onCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 [18:51:37] (03PS5) 10Elukey: Add support to pull datapoints from Kafka [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 [19:12:18] (03PS1) 10MarcoAurelio: [eswiki] Normalize talk namespaces for Anexo, Portal and Wikiproyecto [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600979 (https://phabricator.wikimedia.org/T254077) [19:13:02] (03PS2) 10MarcoAurelio: [eswiki] Normalize talk namespaces for Anexo, Portal and Wikiproyecto [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600979 (https://phabricator.wikimedia.org/T254077) [19:18:54] PROBLEM - MariaDB Slave Lag: s1 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1127.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:03:18] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:04:32] PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:08] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 86, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:54] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [20:10:38] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:35:13] paravoid: we lost fpc0 on cr1-codfw [20:43:18] RECOVERY - MariaDB Slave Lag: s1 on db2097 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [20:55:20] My brain just read "trouble shooting" above. [21:01:51] well, I thought in covid when reading about codfw before :P [21:32:41] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) p:05Triage→03High [21:33:52] ACKNOWLEDGEMENT - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms Ayounsi https://phabricator.wikimedia.org/T254110 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [21:33:52] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 86, down: 4, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T254110 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:33:52] ACKNOWLEDGEMENT - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP Ayounsi https://phabricator.wikimedia.org/T254110 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:40:03] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) Opened JTAC case 2020-0531-0098. [21:43:56] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:31] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) Logs and RSI attached to the case. [22:04:54] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 54 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:10:46] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 47 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas