[00:03:53] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:03] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:27] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:17] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:35] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:33] (03CR) 10DannyS712: [C: 03+1] Order entries by alphabetical order [dns] - 10https://gerrit.wikimedia.org/r/623143 (https://phabricator.wikimedia.org/T253439) (owner: 10Gerrit maintenance bot) [03:54:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:56:49] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200830T0700) [10:14:01] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 639 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:18:15] (03CR) 10Ammarpad: [C: 03+1] Allow bureaucrats to remove sysop permissions on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623119 (https://phabricator.wikimedia.org/T261481) (owner: 10Mdaniels5757) [13:15:35] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 22 probes of 639 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:43:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:45:35] PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=n [15:45:35] tasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [15:47:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:50:08] 10Operations, 10DNS, 10Traffic: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10Tgr) Let's also get rid of the old domain configuration while we are at it. [15:50:40] 10Operations, 10DNS, 10Matrix, 10Traffic: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10Tgr) [15:56:57] 10Operations, 10DNS, 10Matrix, 10Traffic: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10Tgr) (FWIW, a year ago I ran into [[https://phabricator.wikimedia.org/T223835#5240501|some trouble]] with the DNS method... [15:58:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:58:48] (03PS1) 10Gergő Tisza: Revert "Add .well-known/matrix for wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623147 (https://phabricator.wikimedia.org/T223835) [16:04:09] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:09:38] (03PS1) 10ArielGlenn: move dumps around on the snapshots in prep network upgrade work [puppet] - 10https://gerrit.wikimedia.org/r/623177 (https://phabricator.wikimedia.org/T196487) [16:13:16] !log restarted eqiad v5 logstashes [16:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:25] (03PS1) 10Andrew Bogott: Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252) [16:16:25] (03CR) 10jerkins-bot: [V: 04-1] Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252) (owner: 10Andrew Bogott) [16:17:34] (03PS2) 10Andrew Bogott: Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252) [16:21:01] RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:22:55] (03PS3) 10Andrew Bogott: Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252) [16:29:24] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252) (owner: 10Andrew Bogott) [17:57:21] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:57:36] (03PS4) 10Andrew Bogott: wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 [17:57:58] (03CR) 10jerkins-bot: [V: 04-1] wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 (owner: 10Andrew Bogott) [17:58:18] (03PS5) 10Andrew Bogott: wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 [17:58:42] (03CR) 10jerkins-bot: [V: 04-1] wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 (owner: 10Andrew Bogott) [17:59:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:31] (03PS6) 10Andrew Bogott: wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 (https://phabricator.wikimedia.org/T261252) [18:00:13] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 (https://phabricator.wikimedia.org/T261252) (owner: 10Andrew Bogott) [18:02:21] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache full. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:22:22] (03PS2) 10ArielGlenn: move dumps around on the snapshots in prep for network upgrade work [puppet] - 10https://gerrit.wikimedia.org/r/623177 (https://phabricator.wikimedia.org/T196487) [19:57:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:58:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:43:48] (03PS1) 10Urbanecm: itwiki: Assign patrol right to autopatrolled instead of autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623191 (https://phabricator.wikimedia.org/T261587) [23:01:33] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 73 probes of 558 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:07:23] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 52 probes of 558 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:22:47] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 71 probes of 558 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:28:41] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 51 probes of 558 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas