[00:03:53] <icinga-wm>	 PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:03] <icinga-wm>	 PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:27] <icinga-wm>	 PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:17] <icinga-wm>	 PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:35] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:18:33] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] Order entries by alphabetical order [dns] - 10https://gerrit.wikimedia.org/r/623143 (https://phabricator.wikimedia.org/T253439) (owner: 10Gerrit maintenance bot)
[03:54:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:56:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200830T0700)
[10:14:01] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 639 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:18:15] <wikibugs>	 (03CR) 10Ammarpad: [C: 03+1] Allow bureaucrats to remove sysop permissions on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623119 (https://phabricator.wikimedia.org/T261481) (owner: 10Mdaniels5757)
[13:15:35] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 22 probes of 639 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:43:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:45:35] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=n
[15:45:35] <icinga-wm>	 tasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[15:47:25] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:50:08] <wikibugs>	 10Operations, 10DNS, 10Traffic: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10Tgr) Let's also get rid of the old domain configuration while we are at it.
[15:50:40] <wikibugs>	 10Operations, 10DNS, 10Matrix, 10Traffic: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10Tgr)
[15:56:57] <wikibugs>	 10Operations, 10DNS, 10Matrix, 10Traffic: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10Tgr) (FWIW, a year ago I ran into [[https://phabricator.wikimedia.org/T223835#5240501|some trouble]] with the DNS method...
[15:58:37] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:58:48] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "Add .well-known/matrix for wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623147 (https://phabricator.wikimedia.org/T223835)
[16:04:09] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:09:38] <wikibugs>	 (03PS1) 10ArielGlenn: move dumps around on the snapshots in prep network upgrade work [puppet] - 10https://gerrit.wikimedia.org/r/623177 (https://phabricator.wikimedia.org/T196487)
[16:13:16] <herron>	 !log restarted eqiad v5 logstashes
[16:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252)
[16:16:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252) (owner: 10Andrew Bogott)
[16:17:34] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252)
[16:21:01] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[16:22:55] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252)
[16:29:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: Add new enable_nova_rbd hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/623178 (https://phabricator.wikimedia.org/T261252) (owner: 10Andrew Bogott)
[17:57:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:57:36] <wikibugs>	 (03PS4) 10Andrew Bogott: wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097
[17:57:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 (owner: 10Andrew Bogott)
[17:58:18] <wikibugs>	 (03PS5) 10Andrew Bogott: wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097
[17:58:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 (owner: 10Andrew Bogott)
[17:59:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:59:31] <wikibugs>	 (03PS6) 10Andrew Bogott: wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 (https://phabricator.wikimedia.org/T261252)
[18:00:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-ceph-migrate: add a resize step [puppet] - 10https://gerrit.wikimedia.org/r/623097 (https://phabricator.wikimedia.org/T261252) (owner: 10Andrew Bogott)
[18:02:21] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache full. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[18:22:22] <wikibugs>	 (03PS2) 10ArielGlenn: move dumps around on the snapshots in prep for network upgrade work [puppet] - 10https://gerrit.wikimedia.org/r/623177 (https://phabricator.wikimedia.org/T196487)
[19:57:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:58:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:43:48] <wikibugs>	 (03PS1) 10Urbanecm: itwiki: Assign patrol right to autopatrolled instead of autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623191 (https://phabricator.wikimedia.org/T261587)
[23:01:33] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 73 probes of 558 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:07:23] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 52 probes of 558 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:22:47] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 71 probes of 558 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:28:41] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 51 probes of 558 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas