[00:44:38] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:46:48] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:38:39] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:39:48] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:26:38] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 860.70 seconds [03:37:49] PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz] [03:42:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 276.42 seconds [03:49:08] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [03:49:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:52:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:54:09] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 0.020 second response time [04:03:08] RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:56:48] (03PS1) 10Legoktm: python: Install dependencies for `python-ldap` [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453665 (https://phabricator.wikimedia.org/T202218) [09:03:40] (03PS1) 10Legoktm: Update npm to 6.4.0 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453666 (https://phabricator.wikimedia.org/T169451) [09:04:58] (03PS2) 10Legoktm: Update npm to 6.4.0 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453666 (https://phabricator.wikimedia.org/T169451) [09:06:10] (03PS1) 10Framawiki: Set $wmgUseFooterContactLink on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453667 (https://phabricator.wikimedia.org/T202014) [09:08:26] (03CR) 10Legoktm: [C: 032] python: Install dependencies for `python-ldap` [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453665 (https://phabricator.wikimedia.org/T202218) (owner: 10Legoktm) [09:08:47] (03Merged) 10jenkins-bot: python: Install dependencies for `python-ldap` [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453665 (https://phabricator.wikimedia.org/T202218) (owner: 10Legoktm) [09:41:36] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia Community User Group Albania mailing list request - https://phabricator.wikimedia.org/T201670 (10Aklapper) >>! In T201670#4512285, @Dzahn wrote: > how are they subscribing? I can't really confirm that behaviour: Side effect of https://gerrit.wikimedia.org/r/... [09:52:39] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:02:26] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) [11:11:29] 10Operations, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-General-or-Unknown, and 5 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520 (10Addshore) [11:22:04] (03PS1) 10Urbanecm: Upload HD logos for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453692 (https://phabricator.wikimedia.org/T202228) [11:22:06] (03PS1) 10Urbanecm: Use HD logos for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453693 (https://phabricator.wikimedia.org/T202228) [11:23:17] (03CR) 10jerkins-bot: [V: 04-1] Use HD logos for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453693 (https://phabricator.wikimedia.org/T202228) (owner: 10Urbanecm) [11:29:06] (03PS1) 10Urbanecm: Remove the "autoreview" user group from ru.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453696 (https://phabricator.wikimedia.org/T202139) [11:29:47] (03CR) 10jerkins-bot: [V: 04-1] Remove the "autoreview" user group from ru.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453696 (https://phabricator.wikimedia.org/T202139) (owner: 10Urbanecm) [14:46:48] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [14:47:49] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [15:41:38] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [15:46:59] RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 93%, RTA = 280.27 ms [16:05:59] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:11:18] RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 58%, RTA = 266.04 ms [16:57:18] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:59:28] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:15:31] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10abian) >>! In T99531#4411395, @abian wrote: > wikiba.se is a bit unstable. Today it has been down for some hours (from ~1:00 UTC to ~5:30... [17:50:14] (03PS1) 10Andrew Bogott: Openstack: added script to migrate security groups from eqiad to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/453808 [17:51:04] (03CR) 10Andrew Bogott: [C: 032] Openstack: added script to migrate security groups from eqiad to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/453808 (owner: 10Andrew Bogott) [21:59:07] Reedy: I think the cherry-pick at https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/453877/ went okay. [21:59:49] I've put that for tomorrow's Morning SWAT, as the EU one is already full