[03:08:22] RECOVERY - exim queue on mx2001 is OK: OK: Less than 1000 mails in exim queue. [03:28:43] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.56 seconds [03:33:12] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:35:33] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz] [03:57:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 264.37 seconds [04:01:02] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [04:03:42] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:32:35] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:54:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 34 probes of 316 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:58:04] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:13] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 316 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:04:13] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:08:42] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:14:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:27:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:56:41] 10Operations, 10OTRS: Upgrade to OTRS version 5.0.30 - https://phabricator.wikimedia.org/T205540 (10Framawiki) Thank you for updating so quickly ! [09:55:14] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Framawiki) Thanks all ! [10:04:03] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:10:33] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:52:52] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:56:12] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:39:10] (03PS2) 10ArielGlenn: make path to MWScript.php configurable for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/461650 (https://phabricator.wikimedia.org/T204962) [11:42:01] (03CR) 10ArielGlenn: [C: 032] make path to MWScript.php configurable for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/461650 (https://phabricator.wikimedia.org/T204962) (owner: 10ArielGlenn) [12:31:10] (03PS2) 10ArielGlenn: make location of MWScript.php configurable for xml/sql dumps [dumps] - 10https://gerrit.wikimedia.org/r/461651 (https://phabricator.wikimedia.org/T204962) [12:38:47] (03CR) 10ArielGlenn: [C: 032] make location of MWScript.php configurable for xml/sql dumps [dumps] - 10https://gerrit.wikimedia.org/r/461651 (https://phabricator.wikimedia.org/T204962) (owner: 10ArielGlenn) [12:38:56] (03PS3) 10ArielGlenn: make location of MWScript.php configurable for xml/sql dumps [dumps] - 10https://gerrit.wikimedia.org/r/461651 (https://phabricator.wikimedia.org/T204962) [12:41:50] !log ariel@deploy1001 Started deploy [dumps/dumps@26aaee6]: make location of MWScript.php configurable [12:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:54] !log ariel@deploy1001 Finished deploy [dumps/dumps@26aaee6]: make location of MWScript.php configurable (duration: 00m 03s) [12:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:15] (03PS2) 10ArielGlenn: make 'misc cron dumps' use a configured path to MWScript.php [puppet] - 10https://gerrit.wikimedia.org/r/461667 (https://phabricator.wikimedia.org/T204962) [13:20:38] (03CR) 10jerkins-bot: [V: 04-1] make 'misc cron dumps' use a configured path to MWScript.php [puppet] - 10https://gerrit.wikimedia.org/r/461667 (https://phabricator.wikimedia.org/T204962) (owner: 10ArielGlenn) [13:22:35] (03PS3) 10ArielGlenn: make 'misc cron dumps' use a configured path to MWScript.php [puppet] - 10https://gerrit.wikimedia.org/r/461667 (https://phabricator.wikimedia.org/T204962) [14:04:32] (03PS2) 10GTirloni: shinken - Tweak Puppet thresholds [puppet] - 10https://gerrit.wikimedia.org/r/463581 (https://phabricator.wikimedia.org/T161898) [14:22:10] (03CR) 10ArielGlenn: [C: 032] make 'misc cron dumps' use a configured path to MWScript.php [puppet] - 10https://gerrit.wikimedia.org/r/461667 (https://phabricator.wikimedia.org/T204962) (owner: 10ArielGlenn) [14:31:13] (03CR) 10Andrew Bogott: [C: 031] shinken - Tweak Puppet thresholds [puppet] - 10https://gerrit.wikimedia.org/r/463581 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [14:31:45] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750 (10valerio.bozzolan) Hello, same error here. Tried to subscribe myself in https://lists.wikimedia.org/mailman/listinfo/mediawiki-l just now. [15:02:22] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:03:32] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:30:13] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:56:12] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:09:19] 10Operations, 10Mail, 10Toolforge, 10Patch-For-Review, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10valhallasw) To add some context to the current situation -- most of the email sent to security@tools.wmflabs.org is: - from openb... [17:37:05] 08Warning Alert for device cr4-ulsfo.wikimedia.org - Inbound interface errors [17:48:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr4-ulsfo.wikimedia.org recovered from Inbound interface errors [18:10:05] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman issues a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T195750 (10Aklapper) Does the problem still happen after waiting more than 30 minutes? [21:04:13] PROBLEM - Restbase root url on restbase2003 is CRITICAL: HTTP CRITICAL - No data received from host [21:05:22] RECOVERY - Restbase root url on restbase2003 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.127 second response time [21:08:29] 10Operations, 10Scap, 10Datacenter-Switchover-2018, 10Patch-For-Review, and 2 others: Scap is checking canary servers in dormant instead of active-dc - https://phabricator.wikimedia.org/T204907 (10hashar) [21:40:29] (03PS3) 10GTirloni: shinken - Tweak Puppet thresholds [puppet] - 10https://gerrit.wikimedia.org/r/463581 (https://phabricator.wikimedia.org/T161898) [21:53:22] (03CR) 10GTirloni: [C: 032] shinken - Tweak Puppet thresholds [puppet] - 10https://gerrit.wikimedia.org/r/463581 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [22:50:43] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sdi1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sdi1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [22:56:22] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:58:33] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen