[02:31:15] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [02:31:17] (03PS1) 10GeoffreyT2000: Set 'watchcreations' preference to true by default on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385818 (https://phabricator.wikimedia.org/T178750) [02:56:34] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [03:06:44] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [03:26:34] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 731.86 seconds [03:34:35] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89802.29 seconds [03:35:14] PROBLEM - puppet last run on mw1326 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:53:25] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect [04:01:45] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 127.83 seconds [04:05:16] RECOVERY - puppet last run on mw1326 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:18:04] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [04:27:05] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active [04:32:05] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [04:44:15] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [04:50:24] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [06:27:54] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:28:54] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.013 second response time [08:48:25] there is some planned Telia maintenance but afaics cr1-eqdfw should not be affected, but probably I am wrong :) [09:06:20] XioNoX: --^ (could be Equinix IPv6 temp issue, not sure :( ) [09:48:57] (said ipv6 since I saw Connection attempt from unconfigured neighbor on show log messages but it is probably not related from what the alarm says..) [09:56:56] (yes definitely) [09:57:20] (03CR) 10Zoranzoki21: [C: 031] Set 'watchcreations' preference to true by default on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385818 (https://phabricator.wikimedia.org/T178750) (owner: 10GeoffreyT2000) [10:01:06] in icinga it is a warning for some ASes, I am definitely lost in figuring out what is the problem, but it seems minor :) [10:06:27] (03PS1) 10Zoranzoki21: Added to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) [10:07:38] (03CR) 10jerkins-bot: [V: 04-1] Added to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) (owner: 10Zoranzoki21) [10:07:50] (03PS2) 10Zoranzoki21: Added to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385829 (https://phabricator.wikimedia.org/T178753) [11:30:56] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3701806 (10Ladsgroup) I think one of the reasons contributing to the problem is the same problem we had with {T171027}, we stopped emitting injectRCRec... [14:13:18] (03PS4) 10BBlack: eqsin revdns: strawman subnet plan [dns] - 10https://gerrit.wikimedia.org/r/385402 (https://phabricator.wikimedia.org/T156256) [14:14:41] (03PS5) 10BBlack: eqsin revdns: strawman subnet plan [dns] - 10https://gerrit.wikimedia.org/r/385402 (https://phabricator.wikimedia.org/T156256) [15:45:45] (03PS1) 10Andrew Bogott: wmcs shinken: try to quiet down the puppet failure alerts. [puppet] - 10https://gerrit.wikimedia.org/r/385842 [15:47:43] (03PS2) 10Andrew Bogott: wmcs shinken: try to quiet down the puppet failure alerts. [puppet] - 10https://gerrit.wikimedia.org/r/385842 [15:51:33] (03CR) 10Andrew Bogott: [C: 032] wmcs shinken: try to quiet down the puppet failure alerts. [puppet] - 10https://gerrit.wikimedia.org/r/385842 (owner: 10Andrew Bogott) [16:13:54] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.113 second response time [16:22:49] (03Draft2) 10Jayprakash12345: Add $wgNamespaceRobotPolicies Config for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385843 [16:24:25] (03PS3) 10Jayprakash12345: Add $wgNamespaceRobotPolicies Config for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385843 (https://phabricator.wikimedia.org/T178775) [16:41:04] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.364 second response time [18:44:57] 10Operations, 10Parsoid, 10Traffic, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3702293 (10PlanetKrypton) This appears to be the response / request and it's accompanying error https://wiki.dronelaws.io/api.php?action=visualedito... [20:30:08] (03PS1) 10Andrew Bogott: Revert "wmcs shinken: try to quiet down the puppet failure alerts." [puppet] - 10https://gerrit.wikimedia.org/r/385879 [20:31:18] (03CR) 10Andrew Bogott: [C: 032] Revert "wmcs shinken: try to quiet down the puppet failure alerts." [puppet] - 10https://gerrit.wikimedia.org/r/385879 (owner: 10Andrew Bogott) [21:00:24] PROBLEM - HHVM rendering on mw2121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:01:15] RECOVERY - HHVM rendering on mw2121 is OK: HTTP OK: HTTP/1.1 200 OK - 75445 bytes in 0.305 second response time [21:26:24] PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:14] RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 75445 bytes in 0.477 second response time [23:37:25] PROBLEM - Nginx local proxy to apache on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:24] RECOVERY - Nginx local proxy to apache on mw2130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.201 second response time