[00:11:14] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1732320 (10Tgr) Yes, a couple hours ago. We should write to mediawiki-announce, wait a week or so as a courtesy, and then dro... [02:30:24] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 08m 09s) [02:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:14] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-17 02:35:14+00:00 [02:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:58:10] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 14s) [02:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:59:18] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures [03:02:59] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-17 03:02:58+00:00 [03:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:26:37] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:31:58] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 314 [03:33:39] RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 52 [04:24:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [04:27:38] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [05:06:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [05:54:12] (03PS1) 10Legoktm: zuul: Add zuul-test-repo helper script [puppet] - 10https://gerrit.wikimedia.org/r/247031 [06:11:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 9 below the confidence bounds [06:14:48] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: puppet fail [06:27:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Oct 17 06:27:04 UTC 2015 (duration 27m 3s) [06:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:07] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: puppet fail [06:30:08] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:18] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: puppet fail [06:30:37] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:38] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:57] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:28] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:37] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:48] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:57] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CRITICAL: puppet fail [06:41:57] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:55:58] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:56:28] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:29] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:38] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:19] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:28] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:47] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:48] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:58:19] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:38] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:58:48] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:04:57] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:37:58] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:06:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [08:09:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [08:38:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [08:40:57] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 1 failures [08:43:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [08:47:18] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [08:48:59] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [09:07:47] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:34:18] PROBLEM - HTTP on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:35:17] PROBLEM - grafana.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:35:18] PROBLEM - grafana-admin.wikimedia.org on krypton is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:48:59] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [09:59:55] (03CR) 10Hashar: "I have a similar shortcut on my laptop, it is definitely useful." [puppet] - 10https://gerrit.wikimedia.org/r/247031 (owner: 10Legoktm) [10:31:59] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1732681 (10Nemo_bis) >>! In T115416#1731857, @faidon wrote: >>>! In T115416#1724286, @Nemo_bis wrote: >> Well, gerrit and Phabricator emails are certainly very bad. Multiple bu... [10:37:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [10:38:18] PROBLEM - NTP on krypton is CRITICAL: NTP CRITICAL: No response from NTP server [11:32:08] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [11:33:48] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [11:38:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [11:42:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 6 below the confidence bounds [11:55:28] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [12:10:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [12:12:28] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [12:27:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 6 below the confidence bounds [12:39:38] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:51:18] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [13:06:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [13:14:58] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [13:32:18] PROBLEM - YARN NodeManager Node-State on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:08] RECOVERY - YARN NodeManager Node-State on analytics1034 is OK: OK: YARN NodeManager analytics1034.eqiad.wmnet:8041 Node-State: RUNNING [14:05:55] !log reboot krypton, unable to ssh and no console (VM) iowait through the roof [14:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:48] RECOVERY - grafana.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.012 second response time [14:07:48] RECOVERY - grafana-admin.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 534 bytes in 0.011 second response time [14:08:18] RECOVERY - HTTP on krypton is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 504 bytes in 0.004 second response time [14:26:47] RECOVERY - NTP on krypton is OK: NTP OK: Offset -0.003682971001 secs [15:25:53] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Can't see any page, special:RandomPage gives database error - https://phabricator.wikimedia.org/T115505#1732814 (10Aklapper) Note: https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=686184426#Blanked_articles... [16:48:48] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:27] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [18:12:18] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1732880 (10yuvipanda) Also where should we do the redirect? I guess any requests to w.wiki/(.*) should redirect to meta.wikimedia.org/w/index.php?ti... [18:13:29] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1732881 (10yuvipanda) Err, by redirect I mean 'route to' from varnish, not an actual redirect. [18:30:31] (03CR) 10Hashar: [C: 031] zuul: Add zuul-test-repo helper script [puppet] - 10https://gerrit.wikimedia.org/r/247031 (owner: 10Legoktm) [18:31:29] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [18:31:58] ori: ^ this has been happening at least 2-3 days every day and you did get asked to be poked for it :P [18:34:57] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [18:37:32] I think I filed a task about it YuviPanda [18:42:27] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 393 bytes in 0.008 second response time [18:42:39] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 393 bytes in 0.001 second response time [18:43:42] oops sorry didn't disable notifications, this is expected maintenance ^ [19:14:17] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [19:25:58] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.005 second response time on port 9042 [19:31:19] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [19:33:39] PROBLEM - puppet last run on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:33:39] PROBLEM - RAID on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:07] PROBLEM - salt-minion processes on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:07] PROBLEM - Disk space on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:08] PROBLEM - Disk space on Hadoop worker on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:18] PROBLEM - SSH on analytics1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:28] PROBLEM - DPKG on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:28] PROBLEM - configured eth on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:38] PROBLEM - YARN NodeManager Node-State on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:39] PROBLEM - Check size of conntrack table on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:48] PROBLEM - Hadoop DataNode on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:48] PROBLEM - dhclient process on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:08] PROBLEM - Hadoop NodeManager on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:33] <_joe_> uhm, can someone else take a look? I'm basically sleeping [19:47:58] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.006 second response time on port 9042 [19:58:58] I'll take a look [20:02:39] !log powercycle analytics1034, no console no ssh [20:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:17] PROBLEM - NTP on analytics1034 is CRITICAL: NTP CRITICAL: No response from NTP server [20:05:27] RECOVERY - Hadoop NodeManager on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [20:05:48] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 35 minutes ago with 0 failures [20:05:48] RECOVERY - RAID on analytics1034 is OK: OK: optimal, 13 logical, 14 physical [20:06:08] RECOVERY - Disk space on analytics1034 is OK: DISK OK [20:06:08] RECOVERY - salt-minion processes on analytics1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:06:08] RECOVERY - Disk space on Hadoop worker on analytics1034 is OK: DISK OK [20:06:28] RECOVERY - SSH on analytics1034 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [20:06:37] RECOVERY - DPKG on analytics1034 is OK: All packages OK [20:06:37] RECOVERY - configured eth on analytics1034 is OK: OK - interfaces up [20:06:38] RECOVERY - YARN NodeManager Node-State on analytics1034 is OK: OK: YARN NodeManager analytics1034.eqiad.wmnet:8041 Node-State: RUNNING [20:06:48] RECOVERY - Check size of conntrack table on analytics1034 is OK: OK: nf_conntrack is 0 % full [20:06:57] RECOVERY - Hadoop DataNode on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [20:06:58] RECOVERY - dhclient process on analytics1034 is OK: PROCS OK: 0 processes with command name dhclient [20:09:58] RECOVERY - NTP on analytics1034 is OK: NTP OK: Offset 0.00170981884 secs [20:10:49] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [20:23:08] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [20:26:38] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [20:36:18] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [21:02:37] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [21:05:49] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.003 second response time on port 9042 [21:10:57] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [21:14:18] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.997 second response time on port 9042 [21:28:02] (03PS1) 10Yuvipanda: k8s: Turn off verbose logging for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/247057 [21:29:08] (03PS2) 10Yuvipanda: k8s: Turn off verbose logging for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/247057 [21:29:18] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Turn off verbose logging for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/247057 (owner: 10Yuvipanda) [21:39:39] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:41:17] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [21:43:42] (03CR) 10JanZerebecki: [C: 031] "I use this all the time, from your home dir :) , this is a better place for it." [puppet] - 10https://gerrit.wikimedia.org/r/247031 (owner: 10Legoktm) [23:02:48] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: puppet fail [23:17:38] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [23:19:18] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [23:31:09] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:31:27] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [23:34:39] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.001 second response time on port 9042 [23:39:47] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection refused [23:39:50] PROBLEM - Analytics Cassandra database on aqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [23:42:58] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [23:44:37] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [23:49:28] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [23:56:27] RECOVERY - Analytics Cassandra database on aqs1002 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon