[00:01:43] Of course, having them run automatically from terbium would require https://phabricator.wikimedia.org/T98682 being fixed [00:02:08] PROBLEM - puppet last run on mw2029 is CRITICAL puppet fail [00:06:10] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1426876 (10Krenair) This also prevents us from being able to add silver to the wikis which get QueryPages like Special:Wantedpages automatically updated [00:19:10] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [00:20:29] RECOVERY - puppet last run on mw2029 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [00:22:19] PROBLEM - Apache HTTP on mw1108 is CRITICAL - Socket timeout after 10 seconds [00:22:29] PROBLEM - HHVM rendering on mw1108 is CRITICAL - Socket timeout after 10 seconds [00:22:49] PROBLEM - Hadoop NodeManager on analytics1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [00:23:09] PROBLEM - nutcracker port on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:28] PROBLEM - configured eth on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:28] PROBLEM - puppet last run on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:39] PROBLEM - SSH on mw1108 is CRITICAL: Server answer [00:23:49] PROBLEM - dhclient process on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:23:59] PROBLEM - salt-minion processes on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:09] PROBLEM - HHVM processes on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:09] PROBLEM - nutcracker process on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:18] PROBLEM - DPKG on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:18] PROBLEM - Disk space on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:19] PROBLEM - RAID on mw1108 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:35:38] RECOVERY - Hadoop NodeManager on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [00:36:59] RECOVERY - Disk space on mw1108 is OK: DISK OK [00:36:59] RECOVERY - nutcracker process on mw1108 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:36:59] RECOVERY - DPKG on mw1108 is OK: All packages OK [00:37:08] RECOVERY - RAID on mw1108 is OK no RAID installed [00:37:48] RECOVERY - nutcracker port on mw1108 is OK: TCP OK - 0.000 second response time on port 11212 [00:37:59] RECOVERY - puppet last run on mw1108 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:38:08] RECOVERY - configured eth on mw1108 is OK - interfaces up [00:38:19] RECOVERY - SSH on mw1108 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [00:38:20] RECOVERY - dhclient process on mw1108 is OK: PROCS OK: 0 processes with command name dhclient [00:38:30] RECOVERY - salt-minion processes on mw1108 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:38:40] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [00:38:48] RECOVERY - HHVM processes on mw1108 is OK: PROCS OK: 6 processes with command name hhvm [00:38:50] RECOVERY - HHVM rendering on mw1108 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.124 second response time [01:23:30] (03CR) 10Krinkle: [C: 04-1] "If we remove 'contentadmin' from the labswiki entry in InitialiseSettings, please leave a comment in its place that we must never add a 's" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222776 (owner: 10Alex Monk) [01:43:08] (03PS1) 10Alex Monk: Remove bastion1 and bastion2 from labs bastion hosts list [puppet] - 10https://gerrit.wikimedia.org/r/222871 [01:46:23] (03PS2) 10Alex Monk: wikitech: Clean up contentadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222776 [01:48:12] (03CR) 10Alex Monk: "Those IPs point to these instances now:" [puppet] - 10https://gerrit.wikimedia.org/r/222871 (owner: 10Alex Monk) [02:17:18] PROBLEM - puppet last run on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:39] PROBLEM - RAID on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:09] PROBLEM - HHVM rendering on mw1091 is CRITICAL - Socket timeout after 10 seconds [02:18:49] PROBLEM - Apache HTTP on mw1091 is CRITICAL - Socket timeout after 10 seconds [02:20:09] PROBLEM - nutcracker process on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:28] PROBLEM - HHVM processes on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:58] PROBLEM - nutcracker port on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:46] PROBLEM - SSH on mw1091 is CRITICAL - Socket timeout after 10 seconds [02:23:47] PROBLEM - DPKG on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:47] PROBLEM - configured eth on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:47] PROBLEM - Disk space on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:47] PROBLEM - salt-minion processes on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:47] PROBLEM - dhclient process on mw1091 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:24:28] RECOVERY - nutcracker process on mw1091 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:24:30] RECOVERY - HHVM processes on mw1091 is OK: PROCS OK: 6 processes with command name hhvm [02:24:58] RECOVERY - nutcracker port on mw1091 is OK: TCP OK - 0.000 second response time on port 11212 [02:25:08] RECOVERY - DPKG on mw1091 is OK: All packages OK [02:25:08] RECOVERY - SSH on mw1091 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:25:08] RECOVERY - configured eth on mw1091 is OK - interfaces up [02:25:18] RECOVERY - Disk space on mw1091 is OK: DISK OK [02:25:29] RECOVERY - salt-minion processes on mw1091 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:25:38] RECOVERY - puppet last run on mw1091 is OK Puppet is currently enabled, last run 29 minutes ago with 0 failures [02:25:39] RECOVERY - dhclient process on mw1091 is OK: PROCS OK: 0 processes with command name dhclient [02:25:50] RECOVERY - RAID on mw1091 is OK no RAID installed [02:26:44] !log l10nupdate Synchronized php-1.26wmf12/cache/l10n: (no message) (duration: 13m 05s) [02:29:00] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.047 second response time [02:29:08] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.929 second response time [02:29:09] RECOVERY - HHVM rendering on mw1058 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.169 second response time [02:29:09] RECOVERY - HHVM rendering on mw1075 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.233 second response time [02:29:18] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.287 second response time [02:29:30] RECOVERY - HHVM rendering on mw1113 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.298 second response time [02:29:50] PROBLEM - HHVM rendering on mw1061 is CRITICAL - Socket timeout after 10 seconds [02:32:51] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.061 second response time [02:32:51] RECOVERY - HHVM rendering on mw1091 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.182 second response time [02:32:52] PROBLEM - Apache HTTP on mw1065 is CRITICAL - Socket timeout after 10 seconds [02:32:52] PROBLEM - Apache HTTP on mw1102 is CRITICAL - Socket timeout after 10 seconds [02:32:52] PROBLEM - Apache HTTP on mw1061 is CRITICAL - Socket timeout after 10 seconds [02:32:52] RECOVERY - HHVM rendering on mw1110 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.134 second response time [02:32:52] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [02:32:52] PROBLEM - SSH on mw1047 is CRITICAL - Socket timeout after 10 seconds [02:32:52] PROBLEM - Apache HTTP on mw1047 is CRITICAL: Connection timed out [02:32:52] PROBLEM - Apache HTTP on mw1089 is CRITICAL: Connection timed out [02:32:52] PROBLEM - puppet last run on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - HHVM rendering on mw1109 is CRITICAL: Connection timed out [02:32:52] PROBLEM - HHVM rendering on mw1149 is CRITICAL: Connection timed out [02:32:52] PROBLEM - HHVM rendering on mw1047 is CRITICAL: Connection timed out [02:32:52] PROBLEM - Apache HTTP on mw1088 is CRITICAL - Socket timeout after 10 seconds [02:32:52] PROBLEM - HHVM rendering on mw1065 is CRITICAL - Socket timeout after 10 seconds [02:32:52] PROBLEM - DPKG on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - puppet last run on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - Apache HTTP on mw1084 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.019 second response time [02:32:52] PROBLEM - HHVM rendering on mw1084 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.016 second response time [02:32:52] PROBLEM - puppet last run on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - nutcracker port on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - puppet last run on mw1072 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - nutcracker port on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - HHVM rendering on mw1072 is CRITICAL: Connection timed out [02:32:52] PROBLEM - Apache HTTP on mw1072 is CRITICAL: Connection timed out [02:32:52] PROBLEM - DPKG on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - DPKG on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - HHVM rendering on mw1089 is CRITICAL - Socket timeout after 10 seconds [02:32:52] PROBLEM - DPKG on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - SSH on mw1072 is CRITICAL - Socket timeout after 10 seconds [02:32:52] PROBLEM - salt-minion processes on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - nutcracker process on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - puppet last run on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - salt-minion processes on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - RAID on mw1072 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - HHVM rendering on mw1102 is CRITICAL: Connection timed out [02:32:52] PROBLEM - HHVM rendering on mw1088 is CRITICAL: Connection timed out [02:32:52] PROBLEM - SSH on mw1065 is CRITICAL: Server answer [02:32:52] PROBLEM - Apache HTTP on mw1149 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 7.579 second response time [02:32:52] PROBLEM - HHVM processes on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - nutcracker process on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - Disk space on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - nutcracker port on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - salt-minion processes on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - RAID on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - nutcracker port on mw1072 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - salt-minion processes on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - HHVM processes on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - RAID on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - RAID on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - configured eth on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - nutcracker port on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - RAID on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - configured eth on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - DPKG on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - dhclient process on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - configured eth on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:52] PROBLEM - Disk space on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:58] PROBLEM - SSH on mw1061 is CRITICAL - Socket timeout after 10 seconds [02:32:59] PROBLEM - Disk space on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:59] PROBLEM - nutcracker process on mw1072 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:59] PROBLEM - dhclient process on mw1072 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:09] PROBLEM - RAID on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:09] PROBLEM - nutcracker process on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:09] PROBLEM - RAID on mw1149 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:18] PROBLEM - configured eth on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:18] PROBLEM - dhclient process on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:19] PROBLEM - puppet last run on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:19] PROBLEM - SSH on mw1088 is CRITICAL - Socket timeout after 10 seconds [02:33:20] PROBLEM - RAID on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:20] PROBLEM - SSH on mw1109 is CRITICAL - Socket timeout after 10 seconds [02:33:44] PROBLEM - Apache HTTP on mw1109 is CRITICAL: Connection timed out [02:33:44] wow looks like something wrong is going on... also I;m getting a lot of 503s from wikidata suddenly [02:33:45] PROBLEM - dhclient process on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - Disk space on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - DPKG on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - puppet last run on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - SSH on mw1074 is CRITICAL - Socket timeout after 10 seconds [02:33:45] PROBLEM - Disk space on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - nutcracker process on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - HHVM processes on mw1074 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - SSH on mw1089 is CRITICAL - Socket timeout after 10 seconds [02:33:45] PROBLEM - salt-minion processes on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - nutcracker port on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - configured eth on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - DPKG on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - RAID on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:45] PROBLEM - HHVM processes on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:48] PROBLEM - dhclient process on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:49] PROBLEM - configured eth on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:49] PROBLEM - nutcracker process on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:58] PROBLEM - Disk space on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:58] PROBLEM - puppet last run on mw1102 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:58] PROBLEM - nutcracker process on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:58] RECOVERY - HHVM rendering on mw1072 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.136 second response time [02:33:59] RECOVERY - puppet last run on mw1072 is OK Puppet is currently enabled, last run 11 minutes ago with 0 failures [02:33:59] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [02:33:59] RECOVERY - SSH on mw1072 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:34:00] PROBLEM - dhclient process on mw1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:08] PROBLEM - salt-minion processes on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:09] PROBLEM - nutcracker port on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:18] RECOVERY - RAID on mw1072 is OK no RAID installed [02:34:19] PROBLEM - HHVM processes on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:19] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.058 second response time [02:34:19] PROBLEM - dhclient process on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:19] PROBLEM - dhclient process on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:29] PROBLEM - nutcracker process on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:29] RECOVERY - nutcracker port on mw1072 is OK: TCP OK - 0.000 second response time on port 11212 [02:34:55] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.107 second response time [02:34:55] PROBLEM - Disk space on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:55] PROBLEM - HHVM processes on mw1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:55] RECOVERY - nutcracker process on mw1072 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:34:55] RECOVERY - dhclient process on mw1072 is OK: PROCS OK: 0 processes with command name dhclient [02:34:58] PROBLEM - salt-minion processes on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:59] RECOVERY - RAID on mw1149 is OK no RAID installed [02:35:00] PROBLEM - nutcracker port on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:08] PROBLEM - HHVM processes on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:18] PROBLEM - configured eth on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:39] RECOVERY - HHVM rendering on mw1149 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.142 second response time [02:35:49] RECOVERY - puppet last run on mw1102 is OK Puppet is currently enabled, last run 14 minutes ago with 0 failures [02:35:49] RECOVERY - configured eth on mw1089 is OK - interfaces up [02:35:59] RECOVERY - DPKG on mw1089 is OK: All packages OK [02:35:59] RECOVERY - nutcracker port on mw1047 is OK: TCP OK - 0.000 second response time on port 11212 [02:35:59] RECOVERY - nutcracker process on mw1089 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:36:00] RECOVERY - salt-minion processes on mw1089 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:36:10] RECOVERY - salt-minion processes on mw1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:36:11] RECOVERY - dhclient process on mw1089 is OK: PROCS OK: 0 processes with command name dhclient [02:36:19] RECOVERY - HHVM rendering on mw1102 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.126 second response time [02:36:19] RECOVERY - HHVM processes on mw1065 is OK: PROCS OK: 6 processes with command name hhvm [02:36:29] RECOVERY - Disk space on mw1089 is OK: DISK OK [02:36:29] RECOVERY - SSH on mw1065 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:36:29] RECOVERY - HHVM processes on mw1089 is OK: PROCS OK: 6 processes with command name hhvm [02:36:29] RECOVERY - nutcracker port on mw1065 is OK: TCP OK - 0.000 second response time on port 11212 [02:36:30] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.978 second response time [02:36:39] RECOVERY - RAID on mw1065 is OK no RAID installed [02:36:48] RECOVERY - configured eth on mw1065 is OK - interfaces up [02:36:49] RECOVERY - DPKG on mw1065 is OK: All packages OK [02:36:49] RECOVERY - salt-minion processes on mw1109 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:36:49] RECOVERY - Disk space on mw1065 is OK: DISK OK [02:36:58] RECOVERY - nutcracker port on mw1109 is OK: TCP OK - 0.000 second response time on port 11212 [02:36:59] RECOVERY - RAID on mw1109 is OK no RAID installed [02:37:00] RECOVERY - HHVM processes on mw1047 is OK: PROCS OK: 6 processes with command name hhvm [02:37:09] RECOVERY - puppet last run on mw1109 is OK Puppet is currently enabled, last run 13 minutes ago with 0 failures [02:37:18] RECOVERY - SSH on mw1109 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:37:18] RECOVERY - RAID on mw1089 is OK no RAID installed [02:37:19] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.233 second response time [02:37:19] RECOVERY - puppet last run on mw1089 is OK Puppet is currently enabled, last run 13 minutes ago with 0 failures [02:37:29] RECOVERY - DPKG on mw1109 is OK: All packages OK [02:37:30] RECOVERY - Disk space on mw1109 is OK: DISK OK [02:37:30] RECOVERY - SSH on mw1089 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:37:30] RECOVERY - nutcracker port on mw1089 is OK: TCP OK - 0.000 second response time on port 11212 [02:37:31] RECOVERY - configured eth on mw1109 is OK - interfaces up [02:37:31] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [02:37:31] RECOVERY - HHVM processes on mw1109 is OK: PROCS OK: 6 processes with command name hhvm [02:37:38] RECOVERY - dhclient process on mw1109 is OK: PROCS OK: 0 processes with command name dhclient [02:37:40] RECOVERY - HHVM rendering on mw1109 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.286 second response time [02:37:40] RECOVERY - HHVM rendering on mw1065 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.131 second response time [02:37:49] RECOVERY - nutcracker process on mw1065 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:37:49] RECOVERY - puppet last run on mw1065 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:37:50] RECOVERY - Disk space on mw1047 is OK: DISK OK [02:37:59] RECOVERY - dhclient process on mw1065 is OK: PROCS OK: 0 processes with command name dhclient [02:38:00] RECOVERY - HHVM rendering on mw1089 is OK: HTTP OK: HTTP/1.1 200 OK - 64457 bytes in 0.140 second response time [02:38:09] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:38:19] RECOVERY - nutcracker process on mw1109 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:38:29] RECOVERY - puppet last run on mw1113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:38:30] RECOVERY - nutcracker process on mw1047 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:38:49] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 50.00% of data above the critical threshold [500.0] [02:38:58] RECOVERY - Disk space on mw1061 is OK: DISK OK [02:38:58] RECOVERY - dhclient process on mw1061 is OK: PROCS OK: 0 processes with command name dhclient [02:38:59] RECOVERY - puppet last run on mw1075 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:39:09] RECOVERY - nutcracker process on mw1061 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:39:09] RECOVERY - configured eth on mw1047 is OK - interfaces up [02:39:49] RECOVERY - DPKG on mw1061 is OK: All packages OK [02:39:50] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 32 minutes ago with 0 failures [02:40:08] RECOVERY - nutcracker port on mw1061 is OK: TCP OK - 0.000 second response time on port 11212 [02:40:39] RECOVERY - HHVM processes on mw1061 is OK: PROCS OK: 6 processes with command name hhvm [02:40:40] RECOVERY - salt-minion processes on mw1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:40:48] RECOVERY - puppet last run on mw1058 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:40:49] RECOVERY - RAID on mw1061 is OK no RAID installed [02:40:50] RECOVERY - configured eth on mw1061 is OK - interfaces up [02:40:59] RECOVERY - SSH on mw1061 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:41:38] RECOVERY - Disk space on mw1088 is OK: DISK OK [02:41:39] RECOVERY - dhclient process on mw1088 is OK: PROCS OK: 0 processes with command name dhclient [02:41:40] RECOVERY - SSH on mw1047 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:41:48] RECOVERY - nutcracker process on mw1074 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:41:48] RECOVERY - HHVM processes on mw1074 is OK: PROCS OK: 1 process with command name hhvm [02:41:48] RECOVERY - salt-minion processes on mw1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:42:10] RECOVERY - DPKG on mw1074 is OK: All packages OK [02:42:29] RECOVERY - dhclient process on mw1047 is OK: PROCS OK: 0 processes with command name dhclient [02:42:40] RECOVERY - Disk space on mw1074 is OK: DISK OK [02:42:40] RECOVERY - salt-minion processes on mw1074 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:42:59] RECOVERY - nutcracker port on mw1074 is OK: TCP OK - 0.000 second response time on port 11212 [02:43:00] RECOVERY - HHVM rendering on mw1074 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.536 second response time [02:43:01] Sigh. [02:43:06] wikitech session issues again [02:43:19] RECOVERY - configured eth on mw1074 is OK - interfaces up [02:43:28] RECOVERY - dhclient process on mw1074 is OK: PROCS OK: 0 processes with command name dhclient [02:43:38] RECOVERY - SSH on mw1074 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:43:40] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [02:44:49] RECOVERY - RAID on mw1074 is OK no RAID installed [02:46:39] RECOVERY - RAID on mw1047 is OK no RAID installed [02:47:40] PROBLEM - dhclient process on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:40] PROBLEM - Disk space on mw1088 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:41] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 28 minutes ago with 0 failures [02:47:49] PROBLEM - HHVM rendering on mw1052 is CRITICAL - Socket timeout after 10 seconds [02:48:08] PROBLEM - Apache HTTP on mw1052 is CRITICAL - Socket timeout after 10 seconds [02:48:09] RECOVERY - DPKG on mw1047 is OK: All packages OK [02:49:09] RECOVERY - puppet last run on mw1074 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:49:19] PROBLEM - RAID on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:29] PROBLEM - DPKG on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:29] PROBLEM - configured eth on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:29] PROBLEM - puppet last run on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:08] PROBLEM - Disk space on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:09] PROBLEM - dhclient process on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:48] PROBLEM - SSH on mw1052 is CRITICAL - Socket timeout after 10 seconds [02:50:50] PROBLEM - nutcracker process on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:18] PROBLEM - salt-minion processes on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:28] PROBLEM - HHVM processes on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:29] PROBLEM - nutcracker port on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:52:50] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:57:28] !log LocalisationUpdate completed (1.26wmf12) at 2015-07-05 02:57:28+00:00 [02:57:29] RECOVERY - Disk space on mw1088 is OK: DISK OK [02:57:29] RECOVERY - dhclient process on mw1088 is OK: PROCS OK: 0 processes with command name dhclient [02:57:35] Logged the message, Master [02:57:38] RECOVERY - DPKG on mw1088 is OK: All packages OK [02:57:38] RECOVERY - RAID on mw1088 is OK no RAID installed [02:57:49] RECOVERY - nutcracker process on mw1088 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:58:00] RECOVERY - nutcracker port on mw1088 is OK: TCP OK - 0.000 second response time on port 11212 [02:58:09] RECOVERY - puppet last run on mw1088 is OK Puppet is currently enabled, last run 31 minutes ago with 0 failures [02:58:19] RECOVERY - HHVM processes on mw1088 is OK: PROCS OK: 6 processes with command name hhvm [02:58:19] RECOVERY - HHVM rendering on mw1088 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.264 second response time [02:58:38] RECOVERY - SSH on mw1052 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:58:50] RECOVERY - configured eth on mw1088 is OK - interfaces up [02:59:19] RECOVERY - SSH on mw1088 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:59:49] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [02:59:51] SMalyshev: Still? [03:00:19] Katie: last one was 02:58:31 [03:02:59] since then no 503s so far [03:03:00] Please file a task in Phabricator ( https://phabricator.wikimedia.org/ ) if it persists. [03:03:00] ok, will do [03:03:00] You're getting the errors at https://www.wikidata.org ? [03:03:00] yes: org.wikidata.query.rdf.tool.exception.ContainedException: Unexpected status code fetching RDF for https://www.wikidata.org/wiki/Special:EntityData/Q20614033.ttl?nocache=1436065106180&flavor=dump: 503 [03:03:00] Hmmm. [03:03:00] RECOVERY - salt-minion processes on mw1088 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:49] PROBLEM - SSH on mw1052 is CRITICAL - Socket timeout after 10 seconds [03:05:30] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:07:10] RECOVERY - salt-minion processes on mw1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:07:19] RECOVERY - HHVM processes on mw1052 is OK: PROCS OK: 1 process with command name hhvm [03:07:19] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 38 minutes ago with 0 failures [03:07:20] RECOVERY - configured eth on mw1052 is OK - interfaces up [03:07:20] RECOVERY - DPKG on mw1052 is OK: All packages OK [03:07:20] RECOVERY - nutcracker port on mw1052 is OK: TCP OK - 0.000 second response time on port 11212 [03:07:49] RECOVERY - HHVM rendering on mw1052 is OK: HTTP OK: HTTP/1.1 200 OK - 64445 bytes in 5.158 second response time [03:07:58] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.176 second response time [03:07:59] RECOVERY - Disk space on mw1052 is OK: DISK OK [03:07:59] RECOVERY - dhclient process on mw1052 is OK: PROCS OK: 0 processes with command name dhclient [03:08:38] RECOVERY - SSH on mw1052 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [03:08:49] RECOVERY - nutcracker process on mw1052 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:09:08] RECOVERY - RAID on mw1052 is OK no RAID installed [03:09:28] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18407 bytes in 0.045 second response time [03:19:28] PROBLEM - puppet last run on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:09] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures [03:35:10] PROBLEM - puppet last run on wtp1013 is CRITICAL Puppet has 1 failures [03:35:10] PROBLEM - puppet last run on analytics1032 is CRITICAL Puppet has 1 failures [03:35:58] PROBLEM - puppet last run on mw1180 is CRITICAL Puppet has 1 failures [03:35:58] PROBLEM - puppet last run on mw1181 is CRITICAL Puppet has 1 failures [03:36:08] PROBLEM - puppet last run on mw1156 is CRITICAL Puppet has 1 failures [03:36:09] PROBLEM - puppet last run on mw2189 is CRITICAL Puppet has 2 failures [03:39:19] PROBLEM - HHVM processes on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:28] PROBLEM - configured eth on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:40] PROBLEM - salt-minion processes on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:48] PROBLEM - SSH on mw1047 is CRITICAL - Socket timeout after 10 seconds [03:39:49] PROBLEM - puppet last run on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:59] PROBLEM - Disk space on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:40:09] PROBLEM - nutcracker port on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:40:19] PROBLEM - DPKG on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:40:30] PROBLEM - dhclient process on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:40:39] PROBLEM - nutcracker process on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:40:50] PROBLEM - RAID on mw1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:09] PROBLEM - puppet last run on mw1085 is CRITICAL puppet fail [03:45:58] RECOVERY - dhclient process on mw1047 is OK: PROCS OK: 0 processes with command name dhclient [03:45:59] RECOVERY - nutcracker process on mw1047 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:46:09] RECOVERY - RAID on mw1047 is OK no RAID installed [03:46:39] RECOVERY - HHVM processes on mw1047 is OK: PROCS OK: 6 processes with command name hhvm [03:46:49] RECOVERY - configured eth on mw1047 is OK - interfaces up [03:46:59] RECOVERY - salt-minion processes on mw1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:47:00] RECOVERY - SSH on mw1047 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [03:47:00] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [03:47:09] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 1 hour ago with 0 failures [03:47:09] RECOVERY - HHVM rendering on mw1047 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.243 second response time [03:47:19] RECOVERY - Disk space on mw1047 is OK: DISK OK [03:47:28] RECOVERY - nutcracker port on mw1047 is OK: TCP OK - 0.000 second response time on port 11212 [03:47:29] RECOVERY - DPKG on mw1047 is OK: All packages OK [03:50:18] PROBLEM - puppet last run on mw1129 is CRITICAL puppet fail [03:51:59] PROBLEM - puppet last run on db1048 is CRITICAL puppet fail [03:52:18] PROBLEM - puppet last run on stat1003 is CRITICAL puppet fail [03:52:28] PROBLEM - puppet last run on mc2014 is CRITICAL puppet fail [03:52:39] PROBLEM - puppet last run on cp4005 is CRITICAL puppet fail [03:52:50] PROBLEM - puppet last run on mw2062 is CRITICAL puppet fail [03:53:39] RECOVERY - puppet last run on wtp1013 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [03:53:39] RECOVERY - puppet last run on analytics1032 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:54:39] PROBLEM - puppet last run on cp2018 is CRITICAL puppet fail [03:54:48] PROBLEM - puppet last run on ms-be2012 is CRITICAL puppet fail [03:58:08] RECOVERY - puppet last run on mw1180 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [03:58:11] RECOVERY - puppet last run on mw1156 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [03:59:50] PROBLEM - puppet last run on mw2136 is CRITICAL Puppet has 1 failures [03:59:59] RECOVERY - puppet last run on mw1181 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:00:19] RECOVERY - puppet last run on mw2189 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:02:29] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 11.11% of data above the critical threshold [100000000.0] [04:03:19] RECOVERY - puppet last run on mw1085 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:08:48] RECOVERY - puppet last run on mw1129 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [04:08:58] RECOVERY - puppet last run on stat1003 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [04:09:09] RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:09:09] RECOVERY - puppet last run on mc2014 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [04:09:20] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:10:29] RECOVERY - puppet last run on db1048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:11:19] RECOVERY - puppet last run on cp2018 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:11:20] RECOVERY - puppet last run on mw2062 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:11:28] RECOVERY - puppet last run on ms-be2012 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [04:22:49] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:27:00] PROBLEM - Restbase root url on restbase1005 is CRITICAL - Socket timeout after 10 seconds [04:33:20] PROBLEM - HHVM rendering on mw1044 is CRITICAL - Socket timeout after 10 seconds [04:33:39] PROBLEM - Apache HTTP on mw1044 is CRITICAL - Socket timeout after 10 seconds [04:34:29] PROBLEM - nutcracker port on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:29] PROBLEM - puppet last run on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:38] PROBLEM - HHVM processes on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:38] PROBLEM - salt-minion processes on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:39] PROBLEM - RAID on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:39] PROBLEM - configured eth on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:09] PROBLEM - Disk space on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:18] PROBLEM - dhclient process on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:29] PROBLEM - DPKG on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:49] PROBLEM - nutcracker process on mw1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:36:09] PROBLEM - SSH on mw1044 is CRITICAL - Socket timeout after 10 seconds [04:46:58] RECOVERY - nutcracker process on mw1044 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:47:08] RECOVERY - SSH on mw1044 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [04:47:20] RECOVERY - salt-minion processes on mw1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:47:20] RECOVERY - HHVM processes on mw1044 is OK: PROCS OK: 6 processes with command name hhvm [04:47:20] RECOVERY - nutcracker port on mw1044 is OK: TCP OK - 0.000 second response time on port 11212 [04:47:29] RECOVERY - configured eth on mw1044 is OK - interfaces up [04:47:29] RECOVERY - RAID on mw1044 is OK no RAID installed [04:48:09] RECOVERY - HHVM rendering on mw1044 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.200 second response time [04:48:09] RECOVERY - Disk space on mw1044 is OK: DISK OK [04:48:09] RECOVERY - dhclient process on mw1044 is OK: PROCS OK: 0 processes with command name dhclient [04:48:28] RECOVERY - DPKG on mw1044 is OK: All packages OK [04:48:28] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [05:07:58] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [05:09:40] RECOVERY - puppet last run on mw1044 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [05:24:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jul 5 05:24:47 UTC 2015 (duration 24m 46s) [05:26:16] * Katie eyes morebots. [05:30:45] morebots [05:30:45] I am a logbot running on tools-exec-1217. [05:30:46] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:30:46] To log a message, type !log . [06:04:21] Krenair: mukunda/chad/dan, depending on who's around [06:12:27] PROBLEM - puppet last run on ms-be3001 is CRITICAL puppet fail [06:30:48] PROBLEM - puppet last run on cp2003 is CRITICAL puppet fail [06:32:48] PROBLEM - puppet last run on logstash1006 is CRITICAL Puppet has 1 failures [06:32:59] RECOVERY - puppet last run on ms-be3001 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:34:18] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [06:34:18] PROBLEM - puppet last run on analytics1030 is CRITICAL Puppet has 1 failures [06:35:29] PROBLEM - puppet last run on elastic1027 is CRITICAL Puppet has 1 failures [06:35:39] PROBLEM - puppet last run on db1051 is CRITICAL Puppet has 1 failures [06:35:48] PROBLEM - puppet last run on ruthenium is CRITICAL Puppet has 1 failures [06:36:10] PROBLEM - puppet last run on mw1046 is CRITICAL Puppet has 1 failures [06:36:19] PROBLEM - puppet last run on mw2163 is CRITICAL Puppet has 1 failures [06:36:51] PROBLEM - puppet last run on labcontrol2001 is CRITICAL Puppet has 1 failures [06:36:59] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:37:09] PROBLEM - puppet last run on mw1123 is CRITICAL Puppet has 1 failures [06:37:29] PROBLEM - puppet last run on mw1060 is CRITICAL Puppet has 1 failures [06:37:49] PROBLEM - puppet last run on mw1052 is CRITICAL Puppet has 1 failures [06:37:59] PROBLEM - puppet last run on mw1235 is CRITICAL Puppet has 1 failures [06:37:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [06:38:10] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:38:19] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:38:28] PROBLEM - puppet last run on mw2033 is CRITICAL Puppet has 1 failures [06:38:50] PROBLEM - puppet last run on mw1228 is CRITICAL Puppet has 1 failures [06:40:19] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 2 failures [06:43:50] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [06:44:19] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.627 second response time [06:44:38] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [06:44:49] RECOVERY - HHVM rendering on mw1112 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.207 second response time [06:45:59] RECOVERY - puppet last run on mw2033 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:17] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.535 second response time [06:48:17] RECOVERY - puppet last run on elastic1027 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:48:17] RECOVERY - puppet last run on db1051 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:48:17] RECOVERY - puppet last run on mw1060 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:48:17] RECOVERY - puppet last run on ruthenium is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:48:17] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:48:17] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:48:17] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:17] RECOVERY - puppet last run on logstash1006 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:48:17] RECOVERY - HHVM rendering on mw1028 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.134 second response time [06:48:29] RECOVERY - puppet last run on mw1228 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:49:29] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:38] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [06:49:39] RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:49:48] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:49:49] RECOVERY - puppet last run on mw2163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:50] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:49:58] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:50:09] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [06:50:19] RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.426 second response time [06:50:19] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:29] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:38] RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:52:29] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.252 second response time [06:53:18] RECOVERY - HHVM rendering on mw1061 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.138 second response time [06:55:39] RECOVERY - HHVM rendering on mw1070 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.146 second response time [06:55:39] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [06:55:48] RECOVERY - puppet last run on mw1057 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:49] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.247 second response time [06:58:08] RECOVERY - HHVM rendering on mw1069 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.141 second response time [06:58:59] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [06:59:09] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.174 second response time [06:59:10] RECOVERY - HHVM rendering on mw1084 is OK: HTTP OK: HTTP/1.1 200 OK - 64452 bytes in 6.167 second response time [07:00:29] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.533 second response time [07:01:06] !log Restarted HHVM for mw1112,1028,1057,1061,1069,1070,1084,1086 [07:01:27] Logged the message, Master [07:01:32] jynus: any idea what's going on? [07:02:09] RECOVERY - HHVM rendering on mw1086 is OK: HTTP OK: HTTP/1.1 200 OK - 64444 bytes in 0.155 second response time [07:02:28] bblack, sorry I just wanted to log after the fact [07:02:30] PROBLEM - puppet last run on mw1105 is CRITICAL Puppet has 12 failures [07:02:43] are you just restarting them because they're dead? [07:02:50] or? [07:03:02] bblack, yes, they were dead [07:03:17] any idea how/why? locked up but running? [07:03:24] I assure every time before restarting [07:04:31] well, given the time, I assume "regular" crashing, and I only assume more frequently than usual due to the extra load [07:04:53] what's the extra load? [07:05:01] is this still S:RI load maybe? [07:05:23] I'm also seeing some restbase issues in icinga, which I think mobrovac sounded like he was expecting yesterday :/ [07:05:41] I saw you speaking, so you probably know more than me about that [07:06:03] I have no idea [07:06:57] then the puppet failures, usually proxy-induced [07:07:26] well yeah the 06:xx puppetfails are just puppet being bad I think [07:07:45] !log restarted cassandra + restbase on restbase1005 [07:07:51] Logged the message, Master [07:08:21] so to be clear: HHVM service was running, but curl did return the error page on all of those [07:08:48] RECOVERY - Restbase root url on restbase1005 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.010 second response time [07:09:13] !log restarted cassandra on restbase1004 [07:09:17] I restarted all that were dead in the last hours [07:09:20] ok [07:09:29] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [07:09:30] Logged the message, Master [07:09:56] I do not think there is nothing wrong short-term [07:09:59] PROBLEM - puppet last run on cp3042 is CRITICAL puppet fail [07:10:35] s/nothing/anything/ [07:11:55] well the restbase thing is still wrong even short-term [07:12:11] my suspicion is it may have been the root cause of the fallout the past several hours [07:12:20] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [07:12:44] yeah, I expressed badly, I mean something that requires our attention that hasn't been announced [07:12:46] rb1004 still hasn't recovered CQL yet, we'll see [07:14:14] but I specifically connected at this time because I imagine there would be less ops eyes [07:14:51] well I just got home a bit ago [07:15:10] all of the past ~5h looks awful on channel alerts [07:16:49] that is why I connected, just a few minutes ago, too :-) [07:16:56] hopefully with two dead restbase cassandras restarted, things will stabilize for a while [07:18:26] I'm really pretty displeased with the whole RB/cassandra debacle that's been going on lately :P [07:25:19] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:25:27] donations is up [07:27:47] 6operations, 10Traffic, 10fundraising-tech-ops: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1427057 (10Chmarkine) Once these two domain names, `www.donate.wikimediafoundation.org`, `www.donate.mediawiki.org` are removed, `wikimediafoundation.org`... [07:33:50] I really think the Special:RecordImpression problems from the broken campaign going on are to blame for the appserver deaths now [07:34:09] it correlates well, anyways. restbase being unhealthy I think was unrelated. [07:34:30] https://phabricator.wikimedia.org/T45250#1427060 [07:35:58] 6operations, 10Traffic, 10fundraising-tech-ops: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1427062 (10Chmarkine) Oh, actually http://www.donate.wikimediafoundation.org/ redirects to https://wikimediafoundation.org/wiki/Home, and http://www.donate... [07:36:21] uh, that thing is still not killed? [07:36:39] RECOVERY - puppet last run on mw1105 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:40:02] MaxSem: doesn't seem so on the appserver graphs [07:40:30] and we had a big 503 spike around 02:30, bunch of appservers dying shortly before and for a while after, all correlating with the latest request-rate peak from it [07:41:08] I mean, RecordImpression in general. we should feed it to o r i [07:45:37] :) [07:46:11] but this is different. we apparently just have a runaway poorly configured campaign spamming users and spamming our servers in this particular case [07:46:24] even if S:RI survives a bit longer, that campaign needs to die or get throttled [07:47:11] as current status seems ok (although not stable), I will disconnect now, check again things later [07:47:17] cya jynus [07:47:43] as far as I can tell using an anonymous browsing tab: for regular anon/logged-out pageviews, we're showing the banner 100% of the time [07:47:45] (03PS1) 10Chmarkine: Remove www.donate.wikimediafoundation.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/222876 (https://phabricator.wikimedia.org/T102827) [07:48:11] I've followed like 20 article links, and it keeps popping up on every single one [07:48:38] it does stop showing if you click the X-mark to close it, at least [07:50:08] heh after my 20 or so mostly-blind random clicks in from Main_Page, somehow I ended up on https://en.wikipedia.org/wiki/Encyclopedia_of_the_Central_Intelligence_Agency [07:51:45] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Application+servers+eqiad&h=mw1033.eqiad.wmnet&jr=&js=&v=20.425&m=ap_rps&vl=req%2Fsec&ti=Requests+per+second [07:51:56] ^ appserver req-rate impact of banner campaign [07:53:20] (03PS1) 10Chmarkine: Remove www.donate.mediawiki.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/222877 (https://phabricator.wikimedia.org/T102827) [07:56:39] PROBLEM - Apache HTTP on mw1105 is CRITICAL - Socket timeout after 10 seconds [07:57:29] PROBLEM - HHVM rendering on mw1105 is CRITICAL - Socket timeout after 10 seconds [07:58:38] PROBLEM - RAID on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:59] PROBLEM - DPKG on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:59:00] PROBLEM - nutcracker port on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:59:19] PROBLEM - puppet last run on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:59:31] the wikimania campaign? isn't it sliiiightly late to register? [07:59:58] PROBLEM - SSH on mw1105 is CRITICAL - Socket timeout after 10 seconds [07:59:58] PROBLEM - HHVM processes on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:59:58] PROBLEM - dhclient process on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:59:59] PROBLEM - configured eth on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:18] PROBLEM - nutcracker process on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:19] PROBLEM - salt-minion processes on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:00:40] PROBLEM - Disk space on mw1105 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:28] MaxSem: no idea. they're set to run this campaign from Jul 2-14: https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail¬ice=wm2015register [08:06:37] yup [08:06:59] if it's causing problems, I guess it's reasonable to disable it for now [08:07:03] I'm considering just blocking S:RI [08:07:19] I really don't know how, at my level of things, I could or should cleanly disable the campaign itself without breaking other things [08:07:38] but I could at least kill the S:RI traffic at varnish [08:08:15] I'd really rather we have someone who knows more about this, or whoever created it, fix their campaign to not be spammy [08:08:31] that can work yes - we don't friggin need to have analytics about a banner as spammy as this [08:14:49] (03PS1) 10BBlack: filter S:RI from wm2015register T45250 [puppet] - 10https://gerrit.wikimedia.org/r/222879 [08:15:23] (03PS2) 10BBlack: filter S:RI from wm2015register T45250 [puppet] - 10https://gerrit.wikimedia.org/r/222879 [08:15:40] (03CR) 10BBlack: [C: 032 V: 032] filter S:RI from wm2015register T45250 [puppet] - 10https://gerrit.wikimedia.org/r/222879 (owner: 10BBlack) [08:18:08] PROBLEM - HHVM rendering on mw1053 is CRITICAL - Socket timeout after 10 seconds [08:18:38] PROBLEM - Apache HTTP on mw1053 is CRITICAL - Socket timeout after 10 seconds [08:18:39] RECOVERY - HHVM processes on mw1105 is OK: PROCS OK: 6 processes with command name hhvm [08:18:39] RECOVERY - dhclient process on mw1105 is OK: PROCS OK: 0 processes with command name dhclient [08:19:18] RECOVERY - Disk space on mw1105 is OK: DISK OK [08:19:29] RECOVERY - DPKG on mw1105 is OK: All packages OK [08:19:29] PROBLEM - HHVM processes on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:38] RECOVERY - nutcracker port on mw1105 is OK: TCP OK - 0.000 second response time on port 11212 [08:19:40] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:40] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:40] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:48] RECOVERY - puppet last run on mw1105 is OK Puppet is currently enabled, last run 43 minutes ago with 0 failures [08:19:58] RECOVERY - HHVM rendering on mw1105 is OK: HTTP OK: HTTP/1.1 200 OK - 64428 bytes in 0.289 second response time [08:20:08] PROBLEM - SSH on mw1053 is CRITICAL - Socket timeout after 10 seconds [08:20:09] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:09] PROBLEM - Disk space on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:10] PROBLEM - configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:29] RECOVERY - SSH on mw1105 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:20:29] RECOVERY - configured eth on mw1105 is OK - interfaces up [08:20:29] PROBLEM - dhclient process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:38] PROBLEM - salt-minion processes on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:38] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:49] RECOVERY - nutcracker process on mw1105 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:20:50] RECOVERY - salt-minion processes on mw1105 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:21:00] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [08:21:00] RECOVERY - RAID on mw1105 is OK no RAID installed [08:23:34] <_joe_> !log restarted apache on mw1105,mw1092,90,82,78 [08:23:39] Logged the message, Master [08:23:49] <_joe_> !log restarted hhvm because of ooms, not apache [08:23:53] Logged the message, Master [08:27:42] !log FYI: 08:15 < grrrit-wm> (CR) BBlack: [C: 2 V: 2] filter S:RI from wm2015register T45250 [puppet] - https://gerrit.wikimedia.org/r/222879 (owner: BBlack) [08:27:47] Logged the message, Master [08:27:58] RECOVERY - dhclient process on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [08:27:58] RECOVERY - salt-minion processes on mw1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:28:50] RECOVERY - HHVM processes on mw1053 is OK: PROCS OK: 6 processes with command name hhvm [08:28:58] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [08:28:59] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:29:19] RECOVERY - Disk space on mw1053 is OK: DISK OK [08:29:20] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:29:29] RECOVERY - configured eth on mw1053 is OK - interfaces up [08:29:49] <_joe_> !log restaarted HHVM on mw1059 with heap profiling enabled, collecting data (will stop this evening). [08:29:49] RECOVERY - RAID on mw1053 is OK no RAID installed [08:29:53] Logged the message, Master [08:30:49] RECOVERY - DPKG on mw1053 is OK: All packages OK [08:31:08] RECOVERY - puppet last run on mw1053 is OK Puppet is currently enabled, last run 16 minutes ago with 0 failures [08:31:09] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 64428 bytes in 1.656 second response time [08:31:38] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [08:40:34] <_joe_> !log collecting heaps on an api appserver, mw1115, as comparison [08:40:38] Logged the message, Master [08:43:39] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [08:44:19] RECOVERY - Host mw2027 is UPING WARNING - Packet loss = 58%, RTA = 43.04 ms [08:50:28] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time [08:50:28] RECOVERY - HHVM rendering on mw1051 is OK: HTTP OK: HTTP/1.1 200 OK - 64428 bytes in 0.137 second response time [08:58:09] RECOVERY - HHVM busy threads on mw1051 is OK Less than 30.00% above the threshold [57.6] [08:58:49] RECOVERY - HHVM queue size on mw1051 is OK Less than 30.00% above the threshold [10.0] [09:04:54] (03PS1) 10Chmarkine: Remove www.donate.wiktionary.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/222880 (https://phabricator.wikimedia.org/T102827) [09:29:24] (03PS1) 10Chmarkine: Remove www.donate.wikipedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/222883 (https://phabricator.wikimedia.org/T102827) [09:31:30] PROBLEM - Restbase root url on restbase1005 is CRITICAL - Socket timeout after 10 seconds [09:43:34] !log restarted restbase on restbase1005 [09:43:38] Logged the message, Master [09:44:49] RECOVERY - Restbase root url on restbase1005 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.058 second response time [10:17:11] 6operations, 7HHVM: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1427156 (10Joe) I am collecting data on mw1059 (and mw1115 for comparison) to see what changes in the memory profile of both servers over time. In about 1 day of data we should have a very clear pictur... [10:36:00] <_joe_> [10:45:05] Hi [10:45:50] PROBLEM - Restbase root url on restbase1002 is CRITICAL - Socket timeout after 10 seconds [10:45:58] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [10:46:38] PROBLEM - Restbase root url on restbase1003 is CRITICAL - Socket timeout after 10 seconds [10:46:49] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [10:51:39] PROBLEM - puppet last run on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:51:39] PROBLEM - RAID on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:52:18] PROBLEM - Disk space on Hadoop worker on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:52:28] PROBLEM - Disk space on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:52:49] PROBLEM - SSH on analytics1020 is CRITICAL - Socket timeout after 10 seconds [10:53:10] PROBLEM - dhclient process on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:10] PROBLEM - DPKG on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:19] PROBLEM - configured eth on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:28] PROBLEM - Hadoop DataNode on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:53:29] PROBLEM - salt-minion processes on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:29] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:26] <_joe_> !log restarting cassandra on rb1003,4 and restbase on rb1002,3 [11:03:39] PROBLEM - Cassanda CQL query interface on restbase1003 is CRITICAL: Connection refused [11:03:39] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [11:03:44] <_joe_> also, what the fuck. [11:03:47] Logged the message, Master [11:04:38] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.000 second response time on port 9042 [11:05:29] RECOVERY - Cassanda CQL query interface on restbase1003 is OK: TCP OK - 0.002 second response time on port 9042 [11:10:39] RECOVERY - Restbase root url on restbase1003 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.016 second response time [11:11:48] RECOVERY - Restbase root url on restbase1002 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.009 second response time [11:14:29] PROBLEM - Restbase root url on restbase1001 is CRITICAL - Socket timeout after 10 seconds [11:14:49] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [11:14:49] PROBLEM - Restbase root url on restbase1006 is CRITICAL - Socket timeout after 10 seconds [11:15:40] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [11:17:20] <_joe_> this is bad [11:20:08] <_joe_> !log restarting restbase on rb100{1,4,6} [11:20:28] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [11:21:13] <_joe_> !log restarted cassandra on restbase1004 (again), seemingly crashed for a bad request [11:21:17] Logged the message, Master [11:21:20] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.012 second response time on port 9042 [11:21:59] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.010 second response time [11:22:18] RECOVERY - Restbase root url on restbase1006 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.021 second response time [11:22:46] <_joe_> mobrovac, gwicke urandom this should be fixed first thing on monday - restbase should not keep dying for timeouts [11:33:50] _joe_: agree [11:36:33] _joe_: i did deploy a small fix / improvement for that friday, but apparently we're still missing something [12:16:41] (03CR) 10Krinkle: [C: 031] Set $wgMainStash to redis instead of the DB default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221885 (https://phabricator.wikimedia.org/T88493) (owner: 10Aaron Schulz) [12:18:17] (03CR) 10Krinkle: "Code looks good, but I can't find in the commit nor the referenced task anything about why Redis. In case this goes side-ways and/or for w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221885 (https://phabricator.wikimedia.org/T88493) (owner: 10Aaron Schulz) [12:22:49] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [12:23:59] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [12:25:49] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [12:26:38] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.008 second response time on port 9042 [12:27:46] (03CR) 10JanZerebecki: [C: 031] Remove www.donate.wikimediafoundation.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/222876 (https://phabricator.wikimedia.org/T102827) (owner: 10Chmarkine) [12:27:51] Cassanda? [12:28:08] i.e. delenda? :) [12:28:09] (03CR) 10JanZerebecki: [C: 031] Remove www.donate.mediawiki.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/222877 (https://phabricator.wikimedia.org/T102827) (owner: 10Chmarkine) [12:36:29] PROBLEM - Restbase root url on restbase1003 is CRITICAL - Socket timeout after 10 seconds [12:37:25] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1427281 (10JanZerebecki) [12:42:11] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1427290 (10JanZerebecki) Regarding the mkt41.net cnames, see also T74514 and T60373. [12:43:58] PROBLEM - puppet last run on restbase1005 is CRITICAL Puppet last ran 2 days ago [12:44:10] (03CR) 10JanZerebecki: [C: 031] Remove www.donate.wikipedia.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/222883 (https://phabricator.wikimedia.org/T102827) (owner: 10Chmarkine) [12:46:19] (03PS1) 10Giuseppe Lavagetto: cassandra: raise heap size to 16Gb [puppet] - 10https://gerrit.wikimedia.org/r/222899 [12:47:14] (03CR) 10JanZerebecki: [C: 031] Remove www.donate.wiktionary.org from DNS [dns] - 10https://gerrit.wikimedia.org/r/222880 (https://phabricator.wikimedia.org/T102827) (owner: 10Chmarkine) [12:49:23] (03CR) 10Mobrovac: [C: 031] "It should. We have restarted the nodes plenty of times this week due to OOMs." [puppet] - 10https://gerrit.wikimedia.org/r/222899 (owner: 10Giuseppe Lavagetto) [12:49:29] RECOVERY - puppet last run on restbase1005 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:49:44] (03CR) 10Giuseppe Lavagetto: [C: 032] cassandra: raise heap size to 16Gb [puppet] - 10https://gerrit.wikimedia.org/r/222899 (owner: 10Giuseppe Lavagetto) [12:51:36] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: consider moving Cassandra to G1GC in production - https://phabricator.wikimedia.org/T103161#1427300 (10mobrovac) [12:53:10] RECOVERY - Restbase root url on restbase1003 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.019 second response time [12:55:34] !log restbase rolling restart of cassandra to apply the 16G heap change https://gerrit.wikimedia.org/r/222899 [12:55:38] Logged the message, Master [13:31:43] any idea why wikitech instantly looses knowledge of sessions? (as in you log in, reload you are logged out; or the login screen already says you need cookies; even though the cookie is correctly sent) [13:44:37] jzerebecki: https://phabricator.wikimedia.org/T104766 was filed recently [13:58:20] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [13:59:29] PROBLEM - puppet last run on eventlog2001 is CRITICAL puppet fail [14:06:48] Nemo_bis: thx, that's it [14:17:30] 6operations, 10OCG-General-or-Unknown, 6Services: Issues with OCG service in production - https://phabricator.wikimedia.org/T104708#1427364 (10TheDJ) Shouldn't we have event logging for this or something ? It all breaks quite common and we seem to find out everything through user feedback... [14:18:20] RECOVERY - puppet last run on eventlog2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:34:18] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:25] (03PS1) 10Yuvipanda: celery: Create simple module for celery workers [puppet] - 10https://gerrit.wikimedia.org/r/222913 [15:13:27] (03PS1) 10Yuvipanda: ores: Move ores initial install setup into a base class [puppet] - 10https://gerrit.wikimedia.org/r/222914 [15:13:55] (03CR) 10Yuvipanda: [C: 032 V: 032] celery: Create simple module for celery workers [puppet] - 10https://gerrit.wikimedia.org/r/222913 (owner: 10Yuvipanda) [15:16:43] !log restarted nutcracker on silver. [15:16:46] Logged the message, Master [15:20:29] (03CR) 10Yuvipanda: [C: 032] ores: Move ores initial install setup into a base class [puppet] - 10https://gerrit.wikimedia.org/r/222914 (owner: 10Yuvipanda) [15:24:42] (03PS1) 10Yuvipanda: ores: Fix scoping issue with src/venv/config paths [puppet] - 10https://gerrit.wikimedia.org/r/222915 [15:24:54] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Fix scoping issue with src/venv/config paths [puppet] - 10https://gerrit.wikimedia.org/r/222915 (owner: 10Yuvipanda) [15:39:01] (03PS2) 10Yuvipanda: Remove bastion1 and bastion2 from labs bastion hosts list [puppet] - 10https://gerrit.wikimedia.org/r/222871 (owner: 10Alex Monk) [15:39:11] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thanks! :)" [puppet] - 10https://gerrit.wikimedia.org/r/222871 (owner: 10Alex Monk) [15:50:56] (03PS1) 10Yuvipanda: [WIP] ores: worker role [puppet] - 10https://gerrit.wikimedia.org/r/222919 [16:17:39] PROBLEM - puppet last run on mw1167 is CRITICAL Puppet has 1 failures [16:38:29] RECOVERY - puppet last run on mw1167 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:56:08] PROBLEM - puppet last run on mw1163 is CRITICAL Puppet has 1 failures [16:56:09] PROBLEM - puppet last run on mw1125 is CRITICAL Puppet has 1 failures [17:06:33] (03CR) 10Steinsplitter: "can we merge this pls ASAP? https://phabricator.wikimedia.org/T104178#1427512" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221600 (owner: 10Matanya) [17:08:04] (03CR) 10Alex Monk: "Steinsplitter: Today is a Sunday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221600 (owner: 10Matanya) [17:10:10] (03CR) 10Steinsplitter: "And tomorrow is monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221600 (owner: 10Matanya) [17:11:09] RECOVERY - puppet last run on mw1125 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:12:49] RECOVERY - puppet last run on mw1163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:17:33] (03CR) 10Awight: "Code should include a "TODO: Horrific workaround to an unspeakable status quo" :D" [puppet] - 10https://gerrit.wikimedia.org/r/222879 (owner: 10BBlack) [17:20:09] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [17:21:22] (03CR) 10Ori.livneh: "https://youtu.be/kfVsfOSbJY0?t=125" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221600 (owner: 10Matanya) [17:23:31] Krenair: it would be the simplest way to move url whitelisting onwik. O_O [17:23:48] so it don't takes ages to whiteliste it. [17:24:53] Greg said those are ok to deploy on Weekend [17:24:54] btw [17:24:58] it schouldn't be criticism. just a thought :> [17:25:38] PROBLEM - puppet last run on cp3015 is CRITICAL puppet fail [17:34:56] hoo, where/when? [17:35:10] Krenair: Not sure when... probably this channel [17:35:36] +1 ori :p [17:35:47] Fine, I'll deploy it now [17:36:23] Krenair: I have a noc symlink update as well - if you want to do that as iirc that doesn't need any deployment-y stuff :) [17:36:51] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:40:08] JohnFLewis, we sync that stuff anyway [17:40:42] Krenair: want me to link you to the patch and you can do it now or wait for a pickup on Monday? [17:40:47] And actually [17:40:50] noc is served from terbium, not tin [17:40:53] So it does need syncing [17:41:11] oh its from terbium? heh thought it was tin [17:41:45] nope [17:42:19] RECOVERY - puppet last run on cp3015 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:42:29] (03PS4) 10Alex Monk: add unibas.ch to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221600 (owner: 10Matanya) [17:43:04] (03CR) 10Alex Monk: [C: 032] add unibas.ch to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221600 (owner: 10Matanya) [17:43:10] (03Merged) 10jenkins-bot: add unibas.ch to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221600 (owner: 10Matanya) [17:44:14] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/221600/ (duration: 00m 12s) [17:44:18] Logged the message, Master [17:46:29] JohnFLewis, where is this symlink change then? [17:46:49] Krenair: https://gerrit.wikimedia.org/r/#/c/222290/ [17:47:59] (03CR) 10Alex Monk: [C: 032] refresh symlinks to catch new dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222290 (owner: 10John F. Lewis) [17:48:30] (03Merged) 10jenkins-bot: refresh symlinks to catch new dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222290 (owner: 10John F. Lewis) [17:49:24] !log krenair Synchronized docroot/noc/conf: https://gerrit.wikimedia.org/r/#/c/222290/ (duration: 00m 13s) [17:49:28] Logged the message, Master [17:51:34] JohnFLewis, I reckon we're still missing stuff for those dblists [17:51:44] e.g. aren't we supposed to list them at https://noc.wikimedia.org/conf/ ? [17:52:20] and https://noc.wikimedia.org/conf/visualeditor.dblist and https://noc.wikimedia.org/conf/mediaviewer.dblist are 403 [17:52:24] (but nonglobal.dblist is fine) [17:53:04] and also https://noc.wikimedia.org/conf/highlight.php?file=mediaviewer.dblist and https://noc.wikimedia.org/conf/highlight.php?file=visualeditor.dblist don't work either [17:53:05] https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/noc/conf/index.php#L63 [17:53:22] honestly I wonder why we still have this now that it's all in a public git repo [17:53:37] Yeah... [17:53:46] Because it's more convenient for end users, I guess [17:53:57] Would be better to build it up on the public git repo, thought [17:54:05] * though [17:54:06] they just were updated when I ran that script for another conf change so I just committed those as well [17:54:31] so hang on... those two don't actually exist? [17:54:34] I can't find them [17:54:51] someone didn't clean things up then apparently [17:55:41] We did have a visualeditor.dblist at one point [17:55:59] But apparently not anymore. [17:56:14] thats why I committed them as-is because I remember them existing though I didn't know they were removed then at some point [17:56:14] mediaviewer.dblist was also removed [17:56:28] ok [17:57:10] ahh, I see [17:57:17] they're still listed in createTxtFileSymlinks [17:57:21] I'll clean this up [18:01:47] (03PS1) 10Alex Monk: Clean up noc symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222932 [18:11:16] (03CR) 10Alex Monk: [C: 032] Clean up noc symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222932 (owner: 10Alex Monk) [18:11:22] (03Merged) 10jenkins-bot: Clean up noc symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222932 (owner: 10Alex Monk) [18:11:58] !log krenair Synchronized docroot/noc: https://gerrit.wikimedia.org/r/#/c/222932/ (duration: 00m 12s) [18:12:02] Logged the message, Master [19:10:26] (03PS1) 10Alex Monk: Update README to remove pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222941 [19:10:28] (03PS1) 10Alex Monk: Get rid of most of noc.wikimedia.org/conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222942 [19:10:33] JohnFLewis, hoo: ^ [19:10:35] (03CR) 10jenkins-bot: [V: 04-1] Get rid of most of noc.wikimedia.org/conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222942 (owner: 10Alex Monk) [19:11:27] Krenair: Don't think we can do that [19:11:32] why not? [19:11:40] Because we want to keep the urls working [19:11:42] I guess [19:11:49] we could redirect stuff [19:11:51] We could redirect them to $gitfileview [19:12:10] well, but probably not git.wm.o as that's more often down than up [19:12:47] (03PS2) 10Alex Monk: Get rid of most of noc.wikimedia.org/conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222942 [19:41:00] (03CR) 10John F. Lewis: [C: 031] Update README to remove pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222941 (owner: 10Alex Monk) [19:49:18] PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail [20:05:58] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:51] looks like cassandra on restbase1005 could use a restart (looking at http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-thread-pools ) [20:16:48] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [20:16:49] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [20:16:58] subbu, what's going on with cassandra on restbase? [20:17:35] MaxSem, i don't know .. all i know is that marko gabriel and erik have been working on for a couple days before the weekend. [20:18:46] i periodically chekc parsoid ganglia graphs (a few times in the day) and that is how I discovered that something is off since there is very little load on the parsoid cluster right now. [20:20:26] ok, emailing the ops list and signing off. [20:26:10] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:35:09] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:07:09] (03CR) 10Alex Monk: [C: 04-1] "todo: have these urls redirect to git" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222942 (owner: 10Alex Monk) [21:33:52] (03CR) 10BBlack: [C: 04-1] "text-lb doesn't have certs for this, and I don't think it's the right answer to this problem." [dns] - 10https://gerrit.wikimedia.org/r/222860 (https://phabricator.wikimedia.org/T104735) (owner: 10John F. Lewis) [22:03:42] !log restbase rolling restart of restbase [22:03:46] Logged the message, Master [22:08:10] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [22:18:07] (03PS2) 10Ori.livneh: tlsproxy: add negotiated cipher to conn props [puppet] - 10https://gerrit.wikimedia.org/r/222842 (owner: 10BBlack) [22:18:08] (03PS1) 10Ori.livneh: varnishxcps: transform 'C' key to 'ssl_cipher' [puppet] - 10https://gerrit.wikimedia.org/r/222983 [22:19:19] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.014 second response time [22:19:21] (03CR) 10Ori.livneh: [C: 031] "LGTM. I5f3fcf87a8 should go out first (and be allowed to propagate to all Varnishes), so we don't end up with a 'C' metric in Graphite." [puppet] - 10https://gerrit.wikimedia.org/r/222842 (owner: 10BBlack) [22:23:06] (03CR) 10Ori.livneh: [C: 032] varnishxcps: transform 'C' key to 'ssl_cipher' [puppet] - 10https://gerrit.wikimedia.org/r/222983 (owner: 10Ori.livneh) [22:30:37] !log Restarted logstash on logstah1001; Hung due to OOM errors [22:30:42] Logged the message, Master [22:31:46] That's the second time in less than a week that logstash has OOM'ed on logstash1001. Something new [22:32:53] joy [23:05:09] Help! I think MediaWiki is stupid! [23:05:44] Yeah, it's run by computers... we know that. [23:05:45] With the humourous beginning out of the way, I need serious help with a bug that's preventing me from helping with a serious privacy violation. [23:06:05] See the repetition? I'm /that/ stressed out [23:08:20] (03Abandoned) 10John F. Lewis: (www.)wmfusercontent.org point to text-lb [dns] - 10https://gerrit.wikimedia.org/r/222860 (https://phabricator.wikimedia.org/T104735) (owner: 10John F. Lewis) [23:10:00] * odder sobs help help help [23:10:35] Did you file it? [23:14:46] Not sure it's a bug though, maybe it's the supposed behaviour [23:14:52] Need troubleshooting first [23:15:51] If it's a serious issue, you can still open a ticket [23:56:30] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1428414 (10bd808) >>! In T95436#1423807, @Krenair wrote: > How are we going to handle sync of mediawiki-staging between tin and mira? Wouldn't we want any sort of gi...