[00:51:29] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 301 seconds [00:52:30] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 365 seconds [00:53:39] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [00:53:40] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [01:53:55] 3ops-core: Renumber virt0 - https://phabricator.wikimedia.org/T83843#952922 (10faidon) 5Open>3Resolved a:3faidon Well, virt0 was a Tampa host and 208.80.152.0/25 was a Tampa subnet. This is obviously gone now :) [01:53:58] 3ops-core: Renumber virt0 - https://phabricator.wikimedia.org/T83843#952925 (10faidon) [02:35:20] (03PS1) 10OliverKeyes: Change the URLs used by Pybal to simplify tracking for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/182558 [02:45:21] (03CR) 10Ori.livneh: "Can't you reject requests originating in the cluster?" [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [02:50:23] (03CR) 10OliverKeyes: "We can, but:" [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [03:00:36] (03CR) 10Faidon Liambotis: [C: 04-1] "We really need the ability to check real pages and make sure that they work. ("Undefined" is a real page too, btw, and /could/ work for us" [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [04:06:01] (03CR) 10OliverKeyes: "Moving the problem, yes, but essentially it's amalgamating. As said in the commit message, we already have to exclude /wiki/Undefined due " [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [05:38:31] PROBLEM - Disk space on search1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:05:30] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail [06:06:00] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: Puppet has 1 failures [06:06:40] PROBLEM - puppet last run on amssq44 is CRITICAL: CRITICAL: Puppet has 1 failures [06:06:40] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 2 failures [06:07:09] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Puppet has 1 failures [06:13:19] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [06:14:20] PROBLEM - puppet last run on amssq41 is CRITICAL: CRITICAL: Puppet has 1 failures [06:16:39] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:22:40] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:23:29] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:23:29] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:23:50] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:24:30] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:29:00] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:20] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:40] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:50] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:19] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:10] RECOVERY - puppet last run on amssq41 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:33:20] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:20] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:15:30] PROBLEM - Disk space on search1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:11] 3operations, Wikimedia-SSL-related: replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#953034 (10Chmarkine) [10:38:06] 3operations, Wikimedia-SSL-related: replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#953076 (10Chmarkine) [11:05:00] RECOVERY - Disk space on search1021 is OK: DISK OK [11:09:19] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 961.60004874 [11:20:29] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [11:24:20] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [11:30:40] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [11:36:50] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:21:22] (03CR) 10Chad: "Could switch it to Special:Blankpage." [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [16:51:20] PROBLEM - Parsoid on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:30] anybody around? [17:00:49] PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:52] the parsoid cluster is seeing very high load [17:01:06] and I dont't seem to have the right to restart parsoids any more [17:01:51] /cc paravoid akosiaris ori godog _joe_ [17:07:10] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:07:21] <_joe_> gwicke: uh, strange [17:07:32] <_joe_> gwicke: should I just restart parsoid? [17:08:00] PROBLEM - Parsoid on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:08:28] _joe_: yes [17:08:38] <_joe_> on the whole cluster? [17:08:45] in the logs I'm seeing a second parsoid process trying to start up with the current one still running [17:08:49] _joe_: yes [17:09:26] <_joe_> gwicke: mmmh it looks worse than just needing a restart [17:09:39] there is some upstart issue where it doesn't really kill all processes before starting the new process [17:10:14] so a manual kill -9 node && service parsoid start might be in order [17:10:37] <_joe_> that's what I am doing [17:11:19] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.017 second response time [17:11:31] * gwicke is looking forward to systemd [17:11:51] <_joe_> 'service parsoid stop; killall -9 nodejs; service parsoid start' [17:11:54] <_joe_> ok? [17:11:57] yes [17:12:52] (03PS3) 10Hoo man: Fix sitelinkgroups for Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [17:13:19] <_joe_> !log restarting parsoid across the cluster [17:13:29] Logged the message, Master [17:13:30] (03PS4) 10Hoo man: Fix specialSiteLinkGroups for Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [17:14:19] there are a lot of lines like this in /var/log/parsoid/parsoid.log: [17:14:20] {"name":"parsoid","hostname":"wtp1019","pid":31788,"level":50,"logType":"error","process":{"name":"worker","pid":31788},"msg":"Port 8000 is already in use. Exiting.","longMsg":"Port 8000 is already in use. Exiting.","time":"2015-01-03T17:13:50.965Z","v":0} [17:14:30] evidently a second parsoid process trying to start up [17:14:38] <_joe_> possibly [17:14:50] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.017 second response time [17:15:07] <_joe_> I am restarting parsoid on 2 servers at a time [17:15:11] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time [17:15:44] wtp1019 is back to normal [17:16:21] the traffic pattern I see there looks normalish too [17:17:29] load on the cluster went up quite a bit a few hours ago: https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Parsoid+eqiad&m=cpu_report&s=by+name&mc=2&g=cpu_report [17:18:29] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 13 failures [17:18:36] <_joe_> gwicke: I am on vacation until monday, so I didn't take the time to debug this better [17:18:49] while network went down in the same period: https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Parsoid+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [17:18:55] <_joe_> but systems where I restarted node are ok now [17:18:57] so not an external load surge it seems [17:19:28] *nod* [17:19:36] <_joe_> I see quite the opposite [17:20:38] <_joe_> we had a network traffic surge that eventually resulted in a drop when the cpu skyrocketed, so looks like some anomalous load made node get into a bad state [17:20:59] cpu skyrocketed after network had dropped already [17:21:09] <_joe_> anyways, it's recovering now [17:21:27] cpu went up at 3:30, network dropped at around 2:30 [17:21:38] <_joe_> right [17:21:39] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time [17:22:06] I'm also on vacation, so might not spend my entire morning investigating this [17:22:14] <_joe_> :) [17:22:19] <_joe_> wise choice [17:22:27] <_joe_> it's mostly solved now btw [17:23:06] the restart failure is tracked in https://phabricator.wikimedia.org/T75395 [17:23:41] <_joe_> ok, everything restarted, we should be ok now [17:23:48] * _joe_ goes away again [17:23:57] _joe_: thanks for your help! [17:24:12] and enjoy the remainder of your vacation [17:28:09] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [17:36:19] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:38:15] hmm, load seems to be climbing back up again [17:46:59] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:54] there is very little external traffic in the parsoid varnish frontend (grep -v 10.64) [18:03:18] most requests are coming from mw* hosts; https://gdash.wikimedia.org/dashboards/apimethods/ is sadly broken, so I'm wondering if there is any other monitoring for API methods [18:03:36] didn't find anything for the visualeditor action in graphite [18:08:19] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [18:09:30] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:13:39] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [18:15:50] PROBLEM - Parsoid on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:16:09] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:09] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [18:28:19] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:28:20] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:33:18] and we are back in a restart loop: {"name":"parsoid","hostname":"wtp1019","pid":21289,"level":50,"logType":"error","process":{"name":"worker","pid":21289},"msg":"Port 8000 is already in use. Exiting.","longMsg":"Port 8000 is already in use. Exiting.","time":"2015-01-03T18:32:54.835Z","v":0} [18:34:00] load is also really high /cc _joe_ akosiaris paravoid ori godog jgage mutante [18:38:00] PROBLEM - Parsoid on wtp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:40] PROBLEM - RAID on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:39] PROBLEM - puppet last run on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:50] RECOVERY - RAID on wtp1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:50:31] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:52:03] (03PS1) 10GWicke: Make parsoid roots actually root on the parsoid boxes [puppet] - 10https://gerrit.wikimedia.org/r/182585 [18:52:30] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:50] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:53:10] PROBLEM - puppet last run on wtp1022 is CRITICAL: CRITICAL: Puppet has 10 failures [18:55:30] I got to go, hope somebody else can take over handling the parsoid issue [18:57:10] PROBLEM - SSH on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:29] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:51] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:56] (03CR) 10Aude: [C: 04-1] "looks ok, generally however this is also a repo setting (see comments)" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [18:58:00] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:49] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:09] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [18:59:29] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:59:49] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [18:59:50] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:02:50] PROBLEM - SSH on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:10] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:20] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:40] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:49] PROBLEM - Parsoid on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:29] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [19:06:00] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:06:19] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:06:59] PROBLEM - SSH on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:00] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:40] PROBLEM - puppet last run on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:49] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:00] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.011 second response time [19:08:50] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:09:20] PROBLEM - SSH on wtp1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:19] RECOVERY - SSH on wtp1003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:11:00] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:11:10] RECOVERY - SSH on wtp1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:11:20] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [19:11:59] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [19:12:09] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [19:12:19] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:10] PROBLEM - dhclient process on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:10] PROBLEM - DPKG on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:19] PROBLEM - salt-minion processes on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:19] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:19] PROBLEM - puppet last run on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:20] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [19:13:29] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:40] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:09] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:14:10] RECOVERY - dhclient process on wtp1006 is OK: PROCS OK: 0 processes with command name dhclient [19:14:10] RECOVERY - DPKG on wtp1006 is OK: All packages OK [19:14:20] RECOVERY - salt-minion processes on wtp1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:14:29] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [19:14:29] PROBLEM - puppet last run on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:20] PROBLEM - parsoid disk space on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:30] PROBLEM - salt-minion processes on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:40] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:16:19] PROBLEM - DPKG on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:16:20] RECOVERY - parsoid disk space on wtp1004 is OK: DISK OK [19:16:30] RECOVERY - salt-minion processes on wtp1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:16:30] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:16:32] (03CR) 10Aude: Fix specialSiteLinkGroups for Wikibase clients (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [19:17:20] RECOVERY - DPKG on wtp1012 is OK: All packages OK [19:17:40] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:18:09] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [19:18:29] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:50] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:50] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:21:30] PROBLEM - puppet last run on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:40] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:40] PROBLEM - configured eth on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:40] PROBLEM - dhclient process on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:47] (03CR) 10Tpt: "I'm not sure it's a good idea to merge the siteLinkGroups option of the clients with the same option of the repos because we will have to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [19:23:00] PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:29] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:31] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:39] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [19:23:40] PROBLEM - RAID on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:40] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [19:23:40] RECOVERY - dhclient process on wtp1004 is OK: PROCS OK: 0 processes with command name dhclient [19:23:40] RECOVERY - configured eth on wtp1004 is OK: NRPE: Unable to read output [19:23:50] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:24:29] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:24:39] RECOVERY - RAID on wtp1003 is OK: OK: no disks configured for RAID [19:24:50] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [19:25:30] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:25:39] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:26:49] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:27:59] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [19:28:00] PROBLEM - RAID on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:40] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:29:00] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:29:19] PROBLEM - DPKG on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:29:29] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:30:00] PROBLEM - puppet last run on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:30:11] RECOVERY - DPKG on wtp1014 is OK: All packages OK [19:31:30] PROBLEM - DPKG on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:31:39] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 44 minutes ago with 0 failures [19:32:09] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:19] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:29] RECOVERY - RAID on wtp1003 is OK: OK: no disks configured for RAID [19:32:30] RECOVERY - DPKG on wtp1009 is OK: All packages OK [19:33:40] PROBLEM - puppet last run on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:09] PROBLEM - salt-minion processes on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:10] PROBLEM - RAID on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:20] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures [19:34:20] PROBLEM - DPKG on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:30] PROBLEM - SSH on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:30] PROBLEM - configured eth on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:30] PROBLEM - RAID on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:10] PROBLEM - SSH on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:10] PROBLEM - Disk space on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:10] PROBLEM - Parsoid on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:30] PROBLEM - Disk space on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:30] PROBLEM - dhclient process on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:30] PROBLEM - parsoid disk space on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:39] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [19:35:59] PROBLEM - Parsoid on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:09] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time [19:36:09] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:36:09] RECOVERY - Disk space on wtp1002 is OK: DISK OK [19:36:10] RECOVERY - salt-minion processes on wtp1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:10] RECOVERY - RAID on wtp1014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:36:29] RECOVERY - dhclient process on wtp1006 is OK: PROCS OK: 0 processes with command name dhclient [19:36:29] RECOVERY - Disk space on wtp1006 is OK: DISK OK [19:36:29] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:36:30] RECOVERY - DPKG on wtp1006 is OK: All packages OK [19:36:30] RECOVERY - parsoid disk space on wtp1006 is OK: DISK OK [19:36:39] RECOVERY - SSH on wtp1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:36:39] RECOVERY - configured eth on wtp1006 is OK: NRPE: Unable to read output [19:36:39] RECOVERY - RAID on wtp1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:36:40] PROBLEM - puppet last run on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:50] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:50] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.012 second response time [19:36:50] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [19:37:11] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:37:30] PROBLEM - DPKG on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:37:40] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures [19:39:04] (03PS5) 10Tpt: Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 [19:39:19] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:39:40] PROBLEM - configured eth on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:39:49] PROBLEM - salt-minion processes on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:39:49] PROBLEM - dhclient process on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:39:49] PROBLEM - Parsoid on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:50] PROBLEM - RAID on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:40:10] PROBLEM - Disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:40:20] PROBLEM - configured eth on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:40:21] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [19:40:39] PROBLEM - parsoid disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:40:40] RECOVERY - configured eth on wtp1014 is OK: NRPE: Unable to read output [19:40:50] RECOVERY - RAID on wtp1014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:41:10] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:41:20] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:41:56] (03PS6) 10Tpt: Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 [19:42:25] (03CR) 10Tpt: "PS 5-6: removes duplication of sitegroup configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [19:42:40] RECOVERY - parsoid disk space on wtp1012 is OK: DISK OK [19:43:00] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.715 second response time [19:43:00] RECOVERY - dhclient process on wtp1012 is OK: PROCS OK: 0 processes with command name dhclient [19:43:00] RECOVERY - salt-minion processes on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:43:10] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:43:20] RECOVERY - Disk space on wtp1012 is OK: DISK OK [19:43:29] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 45 minutes ago with 0 failures [19:43:29] RECOVERY - configured eth on wtp1012 is OK: NRPE: Unable to read output [19:43:29] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:44:10] PROBLEM - Disk space on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:00] PROBLEM - puppet last run on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:10] RECOVERY - Disk space on wtp1004 is OK: DISK OK [19:45:40] PROBLEM - puppet last run on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:59] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [19:45:59] PROBLEM - SSH on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:00] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:20] PROBLEM - dhclient process on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:20] PROBLEM - salt-minion processes on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:40] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [19:47:00] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:47:10] PROBLEM - RAID on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:10] PROBLEM - dhclient process on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:19] PROBLEM - SSH on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:30] RECOVERY - dhclient process on wtp1012 is OK: PROCS OK: 0 processes with command name dhclient [19:47:30] RECOVERY - salt-minion processes on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:47:40] RECOVERY - DPKG on wtp1012 is OK: All packages OK [19:48:00] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:09] RECOVERY - SSH on wtp1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:48:09] RECOVERY - dhclient process on wtp1004 is OK: PROCS OK: 0 processes with command name dhclient [19:48:09] RECOVERY - RAID on wtp1011 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:48:10] RECOVERY - SSH on wtp1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:49:09] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:49:10] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [19:49:30] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [19:50:00] PROBLEM - Disk space on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:00] PROBLEM - dhclient process on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:00] PROBLEM - parsoid disk space on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:09] PROBLEM - puppet last run on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:19] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:51:01] RECOVERY - dhclient process on wtp1007 is OK: PROCS OK: 0 processes with command name dhclient [19:51:01] RECOVERY - Disk space on wtp1007 is OK: DISK OK [19:51:01] RECOVERY - parsoid disk space on wtp1007 is OK: DISK OK [19:52:30] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:53:00] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:53:10] PROBLEM - DPKG on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:53:20] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:53:20] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [19:53:29] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:54:09] PROBLEM - SSH on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:40] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 56 minutes ago with 0 failures [19:55:00] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:55:29] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:55:29] RECOVERY - DPKG on wtp1012 is OK: All packages OK [19:55:40] PROBLEM - puppet last run on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:10] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [19:56:49] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:57:41] hmm [19:57:52] gwicke: or cscott_away or anyone else who knows about parsoid around? [19:58:04] oh, there was an email [19:58:05] * YuviPanda checks [19:58:20] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 6 failures [19:58:50] PROBLEM - DPKG on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:00] PROBLEM - puppet last run on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:10] PROBLEM - Disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:10] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:10] PROBLEM - configured eth on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:10] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:19] PROBLEM - Parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:30] PROBLEM - parsoid disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:40] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:50] PROBLEM - dhclient process on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:50] PROBLEM - salt-minion processes on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:10] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [20:00:59] RECOVERY - dhclient process on wtp1012 is OK: PROCS OK: 0 processes with command name dhclient [20:00:59] RECOVERY - salt-minion processes on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:01:10] RECOVERY - DPKG on wtp1012 is OK: All packages OK [20:01:20] RECOVERY - Disk space on wtp1012 is OK: DISK OK [20:01:30] RECOVERY - configured eth on wtp1012 is OK: NRPE: Unable to read output [20:01:30] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:01:39] RECOVERY - parsoid disk space on wtp1012 is OK: DISK OK [20:01:49] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [20:02:20] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:03:40] PROBLEM - RAID on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:03:40] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:04:39] RECOVERY - RAID on wtp1024 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:05:50] PROBLEM - puppet last run on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:06:07] !log restarting parsoid on wtp1008 [20:06:15] Logged the message, Master [20:06:20] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.019 second response time [20:06:36] (03PS7) 10Tpt: Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 [20:06:40] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [20:07:00] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.021 second response time [20:07:30] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:07:40] PROBLEM - SSH on wtp1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:59] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [20:08:40] RECOVERY - SSH on wtp1024 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:10:43] hmm, strace wasn’t really helpful [20:10:49] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [20:12:19] !log restarting parsoid on all wtp* hosts [20:12:28] Logged the message, Master [20:12:39] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:39] PROBLEM - configured eth on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:12:39] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:12:40] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures [20:13:21] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.018 second response time [20:13:30] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:13:40] RECOVERY - configured eth on wtp1012 is OK: NRPE: Unable to read output [20:15:00] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.023 second response time [20:15:10] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.021 second response time [20:15:10] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.012 second response time [20:15:20] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.011 second response time [20:15:20] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time [20:15:32] (03CR) 10Aude: [C: 031] Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [20:16:09] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:10] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:16:19] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.024 second response time [20:17:00] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [20:17:30] PROBLEM - configured eth on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:17:40] PROBLEM - Parsoid on wtp1011 is CRITICAL: HTTP CRITICAL - No data received from host [20:18:00] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:10] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:20] RECOVERY - configured eth on wtp1011 is OK: NRPE: Unable to read output [20:18:33] (03CR) 10Hoo man: [C: 04-1] "New global should probably be unset at the end of the script. Fine despite." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [20:18:40] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time [20:18:50] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time [20:19:50] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:22:21] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [20:22:30] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:23:00] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [20:23:49] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:26:49] hmm [20:28:55] hey subbu [20:29:08] hi [20:29:20] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:29:20] !log manually restarted parsoid on wtp1012 [20:29:27] Logged the message, Master [20:29:42] i just saw gwicke's mails .. and looking at logstash to see what i can find there. [20:30:00] subbu: cool. I can be around for as long as you like. [20:30:01] my sense is that parsoid is choking on commons.wikimedia.org/wiki/User:Nordlicht8/Equestriansports/2014_November_21-30 [20:30:19] a simple strace showed a lot of futex waiting, but I didn’t investigate [20:30:23] much [20:30:37] oh [20:31:08] but, i am not sure ..trying that page locally [20:31:33]