[00:51:29] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 301 seconds [00:52:30] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 365 seconds [00:53:39] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [00:53:40] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [01:53:55] 3ops-core: Renumber virt0 - https://phabricator.wikimedia.org/T83843#952922 (10faidon) 5Open>3Resolved a:3faidon Well, virt0 was a Tampa host and 208.80.152.0/25 was a Tampa subnet. This is obviously gone now :) [01:53:58] 3ops-core: Renumber virt0 - https://phabricator.wikimedia.org/T83843#952925 (10faidon) [02:35:20] (03PS1) 10OliverKeyes: Change the URLs used by Pybal to simplify tracking for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/182558 [02:45:21] (03CR) 10Ori.livneh: "Can't you reject requests originating in the cluster?" [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [02:50:23] (03CR) 10OliverKeyes: "We can, but:" [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [03:00:36] (03CR) 10Faidon Liambotis: [C: 04-1] "We really need the ability to check real pages and make sure that they work. ("Undefined" is a real page too, btw, and /could/ work for us" [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [04:06:01] (03CR) 10OliverKeyes: "Moving the problem, yes, but essentially it's amalgamating. As said in the commit message, we already have to exclude /wiki/Undefined due " [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [05:38:31] PROBLEM - Disk space on search1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:05:30] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail [06:06:00] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: Puppet has 1 failures [06:06:40] PROBLEM - puppet last run on amssq44 is CRITICAL: CRITICAL: Puppet has 1 failures [06:06:40] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 2 failures [06:07:09] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Puppet has 1 failures [06:13:19] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [06:14:20] PROBLEM - puppet last run on amssq41 is CRITICAL: CRITICAL: Puppet has 1 failures [06:16:39] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:22:40] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:23:29] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:23:29] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:23:50] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:24:30] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:29:00] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:20] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:40] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:50] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:19] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:10] RECOVERY - puppet last run on amssq41 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:33:20] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:20] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:15:30] PROBLEM - Disk space on search1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:11] 3operations, Wikimedia-SSL-related: replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#953034 (10Chmarkine) [10:38:06] 3operations, Wikimedia-SSL-related: replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#953076 (10Chmarkine) [11:05:00] RECOVERY - Disk space on search1021 is OK: DISK OK [11:09:19] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 961.60004874 [11:20:29] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [11:24:20] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [11:30:40] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [11:36:50] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:21:22] (03CR) 10Chad: "Could switch it to Special:Blankpage." [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [16:51:20] PROBLEM - Parsoid on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:30] anybody around? [17:00:49] PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:52] the parsoid cluster is seeing very high load [17:01:06] and I dont't seem to have the right to restart parsoids any more [17:01:51] /cc paravoid akosiaris ori godog _joe_ [17:07:10] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:07:21] <_joe_> gwicke: uh, strange [17:07:32] <_joe_> gwicke: should I just restart parsoid? [17:08:00] PROBLEM - Parsoid on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:08:28] _joe_: yes [17:08:38] <_joe_> on the whole cluster? [17:08:45] in the logs I'm seeing a second parsoid process trying to start up with the current one still running [17:08:49] _joe_: yes [17:09:26] <_joe_> gwicke: mmmh it looks worse than just needing a restart [17:09:39] there is some upstart issue where it doesn't really kill all processes before starting the new process [17:10:14] so a manual kill -9 node && service parsoid start might be in order [17:10:37] <_joe_> that's what I am doing [17:11:19] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.017 second response time [17:11:31] * gwicke is looking forward to systemd [17:11:51] <_joe_> 'service parsoid stop; killall -9 nodejs; service parsoid start' [17:11:54] <_joe_> ok? [17:11:57] yes [17:12:52] (03PS3) 10Hoo man: Fix sitelinkgroups for Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [17:13:19] <_joe_> !log restarting parsoid across the cluster [17:13:29] Logged the message, Master [17:13:30] (03PS4) 10Hoo man: Fix specialSiteLinkGroups for Wikibase clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [17:14:19] there are a lot of lines like this in /var/log/parsoid/parsoid.log: [17:14:20] {"name":"parsoid","hostname":"wtp1019","pid":31788,"level":50,"logType":"error","process":{"name":"worker","pid":31788},"msg":"Port 8000 is already in use. Exiting.","longMsg":"Port 8000 is already in use. Exiting.","time":"2015-01-03T17:13:50.965Z","v":0} [17:14:30] evidently a second parsoid process trying to start up [17:14:38] <_joe_> possibly [17:14:50] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.017 second response time [17:15:07] <_joe_> I am restarting parsoid on 2 servers at a time [17:15:11] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time [17:15:44] wtp1019 is back to normal [17:16:21] the traffic pattern I see there looks normalish too [17:17:29] load on the cluster went up quite a bit a few hours ago: https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Parsoid+eqiad&m=cpu_report&s=by+name&mc=2&g=cpu_report [17:18:29] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 13 failures [17:18:36] <_joe_> gwicke: I am on vacation until monday, so I didn't take the time to debug this better [17:18:49] while network went down in the same period: https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Parsoid+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [17:18:55] <_joe_> but systems where I restarted node are ok now [17:18:57] so not an external load surge it seems [17:19:28] *nod* [17:19:36] <_joe_> I see quite the opposite [17:20:38] <_joe_> we had a network traffic surge that eventually resulted in a drop when the cpu skyrocketed, so looks like some anomalous load made node get into a bad state [17:20:59] cpu skyrocketed after network had dropped already [17:21:09] <_joe_> anyways, it's recovering now [17:21:27] cpu went up at 3:30, network dropped at around 2:30 [17:21:38] <_joe_> right [17:21:39] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time [17:22:06] I'm also on vacation, so might not spend my entire morning investigating this [17:22:14] <_joe_> :) [17:22:19] <_joe_> wise choice [17:22:27] <_joe_> it's mostly solved now btw [17:23:06] the restart failure is tracked in https://phabricator.wikimedia.org/T75395 [17:23:41] <_joe_> ok, everything restarted, we should be ok now [17:23:48] * _joe_ goes away again [17:23:57] _joe_: thanks for your help! [17:24:12] and enjoy the remainder of your vacation [17:28:09] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [17:36:19] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:38:15] hmm, load seems to be climbing back up again [17:46:59] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:54] there is very little external traffic in the parsoid varnish frontend (grep -v 10.64) [18:03:18] most requests are coming from mw* hosts; https://gdash.wikimedia.org/dashboards/apimethods/ is sadly broken, so I'm wondering if there is any other monitoring for API methods [18:03:36] didn't find anything for the visualeditor action in graphite [18:08:19] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [18:09:30] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:13:39] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [18:15:50] PROBLEM - Parsoid on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:16:09] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:09] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [18:28:19] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:28:20] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:33:18] and we are back in a restart loop: {"name":"parsoid","hostname":"wtp1019","pid":21289,"level":50,"logType":"error","process":{"name":"worker","pid":21289},"msg":"Port 8000 is already in use. Exiting.","longMsg":"Port 8000 is already in use. Exiting.","time":"2015-01-03T18:32:54.835Z","v":0} [18:34:00] load is also really high /cc _joe_ akosiaris paravoid ori godog jgage mutante [18:38:00] PROBLEM - Parsoid on wtp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:40] PROBLEM - RAID on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:39] PROBLEM - puppet last run on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:50] RECOVERY - RAID on wtp1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:50:31] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:52:03] (03PS1) 10GWicke: Make parsoid roots actually root on the parsoid boxes [puppet] - 10https://gerrit.wikimedia.org/r/182585 [18:52:30] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:50] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:53:10] PROBLEM - puppet last run on wtp1022 is CRITICAL: CRITICAL: Puppet has 10 failures [18:55:30] I got to go, hope somebody else can take over handling the parsoid issue [18:57:10] PROBLEM - SSH on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:29] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:51] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:56] (03CR) 10Aude: [C: 04-1] "looks ok, generally however this is also a repo setting (see comments)" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [18:58:00] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:49] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:09] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [18:59:29] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:59:49] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [18:59:50] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:02:50] PROBLEM - SSH on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:10] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:20] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:40] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:49] PROBLEM - Parsoid on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:29] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [19:06:00] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:06:19] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:06:59] PROBLEM - SSH on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:00] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:40] PROBLEM - puppet last run on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:49] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:08:00] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.011 second response time [19:08:50] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:09:20] PROBLEM - SSH on wtp1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:19] RECOVERY - SSH on wtp1003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:11:00] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:11:10] RECOVERY - SSH on wtp1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:11:20] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [19:11:59] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [19:12:09] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [19:12:19] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:10] PROBLEM - dhclient process on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:10] PROBLEM - DPKG on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:19] PROBLEM - salt-minion processes on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:19] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:19] PROBLEM - puppet last run on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:13:20] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [19:13:29] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:40] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:09] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:14:10] RECOVERY - dhclient process on wtp1006 is OK: PROCS OK: 0 processes with command name dhclient [19:14:10] RECOVERY - DPKG on wtp1006 is OK: All packages OK [19:14:20] RECOVERY - salt-minion processes on wtp1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:14:29] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [19:14:29] PROBLEM - puppet last run on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:20] PROBLEM - parsoid disk space on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:30] PROBLEM - salt-minion processes on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:40] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:16:19] PROBLEM - DPKG on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:16:20] RECOVERY - parsoid disk space on wtp1004 is OK: DISK OK [19:16:30] RECOVERY - salt-minion processes on wtp1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:16:30] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:16:32] (03CR) 10Aude: Fix specialSiteLinkGroups for Wikibase clients (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [19:17:20] RECOVERY - DPKG on wtp1012 is OK: All packages OK [19:17:40] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:18:09] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [19:18:29] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:50] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:50] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:21:30] PROBLEM - puppet last run on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:40] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:40] PROBLEM - configured eth on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:40] PROBLEM - dhclient process on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:22:47] (03CR) 10Tpt: "I'm not sure it's a good idea to merge the siteLinkGroups option of the clients with the same option of the repos because we will have to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [19:23:00] PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:29] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:31] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:39] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [19:23:40] PROBLEM - RAID on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:40] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [19:23:40] RECOVERY - dhclient process on wtp1004 is OK: PROCS OK: 0 processes with command name dhclient [19:23:40] RECOVERY - configured eth on wtp1004 is OK: NRPE: Unable to read output [19:23:50] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:24:29] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:24:39] RECOVERY - RAID on wtp1003 is OK: OK: no disks configured for RAID [19:24:50] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [19:25:30] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:25:39] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:26:49] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:27:59] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [19:28:00] PROBLEM - RAID on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:40] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:29:00] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:29:19] PROBLEM - DPKG on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:29:29] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:30:00] PROBLEM - puppet last run on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:30:11] RECOVERY - DPKG on wtp1014 is OK: All packages OK [19:31:30] PROBLEM - DPKG on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:31:39] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 44 minutes ago with 0 failures [19:32:09] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:19] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:29] RECOVERY - RAID on wtp1003 is OK: OK: no disks configured for RAID [19:32:30] RECOVERY - DPKG on wtp1009 is OK: All packages OK [19:33:40] PROBLEM - puppet last run on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:09] PROBLEM - salt-minion processes on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:10] PROBLEM - RAID on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:20] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures [19:34:20] PROBLEM - DPKG on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:30] PROBLEM - SSH on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:30] PROBLEM - configured eth on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:30] PROBLEM - RAID on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:10] PROBLEM - SSH on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:10] PROBLEM - Disk space on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:10] PROBLEM - Parsoid on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:30] PROBLEM - Disk space on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:30] PROBLEM - dhclient process on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:30] PROBLEM - parsoid disk space on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:39] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [19:35:59] PROBLEM - Parsoid on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:09] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time [19:36:09] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:36:09] RECOVERY - Disk space on wtp1002 is OK: DISK OK [19:36:10] RECOVERY - salt-minion processes on wtp1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:10] RECOVERY - RAID on wtp1014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:36:29] RECOVERY - dhclient process on wtp1006 is OK: PROCS OK: 0 processes with command name dhclient [19:36:29] RECOVERY - Disk space on wtp1006 is OK: DISK OK [19:36:29] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:36:30] RECOVERY - DPKG on wtp1006 is OK: All packages OK [19:36:30] RECOVERY - parsoid disk space on wtp1006 is OK: DISK OK [19:36:39] RECOVERY - SSH on wtp1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:36:39] RECOVERY - configured eth on wtp1006 is OK: NRPE: Unable to read output [19:36:39] RECOVERY - RAID on wtp1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:36:40] PROBLEM - puppet last run on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:50] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:50] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.012 second response time [19:36:50] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [19:37:11] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:37:30] PROBLEM - DPKG on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:37:40] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures [19:39:04] (03PS5) 10Tpt: Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 [19:39:19] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:39:40] PROBLEM - configured eth on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:39:49] PROBLEM - salt-minion processes on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:39:49] PROBLEM - dhclient process on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:39:49] PROBLEM - Parsoid on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:50] PROBLEM - RAID on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:40:10] PROBLEM - Disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:40:20] PROBLEM - configured eth on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:40:21] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [19:40:39] PROBLEM - parsoid disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:40:40] RECOVERY - configured eth on wtp1014 is OK: NRPE: Unable to read output [19:40:50] RECOVERY - RAID on wtp1014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:41:10] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:41:20] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:41:56] (03PS6) 10Tpt: Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 [19:42:25] (03CR) 10Tpt: "PS 5-6: removes duplication of sitegroup configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [19:42:40] RECOVERY - parsoid disk space on wtp1012 is OK: DISK OK [19:43:00] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.715 second response time [19:43:00] RECOVERY - dhclient process on wtp1012 is OK: PROCS OK: 0 processes with command name dhclient [19:43:00] RECOVERY - salt-minion processes on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:43:10] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:43:20] RECOVERY - Disk space on wtp1012 is OK: DISK OK [19:43:29] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 45 minutes ago with 0 failures [19:43:29] RECOVERY - configured eth on wtp1012 is OK: NRPE: Unable to read output [19:43:29] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:44:10] PROBLEM - Disk space on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:00] PROBLEM - puppet last run on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:10] RECOVERY - Disk space on wtp1004 is OK: DISK OK [19:45:40] PROBLEM - puppet last run on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:59] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [19:45:59] PROBLEM - SSH on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:00] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:20] PROBLEM - dhclient process on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:20] PROBLEM - salt-minion processes on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:40] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [19:47:00] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:47:10] PROBLEM - RAID on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:10] PROBLEM - dhclient process on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:19] PROBLEM - SSH on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:30] RECOVERY - dhclient process on wtp1012 is OK: PROCS OK: 0 processes with command name dhclient [19:47:30] RECOVERY - salt-minion processes on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:47:40] RECOVERY - DPKG on wtp1012 is OK: All packages OK [19:48:00] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:09] RECOVERY - SSH on wtp1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:48:09] RECOVERY - dhclient process on wtp1004 is OK: PROCS OK: 0 processes with command name dhclient [19:48:09] RECOVERY - RAID on wtp1011 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:48:10] RECOVERY - SSH on wtp1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:49:09] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:49:10] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [19:49:30] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [19:50:00] PROBLEM - Disk space on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:00] PROBLEM - dhclient process on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:00] PROBLEM - parsoid disk space on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:09] PROBLEM - puppet last run on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:19] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:51:01] RECOVERY - dhclient process on wtp1007 is OK: PROCS OK: 0 processes with command name dhclient [19:51:01] RECOVERY - Disk space on wtp1007 is OK: DISK OK [19:51:01] RECOVERY - parsoid disk space on wtp1007 is OK: DISK OK [19:52:30] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:53:00] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:53:10] PROBLEM - DPKG on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:53:20] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:53:20] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [19:53:29] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:54:09] PROBLEM - SSH on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:40] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 56 minutes ago with 0 failures [19:55:00] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:55:29] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [19:55:29] RECOVERY - DPKG on wtp1012 is OK: All packages OK [19:55:40] PROBLEM - puppet last run on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:56:10] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [19:56:49] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:57:41] hmm [19:57:52] gwicke: or cscott_away or anyone else who knows about parsoid around? [19:58:04] oh, there was an email [19:58:05] * YuviPanda checks [19:58:20] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 6 failures [19:58:50] PROBLEM - DPKG on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:00] PROBLEM - puppet last run on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:10] PROBLEM - Disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:10] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:10] PROBLEM - configured eth on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:10] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:19] PROBLEM - Parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:30] PROBLEM - parsoid disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:40] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:50] PROBLEM - dhclient process on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:50] PROBLEM - salt-minion processes on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:10] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [20:00:59] RECOVERY - dhclient process on wtp1012 is OK: PROCS OK: 0 processes with command name dhclient [20:00:59] RECOVERY - salt-minion processes on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:01:10] RECOVERY - DPKG on wtp1012 is OK: All packages OK [20:01:20] RECOVERY - Disk space on wtp1012 is OK: DISK OK [20:01:30] RECOVERY - configured eth on wtp1012 is OK: NRPE: Unable to read output [20:01:30] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:01:39] RECOVERY - parsoid disk space on wtp1012 is OK: DISK OK [20:01:49] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [20:02:20] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:03:40] PROBLEM - RAID on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:03:40] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:04:39] RECOVERY - RAID on wtp1024 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:05:50] PROBLEM - puppet last run on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:06:07] !log restarting parsoid on wtp1008 [20:06:15] Logged the message, Master [20:06:20] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.019 second response time [20:06:36] (03PS7) 10Tpt: Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 [20:06:40] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [20:07:00] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.021 second response time [20:07:30] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:07:40] PROBLEM - SSH on wtp1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:59] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [20:08:40] RECOVERY - SSH on wtp1024 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:10:43] hmm, strace wasn’t really helpful [20:10:49] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [20:12:19] !log restarting parsoid on all wtp* hosts [20:12:28] Logged the message, Master [20:12:39] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:39] PROBLEM - configured eth on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:12:39] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:12:40] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures [20:13:21] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.018 second response time [20:13:30] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:13:40] RECOVERY - configured eth on wtp1012 is OK: NRPE: Unable to read output [20:15:00] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.023 second response time [20:15:10] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.021 second response time [20:15:10] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.012 second response time [20:15:20] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.011 second response time [20:15:20] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time [20:15:32] (03CR) 10Aude: [C: 031] Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [20:16:09] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:10] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:16:19] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.024 second response time [20:17:00] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [20:17:30] PROBLEM - configured eth on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:17:40] PROBLEM - Parsoid on wtp1011 is CRITICAL: HTTP CRITICAL - No data received from host [20:18:00] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:10] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:20] RECOVERY - configured eth on wtp1011 is OK: NRPE: Unable to read output [20:18:33] (03CR) 10Hoo man: [C: 04-1] "New global should probably be unset at the end of the script. Fine despite." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [20:18:40] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time [20:18:50] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time [20:19:50] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:22:21] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [20:22:30] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:23:00] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [20:23:49] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:26:49] hmm [20:28:55] hey subbu [20:29:08] hi [20:29:20] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:29:20] !log manually restarted parsoid on wtp1012 [20:29:27] Logged the message, Master [20:29:42] i just saw gwicke's mails .. and looking at logstash to see what i can find there. [20:30:00] subbu: cool. I can be around for as long as you like. [20:30:01] my sense is that parsoid is choking on commons.wikimedia.org/wiki/User:Nordlicht8/Equestriansports/2014_November_21-30 [20:30:19] a simple strace showed a lot of futex waiting, but I didn’t investigate [20:30:23] much [20:30:37] oh [20:31:08] but, i am not sure ..trying that page locally [20:31:33] i see a bunch of extension failure errors in logstash for that page. [20:32:23] it sure has zillions of photos on it. [20:32:33] but, gwicke have you found anything so far? [20:33:58] dunno if gwicke is still around [20:34:15] I got to go, hope somebody else can take over handling the parsoid issue [20:34:35] ok. [20:35:03] no, that page finished parsing locally. [20:36:22] I wish people used !log [20:38:38] (03CR) 10Hoo man: Display links to Wikidata in the other project sidebar (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [20:39:33] subbu: hmm, some hosts are loading up again [20:39:35] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Parsoid+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [20:39:53] yes .. i see it. i am looking at logs to see what page it is choking on. [20:40:06] ok [20:43:45] ok, found a/the page .. http://ur.wikipedia.org/wiki/%D9%86%D8%A7%D9%85_%D9%85%D9%82%D8%A7%D9%85%D8%A7%D8%AA_%D8%A7%DB%92 [20:44:20] now, have to figure out how to fix it. [20:44:50] subbu: are you still in India? :) [20:45:09] no, back in US .. currently in cleveland .. back in mpls tomorrow [20:45:13] ah, nice [20:45:21] so it isn’t crazy late for you :D good. [20:45:37] no :) [20:45:43] hmm, should I restart things once again to give you time to fix things? [20:46:15] yes .. we might have to do it a few times .. unless the job clears out from the job queue (assuming that is where it is coming from). [20:46:38] !log restarting parsoid on wtp* again [20:46:46] Logged the message, Master [20:49:00] subbu: doing a rolling restart 2machines at a time [20:49:05] k [20:49:08] should probably increase the number of machines per hit next time [20:49:18] subbu: I can be around for another hour, I think. [20:49:24] YuviPanda, thanks very much btw for sticking around and helping around. [20:49:29] k. [20:49:43] i should have a handle on it well within that time, i am hoping. [20:50:00] PROBLEM - SSH on wtp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:50:20] it is a gigantic 25K list .. i assume our list algorithm needs fixup. [20:51:00] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:51:21] :) [20:51:29] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:20] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [20:55:50] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:00] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:50] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [20:57:00] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [21:10:10] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: puppet fail [21:17:36] subbu: am restarting again. [21:17:49] subbu: when done, I’ll go take a shower and be back in another 20mins or so. [21:18:13] YuviPanda, sounds good. [21:19:50] !log restarting parsoid on wtp* hosts again [21:19:56] Logged the message, Master [21:28:50] (03PS8) 10Tpt: Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 [21:29:12] (03CR) 10Tpt: "PS 8: Removes an obsolete comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [21:30:19] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [21:31:15] subbu: alright, I’m going afk, will be back in about 20mins [21:31:43] k [22:01:04] subbu: any luck? [22:01:09] it’s back! :) [22:01:39] i know .. fixing the parse is a bigger issue than a simple quick hack .. so, i started looking at timeouts ... [22:02:24] but, the problem seems to be that the job queue (or wherever the request is coming from) is retrying the title .. so, exploring couple other leads. [22:02:38] lost some time because was fighting bad wifi here. [22:07:32] our timeout handling should have handled it and it seems to be handling it looking at kibana .. timeout log events have spiked in the last few hours as expected. but, looks like processes are still stuck (even though they should have been restarted). [22:08:42] hmm [22:10:08] one temporary hack is to return a 500 just for this one title. [22:10:48] subbu: we could potentially do that to punt this until MOnday [22:10:49] *Monday [22:11:00] subbu: if we are sure that is what is causing the issue [22:11:33] it feels horrible, but I don’t think there’s going to be other opsen around today much... [22:11:43] Where is the Foundation's main DC these days? [22:11:45] i saw the same title on wtp1002 2 times i check .. let me look on other cores. [22:12:00] YuviPanda, and for now, can hardcode a 500 return for that title. [22:12:15] ok [22:12:40] Lcawte: Ashburn, Virginia [22:12:42] aks eqiad [22:13:09] YuviPanda: Thank you :) [22:13:57] Lcawte: googling for that got me to http://www.dslreports.com/forum/r27672816-Suspicious-Activity-in-My-Computer [22:14:03] which I found somewhat hilarious [22:14:25] YuviPanda, yes .. same title on wtp1002 on multiple processes and on wtp1005 as well. [22:14:51] subbu: hmm, right. so hack commit (+ phab task?) + deploy? [22:15:06] !log restarted parsoid on wtp* hosts agian [22:15:07] *again [22:15:08] yup, will do. [22:15:11] Logged the message, Master [22:15:24] subbu: cool. any ETA? [22:15:31] 10-15 mins at most. [22:15:35] subbu: cool [22:17:43] i'll do the fix on tin since master has a lot of commits (that have been tested) but will only be deploy on monday. [22:18:31] subbu: uh, isn’t there a deploy branch of sorts? [22:19:02] no, we don't. something to fix with our deploy setup. [22:19:16] subbu: ah, hmm. even then, perhaps just push it to gerrit and cherry-pick on tin? [22:19:21] so that it exists outside of just tin [22:21:22] grr .. wifi is really acting up on top of this. [22:26:31] YuviPanda, ok .. deploying now. [22:28:28] synced .. restarting service via dsh [22:29:49] !log hotfix synced to parsoid cores (to return 500 for urwiki:نام_مقامات_اے); restart coming next [22:29:58] Logged the message, Master [22:33:39] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:39] PROBLEM - Parsoid on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:35:15] subbu: ^ [22:35:26] YuviPanda, the restarts are going slowly .. [22:36:19] looking at ganglia, 1004 and 1006 didn't restart with my dsh restart .. so you might have to intervene on those stuck processes. [22:36:26] let me do so [22:36:46] the dsh command is now at wtp1012 [22:37:20] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:37:23] !log restarted parsoid on wtp1004 [22:37:31] Logged the message, Master [22:37:59] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.021 second response time [22:38:17] !log restarted parsoid on wtp1006 [22:38:29] !log restarted parsoid on wtp1010 [22:38:33] subbu: hmm, why dsh and not salt? [22:38:43] oh, I guess you don’t have salt perms [22:38:44] ok [22:38:47] ya. [22:38:48] Logged the message, Master [22:38:56] Logged the message, Master [22:38:59] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.026 second response time [22:39:16] YuviPanda, I use the deploy process as on https://wikitech.wikimedia.org/wiki/Parsoid#Deploying_changes [22:39:27] right [22:39:30] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time [22:39:50] subbu: let me know if you need a rolling restart or something including a nodejs kill -9. [22:40:35] subbu: in fact, it might be a good idea anyway... [22:40:52] dsh is now at wtp1016 [22:40:52] subbu: mind if I do one? [22:41:18] sure. go ahead. [22:41:27] should I ^C the dsh then? [22:41:31] subbu: yeah. [22:41:37] done. [22:41:46] will let you do the rolling restarts [22:42:01] subbu: doing now. [22:42:10] I can include a killall -9 nodejs in it [22:42:16] to make sure runaway processes are dead too [22:43:02] k [22:46:41] subbu: restarts done [22:46:54] ok, verifying version. [22:47:23] seems the right version on all cores [22:50:28] now to verify that it doesn't come back. [22:51:07] YuviPanda, thanks for your help. [22:52:48] subbu: :D yw! I’ll stick around for a while more to see if it stays down [22:52:56] subbu: can you update ops@? [22:53:06] and perhaps file a phab ticket for appropriate bug [22:53:10] yes, i'll .. i have a draft ready to do .. [22:53:13] *go [22:53:43] cool [22:56:01] hmm .. doesn't seem fixed looking at load. [22:58:48] hmm, yeah, going back up [22:59:03] I wonder, if one way to fix it is to delete that page or something, temporarily? [22:59:39] that page is still being parsed, 500 is not being returned ... local dev. [22:59:49] unlike local dev version on my laptop. [23:00:39] uh oh. [23:02:48] YuviPanda, the copy-paste from my laptop didn't work properly . of course encoding .. [23:03:08] subbu: this is why we should’ve put it up on gerrit and cherry-picked it :) [23:03:17] yup, you are right. [23:03:19] subbu: also so it can be linked to easier. can we do that? [23:03:45] we can. [23:03:45] let me do it now. [23:04:21] ok [23:07:13] YuviPanda, https://gerrit.wikimedia.org/r/#/c/182640/ [23:07:51] subbu: cool! :D do deploy, I can do the restarts [23:10:14] (03CR) 10OliverKeyes: "....." [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [23:11:13] * subbu is waiting for jenkins to merge [23:12:00] I'm here now [23:12:13] how are things? can I do anything? [23:12:49] paravoid, i decided to return a http 500 for the title that is causing this since the fix seems more involved than I can do in a tired state over the weekend. [23:13:08] makes sense [23:13:14] we could do it from varnish as well [23:13:31] paravoid, ah .. that might work too. [23:13:45] ur.wikipedia.org/wiki/نام_مقامات_اے is the title [23:14:10] why is this overloading parsoid? [23:14:15] it has a 25K list [23:14:19] and parsoid isn't handling it. [23:14:26] wtf [23:14:40] my browser can't handle it either [23:15:05] we have timeout handling that should have restarted it ... but the processes are getting stuck instead of getting restarted. [23:15:17] kibana is logging a lot of timeouts for this title. [23:16:04] i botched the first deploy that returns a http 500 because i did a copy-paste of the code from my laptop to tin to deploy .. and because of encoding issues, the title seems to be different than local dev version. [23:16:34] so, the other option YuviPanda suggested was to temporarily blank the title but that only requires someone to revert it to see this problem again. [23:16:53] I was mostly making a superprotect joke [23:17:01] ah .. you lost me :) [23:17:05] :D [23:17:21] subbu: did jenkins merge? should I do a restart? [23:17:41] paravoid, so, should i try parsoid http 500 or should we do varnish? [23:18:01] well if you're already on it, just go ahead [23:18:05] hmm, I’m not sure of our setup but if we do a 500 from varnish can we log it? the parsoid code does... [23:18:06] YuviPanda, yes, it merged. i can deploy .. [23:18:17] subbu: please do. let me know when you need a clean restart [23:18:28] and should really fix that bug in parsoid’s init script [23:18:28] k. will let you know. hang on. [23:19:17] (https://phabricator.wikimedia.org/T75395 being the init script related bug (perhaps?)) [23:22:03] YuviPanda, okay, synced code. [23:22:06] requires a restart [23:22:23] restarting. can you log with commit sha1? [23:23:05] !log Try #2: hotfix synced to parsoid cores (to return 500 for urwiki:نام_مقامات_اے); git sha 85d8818ec1b692aaab440630a119c539d63d5ca5 [23:23:11] Logged the message, Master [23:23:15] YuviPanda, ^ [23:23:20] ty [23:28:44] subbu: restart completed, btw [23:28:57] let’s see if it stays down *this* time [23:29:17] ya. [23:30:45] * subbu has fingers crossed ... [23:31:19] i am sending my wife and inlaws for dinner without me :) .. will go after I confirm this is fixed. [23:32:34] heh, let’s wait for 10mins and then I can go to sleep too [23:33:07] yes .. it is 5am for you. [23:34:35] the timeouts on kibana are looking good so far. fewer. [23:35:02] (03PS2) 10OliverKeyes: Change the URLs used by Pybal to simplify tracking for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/182558 [23:36:08] paravoid, YuviPanda, from kibana logstash ui: "Returning http 500 for urwiki:نام_مقامات_اے'" [23:36:15] whee [23:36:24] so, i think that worked. [23:36:59] ganglia looks greener now, although some hosts still have 50%+ CPU util [23:37:18] not the best metric, perhaps [23:37:32] now to figure out why/who is continuously hitting parsoid with that url .. not sure if it is coming from the job queue. [23:37:44] but, that for another time. i will prepare the update for ops and send it out. [23:43:43] YuviPanda, alright .. looking good to me. [23:43:49] subbu: yeah, to me too [23:43:56] subbu: I think I’ll get some sleep now. [23:44:07] ok, hitting send on the email to ops list [23:44:08] subbu: thanks for coming online to get this fixed! [23:44:34] ya .. glad i noticed. [23:44:51] :D [23:45:06] subbu: a non working parsoid implies dead VE and Flow only, right/ [23:45:06] ? [23:45:18] yes. [23:45:49] and content translation. [23:45:55] right [23:46:08] and possibly kiwix and ocg. [23:46:17] emailed to ops. [23:47:34] paravoid, i am off now .. back online in about 2 hours. [23:47:48] am off too, back in… a day? [23:48:18] * YuviPanda has the beginnings of a massive cold