[00:03:35] New patchset: Jalexander; "Adjust shop link to protorel shop.wikimedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13551 [01:04:43] hi Jeff_Green [01:04:45] oh hello [01:04:52] someone paged? [01:04:57] that'd be me. [01:05:10] so - lvs3 and 4 have decided that all their backends are down. [01:05:16] ah! my non-cheap-german burner phone didn't know your caller ID [01:05:19] nice [01:05:29] due to the maximum number of depoolable servers, all services still seem to be functioning. [01:05:39] ok [01:05:40] but, for example, take a look at this page: [01:05:40] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [01:05:53] You can see a bunch (10 or so?) servers depooled around 17:00UTC. [01:06:01] (and the corresponding CPU increase on the rest) [01:06:08] yeah [01:06:21] anything in pybal logs? [01:06:35] yeah. a bunch of tracebacks [01:07:02] and it seems like health checks (starting at 17:03ish) taking longer and longer at a rate of one sec/sec. [01:07:13] i see [01:07:29] I know there was a bunch of DNS stuff (and some ipv6 stuff?) that mark did earlier today [01:07:42] yeah, that's what I was just thinking too [01:07:43] I'm not positive that it's the cause, but DNS resolution is very slow on lvs3/4 [01:07:57] lvs2 does not seem to exhibit the problem. [01:08:20] which are at eqiad vs pmtpa? [01:08:58] hmm.. I think both are pmtpa. [01:09:11] I haven't looked at an eqiad lvs yet (assuming they're numbered 1000+ [01:09:13] ) [01:09:47] lvs3 has 208.80.152.122 at the top of its resolv.conf, lsv2 does not [01:09:50] lvs1001 seems to fail in a similar manner. [01:09:58] I think that's localhost, isn't it? [01:10:11] yeah. [01:10:21] it's not responsive [01:10:41] one thing that dsch noticed is that the allow_from line is new today. [01:10:51] I haven't tried just kicking pdns_recursor yet. [01:11:17] I'm going to comment it out of resolv.conf on lvs3 for a minute [01:11:28] Mailing lists are slow. [01:11:34] Is that a known thing? [01:12:03] Brooke: first I've heard, but I just logged in [01:12:20] Mail delivery seems delayed. [01:12:32] wikibugs is slow. [01:12:41] I posted to wikitech-l and I still haven't received it. [01:12:46] Something probably up with them. [01:12:48] looking [01:13:09] queue is a bit high on sodium [01:13:42] ben: same symptom on sodium, first resolver on the list is failing [01:14:10] that does seem to be a common thread. [01:14:22] I'm going to try taking it out on lvs3 [01:14:28] wikibugs isn't actually that badly delayed. [01:14:31] Only a few minutes. [01:14:33] ben i did lvs3 [01:14:44] ah. [01:14:47] so I see. [01:14:48] any change? [01:14:48] And the mailing list archive has my mail. I just haven't gotten it back. [01:14:49] i did not do 4 [01:15:04] Brooke: i suspect this is just a delay due to a DNS resolver snafu [01:15:30] yes, it looks like lvs3 has re-enabled its backend servers. [01:15:50] I'll do that on lvs4. [01:15:53] ok [01:16:07] thit will resolve the immediate bit, then we can play with lvs3 to get a real (aka puppetized) solution. [01:16:32] ok [01:17:52] it looks like that worked... [01:18:11] Mail seems better now. [01:18:13] Thanks. :-) [01:18:37] yw [01:19:26] dschoon just made me realize why it failed. [01:19:33] ipv6? [01:19:38] no! (amazincgly) [01:19:43] shocking [01:19:48] the recorsor config specifically allows 127.0.0.1 [01:19:56] i drink enough to be considered ops/ [01:19:57] but /etc/resolv.conf used the public address. [01:20:05] buh. [01:20:10] well well well [01:20:16] there's no place like home! [01:20:20] (I have not yet tested this theory) [01:20:26] lets test! [01:20:34] ...except for, well, not-home. the front door to the treehouse? [01:20:34] test? why would we do that. ready fire aim! [01:20:40] localhost still sucks [01:20:40] shit, this is about to become a backdoor joke. [01:20:41] * maplebed tests on lvs3 [01:20:55] they can both suck. [01:21:01] we so rarely have useful logs from our services . . . [01:21:03] * Jeff_Green weeps. [01:21:24] I'm going to restart it pdns_recursor on lvs3 [01:21:49] ok. [01:23:03] and again, to enable logging [01:24:09] you know, it doesn't say its listening on localhost [01:24:20] see /var/log/daemon [01:24:44] ok, tests complete. 127.0.0.1 in /etc/resolv.conf works, (the hosts stay up) but 208.80.152.122 causes them to be marked down. [01:25:13] where? [01:25:15] lvs3 [01:25:18] ha [01:25:21] test repeated, same results. [01:25:23] rlly? [01:25:33] i *just* enabled pdns to listen on localhost [01:26:08] why the F would we have it listen on anything other than localhost? I don't get it. [01:26:25] I see it listening on both at the moment [01:26:32] that's b/c I hax0rd the config [01:26:32] (127.0.0.1 and 208.80.152.122) [01:26:35] ok. [01:26:39] i added localhost [01:26:48] which supports dschoon's theory [01:27:16] so . . . we have options [01:27:31] the catch is the allow_from line though - it still will only allow requests from localhost even though it's listening on both. [01:27:33] i vote for the ones that involve drinking. [01:27:36] yeah. [01:27:49] dschoon: I have a head start! [01:27:56] allow_from is just a giant blazing red flag made of airhorns [01:28:05] well the config is just broken [01:28:19] we either need to make it listen on localhost and query to localhost in resolv conf [01:28:26] yeah. [01:28:32] or we need to make the allow_from not stupid [01:28:36] brken = it's set to listen on its public IP but only allows querios from localhost. [01:29:01] ma rk must have had some reason to have it listen on the public ip [01:29:44] why don't we add the public IP to the allow_from, and send a memorandum to ma rk [01:30:27] local-address=<%= flatten_ips(listen_addresses).sort.join(" ") %> [01:30:38] I haven't looked at the puppet stuff yet [01:30:43] i think you need a comma. [01:30:46] ", " [01:30:58] or at least, that's what it is in his last few commit. [01:30:59] s [01:31:01] dschoon: that's the existing config [01:31:12] oh. [01:31:13] listen. [01:31:14] comma in the join delimeter? [01:31:15] nm! [01:31:17] ignore me. [01:31:24] comma was for allow_from [01:31:27] (our best friend) [01:31:38] (in the last few commits, for whatever amount of reliability that lends) [01:31:40] allow-from=127.0.0.0/8, ::1/128, <%= allow_from.join(", ") %> [01:31:51] the local-adress on lvs3 (that you did by hand) uses a comma for the join [01:32:11] # allow-from If set, only allow these comma separated netmasks to recurse [01:32:13] comma there [01:32:25] # local-address IP addresses to listen on, separated by spaces or commas [01:32:31] local-address takes either. clever. [01:32:37] ::sigh:: [01:32:52] quality is job 8. I mean 9. [01:33:25] ok, well, from here I think it's just puppet stabbing [01:33:52] can I go back to hacking bicycles? :-) [01:33:59] yes. thanks for your help. [01:34:04] anytime [01:34:12] i'll have my phone this time, text again if anything comes up [01:34:28] btw, I slayed puppet on lvs3 only [01:34:33] ok. [01:34:35] including cron [01:35:40] rockin. [01:40:14] fwiw, the diffs i was looking at were: [01:40:15] https://gerrit.wikimedia.org/r/#/c/13476/1/templates/powerdns/recursor.conf.erb [01:40:17] https://gerrit.wikimedia.org/r/#/c/13476/1/manifests/dns.pp [01:40:31] (this one makes no sense to me, but whatever. https://gerrit.wikimedia.org/r/#/c/13470/1/manifests/dns.pp ) [01:48:25] New patchset: Bhartshorne; "instructing powerdns pdns recursor to allow connections from its own public IP address (since this is what goes in resolv.conf)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13554 [01:48:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13554 [01:50:36] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13554 [01:54:30] New patchset: Bhartshorne; "die extra commas die" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13555 [01:55:02] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13555 [01:56:52] that's a damn pretty concatenation, mr maplebed [02:18:08] everything all better now? [02:18:59] hrm...not seeing anything in the server admin log. that doesn't bode well [02:20:02] the man is typing furiously [02:20:22] well, with one hand. the other holds a fine glass of victory cogniac [02:20:35] k...thanks for the update! [02:20:39] guest is here [02:27:21] roblaAFK: I updated the village pump, but neglected the SAL. [02:28:28] !log corrected LVS pdns_recursor config error causing DNS queries to fail on LVS servers in gerrit r13554 and r13555. [02:28:34] Logged the message, Master [14:42:31] New patchset: Alex Monk; "Enhance account throttling" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12185 [15:47:11] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 59300 bytes in 8.550 seconds [15:58:35] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [15:59:56] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [16:12:00] !log Temporarily added path 6939+ 14907+ to AVOID-PATHs on cr2-knams [16:12:07] Logged the message, Master [16:21:14] hi mark [16:21:19] is that for the ipv6 packet loss? [16:23:11] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Puppet has not run in the last 10 hours [16:23:11] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [16:34:26] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [16:35:47] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [16:36:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [17:03:54] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 7, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [17:04:21] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:05:24] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [17:24:45] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:37:12] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [18:48:33] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [18:59:50] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [19:09:53] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [19:20:50] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:05:17] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [20:06:38] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [20:25:32] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [20:25:32] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [20:27:29] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [20:28:32] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [20:28:59] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 7, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [20:29:26] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [20:29:26] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [20:32:35] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [20:32:35] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [20:32:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:32:35] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [20:33:29] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [20:33:29] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [20:34:14] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:34:32] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [20:35:26] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [20:35:26] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [20:35:26] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [20:36:11] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [20:36:29] PROBLEM - Puppet freshness on search35 is CRITICAL: Puppet has not run in the last 10 hours [20:37:32] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [20:39:29] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [20:40:32] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [20:41:35] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [20:41:35] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [20:42:29] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [20:43:32] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [20:44:35] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [20:44:35] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [20:46:32] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [20:48:29] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [20:48:29] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [20:48:29] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [20:49:32] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [20:50:26] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [20:52:05] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:54:29] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [21:07:41] PROBLEM - Host search1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:14:35] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [21:37:59] PROBLEM - SSH on pdf3 is CRITICAL: Server answer: [21:39:29] RECOVERY - SSH on pdf3 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [21:41:44] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [21:44:35] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [21:59:01] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Puppet has not run in the last 10 hours [22:15:13] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:15:49] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:17:10] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:35:37] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:37:25] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [22:38:10] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [22:52:34] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:55:34] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:06:47] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [23:10:50] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [23:16:05] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:19:05] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:23:53] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:37:32] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor