[00:01:40] (03PS1) 10Mattflaschen: Set group as wikidev for /srv/mediawiki on singlenode mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 [00:07:14] (03PS1) 10Bsitu: Enable Echo on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 [00:07:49] (03CR) 10Bsitu: [C: 04-2] Enable Echo on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 (owner: 10Bsitu) [00:12:43] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [00:18:15] (03PS2) 10Bsitu: Enable Echo on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 [00:19:43] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:43] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:43] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:43] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:43] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:44] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [00:21:51] (03CR) 10MZMcBride: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78944 (owner: 10QChris) [00:22:31] Ryan_Lane: deploying, with greg-g's blessing [00:23:00] * greg-g is still here and nods :) [00:28:22] !log olivneh synchronized php-1.22wmf12/extensions/CoreEvents 'Updating CoreEvents to master for Ide8469db2 (1/2)' [00:28:27] Logged the message, Master [00:28:49] !log olivneh synchronized php-1.22wmf13/extensions/CoreEvents 'Updating CoreEvents to master for Ide8469db2 (2/2)' [00:28:54] Logged the message, Master [00:29:05] ori-l: sounds good [00:29:35] table name is generated from SchemaName_revId, so this data will go into a new table [00:30:01] because schema migrations are scientifically proven to be not fun [00:30:05] :D [00:30:32] s/will go/is going/ :) [00:32:12] !bug 45007 | Danny_B [00:32:12] Danny_B: https://bugzilla.wikimedia.org/45007 [00:34:23] thx [00:35:03] i actually think this is quite new issue - definitely during july and beginning of august it was updated */3 [00:35:49] i do maintenance quite regularly, so i guess i remember it correctly [00:37:00] just reused the existing one, new would have felt like a duplicate [00:37:08] but shrug [00:37:31] it could also be split off (not running / run more often) [00:39:44] 3 days is totally ok. i simply wonder it's simply somehow stucked now, either cron or the job itself [00:39:51] we'll see [00:40:33] otoh running it on cswikt would be quite handy atm, since we've done pretty significant maintenance recently so updated lists would be handy [00:42:47] maybe it's this: [00:42:59] update_special_pages_small: [00:43:00] ensure => absent; [00:43:25] because if this is mwdeploy user i dont see it on hume [00:43:56] and it's not obvious how to run it on just one language, and there is no logfile at that location .. s :p [00:44:27] gotta continue on ticket [00:45:03] and grab some food,, bbl [00:52:55] bon apetite, mutante [00:56:47] (03CR) 10MZMcBride: "This seems fine to me. is now empty and will presumably be speedily deleted sho" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79832 (owner: 10Nemo bis) [01:53:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:35] (03PS1) 10Demon: Remove old ircbot cruft [operations/puppet] - 10https://gerrit.wikimedia.org/r/79968 [01:54:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [01:58:53] (03PS2) 10Demon: Remove old ircbot and gitweb cruft [operations/puppet] - 10https://gerrit.wikimedia.org/r/79968 [02:31:40] !log LocalisationUpdate completed (1.22wmf13) at Tue Aug 20 02:31:40 UTC 2013 [02:31:48] Logged the message, Master [02:45:05] !log LocalisationUpdate completed (1.22wmf12) at Tue Aug 20 02:45:04 UTC 2013 [02:45:10] Logged the message, Master [03:07:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Aug 20 03:07:38 UTC 2013 [03:07:47] Logged the message, Master [03:16:05] (03CR) 10Faidon: [C: 032] Add IP addresses for Smart Cambodia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 (owner: 10Dr0ptp4kt) [03:16:38] dr0ptp4kt: ^ [03:27:29] paravoid, when is this vacation I hear about that you are supposed to go on? [03:27:29] is it now? [03:27:32] is it true what they say? [03:27:36] nope [03:27:42] phew, just checkin :) [03:27:49] :) [03:27:55] why, need anything? :) [03:28:12] not really, the kafka-mirror review, but its no hurry at all [03:28:16] and alex can review it just fine [03:28:25] I've already flagged it, it's second on my list now [03:28:32] :) [03:58:22] !log authdns-update: Google DKIM selector [03:58:29] Logged the message, Master [04:12:13] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [04:29:08] (03CR) 10Faidon: "I don't like this much. A package to provide an init script seems a little ugly to me (but I might be missing the details)." [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [04:32:58] (03CR) 10Ottomata: "I'm fine with putting these files in the main kafka package, I actually thought you'd like this better. kafka-mirror will only be started" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [04:34:11] (03CR) 10Ottomata: "Ha, and almost all init scripts look very similar. Why don't we use upstart instead? ;) har har" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [04:47:53] I never said no to upstart :) [04:48:02] (but have fun doing all this logic with upstart...) [04:52:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [04:53:58] PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40) [04:54:08] PROBLEM - Host mexia is DOWN: PING CRITICAL - Packet loss = 100% [04:55:48] RECOVERY - Host mexia is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [04:56:48] RECOVERY - Host rubidium is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [05:02:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [05:11:34] PROBLEM - NTP on rubidium is CRITICAL: NTP CRITICAL: Offset unknown [05:14:54] RECOVERY - NTP on rubidium is OK: NTP OK: Offset -0.001206755638 secs [06:48:45] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [06:50:29] (03CR) 10Yurik: "Adam, this change should have been generated by the vcl...py script we have in the maintenance (I am not sure if it was, but i suspect it " [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 (owner: 10Dr0ptp4kt) [06:54:55] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:05] PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:16] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:27] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:30] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:30] PROBLEM - LVS HTTP IPv6 on wikivoyage-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:33] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:39] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:40] PROBLEM - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:42] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:42] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:01] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:01] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:56:01] PROBLEM - LVS HTTPS IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:56:04] PROBLEM - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:04] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:04] PROBLEM - LVS HTTPS IPv6 on wikivoyage-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:56:06] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:56:06] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:11] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.008 second response time [06:56:47] ugh that might me be [06:57:00] Aug 20 06:47:29 lvs1001 kernel: [12096597.920226] unregister_netdevice: waiting for eth2.1003 to become free. Usage count = 129 [06:57:06] yes [06:57:11] RECOVERY - LVS HTTP IPv6 on wikivoyage-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 42809 bytes in 0.005 second response time [06:57:13] that is because it has the ip address of the router [06:57:22] um [06:57:26] no, that's because you try to remove an interface in use [06:57:31] what did you do? [06:57:55] did you change something just in lvs1001? [06:58:21] PROBLEM - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:58:23] yes, just a second please [06:58:31] wait [06:58:34] I'll kill pybal [06:58:40] traffic with shift to the backuip [06:58:43] there is a tagge d interface that has the same address as [06:58:58] !log killing pybal on lvs1001 [06:59:01] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 94906 bytes in 3.017 second response time [06:59:01] RECOVERY - LVS HTTPS IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 602 bytes in 3.011 second response time [06:59:01] ok, yuo have got this [06:59:04] Logged the message, Master [06:59:04] RECOVERY - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 3.016 second response time [06:59:05] RECOVERY - LVS HTTPS IPv6 on wikivoyage-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 42809 bytes in 3.015 second response time [06:59:07] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 3.017 second response time [06:59:07] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 3.020 second response time [06:59:07] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 94906 bytes in 0.031 second response time [06:59:09] RECOVERY - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22222 bytes in 0.012 second response time [06:59:17] ae3-1003.cr2-eqiad.wikimedia.org. [06:59:21] apergos: before you do network changes you should definitely fail over the load balancer if one's active [06:59:21] RECOVERY - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.005 second response time [06:59:21] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.025 second response time [06:59:24] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.023 second response time [06:59:24] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.031 second response time [06:59:24] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 94906 bytes in 0.063 second response time [06:59:24] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.060 second response time [06:59:24] RECOVERY - LVS HTTPS IPv4 on wikivoyage-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 42809 bytes in 0.057 second response time [06:59:26] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.058 second response time [06:59:27] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.039 second response time [06:59:27] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22222 bytes in 0.020 second response time [06:59:29] RECOVERY - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3838 bytes in 0.031 second response time [06:59:31] RECOVERY - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.028 second response time [06:59:31] apergos: okay, now you can take your time and fix this :) [06:59:31] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.030 second response time [06:59:31] RECOVERY - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3839 bytes in 0.031 second response time [06:59:34] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.036 second response time [06:59:34] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.030 second response time [06:59:34] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.036 second response time [06:59:40] no need to operate under panic :) [06:59:41] it needed to be done, just in a way with a little less paging [06:59:43] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 602 bytes in 0.020 second response time [06:59:44] ok, thank you [06:59:50] and I probably sohuld have asked for help [06:59:54] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.022 second response time [06:59:54] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.032 second response time [07:00:16] i know you just wanted to spread around your migraine! [07:00:23] lvs1001 is overdue for an ntpdate fix and a reboot [07:00:27] anyways, looking in puppet it became clear that changing the ip in the node ips list does not actually fix the interface, it ifup at the end but [07:00:37] if the interface is already up then... [07:00:44] my suggestion is [07:00:50] LeslieCarr: I apologize a bunch of ties, please get some sleep [07:00:52] cleanup /e/n/interfaces [07:00:53] *time [07:01:06] apt-get dist-upgrade [07:01:09] so in puppet, the interface was corrected by leslie [07:01:11] ntpdate [07:01:12] reboot [07:01:19] and shows in /etc/network/interfaces as right already [07:01:24] I will do the rest of those now [07:02:37] (and please don't forget to !log, I spent a few minutes trying to figure out what might have triggered this :) [07:02:48] yes, that was my bad [07:02:56] happens to the best of us [07:07:17] uh what params do I give to ntpdate? [07:07:25] try ntpdate-debian [07:07:30] (03PS1) 10Faidon: Remove CT from icinga paging [operations/puppet] - 10https://gerrit.wikimedia.org/r/79979 [07:07:46] better [07:07:49] (03CR) 10Faidon: [C: 032] Remove CT from icinga paging [operations/puppet] - 10https://gerrit.wikimedia.org/r/79979 (owner: 10Faidon) [07:07:57] (03CR) 10Faidon: [V: 032] Remove CT from icinga paging [operations/puppet] - 10https://gerrit.wikimedia.org/r/79979 (owner: 10Faidon) [07:09:50] !log rebooting lvs1001 to fix eth2.1003 ip addr, after misguided attempt to simply ifdown/ifup [07:09:55] Logged the message, Master [07:10:45] PROBLEM - Host lvs1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.55) [07:11:20] what are the memory allocation problem lines I see on bootup? [07:11:23] rather a lot of them [07:11:35] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [07:11:43] also these: [ 50.298525] bnx2 0000:01:00.1: eth1: NIC Copper Link is Down [07:12:03] I don't see anything in dmesg [07:12:19] I se the interface is up now so I can ignore those [07:12:55] on the console, [07:12:57] Since the script you are attempting to invoke has been converted to an [07:12:57] Upstart job, you may also use the start(8) utility, e.g. start S20salt-minion [07:12:57] Memory allocation problem [07:13:03] and about 30 more of the last line [07:13:38] see end of /var/log/boot.log [07:13:59] weird [07:14:45] inet 208.80.154.78/26 brd 208.80.154.127 scope global eth2.1003 yay [07:17:19] so if I were going to 'do this right' (for some future next time)... how would I fail over the traffic? [07:17:41] /etc/init.d/pybal stop [07:18:01] pybal maintains bgp sessions [07:18:12] announcing the service IPs [07:18:35] once you kill it, the router automatically falls back to the backup box, via bgp [07:18:58] ok, that's good to know [07:19:28] make sure that all IPs are on that box now and that lvs1004 isn't getting any traffic [07:19:47] and kill pybal/dist-upgrade/ntpdate/reboot lvs1004 too if you feel confident :) [07:20:38] I don't but an experienced ops person is around in case I screw up ;-) [07:21:15] (03CR) 10Ori.livneh: "Minor terminology quibble: you're spawning subprocesses, not threads, and you're counting CPUs, not cores." [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 (owner: 10MaxSem) [07:23:40] (03CR) 10MaxSem: "Well, the parameter itself is called --threads;)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 (owner: 10MaxSem) [07:24:49] (03CR) 10Ori.livneh: "Yes, I saw. Would you be annoyed if I fixed that in core?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 (owner: 10MaxSem) [07:25:12] lol ori-l [07:25:47] I thought about doing it but didn't want to annoy you by making you update the patch to use a different command-line argument [07:26:40] also, another aside re: that patch, it'd be good to be able to compare performance before/after. there are wfProfile() calls on the relevant functions but they're not ending up in graphite, possibly because we don't have that set up for maintenance scripts on tin, but I don't really know [07:29:29] there's a nice way to profile it: time [07:29:48] and I did it on beta [07:30:10] heh [07:30:12] yes, you're right [07:30:26] i'm so used to thinking about profiling PHP code in the context of web requests that i didn't think of that [07:31:45] (03CR) 10TTO: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [07:33:25] (03CR) 10Ori.livneh: [C: 031] Rebuild localisation cache in several threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 (owner: 10MaxSem) [07:37:58] !log reboot lvs1004 after dist-upgrade [07:38:03] Logged the message, Master [07:38:14] yay [07:38:43] PROBLEM - Host lvs1004 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:55] ori-l: maint scripts were explicitly removed from graphite iirc [07:39:02] they were messing averages too much [07:39:25] you have a maint script running for two days and then averaging that with request times [07:39:37] aaron would know more, I remember him looking at it [07:40:12] contrary, I remeber him making all maint code being profiled, as opposed to 1/50th [07:40:22] anyway, whom do I need to bribe to review ^^^? :P [07:40:36] same 'Memory allocation problem' on lvs1004 [07:40:38] nice :-/ [07:41:03] RECOVERY - Host lvs1004 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [07:55:57] paravoid: (not urgent, when you get time) I wanted to ask your thoughts on https://rt.wikimedia.org/Ticket/Display.html?id=5616 best approach [07:56:46] blergh [07:57:10] maybe snmptrap has a bind address option? [08:01:33] (03PS1) 10Faidon: authdns: add Ganglia plugin for gdnsd [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 [08:02:14] I didn't see it first time I looked, nor now on a recheck [08:02:36] this includes the snmpcmd options [08:03:30] (03CR) 10Faidon: [C: 032] authdns: add Ganglia plugin for gdnsd [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 (owner: 10Faidon) [08:09:33] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:10:23] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:11:45] looking to see if there's anything in the conf file that can be useful [08:24:43] (03PS1) 10Faidon: authdns: fix for Ganglia unicode string bug [operations/puppet] - 10https://gerrit.wikimedia.org/r/79982 [08:25:44] (03CR) 10Faidon: [C: 032] authdns: fix for Ganglia unicode string bug [operations/puppet] - 10https://gerrit.wikimedia.org/r/79982 (owner: 10Faidon) [08:47:13] (03CR) 10Ori.livneh: "(7 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 (owner: 10Faidon) [08:47:29] apergos: if you're up to it, all LVS could use kernel upgrade & reboot [08:47:32] and ntpdate [08:48:08] ok, I'll do that in a little (still looking into snmptrap stuff, there's a few email thread I've found discussing why the conf option does or does not work with v1 etc) [08:52:34] PROBLEM - Disk space on cp1047 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 12437 MB (3% inode=99%): /srv/sdb3 12453 MB (4% inode=99%): [08:57:01] snmp.conf appears to have a 'clientaddr' option [08:57:06] is that what you're looking at? [08:57:13] yes [08:57:24] I have tested with it. no effect [08:57:30] straced it? [08:57:36] nope [08:58:19] does it also not work for snmpget and friends? [08:58:28] haven't tried those. [08:58:33] you should [08:58:52] they likely share a lot of the same code [08:59:04] (03CR) 10Faidon: "(6 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 (owner: 10Faidon) [08:59:34] PROBLEM - Disk space on cp1047 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 12437 MB (3% inode=99%): /srv/sdb3 12453 MB (4% inode=99%): [09:00:31] i think it should be reasonable to set clientaddr to $::ipaddress for all our systems [09:00:39] yeah that was my idea [09:00:47] binding to $::ipaddress [09:01:27] according to some google results, it does work for snmptrap but not snmpd [09:01:42] I am testing with snmptrap which is what we want [09:02:18] (03CR) 10Akosiaris: "Well. This is in reality an empty package. Just an init script. I think we should just incorporate this functionality in the original kafk" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [09:06:55] where are we going to use snmp traps ? [09:07:06] we use them to report successful puppet runs [09:09:23] (03CR) 10Ori.livneh: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 (owner: 10Faidon) [09:09:47] ah yes... I had an idea about replacing that with an nrpe command parsing /var/lib/puppet/state/last_run_summary.yaml [09:09:50] ori-l: it won't work with 3 anyway [09:09:58] ori-l: at least because of urllib2 [09:10:09] apergos: do you mind if I have a go ? [09:10:23] replacing it you mean? sure go ahead [09:10:36] paravoid: yes, but that too is a superficial incompatibility that is easy to gloss over with a except ImportError: [09:10:47] https://rt.wikimedia.org/Ticket/Display.html?id=5616 that's the ticket [09:11:01] i think that generally it's possible to write 2/3 compatible code by adopting a small set of nonintrusive habits [09:11:44] apergos: ok thanks [09:11:45] and you end up with more robust code if you get beaten up for thinking strings = bytes [09:12:26] the big reason it won't work with python3 is that gmond module-loader is py2 specific, but still [09:12:37] print('a', 'b') != print 'a', 'b' [09:12:38] in python2 [09:12:44] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [09:12:49] yes, but you are printing a single value in each statement [09:12:56] in this particular example [09:13:01] i wouldn't have suggested it otherwise [09:13:12] anyways, the review was in the spirit of "nifty python tips", not harassing over trivial stuff [09:13:22] you're free to take it or leave it, honestly [09:13:43] - print(' %(name)s: %(units)s %(value)s [%(description)s' % d) [09:13:46] + print((' %(name)s: %(units)s %(value)s [%(description)s' % d)) [09:13:48] apergos: this seems to work: [09:13:49] clientaddr 208.80.154.56:162 [09:13:50] clientaddrUsesPort yes [09:13:50] haha [09:13:52] that's 2to3 :) [09:14:21] i don't use 2to3, i write 2/3 compat code :P [09:15:38] (03CR) 10TMg: [C: 031] Dereference unused category from ArticleFeedbackToolv5 en.wiki config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79832 (owner: 10Nemo bis) [09:15:39] i rather like it when people comb through my code so i sometimes do it if my curiosity is piqued, but i'm usually careful not to attach a score if i'm just being pedantic or opinionated [09:15:50] mark: those lines in the snmp.conf file as is? [09:16:04] oh don't me wrong [09:16:07] just those two lines added to the stock snmp.conf yes [09:16:14] the review was very much welcome [09:16:14] because live testing gives [09:16:16] /etc/snmp/snmp.conf: line 7: Warning: Unknown token: clientaddrUsesPort. [09:16:50] k :) [09:17:11] apergos: just removed that [09:17:14] now I have: clientaddr 208.80.154.56 [09:17:16] and that works too [09:17:18] what did you test with? [09:17:50] bind(3, {sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("208.80.154.56")}, 16) = 0 [09:17:54] it was 0.0.0.0 before [09:18:14] RECOVERY - Puppet freshness on lvs1004 is OK: puppet ran at Tue Aug 20 09:18:08 UTC 2013 [09:18:37] I had a : in there [09:19:21] to pre precise I had 'clientaddr : 208.80.154.137' [09:19:24] *to be [09:19:38] anyways that's obviously the issue because now neon is picking them up [09:19:50] (03PS1) 10Faidon: authdns: more Ganglia plugin fixups [operations/puppet] - 10https://gerrit.wikimedia.org/r/79983 [09:20:15] ok [09:20:28] will you put a template snmp.conf that uses $::ipaddress and put that in base.pp? [09:20:33] yes, that's the plan [09:20:56] (03CR) 10Faidon: [C: 032] authdns: more Ganglia plugin fixups [operations/puppet] - 10https://gerrit.wikimedia.org/r/79983 (owner: 10Faidon) [09:24:35] so [09:24:40] suggestions on how to test the new DNS boxes? [09:24:52] I've done extensive perf testing with gdnsd before, so I'm not worried about that [09:25:02] I do worry about missing records or whatnot [09:25:25] I was thinking maybe write something to pcap real traffic, replay it and compare answers [09:25:35] but seems a bit too complicated, maybe even paranoid [09:28:49] something like tcpreplay but for dns ? [09:28:59] kind of [09:29:03] it's a bit more complicated than that [09:29:24] cause you have to compare answers [09:29:25] need to masquerade the source to not send arbitrary packets to random people [09:29:36] and then capture the response [09:29:51] also tie req/resp from the pcap to find the expected response [09:30:39] there also known differences in responses [09:30:53] so I'd need to filter out those [09:31:22] so for example, when asked for en.wikipedia.org A, PowerDNS will reply the CNAME to wikimedia-lb.wikimedia.org, but it'll also reply the A record for that CNAME [09:31:28] gdnsd won't do that, and rightly so [09:32:33] bind also does that [09:33:22] no it doesn't... i does however add authority and additional sections [09:33:43] yeah, that's configurable in both bind and gdnsd [09:33:46] (but with opposite defaults) [09:33:59] "minimal-responses yes;" in bind [09:34:24] 'include_optional_ns = true" in gdnsd [09:36:10] (03PS1) 10ArielGlenn: force snmp traps to be sent with canonical client ip addr [operations/puppet] - 10https://gerrit.wikimedia.org/r/79984 [09:41:23] (03CR) 10ArielGlenn: [C: 032] force snmp traps to be sent with canonical client ip addr [operations/puppet] - 10https://gerrit.wikimedia.org/r/79984 (owner: 10ArielGlenn) [09:44:35] rebooting role::poolcounter machines can be done easily or is there something that i should be aware of ? [09:47:22] why did we put the bacula director on the same machine as poolcounter? [09:48:26] RECOVERY - Puppet freshness on lvs1005 is OK: puppet ran at Tue Aug 20 09:48:22 UTC 2013 [09:49:20] akosiaris, looking at code, it should be safe, but I'd still recommend deplooling in MW config first to avoid losing work in process [09:49:36] RECOVERY - Puppet freshness on lvs1006 is OK: puppet ran at Tue Aug 20 09:49:28 UTC 2013 [09:49:59] MaxSem: ok thanks :-) [09:59:25] (03PS1) 10TTO: Set Wikibase sort order to alphabetic for ilowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79990 [10:13:02] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [10:14:02] speaking of which, ^^ needs fixage:) [10:14:11] ? [10:15:25] yesterday, mw1046 had r/o root partition [10:15:43] apparently, the warning above is caused by the same issue [10:17:15] I put inn a ticket already [10:17:17] bad hd [10:17:27] !Log reboot lvs1002 after dist-upgrade [10:17:33] Logged the message, Master [10:18:32] PROBLEM - Host lvs1002 is DOWN: CRITICAL - Host Unreachable (208.80.154.56) [10:20:02] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [10:20:02] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [10:20:02] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [10:20:32] virt1 is decommissioned supposedly and yet the box is still powered up, responds to pings. (but not ssh) [10:20:44] our decom processes are a complete mess [10:20:53] still are [10:20:54] you're telling me [10:21:02] RECOVERY - Host lvs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [10:21:05] I'll be looking into virt 1,3 and 4 next (since the lvses are now down [10:21:07] done! [10:21:17] all lvses are done? [10:21:18] Waiting up to 60 more seconds for network configuration... [10:21:18] [10:21:26] as far as puppet seeing them [10:21:36] not as far as reboots. that's a different track [10:21:41] ah [10:22:52] PROBLEM - Host upload-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:02] so not cool [10:25:19] Aug 20 10:19:43 lvs1002 kernel: [ 38.951673] ADDRCONF(NETDEV_UP): eth2.1019: link is not ready [10:25:23] apparently never became ready [10:25:55] any ideas? [10:26:10] paravoid: [10:26:38] hey just got the page [10:26:46] what did you do? [10:27:10]