[00:01:40] (03PS1) 10Mattflaschen: Set group as wikidev for /srv/mediawiki on singlenode mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 [00:07:14] (03PS1) 10Bsitu: Enable Echo on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 [00:07:49] (03CR) 10Bsitu: [C: 04-2] Enable Echo on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 (owner: 10Bsitu) [00:12:43] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [00:18:15] (03PS2) 10Bsitu: Enable Echo on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 [00:19:43] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:43] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:43] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:43] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:43] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:44] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [00:21:51] (03CR) 10MZMcBride: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78944 (owner: 10QChris) [00:22:31] Ryan_Lane: deploying, with greg-g's blessing [00:23:00] * greg-g is still here and nods :) [00:28:22] !log olivneh synchronized php-1.22wmf12/extensions/CoreEvents 'Updating CoreEvents to master for Ide8469db2 (1/2)' [00:28:27] Logged the message, Master [00:28:49] !log olivneh synchronized php-1.22wmf13/extensions/CoreEvents 'Updating CoreEvents to master for Ide8469db2 (2/2)' [00:28:54] Logged the message, Master [00:29:05] ori-l: sounds good [00:29:35] table name is generated from SchemaName_revId, so this data will go into a new table [00:30:01] because schema migrations are scientifically proven to be not fun [00:30:05] :D [00:30:32] s/will go/is going/ :) [00:32:12] !bug 45007 | Danny_B [00:32:12] Danny_B: https://bugzilla.wikimedia.org/45007 [00:34:23] thx [00:35:03] i actually think this is quite new issue - definitely during july and beginning of august it was updated */3 [00:35:49] i do maintenance quite regularly, so i guess i remember it correctly [00:37:00] just reused the existing one, new would have felt like a duplicate [00:37:08] but shrug [00:37:31] it could also be split off (not running / run more often) [00:39:44] 3 days is totally ok. i simply wonder it's simply somehow stucked now, either cron or the job itself [00:39:51] we'll see [00:40:33] otoh running it on cswikt would be quite handy atm, since we've done pretty significant maintenance recently so updated lists would be handy [00:42:47] maybe it's this: [00:42:59] update_special_pages_small: [00:43:00] ensure => absent; [00:43:25] because if this is mwdeploy user i dont see it on hume [00:43:56] and it's not obvious how to run it on just one language, and there is no logfile at that location .. s :p [00:44:27] gotta continue on ticket [00:45:03] and grab some food,, bbl [00:52:55] bon apetite, mutante [00:56:47] (03CR) 10MZMcBride: "This seems fine to me. is now empty and will presumably be speedily deleted sho" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79832 (owner: 10Nemo bis) [01:53:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:35] (03PS1) 10Demon: Remove old ircbot cruft [operations/puppet] - 10https://gerrit.wikimedia.org/r/79968 [01:54:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [01:58:53] (03PS2) 10Demon: Remove old ircbot and gitweb cruft [operations/puppet] - 10https://gerrit.wikimedia.org/r/79968 [02:31:40] !log LocalisationUpdate completed (1.22wmf13) at Tue Aug 20 02:31:40 UTC 2013 [02:31:48] Logged the message, Master [02:45:05] !log LocalisationUpdate completed (1.22wmf12) at Tue Aug 20 02:45:04 UTC 2013 [02:45:10] Logged the message, Master [03:07:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Aug 20 03:07:38 UTC 2013 [03:07:47] Logged the message, Master [03:16:05] (03CR) 10Faidon: [C: 032] Add IP addresses for Smart Cambodia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 (owner: 10Dr0ptp4kt) [03:16:38] dr0ptp4kt: ^ [03:27:29] paravoid, when is this vacation I hear about that you are supposed to go on? [03:27:29] is it now? [03:27:32] is it true what they say? [03:27:36] nope [03:27:42] phew, just checkin :) [03:27:49] :) [03:27:55] why, need anything? :) [03:28:12] not really, the kafka-mirror review, but its no hurry at all [03:28:16] and alex can review it just fine [03:28:25] I've already flagged it, it's second on my list now [03:28:32] :) [03:58:22] !log authdns-update: Google DKIM selector [03:58:29] Logged the message, Master [04:12:13] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [04:29:08] (03CR) 10Faidon: "I don't like this much. A package to provide an init script seems a little ugly to me (but I might be missing the details)." [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [04:32:58] (03CR) 10Ottomata: "I'm fine with putting these files in the main kafka package, I actually thought you'd like this better. kafka-mirror will only be started" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [04:34:11] (03CR) 10Ottomata: "Ha, and almost all init scripts look very similar. Why don't we use upstart instead? ;) har har" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [04:47:53] I never said no to upstart :) [04:48:02] (but have fun doing all this logic with upstart...) [04:52:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [04:53:58] PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40) [04:54:08] PROBLEM - Host mexia is DOWN: PING CRITICAL - Packet loss = 100% [04:55:48] RECOVERY - Host mexia is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [04:56:48] RECOVERY - Host rubidium is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [05:02:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [05:11:34] PROBLEM - NTP on rubidium is CRITICAL: NTP CRITICAL: Offset unknown [05:14:54] RECOVERY - NTP on rubidium is OK: NTP OK: Offset -0.001206755638 secs [06:48:45] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [06:50:29] (03CR) 10Yurik: "Adam, this change should have been generated by the vcl...py script we have in the maintenance (I am not sure if it was, but i suspect it " [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 (owner: 10Dr0ptp4kt) [06:54:55] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:05] PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:16] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:25] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:27] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:30] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:30] PROBLEM - LVS HTTP IPv6 on wikivoyage-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:33] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:37] PROBLEM - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:39] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:40] PROBLEM - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:55:42] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:55:42] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:01] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:01] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:56:01] PROBLEM - LVS HTTPS IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:56:04] PROBLEM - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:04] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:04] PROBLEM - LVS HTTPS IPv6 on wikivoyage-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:56:06] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:56:06] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:56:11] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.008 second response time [06:56:47] ugh that might me be [06:57:00] Aug 20 06:47:29 lvs1001 kernel: [12096597.920226] unregister_netdevice: waiting for eth2.1003 to become free. Usage count = 129 [06:57:06] yes [06:57:11] RECOVERY - LVS HTTP IPv6 on wikivoyage-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 42809 bytes in 0.005 second response time [06:57:13] that is because it has the ip address of the router [06:57:22] um [06:57:26] no, that's because you try to remove an interface in use [06:57:31] what did you do? [06:57:55] did you change something just in lvs1001? [06:58:21] PROBLEM - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [06:58:23] yes, just a second please [06:58:31] wait [06:58:34] I'll kill pybal [06:58:40] traffic with shift to the backuip [06:58:43] there is a tagge d interface that has the same address as [06:58:58] !log killing pybal on lvs1001 [06:59:01] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 94906 bytes in 3.017 second response time [06:59:01] RECOVERY - LVS HTTPS IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 602 bytes in 3.011 second response time [06:59:01] ok, yuo have got this [06:59:04] Logged the message, Master [06:59:04] RECOVERY - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 3.016 second response time [06:59:05] RECOVERY - LVS HTTPS IPv6 on wikivoyage-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 42809 bytes in 3.015 second response time [06:59:07] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 3.017 second response time [06:59:07] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 3.020 second response time [06:59:07] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 94906 bytes in 0.031 second response time [06:59:09] RECOVERY - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22222 bytes in 0.012 second response time [06:59:17] ae3-1003.cr2-eqiad.wikimedia.org. [06:59:21] apergos: before you do network changes you should definitely fail over the load balancer if one's active [06:59:21] RECOVERY - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.005 second response time [06:59:21] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.025 second response time [06:59:24] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.023 second response time [06:59:24] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.031 second response time [06:59:24] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 94906 bytes in 0.063 second response time [06:59:24] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.060 second response time [06:59:24] RECOVERY - LVS HTTPS IPv4 on wikivoyage-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 42809 bytes in 0.057 second response time [06:59:26] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.058 second response time [06:59:27] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.039 second response time [06:59:27] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22222 bytes in 0.020 second response time [06:59:29] RECOVERY - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3838 bytes in 0.031 second response time [06:59:31] RECOVERY - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.028 second response time [06:59:31] apergos: okay, now you can take your time and fix this :) [06:59:31] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.030 second response time [06:59:31] RECOVERY - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3839 bytes in 0.031 second response time [06:59:34] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.036 second response time [06:59:34] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.030 second response time [06:59:34] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.036 second response time [06:59:40] no need to operate under panic :) [06:59:41] it needed to be done, just in a way with a little less paging [06:59:43] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 602 bytes in 0.020 second response time [06:59:44] ok, thank you [06:59:50] and I probably sohuld have asked for help [06:59:54] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.022 second response time [06:59:54] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61988 bytes in 0.032 second response time [07:00:16] i know you just wanted to spread around your migraine! [07:00:23] lvs1001 is overdue for an ntpdate fix and a reboot [07:00:27] anyways, looking in puppet it became clear that changing the ip in the node ips list does not actually fix the interface, it ifup at the end but [07:00:37] if the interface is already up then... [07:00:44] my suggestion is [07:00:50] LeslieCarr: I apologize a bunch of ties, please get some sleep [07:00:52] cleanup /e/n/interfaces [07:00:53] *time [07:01:06] apt-get dist-upgrade [07:01:09] so in puppet, the interface was corrected by leslie [07:01:11] ntpdate [07:01:12] reboot [07:01:19] and shows in /etc/network/interfaces as right already [07:01:24] I will do the rest of those now [07:02:37] (and please don't forget to !log, I spent a few minutes trying to figure out what might have triggered this :) [07:02:48] yes, that was my bad [07:02:56] happens to the best of us [07:07:17] uh what params do I give to ntpdate? [07:07:25] try ntpdate-debian [07:07:30] (03PS1) 10Faidon: Remove CT from icinga paging [operations/puppet] - 10https://gerrit.wikimedia.org/r/79979 [07:07:46] better [07:07:49] (03CR) 10Faidon: [C: 032] Remove CT from icinga paging [operations/puppet] - 10https://gerrit.wikimedia.org/r/79979 (owner: 10Faidon) [07:07:57] (03CR) 10Faidon: [V: 032] Remove CT from icinga paging [operations/puppet] - 10https://gerrit.wikimedia.org/r/79979 (owner: 10Faidon) [07:09:50] !log rebooting lvs1001 to fix eth2.1003 ip addr, after misguided attempt to simply ifdown/ifup [07:09:55] Logged the message, Master [07:10:45] PROBLEM - Host lvs1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.55) [07:11:20] what are the memory allocation problem lines I see on bootup? [07:11:23] rather a lot of them [07:11:35] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [07:11:43] also these: [ 50.298525] bnx2 0000:01:00.1: eth1: NIC Copper Link is Down [07:12:03] I don't see anything in dmesg [07:12:19] I se the interface is up now so I can ignore those [07:12:55] on the console, [07:12:57] Since the script you are attempting to invoke has been converted to an [07:12:57] Upstart job, you may also use the start(8) utility, e.g. start S20salt-minion [07:12:57] Memory allocation problem [07:13:03] and about 30 more of the last line [07:13:38] see end of /var/log/boot.log [07:13:59] weird [07:14:45] inet 208.80.154.78/26 brd 208.80.154.127 scope global eth2.1003 yay [07:17:19] so if I were going to 'do this right' (for some future next time)... how would I fail over the traffic? [07:17:41] /etc/init.d/pybal stop [07:18:01] pybal maintains bgp sessions [07:18:12] announcing the service IPs [07:18:35] once you kill it, the router automatically falls back to the backup box, via bgp [07:18:58] ok, that's good to know [07:19:28] make sure that all IPs are on that box now and that lvs1004 isn't getting any traffic [07:19:47] and kill pybal/dist-upgrade/ntpdate/reboot lvs1004 too if you feel confident :) [07:20:38] I don't but an experienced ops person is around in case I screw up ;-) [07:21:15] (03CR) 10Ori.livneh: "Minor terminology quibble: you're spawning subprocesses, not threads, and you're counting CPUs, not cores." [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 (owner: 10MaxSem) [07:23:40] (03CR) 10MaxSem: "Well, the parameter itself is called --threads;)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 (owner: 10MaxSem) [07:24:49] (03CR) 10Ori.livneh: "Yes, I saw. Would you be annoyed if I fixed that in core?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 (owner: 10MaxSem) [07:25:12] lol ori-l [07:25:47] I thought about doing it but didn't want to annoy you by making you update the patch to use a different command-line argument [07:26:40] also, another aside re: that patch, it'd be good to be able to compare performance before/after. there are wfProfile() calls on the relevant functions but they're not ending up in graphite, possibly because we don't have that set up for maintenance scripts on tin, but I don't really know [07:29:29] there's a nice way to profile it: time [07:29:48] and I did it on beta [07:30:10] heh [07:30:12] yes, you're right [07:30:26] i'm so used to thinking about profiling PHP code in the context of web requests that i didn't think of that [07:31:45] (03CR) 10TTO: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [07:33:25] (03CR) 10Ori.livneh: [C: 031] Rebuild localisation cache in several threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 (owner: 10MaxSem) [07:37:58] !log reboot lvs1004 after dist-upgrade [07:38:03] Logged the message, Master [07:38:14] yay [07:38:43] PROBLEM - Host lvs1004 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:55] ori-l: maint scripts were explicitly removed from graphite iirc [07:39:02] they were messing averages too much [07:39:25] you have a maint script running for two days and then averaging that with request times [07:39:37] aaron would know more, I remember him looking at it [07:40:12] contrary, I remeber him making all maint code being profiled, as opposed to 1/50th [07:40:22] anyway, whom do I need to bribe to review ^^^? :P [07:40:36] same 'Memory allocation problem' on lvs1004 [07:40:38] nice :-/ [07:41:03] RECOVERY - Host lvs1004 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [07:55:57] paravoid: (not urgent, when you get time) I wanted to ask your thoughts on https://rt.wikimedia.org/Ticket/Display.html?id=5616 best approach [07:56:46] blergh [07:57:10] maybe snmptrap has a bind address option? [08:01:33] (03PS1) 10Faidon: authdns: add Ganglia plugin for gdnsd [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 [08:02:14] I didn't see it first time I looked, nor now on a recheck [08:02:36] this includes the snmpcmd options [08:03:30] (03CR) 10Faidon: [C: 032] authdns: add Ganglia plugin for gdnsd [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 (owner: 10Faidon) [08:09:33] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:10:23] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:11:45] looking to see if there's anything in the conf file that can be useful [08:24:43] (03PS1) 10Faidon: authdns: fix for Ganglia unicode string bug [operations/puppet] - 10https://gerrit.wikimedia.org/r/79982 [08:25:44] (03CR) 10Faidon: [C: 032] authdns: fix for Ganglia unicode string bug [operations/puppet] - 10https://gerrit.wikimedia.org/r/79982 (owner: 10Faidon) [08:47:13] (03CR) 10Ori.livneh: "(7 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 (owner: 10Faidon) [08:47:29] apergos: if you're up to it, all LVS could use kernel upgrade & reboot [08:47:32] and ntpdate [08:48:08] ok, I'll do that in a little (still looking into snmptrap stuff, there's a few email thread I've found discussing why the conf option does or does not work with v1 etc) [08:52:34] PROBLEM - Disk space on cp1047 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 12437 MB (3% inode=99%): /srv/sdb3 12453 MB (4% inode=99%): [08:57:01] snmp.conf appears to have a 'clientaddr' option [08:57:06] is that what you're looking at? [08:57:13] yes [08:57:24] I have tested with it. no effect [08:57:30] straced it? [08:57:36] nope [08:58:19] does it also not work for snmpget and friends? [08:58:28] haven't tried those. [08:58:33] you should [08:58:52] they likely share a lot of the same code [08:59:04] (03CR) 10Faidon: "(6 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 (owner: 10Faidon) [08:59:34] PROBLEM - Disk space on cp1047 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 12437 MB (3% inode=99%): /srv/sdb3 12453 MB (4% inode=99%): [09:00:31] i think it should be reasonable to set clientaddr to $::ipaddress for all our systems [09:00:39] yeah that was my idea [09:00:47] binding to $::ipaddress [09:01:27] according to some google results, it does work for snmptrap but not snmpd [09:01:42] I am testing with snmptrap which is what we want [09:02:18] (03CR) 10Akosiaris: "Well. This is in reality an empty package. Just an init script. I think we should just incorporate this functionality in the original kafk" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [09:06:55] where are we going to use snmp traps ? [09:07:06] we use them to report successful puppet runs [09:09:23] (03CR) 10Ori.livneh: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79981 (owner: 10Faidon) [09:09:47] ah yes... I had an idea about replacing that with an nrpe command parsing /var/lib/puppet/state/last_run_summary.yaml [09:09:50] ori-l: it won't work with 3 anyway [09:09:58] ori-l: at least because of urllib2 [09:10:09] apergos: do you mind if I have a go ? [09:10:23] replacing it you mean? sure go ahead [09:10:36] paravoid: yes, but that too is a superficial incompatibility that is easy to gloss over with a except ImportError: [09:10:47] https://rt.wikimedia.org/Ticket/Display.html?id=5616 that's the ticket [09:11:01] i think that generally it's possible to write 2/3 compatible code by adopting a small set of nonintrusive habits [09:11:44] apergos: ok thanks [09:11:45] and you end up with more robust code if you get beaten up for thinking strings = bytes [09:12:26] the big reason it won't work with python3 is that gmond module-loader is py2 specific, but still [09:12:37] print('a', 'b') != print 'a', 'b' [09:12:38] in python2 [09:12:44] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [09:12:49] yes, but you are printing a single value in each statement [09:12:56] in this particular example [09:13:01] i wouldn't have suggested it otherwise [09:13:12] anyways, the review was in the spirit of "nifty python tips", not harassing over trivial stuff [09:13:22] you're free to take it or leave it, honestly [09:13:43] - print(' %(name)s: %(units)s %(value)s [%(description)s' % d) [09:13:46] + print((' %(name)s: %(units)s %(value)s [%(description)s' % d)) [09:13:48] apergos: this seems to work: [09:13:49] clientaddr 208.80.154.56:162 [09:13:50] clientaddrUsesPort yes [09:13:50] haha [09:13:52] that's 2to3 :) [09:14:21] i don't use 2to3, i write 2/3 compat code :P [09:15:38] (03CR) 10TMg: [C: 031] Dereference unused category from ArticleFeedbackToolv5 en.wiki config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79832 (owner: 10Nemo bis) [09:15:39] i rather like it when people comb through my code so i sometimes do it if my curiosity is piqued, but i'm usually careful not to attach a score if i'm just being pedantic or opinionated [09:15:50] mark: those lines in the snmp.conf file as is? [09:16:04] oh don't me wrong [09:16:07] just those two lines added to the stock snmp.conf yes [09:16:14] the review was very much welcome [09:16:14] because live testing gives [09:16:16] /etc/snmp/snmp.conf: line 7: Warning: Unknown token: clientaddrUsesPort. [09:16:50] k :) [09:17:11] apergos: just removed that [09:17:14] now I have: clientaddr 208.80.154.56 [09:17:16] and that works too [09:17:18] what did you test with? [09:17:50] bind(3, {sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("208.80.154.56")}, 16) = 0 [09:17:54] it was 0.0.0.0 before [09:18:14] RECOVERY - Puppet freshness on lvs1004 is OK: puppet ran at Tue Aug 20 09:18:08 UTC 2013 [09:18:37] I had a : in there [09:19:21] to pre precise I had 'clientaddr : 208.80.154.137' [09:19:24] *to be [09:19:38] anyways that's obviously the issue because now neon is picking them up [09:19:50] (03PS1) 10Faidon: authdns: more Ganglia plugin fixups [operations/puppet] - 10https://gerrit.wikimedia.org/r/79983 [09:20:15] ok [09:20:28] will you put a template snmp.conf that uses $::ipaddress and put that in base.pp? [09:20:33] yes, that's the plan [09:20:56] (03CR) 10Faidon: [C: 032] authdns: more Ganglia plugin fixups [operations/puppet] - 10https://gerrit.wikimedia.org/r/79983 (owner: 10Faidon) [09:24:35] so [09:24:40] suggestions on how to test the new DNS boxes? [09:24:52] I've done extensive perf testing with gdnsd before, so I'm not worried about that [09:25:02] I do worry about missing records or whatnot [09:25:25] I was thinking maybe write something to pcap real traffic, replay it and compare answers [09:25:35] but seems a bit too complicated, maybe even paranoid [09:28:49] something like tcpreplay but for dns ? [09:28:59] kind of [09:29:03] it's a bit more complicated than that [09:29:24] cause you have to compare answers [09:29:25] need to masquerade the source to not send arbitrary packets to random people [09:29:36] and then capture the response [09:29:51] also tie req/resp from the pcap to find the expected response [09:30:39] there also known differences in responses [09:30:53] so I'd need to filter out those [09:31:22] so for example, when asked for en.wikipedia.org A, PowerDNS will reply the CNAME to wikimedia-lb.wikimedia.org, but it'll also reply the A record for that CNAME [09:31:28] gdnsd won't do that, and rightly so [09:32:33] bind also does that [09:33:22] no it doesn't... i does however add authority and additional sections [09:33:43] yeah, that's configurable in both bind and gdnsd [09:33:46] (but with opposite defaults) [09:33:59] "minimal-responses yes;" in bind [09:34:24] 'include_optional_ns = true" in gdnsd [09:36:10] (03PS1) 10ArielGlenn: force snmp traps to be sent with canonical client ip addr [operations/puppet] - 10https://gerrit.wikimedia.org/r/79984 [09:41:23] (03CR) 10ArielGlenn: [C: 032] force snmp traps to be sent with canonical client ip addr [operations/puppet] - 10https://gerrit.wikimedia.org/r/79984 (owner: 10ArielGlenn) [09:44:35] rebooting role::poolcounter machines can be done easily or is there something that i should be aware of ? [09:47:22] why did we put the bacula director on the same machine as poolcounter? [09:48:26] RECOVERY - Puppet freshness on lvs1005 is OK: puppet ran at Tue Aug 20 09:48:22 UTC 2013 [09:49:20] akosiaris, looking at code, it should be safe, but I'd still recommend deplooling in MW config first to avoid losing work in process [09:49:36] RECOVERY - Puppet freshness on lvs1006 is OK: puppet ran at Tue Aug 20 09:49:28 UTC 2013 [09:49:59] MaxSem: ok thanks :-) [09:59:25] (03PS1) 10TTO: Set Wikibase sort order to alphabetic for ilowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79990 [10:13:02] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [10:14:02] speaking of which, ^^ needs fixage:) [10:14:11] ? [10:15:25] yesterday, mw1046 had r/o root partition [10:15:43] apparently, the warning above is caused by the same issue [10:17:15] I put inn a ticket already [10:17:17] bad hd [10:17:27] !Log reboot lvs1002 after dist-upgrade [10:17:33] Logged the message, Master [10:18:32] PROBLEM - Host lvs1002 is DOWN: CRITICAL - Host Unreachable (208.80.154.56) [10:20:02] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [10:20:02] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [10:20:02] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [10:20:32] virt1 is decommissioned supposedly and yet the box is still powered up, responds to pings. (but not ssh) [10:20:44] our decom processes are a complete mess [10:20:53] still are [10:20:54] you're telling me [10:21:02] RECOVERY - Host lvs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [10:21:05] I'll be looking into virt 1,3 and 4 next (since the lvses are now down [10:21:07] done! [10:21:17] all lvses are done? [10:21:18] Waiting up to 60 more seconds for network configuration... [10:21:18] [10:21:26] as far as puppet seeing them [10:21:36] not as far as reboots. that's a different track [10:21:41] ah [10:22:52] PROBLEM - Host upload-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:02] so not cool [10:25:19] Aug 20 10:19:43 lvs1002 kernel: [ 38.951673] ADDRCONF(NETDEV_UP): eth2.1019: link is not ready [10:25:23] apparently never became ready [10:25:55] any ideas? [10:26:10] paravoid: [10:26:38] hey just got the page [10:26:46] what did you do? [10:27:10] powercycled lvs1002 after apt-get dist-upgrade [10:27:17] just a dist-upgrade? [10:27:24] ntpdate [10:27:25] that's it [10:27:31] no interface changes? [10:27:33] nope [10:27:42] and I just got the page. nice [10:27:59] PROBLEM - Host misc-web-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [10:28:45] i assume someon's on its serial console? [10:28:56] me [10:28:56] the box is fine [10:28:57] sec [10:29:11] and it booted up, one can get on it via ssh fine [10:29:31] off [10:30:17] no IPs bound on the interfaces [10:30:25] well, on lo [10:31:40] are you fixing it, should I? [10:31:40] PROBLEM - Host upload-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Plugin timed out after 15 seconds [10:31:49] IFACE=lo MODE=start sh -x /etc/network/if-up.d/wikimedia-lvs-realserver [10:32:01] RECOVERY - Host upload-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [10:32:03] RECOVERY - Host upload-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [10:32:05] it added them [10:32:06] wtf [10:32:10] RECOVERY - Host misc-web-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [10:32:12] why weren't they there? [10:33:43] Aug 20 10:22:05 lvs1002 lldpd[2505]: lldp_decode: unknown org tlv received on eth2 [10:33:43] I wonder if this had any relation [10:33:50] no [10:37:12] wtf paging [10:37:19] I just got a page from 12mins ago [10:38:39] are you looking at what could have gone wrong? and, should I keep on with the other lvs hosts or wait a bit? [10:38:49] don't touch them [10:40:38] apergos: do you have the console output still open? [10:40:46] no, I got off [10:41:19] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:41:36] I saw that it was waiting 60 additional seconds for network configuration (as pasted above) [10:41:56] but after that it proceeded and gave a login prompt [10:42:09] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [10:45:37] did you do anything else but reboot? [10:45:42] no [10:45:52] ok [10:46:17] I mena looked at log files etc but that's it [10:46:19] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:47:19] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [10:48:06] weird [10:48:10] I wonder if it's a race [11:07:28] (03PS1) 10ArielGlenn: remove nonexistent hosts virt3 and virt4 from nagios checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/79996 [11:08:49] (03CR) 10ArielGlenn: [C: 032] remove nonexistent hosts virt3 and virt4 from nagios checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/79996 (owner: 10ArielGlenn) [11:10:24] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:12:14] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [11:16:24] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:18:14] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [11:31:06] (03PS1) 10ArielGlenn: create parent directory /etc/dsh for dsh node group files [operations/puppet] - 10https://gerrit.wikimedia.org/r/79997 [11:32:32] (03CR) 10ArielGlenn: [C: 032] create parent directory /etc/dsh for dsh node group files [operations/puppet] - 10https://gerrit.wikimedia.org/r/79997 (owner: 10ArielGlenn) [11:35:04] (03PS1) 10Faidon: authdns: also adjust descriptions on gdnsd.pyconf [operations/puppet] - 10https://gerrit.wikimedia.org/r/79998 [11:35:17] woo 2 left! [11:35:33] (03CR) 10Faidon: [C: 032] authdns: also adjust descriptions on gdnsd.pyconf [operations/puppet] - 10https://gerrit.wikimedia.org/r/79998 (owner: 10Faidon) [12:00:30] (03PS1) 10ArielGlenn: add virt3,4 to decommissioned list since they are nonexistant [operations/puppet] - 10https://gerrit.wikimedia.org/r/79999 [12:01:39] (03CR) 10ArielGlenn: [C: 032] add virt3,4 to decommissioned list since they are nonexistant [operations/puppet] - 10https://gerrit.wikimedia.org/r/79999 (owner: 10ArielGlenn) [12:07:36] (03CR) 10Jeroen De Dauw: [C: 031] Add DataTypes extension [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/76481 (owner: 10Aude) [12:25:02] (03CR) 10Mark Bergsma: [C: 031] Added support for escaping troublesome characters in tag content. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79745 (owner: 10Edenhill) [12:28:35] (03CR) 10Mark Bergsma: [C: 031] Added JSON formatter, field name identifers and type casting option. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79746 (owner: 10Edenhill) [12:30:08] (03CR) 10Mark Bergsma: [C: 031] Added 'output = null' for testing purposes. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79747 (owner: 10Edenhill) [12:32:41] (03CR) 10Mark Bergsma: [C: 031] When reading offline VSL files (-r ..) make a copy of each matched tags data since the data is volatile. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79748 (owner: 10Edenhill) [12:33:59] (03CR) 10Mark Bergsma: [C: 031] Handle "Var: Val" with empty " Val"s. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79749 (owner: 10Edenhill) [12:34:36] (03CR) 10Mark Bergsma: [C: 031] Indent fix and clarified comment. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79750 (owner: 10Edenhill) [12:36:02] (03CR) 10Mark Bergsma: [C: 032] Dont redeclare 'len' [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79751 (owner: 10Edenhill) [12:36:25] (03CR) 10Mark Bergsma: [C: 031] Decrease default log.level to 6 (info) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79752 (owner: 10Edenhill) [12:44:36] (03PS1) 10ArielGlenn: correct ensure for /etc/dsh [operations/puppet] - 10https://gerrit.wikimedia.org/r/80002 [12:45:25] (03CR) 10ArielGlenn: [C: 032] correct ensure for /etc/dsh [operations/puppet] - 10https://gerrit.wikimedia.org/r/80002 (owner: 10ArielGlenn) [12:45:57] and it's still wrong third time's a charm [12:47:18] (03PS1) 10ArielGlenn: third time's a charm? [operations/puppet] - 10https://gerrit.wikimedia.org/r/80004 [12:47:44] (03CR) 10ArielGlenn: [C: 032] third time's a charm? [operations/puppet] - 10https://gerrit.wikimedia.org/r/80004 (owner: 10ArielGlenn) [12:53:23] (03CR) 10Mark Bergsma: [C: 032] Added support for escaping troublesome characters in tag content. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79745 (owner: 10Edenhill) [12:53:30] (03CR) 10Mark Bergsma: [V: 032] Added support for escaping troublesome characters in tag content. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79745 (owner: 10Edenhill) [12:54:14] (03CR) 10Mark Bergsma: [C: 032] Added JSON formatter, field name identifers and type casting option. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79746 (owner: 10Edenhill) [12:54:20] (03CR) 10Mark Bergsma: [V: 032] Added JSON formatter, field name identifers and type casting option. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79746 (owner: 10Edenhill) [12:54:30] (03CR) 10Mark Bergsma: [C: 032 V: 032] Added 'output = null' for testing purposes. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79747 (owner: 10Edenhill) [12:54:40] (03CR) 10Mark Bergsma: [C: 032 V: 032] When reading offline VSL files (-r ..) make a copy of each matched tags data since the data is volatile. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79748 (owner: 10Edenhill) [12:54:50] (03CR) 10Mark Bergsma: [C: 032 V: 032] Handle "Var: Val" with empty " Val"s. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79749 (owner: 10Edenhill) [12:55:01] (03CR) 10Mark Bergsma: [C: 032 V: 032] Indent fix and clarified comment. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79750 (owner: 10Edenhill) [12:55:11] (03CR) 10Mark Bergsma: [V: 032] Dont redeclare 'len' [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79751 (owner: 10Edenhill) [12:55:21] (03CR) 10Mark Bergsma: [C: 032 V: 032] Decrease default log.level to 6 (info) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/79752 (owner: 10Edenhill) [13:05:42] (03PS5) 10Mark Bergsma: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [13:27:41] (03CR) 10Edenhill: [C: 031] (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [13:37:07] (03PS1) 10ArielGlenn: fix up path of check-raid.py for sudoers [operations/puppet] - 10https://gerrit.wikimedia.org/r/80013 [13:41:16] (03CR) 10ArielGlenn: [C: 032] fix up path of check-raid.py for sudoers [operations/puppet] - 10https://gerrit.wikimedia.org/r/80013 (owner: 10ArielGlenn) [14:04:01] (03CR) 10Ottomata: "Cool, can do. I'd rather this be a separate init script, since they are very distinct services." [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [14:13:06] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [14:15:22] hey akosiaris [14:15:37] if I add that kafka-mirror .init script to the base kafka package [14:15:43] what's the best way to tell debhelper to install it? [14:15:51] i could put it in install, but then I guess it wouldn't set up rc links? [14:23:46] RECOVERY - RAID on snapshot1 is OK: OK: no RAID installed [14:25:26] RECOVERY - RAID on snapshot2 is OK: OK: no RAID installed [14:25:46] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [14:25:48] are all your varnish frontends 12.04/precise based? [14:26:36] RECOVERY - RAID on snapshot4 is OK: OK: no RAID installed [14:27:07] Snaps: I believe so. I can't say 100% for sure [14:27:10] but I believe so [14:27:54] !log starting multiple parallel swift->ceph copy jobs on terbium [14:27:59] Logged the message, Master [14:28:23] okay, so varnishkafka shouldnt depend on libraries not generally available on 12.04 then [14:29:22] Snaps: depends on the library [14:29:28] we can always backport [14:29:51] backporting has tradeoffs and for some we can't really do it [14:30:03] like don't ask for some libc6 feature :) [14:30:08] its json libyajl library, which is still on 1.x in precise, but in newer ubuntus theres a 2.x version I use with varnishkafka. [14:30:30] But its not a problemto use yajl1 in varnishkafka, so I'll do that [14:30:36] RECOVERY - RAID on db31 is OK: OK: 1 logical device(s) checked [14:30:52] libvirt is a reverse dep [14:31:00] it's a different soname/package name [14:31:16] but if it's easy for you to back down a version then, yes, I'd prefer it [14:32:12] yajl looks nice [14:32:30] yeah, I like it, and its reasonably fast [14:49:10] (03PS3) 10Petr Onderka: Implemented diff dumps [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/79808 [14:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [14:54:24] (03PS6) 10Mark Bergsma: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [14:57:00] (03CR) 10Edenhill: [C: 031] (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [14:57:23] (just added libyajl-dev as build dep) [14:57:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:45] (03CR) 10Edenhill: "(1 comment)" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [14:59:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [14:59:27] for building it needs the dev packages for include files [15:02:15] ah, it says "*Build*-Depends". nevermind me [15:02:46] (03CR) 10Faidon: "(1 comment)" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [15:21:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:58] mark: sq41 is broke beyond repair but is 1 of the 2 upload squids. Do you wanna add another to upload? [15:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [15:23:19] what do you mean, 1 of the 2? [15:23:37] sq41 and sq42 role is upload [15:23:44] there were tens ;) [15:23:48] i'm sure a bunch have died by now [15:24:01] those ring a bell as a pair [15:24:03] and it's tampa, it's fine, we don't care anymore [15:24:15] RECOVERY - RAID on db9 is OK: OK: State is Optimal, checked 2 logical device(s) [15:24:25] they are a pair thx jeremyb [15:24:34] sq41..58 are all upload [15:25:02] i meant just for ganglia monitoring [15:25:15] PROBLEM - RAID on erzurumi is CRITICAL: Connection refused by host [15:25:52] oh [15:25:53] don't care [15:26:03] yeah sorry wasn't clear [15:26:09] likely those boxes will just get decommissioned in a month or 2 ;) [15:27:35] PROBLEM - Disk space on erzurumi is CRITICAL: Connection refused by host [15:28:05] PROBLEM - DPKG on erzurumi is CRITICAL: NRPE: Command check_dpkg not defined [15:28:12] (03PS1) 10Demon: Turn HTTPs on by default for beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80027 [15:28:30] (03CR) 10jenkins-bot: [V: 04-1] Turn HTTPs on by default for beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80027 (owner: 10Demon) [15:29:27] that is me on erzurumi [15:29:31] hrmmm, can't find what i was thinking of [15:29:32] ignore those [15:29:32] (03CR) 10Demon: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80027 (owner: 10Demon) [15:29:43] ^d, any ideas how HTTPS for everyone will be implemented, in PHP or with Varnish rules? [15:31:21] <^d> It's done in PHP. There's a place in Wiki.php where if you're required to be on HTTPS but are on HTTP it'll OutputPage::redirect() you. [15:32:43] (03PS3) 10Akosiaris: Refactoring nrpe module (round 2/??) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 [15:33:07] MaxSem: http://news.netcraft.com/archives/2013/06/25/ssl-intercepted-today-decrypted-tomorrow.html [15:33:31] "There is a defence against this, known as perfect forward secrecy (PFS). When PFS is used, the compromise of an SSL site's private key does not necessarily reveal the secrets of past private communication; connections to SSL sites which use PFS have a per-session key which is not revealed if the long-term private key is compromised. The security of PFS depends on both parties discarding the shared secret after the transaction is com [15:33:31] plete (or after a reasonable period to allow for session resumption). " [15:33:45] If someone has a free minute on mchenry, input on https://bugzilla.wikimedia.org/show_bug.cgi?id=42774#c21 would be nice. [15:34:28] Is Wikimedia planning on using PFS? [15:34:47] (03PS1) 10Cmjohnson: removing sq41 role/cache & ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/80028 [15:34:48] I guess we are, but the question is when [15:34:56] better ask Ryan:) [15:36:05] ^d, by forcing User::requiresHTTPS() to true? [15:36:08] (03CR) 10Cmjohnson: [C: 032 V: 032] "goodbye sq41" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80028 (owner: 10Cmjohnson) [15:36:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:45] (03PS4) 10Akosiaris: Refactoring nrpe module (round 2/??) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 [15:37:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [15:38:05] RECOVERY - DPKG on erzurumi is OK: All packages OK [15:38:15] RECOVERY - RAID on erzurumi is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [15:38:35] RECOVERY - Disk space on erzurumi is OK: DISK OK [15:38:56] still running hardy, yuck [15:39:07] (03PS1) 10Faidon: exim: switch sodium non-list mail to its own IP [operations/puppet] - 10https://gerrit.wikimedia.org/r/80029 [15:39:50] heya paravoid, how do I properly install multiple init scripts with a single binary package? [15:40:01] (03PS2) 10Faidon: exim: switch sodium non-list mail to its own IP [operations/puppet] - 10https://gerrit.wikimedia.org/r/80029 [15:40:04] ottomata: thas for the kafka thingy ? [15:40:11] that's* [15:40:17] yeah [15:40:36] dh_installinit seems to only work with one [15:40:40] per binary [15:40:46] no, it has arguments [15:40:50] you need to override its call [15:40:59] and call dh_installinit --whatever the option is [15:40:59] hmm, was reading the man, didn't see that….reading harder [15:41:07] you need to call it twice [15:41:21] I was wondering about that. is it a different process ? Or the same with some different args ? [15:41:25] --name=kafka-mirror [15:41:29] different process [15:41:30] and --name=kafka I guess [15:41:46] but having one init script spawning two daemons is not unheard of [15:41:50] its completely different, it fires up multiple consumers that feed into a single producer [15:42:07] (03CR) 10Faidon: [C: 032] exim: switch sodium non-list mail to its own IP [operations/puppet] - 10https://gerrit.wikimedia.org/r/80029 (owner: 10Faidon) [15:42:35] ok [15:42:52] <^d> MaxSem: Yep, basically. [15:42:57] i can't think of an init script i've seen that is used to spawn multiple processes [15:43:17] how would puppet deal with that? we'd have to manually set all the start, stop, restart, etc. commands? [15:43:20] so a kafka mirror-node will run both that process and the regular one... ok [15:43:27] maybe? [15:43:28] <^d> That function could later be extended to add per-group support or somesuch, like if we want to force Oversighters or something to use HTTPS. I think Tyler's going to look at that. [15:43:28] or maybe not [15:43:31] its completely separate [15:43:41] it could run anywhere [15:43:53] (03Merged) 10jenkins-bot: Turn HTTPs on by default for beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80027 (owner: 10Demon) [15:44:22] ottomata: openvpn spawns multiple processes one per config [15:44:34] ^d, cool, thanks - I'll need to whack in a hook for mobile/zero users who are unable to use the site via HTTPS [15:44:54] <^d> Mmk. Feel free to toss me on the review list for such a thing. [15:44:54] hm, but that is not manual, right? it looks at the configs and starts them all [15:45:04] this would be specifying which process you want to start [15:45:05] apergos: err: /Stage[main]/Base::Puppet/File[/etc/snmp/snmp.conf]/ensure: change from absent to present failed: Could not set 'present on ensure: No such file or directory - /etc/snmp/snmp.conf.puppettmp_7199 at /etc/puppet/manifests/base.pp:107 [15:45:12] a broker, or a mirror maker [15:45:15] so [15:45:20] apergos: that's puppet language for "/etc/snmp doesn't exist" [15:45:32] so you are suggesting someting like [15:45:40] what host is that? [15:45:55] /etc/init.d/kafka mirror start [15:45:55] /etc/init.d/kafka broker start [15:45:55] etc. ? [15:45:59] paravoid: [15:47:06] RECOVERY - mailman on sodium is OK: PROCS OK: 10 processes with args mailman [15:47:11] yes? [15:47:16] what host is that? [15:47:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:27] sodium [15:47:36] RECOVERY - spamassassin on sodium is OK: PROCS OK: 4 processes with args spamd [15:47:43] but the issue is /etc/snmp doesn't exist [15:48:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [15:49:49] I wondr why it would not have it and other systems would... a package that creates it, perhaps? I'll have a look [15:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [15:57:53] (03PS4) 10Faidon: exim: add DKIM for wikimedia.org domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/79754 [16:00:03] bah. lucid version of package doesn't provide the directory. meehh [16:07:39] (03PS1) 10ArielGlenn: lucid libsnmp doesn't create /etc/snmp so we do. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80033 [16:09:57] (03CR) 10ArielGlenn: [C: 032] lucid libsnmp doesn't create /etc/snmp so we do. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80033 (owner: 10ArielGlenn) [16:21:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.209 second response time [16:23:22] <^d> MaxSem: Merged your hook addition. Do you want me to stage it for merge to wmf branches with other https stuff for tomorrow? [16:24:54] ^d, thanks - I was intending it for HTTPS for everyone as mobile already uses secure login [16:25:27] so probably no hurry needed unless we want to move everyone to HTPPS within 1 week:) [16:25:45] <^d> Easy enough to do it now so we don't have to think about it later. [16:26:31] sounds good then:) [16:26:44] bleh [16:26:53] gerrit really should have mid-air collision detection :( [16:27:04] (re: that hook) [16:28:02] (03PS1) 10CSteipp: Allow autoconfirmed to propose Consumers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80039 [16:36:09] (03PS3) 10Ottomata: Installing kafka-mirror init.d and default scripts. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 [16:37:06] paravoid ^ [16:40:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:41:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [16:42:06] (03PS4) 10Petr Onderka: Implemented diff dumps [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/79808 [16:43:45] (03PS5) 10Petr Onderka: Implemented diff dumps [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/79808 [16:49:09] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [16:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [16:55:09] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: No successful Puppet run in the last 10 hours [17:02:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [17:16:58] (03PS1) 10ArielGlenn: account for hosts where every disk is raid 0 (e.g. the ms-be hosts) [operations/puppet] - 10https://gerrit.wikimedia.org/r/80055 [17:21:39] binasher: dberror.log is useless with spam atm [17:21:57] (03PS3) 10Bsitu: Enable Echo and Thanks on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 [17:22:07] Aaron|home: yay spam [17:22:18] mostly the same 2 errors [17:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:30] oh, those [17:22:32] yeah [17:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [17:24:57] apergos: are you on it? [17:25:06] no, I"m on dinner [17:25:22] sorry, but it is after my 12 hours a day tick mark... [17:26:37] oh, no worry. get offline apergos! [17:26:41] gone! [17:27:02] grr EducationProgram extension [17:27:04] enwiki IndexPager::buildQueryInfo (EducationProgram\RevisionPager) 10.64.16.32 1176 Key 'rev_time' doesn't exist in table 'ep_revisions' (10.64.16.32) SELECT rev_id,rev_object_id,rev_object_identifier,rev_user_id,rev_type,rev_comment,rev_user_text,rev_minor_edit,rev_time,rev_deleted,rev_data FROM `ep_revisions` FORCE INDEX (rev_time) WHERE rev_type = 'EPCourses' AND rev_object_id = '125' ORDER BY rev_time [17:27:06] LIMIT 51 [17:27:39] (03PS4) 10Bsitu: Enable Echo and Thanks on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 [17:31:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:08] are we still using EducationProgram? [17:32:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [17:43:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [17:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [18:12:42] Ryan_Lane: Around? [18:21:17] bblack: noticing that showing cumulative counters in vhtcpd ganglia is hard to read. Should I make a new patch that saves deltas vs last poll instead? [18:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [18:23:43] running scap in a second [18:24:29] (03CR) 10Anomie: [C: 031] Allow autoconfirmed to propose Consumers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80039 (owner: 10CSteipp) [18:25:32] bsitu: if you get done with your deployment early, it would be great if we could sneak in https://gerrit.wikimedia.org/r/#/c/80039/ [18:25:51] !log updated Parsoid to f359548f04e739 [18:25:56] Logged the message, Master [18:26:04] anomie: sure [18:26:11] heh, this is such a "sneak in" kind of day... [18:26:29] anomie: I have one more config change to deploy, but that should be quick [18:31:16] (03PS1) 10Eloquence: Added new public key for myself. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80073 [18:34:58] !log bsitu Started syncing Wikimedia installation... : Update Echo to master [18:35:04] Logged the message, Master [18:36:19] (03CR) 10Dzahn: [C: 032] Added new public key for myself. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80073 (owner: 10Eloquence) [18:40:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [18:41:48] (03PS1) 10Eloquence: Remove redundant key comment, fix key type [operations/puppet] - 10https://gerrit.wikimedia.org/r/80075 [18:44:26] (03CR) 10Bsitu: [C: 032] Enable Echo and Thanks on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 (owner: 10Bsitu) [18:44:38] (03Merged) 10jenkins-bot: Enable Echo and Thanks on fr, hu, pt, pl and sv wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79956 (owner: 10Bsitu) [18:45:10] (03CR) 10Dzahn: [C: 032] "yep, Erik sat next to me, also checked for old keys on all hosts using salt" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80075 (owner: 10Eloquence) [18:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [18:57:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:58:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [19:06:52] mw1046: rsync: mkstemp "/usr/local/apache/common-local/php-1.22wmf13/languages/messages/.MessagesKsh.php.9TJcvt" failed: Read-only file system (30) [19:07:05] a lot of this kind of error on mw1046 [19:07:11] !log bsitu Finished syncing Wikimedia installation... : Update Echo to master [19:07:16] Logged the message, Master [19:07:48] (03PS1) 10Demon: test2wiki to secure login [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80078 [19:09:30] anomie: ping [19:09:37] bsitu: pong [19:09:46] I am pushing out initialiseSetting.php [19:09:51] 'wgSecureLogin' => array( [19:09:52] - 'default' => false, [19:09:52] + 'default' => true, [19:09:52] 'loginwiki' => true, [19:10:16] is this your change? [19:10:46] No, mine is a change to CommonSettings.php, changing emailconfirmed (which hasn't existed since 2008) to autoconfirmed [19:11:30] ^d: Is that wgSecureLogin change yours? [19:11:38] bsitu: that change to InitialiseSettings should not go out until tomorrow ^d ^^^ [19:11:50] <^d> That should've been initialisesettings-labs. [19:12:28] oops, [19:12:29] yes [19:12:34] it's in the labs file [19:12:38] <^d> :) [19:12:45] whew [19:12:45] false alarm, sorry [19:12:51] <^d> Harmless for prod, feel free to sync :) [19:12:51] hah [19:13:43] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [19:15:06] ^d: I will sync it if it's harmless [19:15:47] if in labs, yeah, harmless [19:17:28] !log bsitu synchronized echowikis.dblist 'Add fr, hu, pt, pl and sv to Echo dblist' [19:17:33] Logged the message, Master [19:18:34] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Enable Echo and Thanks on fr, hu, pt, pl and sv wiki' [19:18:39] Logged the message, Master [19:19:25] greg-g: yeah, it's for beta-lab [19:19:33] !log bsitu synchronized wmf-config/InitialiseSettings-labs.php 'Turn HTTPs on by default for beta' [19:19:39] Logged the message, Master [19:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [19:24:13] anomie: I am done with the deploy now [19:24:21] bsitu: ok! [19:24:31] (03CR) 10Anomie: [C: 032] Allow autoconfirmed to propose Consumers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80039 (owner: 10CSteipp) [19:25:53] (03Merged) 10jenkins-bot: Allow autoconfirmed to propose Consumers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80039 (owner: 10CSteipp) [19:27:34] !log anomie synchronized wmf-config/CommonSettings.php 'Fix OAuth rights assignments' [19:27:39] Logged the message, Master [19:30:59] bsitu: Thanks [19:31:08] (03CR) 10Reedy: "Yay" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80039 (owner: 10CSteipp) [19:31:27] anomie: np, glad that you can utilize the window [19:31:42] (03CR) 10Greg Grossmeier: "Ping others on the review list." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/76481 (owner: 10Aude) [19:33:23] yeah, thanks bsitu [19:33:27] and thanks anomie [19:33:52] * ^d twiddles thumbs [19:33:53] "I just wanna say thanks to my mom, my dad, my sister, and all those who stood behind me on the way to this deployment." [19:33:56] <^d> lotsa code to sync. [19:36:23] * ^d prays to the scap gods [19:36:35] ^d: you calleD? [19:40:13] !log demon Started syncing Wikimedia installation... : [19:40:19] Logged the message, Master [19:45:56] notpeter / binasher: either of you got a sec? i need the LOCK TABLES privilege granted to user 'eventlog' on database 'log' on db1047. [19:46:34] <^d> If someone has a chance, scap & friends have been complaining about mw1046 having a r/o filesystem [19:47:04] hmph [19:47:13] the HTTPS warning banner can't bedismissed. [19:47:28] ^d: will take a look [19:47:32] ^d: your fault? :D ^ [19:47:48] Hi. I just read the centralnotice about https being made compulsory. I have a (nooby) question: for the same article, what would be the difference in the data transfer between http and https? (I'm assuming https will require more, but how much more?) [19:48:09] <^d> MatmaRex: About the banner? No, I had nothing to do with that. [19:48:17] greg-g: ^^^ [19:48:24] Sid-G: It's relatively minimal overhead [19:48:33] ori-l: notpeter doesn't work here any more (technically does, but he's at burning man and his last day is also during burning man)… what's the use case for lock tables though? [19:49:06] Reedy: define "relatively minimal" [19:49:20] Not much [19:49:34] it's an extra couple of kbyte [19:49:42] Reedy: so, no statistics? [19:49:50] the biggest overhead IMHO for many will be the extra RTT during the SSL setup [19:49:54] Sid-G: http://stackoverflow.com/questions/548029/how-much-overhead-does-ssl-impose [19:49:56] http://stackoverflow.com/questions/548029/how-much-overhead-does-ssl-impose [19:49:59] gah [19:50:07] * Sid-G looks [19:50:09] * greg-g shakes fist at Reedy  [19:50:12] Sid-G: Order of magnitude: zero. [19:50:45] Sid-G: The majority of people won't notice the difference [19:51:12] greg-g: the HTTPS warning banner can't be dismissed, could you look into it? [19:51:23] About 10,800,000 results [19:51:29] That'll keep you in reading for a while [19:51:38] MatmaRex: "It's a banner!" [19:51:45] PROBLEM - Puppet freshness on virt2 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:01] Almost purposeful [19:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [19:53:22] the centralnotice banner on en-wp seems to be hideable for me [19:53:42] * Reedy clicks [Hide] [19:53:54] MatmaRex: I have no idea how those work, basile set it up for me [19:54:05] MatmaRex: RESOLVED WORKSFORME [19:54:45] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [19:56:41] ok, so any gadgets using cookies will have to use the secure attribute in the cookies now? [19:58:19] mw1046 is depooled from pybal [19:59:34] binasher: i need to truncate a table that predates the current pruning policy but which continues to get inserts at a rapid clip [20:02:09] Reedy: try visiting some pages after you dismiss it [20:02:14] !log demon Started syncing Wikimedia installation... : [20:02:17] i dismissed it twice alreedy [20:02:52] It can be dismissed [20:02:54] !log granted "lock tables" to eventlog user on db1047 [20:02:56] It just doesn't stay dismissed [20:02:56] ori-l: ^^ [20:02:59] Logged the message, Master [20:03:17] binasher: thanks -- much obliged. didn't realize it was peter's last day already :( [20:03:27] Reedy…: [20:04:43] mutante: random question re RT: why do I always get the XSS warning when I do things? [20:05:18] (03PS1) 10Edenhill: Added support for libyajl version 1.x [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/80127 [20:10:23] greg-g: i have seen those in the past right after the upgrade before some http->https redirects were fixed but not since, something related to http/https? httpseverywhere? [20:11:05] I do use httpseverywhere... maybe [20:11:15] I mean, it 'works' I just have to do an extra click :/ [20:12:25] i know which error page you mean yea, i have seen it, but not anymore since some fix quite a while ago..hmm [20:12:37] who can i complin to about the broken banners? [20:12:57] i don't want to hide them wth site css [20:13:44] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [20:14:56] actually, i just did that, so whatever. [20:16:47] greg-g: since it moved from old server to new server? local browser cache? it was on streber but now magnesium, but resolves to IP, 208.80.154.5 [20:17:02] <^demon> poor mw1046 :\ [20:17:28] mutante: good ideas, will futz in a bit, thanks, just making sure it wasn't just me ;) [20:17:31] it's got the hd errors [20:17:34] and a ticket [20:17:40] ( ^demon ) [20:17:42] apergos: that wasn't long enough to sleep [20:17:50] I didn't sleep, I went off in search of food [20:18:09] it was a longer search than usual, turned into a movie [20:18:09] <^demon> apergos: Would be nice if it was out of rotation then. [20:18:11] ah, then that's an ok amout of time to eat and such [20:18:14] <^demon> scap & friends complain :\ [20:18:27] ^demon, is your scap still running? [20:18:38] <^demon> Yeah, I disconnected and was stupidly not in a screen [20:19:48] * apergos wonders why we don't keep some small number of the right kind of disks on site (but is not enough of a hardware person to answer that question) [20:20:06] I guess dell is supposed to overnight them or something [20:20:44] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [20:21:13] those jenkins build failures one gets that are unrelated to the actual patch content, "Could not generate documentation: Definition 'nrpe::monitor_service' is already defined at /srv/org/wikimedia/doc/puppetsource/modules/nrpe/spec/fixtures/modules/nrpe/manifests/monitor_service.pp:21; cannot be redefined at /srv/org/wikimedia/doc/puppetsource/modules/nrpe/manifests/monitor_service.pp:21" , should i create a bug for them? and would you say [20:21:39] https://integration.wikimedia.org/ci/job/operations-puppet-doc/2145/console [20:22:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:18] that's bad that both paths can be traversed to the same files [20:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [20:23:27] and are then counted as separate [20:23:41] <^demon> apergos: Can mw1046 at least be removed from the dsh group for now? [20:24:29] as long as folks remember to put it back in when it comes back up, I guess [20:24:31] ...or can scap we tweaked to use salt instead of dsh?:P [20:24:32] beware, dsh groups are in private puppet [20:24:37] !log demon Finished syncing Wikimedia installation... : [20:24:42] Logged the message, Master [20:24:44] but might have to double check if it actually writes them [20:24:50] mutante, o rly? [20:25:03] eh, not private, public, but yeah, puppet [20:25:14] ;) [20:25:18] /puppet/files/dsh/ [20:25:21] ah I did fix the one dsh issue (with /etc/dsh not existing on some hosts), that won't affect scaps though, those hosts already had it of course [20:25:36] if that was manual it might be overwritten by puppet [20:25:48] I fixed it in puppet [20:25:52] ^demon, is that all?:) [20:26:17] <^demon> MaxSem: Just one sync-file, sec. [20:26:24] (03CR) 10Demon: [C: 032] test2wiki to secure login [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80078 (owner: 10Demon) [20:27:00] (03Merged) 10jenkins-bot: test2wiki to secure login [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80078 (owner: 10Demon) [20:28:15] !log demon synchronized wmf-config/InitialiseSettings.php 'test2wiki to secure login' [20:28:20] Logged the message, Master [20:29:17] <^demon> MaxSem: I'm done. [20:29:23] thanks!:) [21:16:46] scapping... [21:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.161 second response time [21:25:42] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [21:25:47] Logged the message, Master [21:31:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:32:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [21:50:11] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [21:50:17] Logged the message, Master [21:53:01] (03PS1) 10Asher: adding virt1 to decom [rt 5472] [operations/puppet] - 10https://gerrit.wikimedia.org/r/80142 [21:55:11] (03CR) 10Asher: [C: 032 V: 032] adding virt1 to decom [rt 5472] [operations/puppet] - 10https://gerrit.wikimedia.org/r/80142 (owner: 10Asher) [21:57:37] (03PS1) 10Dzahn: remove sq41, decom'ed per RT #5618 [operations/puppet] - 10https://gerrit.wikimedia.org/r/80143 [21:59:08] (03CR) 10Dzahn: [C: 032] "already in decom.pp in puppet and disabled in pybal" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80143 (owner: 10Dzahn) [21:59:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.167 second response time [22:02:02] (03PS1) 10Dr0ptp4kt: Adding carrier for baselining. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80144 [22:06:36] (03CR) 10Dr0ptp4kt: "Mark, Faidon, Asher: this is okay for deployment, provided your approval." [operations/puppet] - 10https://gerrit.wikimedia.org/r/80144 (owner: 10Dr0ptp4kt) [22:07:08] (03PS8) 10Yuvipanda: Route requests based on data from Redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/78025 [22:07:11] http://icannwiki.com/index.php/All_New_gTLD_Applications [22:07:22] http://icannwiki.com/index.php/.mcd [22:10:47] anyone want to merge my key update? :) https://gerrit.wikimedia.org/r/#/c/79304/ [22:14:55] (03CR) 10Faidon: [C: 032] Adding carrier for baselining. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80144 (owner: 10Dr0ptp4kt) [22:15:28] Aaron|home: ack for switching ceph masters? [22:15:32] (not now) [22:15:39] er, filebackend masters [22:15:56] ack? [22:16:07] paravoid, thx for the review and +2 [22:16:11] well, syn [22:16:39] you wanted to check something yesterday but didn't have your key [22:17:28] oh, yeah, I didn't see anything crazy [22:17:49] oh paravoid you are here! [22:18:00] not for long [22:18:02] so...... I was wondering..... [22:18:03] dangit [22:18:04] what's up [22:18:05] paravoid doesn't sleep [22:18:37] basically: does the ability to give a user http vs https based on their IP need to be a change in the DNS level? Varnish? [22:18:59] redirects you mean? [22:19:09] mediawiki [22:19:23] csteipp: ^^ [22:19:51] (03CR) 10Dzahn: "could you make the old one "ensure =>" and add the new one additionally? yeah, i don't know if we want to keep all keys until the end of t" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79304 (owner: 10Jalexander) [22:20:21] paravoid: so nothing needed on your end to do it, something we need to do in our codebase that reads the varnish header or something... [22:20:24] thanks mutante [22:20:32] yes [22:20:41] * greg-g is just thinking outloud [22:20:54] well then, that's 3 different responses to how to get this done ;) [22:21:05] paravoid: Do you know if the country database is available to the apache servers? [22:21:18] take x-forwarded-proto, ip and geoip database as input, possibly return a 302/301 with vary: x-forwarded-proto [22:21:31] didn't Tim say mediawiki already does geoip queries? [22:22:23] see Tim's mail to engineering@ [22:22:34] Hmm.. Anyone from CentralNotice around? [22:22:34] from Aug 13th [22:22:38] so i figured that it probably doesnt need a change in the geoip database, the IPs for China/Iran/etc are stilll the same, but the decision when to redirect is different [22:22:51] so would that rather be /puppet/templates/varnish$ vi geoip.inc.vcl.erb [22:22:57] no [22:23:19] that's for geolookup.wikimedia.org, which we use from javascript [22:23:27] mediawiki can do queries via php [22:23:42] there's a php extension, php5-geoip, that links with libgeoip [22:23:49] aha [22:23:50] and presumably the databases are already in the system [22:24:05] we have modules/geoip to install all kinds of different databases [22:24:11] the free one, the proprietary one etc. [22:24:11] * greg-g nods [22:24:15] hah [22:24:26] country or city level, v4, v6 [22:24:36] regions, as numbers, you name it [22:25:03] makes me wonder if China's SRAs (Hong Kong, Macau) are already a different country or you'd have to exclude again on city level [22:25:51] * ksnider wishes for an "X-Censored" header [22:25:58] they have different iso-3166-2 codes, so yes [22:26:43] er, -1 even [22:28:07] manifests/role/applicationserver.pp includes misc.pp [22:28:44] which in turns includes the proprietary database in production and the (free GeoLite) .deb in labs [22:29:12] so, as Tim said, the foundations are there [22:29:50] btw I think geoip isn't enough, I think we'll need custom IP blocks as well [22:30:07] s/custom/arbitrary/ [22:33:16] paravoid: like what? [22:33:37] what kind of arbitrary blocks are you thinking? [22:34:23] i'm in china where https is blocked. [22:34:33] what should i do? [22:34:55] (03PS2) 10Jalexander: Replace public key for jamesofur [operations/puppet] - 10https://gerrit.wikimedia.org/r/79304 [22:35:48] (03PS1) 10BryanDavis: Add *_delta stats for vhtcpd ganglia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80151 [22:36:05] Anna_Frodesiak: What should you do for what? [22:36:31] hello Anna_Frodesiak. A) chinese language wikis will be excluded from the HTTPS requirement and B) it is our understanding that chinese users are able to use the https://login.wikimedia.org address to login currently, yes? [22:36:33] greg-g: what if there's a large well known address range that's not in the database? [22:36:33] any url starting with https is blocked here [22:36:45] HELLO was the use of content encoding impossible? [22:36:55] bd808: I had a look at your ganglia plugin as I wrote my own for gdnsd [22:36:56] i use enwp [22:37:05] Anna_Frodesiak: that isn't my understanding, can you confirm you can access that url I just typed out? [22:37:13] bd808: have a look if you're feeling up to it [22:37:21] paravoid: I hope it helped more than it hurt [22:37:29] I rewrote it [22:37:39] and simplified it a lot [22:37:57] yes i can access https://login.wikimedia.org/ [22:38:00] paravoid: I didn't know that was the case, but ok. [22:38:17] should i logout first? [22:38:19] greg-g: there isn't, as far as we know, yet [22:38:31] Anna_Frodesiak: great! then you'll be able to login tomorrow. Right now we're working on a solution for your situation where you reside in China but want to participate on English Wikipedia [22:38:50] Anna_Frodesiak: no, that's ok, just wanted to confirm you could access that site. Thank you very much. [22:39:17] i logged in fine [22:39:20] then i went to https://www.wikipedia.org/ [22:39:24] and i'm blocked [22:39:40] right [22:39:47] thanks for confirming that. [22:39:53] wait, what? [22:40:00] www is blocked but login isn't? [22:40:07] correct [22:40:21] do you want me to do it again to confirm? [22:40:35] i can try to go to different wmf projects after login to see [22:41:48] s'ok, that's what I've heard [22:41:54] paravoid: weird huh :) [22:42:07] so i'm an admin at enwp. will i be booted forever? [22:43:23] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:43:47] Anna_Frodesiak: i'm curious, can you give me the ip address you get for login.wikimedia.org and www.wikipedia.org ? [22:44:16] how do i do that? [22:44:24] which os are you using ? [22:44:49] chrome [22:44:55] oh windows xp [22:45:17] http://www.whatismyip.com/ [22:45:18] this should work - http://www.rackspace.com/knowledge_center/article/nslookup-checking-dns-records-on-windows [22:45:29] MaxSem: i'm more curious about the dns resolution, not her ip in particular [22:45:34] oh you want my ip ok [22:45:35] eh, right:) [22:45:38] maybe one ip is blocked on port 443 and one isn't [22:45:51] what's to check leslie? [22:45:54] we know our IPs [22:46:00] i was curious what she was getting [22:46:06] maybe there's something really weird going on [22:46:17] 124.66.15.78 [22:46:35] like bad dns entries propogated by the isp [22:46:59] anyway, gnight! [22:47:31] Anna_Frodesiak: can you do http://www.rackspace.com/knowledge_center/article/nslookup-checking-dns-records-on-windows and do "nslookup login.wikimedia.org" and "nslookup www.wikipedia.org" ? [22:47:43] ok standby... [22:47:46] thank you [22:48:12] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [22:48:27] Run cmd and type "ipconfig /all" Then enter key [22:48:40] LeslieCarr: Sounds like a good reason to do a tech trip to China [22:48:54] Then post the output of the window [22:49:01] root______: Why? [22:49:35] root______: that doesn't help [22:49:50] i'm curious if incorrect ip's are being served [22:50:23] do i have to go the windows commandline? [22:50:24] it's probably just being blocked by the great firewall, but it's possible … [22:50:32] Anna_Frodesiak: Yup [22:50:35] By using this way there would be the entire configuration and you'll don't need to ask further info. [22:50:48] This is the reason [22:50:49] ok that doesn't let me copy paste for some reason [22:50:51] zh-vpn.wikimedia.org ?:p [22:51:04] root______: and how am i supposed to find out what her isp's dns servers are returning ? i don't want to know her server ip's, i want to know what her isp is returning [22:51:42] i type in manually nslookup -type=A www.wikipedia.org [22:51:46] is that right? [22:51:55] yep, that is correct [22:52:00] ok standby [22:52:00] Anna_Frodesiak: Left click in the top left corner, Edit -> Select all [22:52:02] Then press enter [22:53:57] ok here it is: [22:54:04] C:\>nslookup -type=A www.wikipedia.org [22:54:04] Server: dns1.hi169.net [22:54:04] Address: 221.11.132.2 [22:54:05] Non-authoritative answer: [22:54:05] Name: wikipedia-lb.eqiad.wikimedia.org [22:54:05] Address: 208.80.154.225 [22:54:05] Aliases: www.wikipedia.org, wikipedia-lb.wikimedia.org [22:54:24] cool, that is correct :) [22:54:52] so what's the bottom line? will i be unable to access enwp without https? [22:57:09] ? [22:57:33] Anna_Frodesiak: afaik you'll have to wait until tomorrow [22:57:57] < greg-g> Anna_Frodesiak: great! then you'll be able to login tomorrow. .. [22:58:28] i will be able to access enwp? [22:58:53] !log olivneh synchronized php-1.22wmf12/extensions/CoreEvents 'Updating CoreEvents to master (1/2)' [22:58:58] Logged the message, Master [22:59:18] !log olivneh synchronized php-1.22wmf13/extensions/CoreEvents 'Updating CoreEvents to master (2/2)' [22:59:23] Logged the message, Master [23:00:17] !log During sync-dir, SSH timeouts from srv281, mw1089, mw1173; rsync errors on mw1046 (RO fs; previously reported) [23:00:22] Logged the message, Master [23:00:47] thanks everyone for your help. i guess i'll just have to wait and find out, along with plenty of others [23:00:56] is mw1046 still in circulation? [23:01:30] ori-l, no [23:01:40] just still in a dsh group [23:01:48] where do you check? [23:02:04] by poking an op:) [23:02:27] we mortals can't look into pybal ourselves [23:02:34] Anna_Frodesiak: you should then be redirected to http based on your IP [23:04:11] it's not [23:04:17] and I think you can, MaxSem [23:04:19] grep mw1046 /h/w/conf/pybal/eqiad/apaches [23:05:52] (03PS1) 10Dzahn: remove mw1046 from dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/80158 [23:06:00] MaxSem: ^ [23:06:04] can we make this an ironclad rule? [23:06:27] leaving it in the dsh group after pulling it from circulation should be a violation of our norms [23:06:33] it's really stressful [23:06:37] mutante, thanks:) [23:06:56] especially because it's quite common for a number of hosts to be pulled, so i get the feeling sometimes that if i complain too loudly i just end up looking like a noob [23:07:02] Anna_Frodesiak: we are working on a solution so you can access enwp tomorrow right now. [23:07:21] not to mention the fact that the last two times i trained someone to deploy their reaction to the ssh timeouts was panic / terror [23:07:51] MaxSem: ori-l http://noc.wikimedia.org/pybal/eqiad/ [23:08:02] that too [23:08:29] where is /h/w/conf/pybal/eqiad/apaches ? [23:08:37] greg-g: thank you! [23:08:46] greg-g: what's the prognosis? [23:08:51] on fenari [23:09:11] Anna_Frodesiak: looking positive :) we have our lead dev on the problem right now [23:09:18] i don't think it's acceptable to leave it up to the deployer to scramble to figure out if the errors s/he is seeing are panic-worthy [23:09:21] splendid. thanks :) [23:09:30] by sshing to a different host or pulling up a web browser [23:09:55] (03CR) 10Dzahn: [C: 032] "deactivated in pybal. http://noc.wikimedia.org/pybal/eqiad/apaches" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80158 (owner: 10Dzahn) [23:09:57] i'm sorry to harp on this, but it really causes undue anxiety every time [23:10:20] anyways, , and ops help w/this wld be appreciated. [23:10:22] Anna_Frodesiak: there's also some possibly helpful info at http://en.wikipedia.org/wiki/Internet_censorship_in_the_People's_Republic_of_China#Evasion [23:10:55] it suggests[[Tor (anonymity network)]] works if accessed via https [23:11:30] hard drive errors , August 20 (that's today) https://rt.wikimedia.org/Ticket/Display.html?id=5628 [23:12:19] paravoid: gnsd plugin looks semi-familiar. :) I don't remember what plugin I "borrowed" most of those patterns from. [23:12:30] sorry about that. the great firewall booted me [23:12:59] mutante: it's not about turnaround time for the repair; i have no idea what sort of work is involved or how long it takes; i'm sure you guys do that well. it's really just about pulling it from the dsh group whenever you pull it from pybal, and ideally making it automated so it becomes a non-issue [23:13:26] superfluous errors help no one [23:13:41] Anna_Frodesiak: scratch that re. Tor - that's just to get the software. Situation looks complex :( [23:14:42] BAd idea to suggest illegality without for fighting for freedom... [23:15:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:16:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.154 second response time [23:16:49] ori-l: nod, i see it in SAL, the depooling was an hour before next scap [23:19:17] people shouldn't panic though, it just skips the host [23:20:04] mutante, during scap, there was a shitload of messages from it, resulting in possibly useful messages being scrolled out [23:20:24] also, I have an impression that it slowed the scap [23:20:36] seems to have yesterday and today [23:20:44] (slowed the scap) [23:21:19] so, yes, please reduce the amount of unneeded error messages in scap, really not useful for our deployers [23:22:17] another fun problem is not to forget to add it back when fixed [23:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:53] deployers are also frequently wrestling with cache expiry, which is hard to reason about because it happens in so many levels (user's browser; user's proxy; bits; etc). an apache that is serving requests based on a stale codebase adds another complicating factor to worry about and reason through [23:23:17] oh wait, is this server just read-only and serving outdated php? [23:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [23:23:22] No, no, it's not [23:23:25] It's the opposite [23:23:25] wehw [23:23:26] it's not serving [23:23:32] It's down but still in the list to receive deployments [23:23:33] s/eh/he/ [23:23:39] gotcha, that's what I thought [23:23:46] Which, if anything, makes an outdated Apache scenario /less/ likely [23:23:50] but you can't tell that from the output [23:23:54] That's right [23:24:02] * greg-g nods [23:24:04] I'll clean up the dsh list based on the output of sync-file README before I deploy [23:24:11] In like a minute [23:24:35] There's a couple of fully down systems, and the readonly one [23:24:38] have you tested your README changes in labs? [23:25:28] added comment to 5628. sounds like it should be on a list [23:25:29] don't wanna deploy untested source [23:25:36] thanks mutante [23:25:45] thanks, sorry for ranting [23:28:54] Doing a sync-dir to wmf13 now, to push code out to the test wikis [23:29:10] mw1046 is throwing a bunch of scary-looking errors [23:29:29] greg-g: I'll test my VE changes in labs when labs's synchronization of VE changes stops being completely broken ;) [23:29:50] RoanKattouw: i just merged 80158, i guess puppet didn't get to remove them yet .. [23:30:06] RoanKattouw: fine, be that way :P [23:30:07] !g 80158 [23:30:08] https://gerrit.wikimedia.org/r/#q,80158,n,z [23:30:37] RoanKattouw: it's hard disk died today [23:30:47] datacenter tech on it [23:31:33] mutante: Can I force-run puppet on tin to get the dsh list to update? [23:31:34] gotta still check the others? " srv281, mw1089, mw1173" [23:31:38] sync-dir is horrendously slow [23:31:57] Those are just SSH connection time-outs, I don't really care about those because they fail fast [23:31:59] RoanKattouw: yea, should work, well, i'm just expecting it does since the files are in puppet :) [23:32:00] 1046 is taking forever [23:32:14] hit Enter a couple times? [23:32:20] No, that doesn't work [23:32:22] rsync mkdir errors [23:32:38] !log Forcing puppet on tin to update dsh lists for mw1046 removal [23:32:39] running puppet on tin [23:32:41] ah, ok [23:32:42] Logged the message, Mr. Obvious [23:35:13] * RoanKattouw waits for puppet taking forever [23:35:36] want me to live fix them really quick? i dont care [23:35:37] Is the puppetmaster hideously overloaded again? [23:35:42] Yeah go for it [23:36:57] !log manually removed mw1046 from dsh groups mediawiki-installation,apaches,apaches-eqiad on tin [23:37:02] Logged the message, Master [23:37:07] nice how we have apache-eqiad AND apaches-eqiad [23:37:24] but the latter is empty [23:37:42] those are the groups you use, right [23:38:19] mediawiki-installation is the one that matters [23:38:20] Thanks man [23:38:37] !log Repeating previously aborted sync-dir of extensions/VisualEditor now that the dsh node list is fixed [23:38:38] mw-eqiad as well now.. [23:38:43] Logged the message, Mr. Obvious [23:38:44] we should have less groups :p [23:38:47] !log catrope synchronized php-1.22wmf13/extensions/VisualEditor 'Deploy new VE code to wmf13 first for testing' [23:38:52] np [23:38:52] Logged the message, Master [23:42:01] now to those timing out .. [23:42:06] !log powercycling srv281 [23:42:11] Logged the message, Master [23:45:02] !log powercycling frozen mw1089 [23:45:06] Logged the message, Master [23:45:24] RECOVERY - Host srv281 is UP: PING OK - Packet loss = 0%, RTA = 26.86 ms [23:47:22] !log powercycling frozen mw1173 [23:47:27] Logged the message, Master [23:47:34] RECOVERY - Host mw1089 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [23:48:24] PROBLEM - Apache HTTP on srv281 is CRITICAL: Connection refused [23:48:34] RECOVERY - Puppet freshness on mw1089 is OK: puppet ran at Tue Aug 20 23:48:28 UTC 2013 [23:49:34] RECOVERY - Host mw1173 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [23:49:34] PROBLEM - Apache HTTP on mw1089 is CRITICAL: Connection refused [23:49:47] ori-l: RoanKattouw , so those 3 that had timeouts were all down and now they are back, so i'm not removing them from dsh groups, the two mw hosts in eqiad were and are enabled, the srv281 is not because it's been flaky before [23:49:53] OK [23:50:07] afraid might have to sync one more time for those 2 [23:51:10] srv281 should die.. creating ticket [23:51:34] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.044 second response time [23:52:03] mutante: If you powercycled them, it should be fine [23:52:11] There is a script that syncs during Apache startup [23:52:17] unless you sync-dir'ed while they were down [23:52:24] PROBLEM - Apache HTTP on mw1173 is CRITICAL: Connection refused [23:52:25] ah, true , sure [23:52:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:52:36] which will fix those Apache monitoring reports in a few .. [23:52:41] that's why they start later [23:53:24] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [23:54:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [23:56:31] !log catrope Started syncing Wikimedia installation... : Updating VisualEditor to master [23:56:36] Logged the message, Master