[00:00:42] (03PS1) 10Ori.livneh: profiler-to-carbon: read entire socket buffer in one go [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101146 [00:01:29] (03CR) 10Ori.livneh: [C: 032 V: 032] profiler-to-carbon: read entire socket buffer in one go [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101146 (owner: 10Ori.livneh) [00:01:38] ori-l: Hold on [00:01:43] I'm very slightly going over my deploy window [00:02:03] RoanKattouw: it's not something that would affect / conflict with your deployment [00:02:33] that is, i'm not syncing / scapping anything [00:02:38] Oh OK [00:02:40] Carry on :) [00:03:05] mwalker: You actually signed up for the LD, please hold while I wrap up the preceding window [00:03:25] !log catrope updated /a/common/php-1.23wmf7 to {{Gerrit|If7c3c52ed}}: Update EducationProgram to wmf7 branch for cherry-picks [00:03:26] RoanKattouw: I'm waiting for jenkins to merge my commits anyways [00:03:34] could take a couple more hours [00:03:42] Logged the message, Master [00:03:52] heh [00:04:05] !log catrope synchronized php-1.23wmf7/extensions/EducationProgram 'Cherry-pick to fix ContextSource compatibility' [00:04:16] mwalker: OK I'm done, it's all yours [00:04:18] Logged the message, Master [00:04:28] AndyRussG: There's your deployment ---^^ [00:04:49] \o/ [00:05:15] Thanks a ton [00:06:49] RoanKattouw: thanks for taking care of the EduProgram issue [00:11:29] !log mwalker synchronized php-1.23wmf6/extensions/Collection/ 'Reverting to known good condition for collection extension' [00:11:46] Logged the message, Master [00:12:00] !log mwalker synchronized php-1.23wmf7/extensions/Collection/ 'Reverting to known good condition for collection extension' [00:12:15] Logged the message, Master [00:13:25] mwalker: How we doin' here? [00:13:37] marktraceur: looks like I'm stable [00:13:39] so; go for it! [00:13:40] 'kay [00:18:40] !log mholmquist synchronized php-1.23wmf6/extensions/MultimediaViewer/ 'Fix for another stray MultimediaViewer event' [00:18:51] gotta catch 'em all [00:18:55] Logged the message, Master [00:19:40] Oh, sorry [00:19:44] LIGHTENING DEPLOOOOYYYYYY [00:20:03] marktraceur reads the instructions [00:20:05] !log mholmquist synchronized php-1.23wmf7/extensions/MultimediaViewer/ 'Fix for another stray MultimediaViewer event' [00:20:11] Only halfway through [00:20:19] Logged the message, Master [00:20:25] When I'm confused and have lost half the parts [00:20:39] If there's anyone else they can go, I need to await the RL cache update [00:20:57] nah, you're it [00:21:28] 'kay [00:24:18] Oh, wait. No. Agh. [00:24:43] 8 minutes left, give me one sec. [00:25:09] !log mholmquist synchronized php-1.23wmf6/extensions/MultimediaViewer/ 'Fix for another stray MultimediaViewer event, take 2.' [00:25:24] Logged the message, Master [00:25:58] !log mholmquist synchronized php-1.23wmf7/extensions/MultimediaViewer/ 'Fix for another stray MultimediaViewer event, take 2.' [00:26:14] Logged the message, Master [00:26:16] Ignore me. [00:26:43] OK, that fixed it on mw.org (maybe? probably?), so I'm going to declare actually being done. [00:26:55] Yeah, enwiki is fixed too [00:27:14] * marktraceur praises Pikachu, Japanese god of thunder [00:55:23] (03PS1) 10Ori.livneh: Fix typo in profiler-to-carbon and tweak client behavior [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101158 [00:55:33] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix typo in profiler-to-carbon and tweak client behavior [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101158 (owner: 10Ori.livneh) [01:03:35] (03CR) 10MarkTraceur: "You could reasonably ignore my blind adding of reviewers." [operations/puppet] - 10https://gerrit.wikimedia.org/r/101111 (owner: 10MarkTraceur) [01:10:06] (03CR) 10OliverKeyes: "Is there a rationale for this, or is it just 'this would be fun'? While Stat1 has a public IP we tend not to actively host projects there." [operations/puppet] - 10https://gerrit.wikimedia.org/r/101111 (owner: 10MarkTraceur) [01:10:43] (03PS2) 10MarkTraceur: Add nodejs to stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101111 [01:11:00] ori-l: ^^ [01:12:06] (03CR) 10MarkTraceur: "Oliver:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101111 (owner: 10MarkTraceur) [01:15:08] (03CR) 10OliverKeyes: [C: 031] "Makes sense; +1ing out of principle." [operations/puppet] - 10https://gerrit.wikimedia.org/r/101111 (owner: 10MarkTraceur) [01:16:37] (03CR) 10Ori.livneh: [C: 04-1] "whitespace" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101111 (owner: 10MarkTraceur) [01:16:44] Bah [01:17:08] Fucking whitespace inconsistency [01:17:10] i'd usually fix it myself but i'm a bit busy [01:17:32] (03PS3) 10MarkTraceur: Add nodejs to stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101111 [01:17:37] 's okay, I'm fast also [01:18:47] (03CR) 10Ori.livneh: [C: 032 V: 032] "approved by ops, patch looks fine" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101111 (owner: 10MarkTraceur) [01:19:04] Shoop da whoop [01:19:32] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 301 seconds [01:19:33] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 305 seconds [01:19:38] * marktraceur doesn't really know what the deploy process is like for changes to puppet [01:20:48] marktraceur: i deployed it [01:21:14] the process is a bit elaborate, but it's documented well on wikitech [01:21:16] if you're curious [01:28:22] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 301 seconds [01:33:23] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 311 seconds [01:33:33] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 316 seconds [01:35:09] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 306 seconds [01:35:19] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 314 seconds [01:37:12] (03PS1) 10Gage: JG: add gage to icinga contactgroups & cgi authorization [operations/puppet] - 10https://gerrit.wikimedia.org/r/101171 [01:42:09] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay -0 seconds [01:42:19] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [01:42:19] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [01:42:39] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay -1 seconds [01:51:29] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 1 seconds [01:51:39] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [01:53:31] (03CR) 10Dzahn: [C: 031] JG: add gage to icinga contactgroups & cgi authorization [operations/puppet] - 10https://gerrit.wikimedia.org/r/101171 (owner: 10Gage) [01:54:21] jgage: want me to merge or try yourself [01:55:29] looks good, i see you got the contact in private file [01:56:51] (03CR) 10Dzahn: "RT #6495" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101171 (owner: 10Gage) [02:03:20] (03PS1) 10Springle: set db1017 as s5 analytics slave [operations/dns] - 10https://gerrit.wikimedia.org/r/101173 [02:05:31] (03CR) 10Springle: [C: 032] set db1017 as s5 analytics slave [operations/dns] - 10https://gerrit.wikimedia.org/r/101173 (owner: 10Springle) [02:05:46] (03CR) 10Dzahn: [C: 032] JG: add gage to icinga contactgroups & cgi authorization [operations/puppet] - 10https://gerrit.wikimedia.org/r/101171 (owner: 10Gage) [02:21:05] !log LocalisationUpdate completed (1.23wmf6) at Fri Dec 13 02:21:05 UTC 2013 [02:21:23] Logged the message, Master [02:41:23] !log LocalisationUpdate completed (1.23wmf7) at Fri Dec 13 02:41:23 UTC 2013 [02:41:40] Logged the message, Master [02:43:05] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [02:48:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Dec 13 02:48:41 UTC 2013 [02:48:55] Logged the message, Master [02:50:52] !log ongoing schema changes on slaves, indexing only, logging gerrit 85508, wb_terms gerrit 99660 [02:51:08] Logged the message, Master [03:01:45] (03PS1) 10Dzahn: install various Perl modules needed by Bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 [03:09:13] (03PS2) 10Dzahn: install various Perl modules needed by Bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 [03:13:10] (03PS3) 10Dzahn: install various Perl modules needed by Bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 [03:20:06] PROBLEM - Puppet freshness on tungsten is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 12:18:59 AM UTC [03:31:56] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [03:34:16] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:35:13] PROBLEM - NTP on elastic1007 is CRITICAL: NTP CRITICAL: Offset unknown [03:38:13] RECOVERY - NTP on elastic1007 is OK: NTP OK: Offset -0.0009071826935 secs [03:39:04] (03PS4) 10Dzahn: install various Perl modules needed by Bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 [03:42:20] (03CR) 10Dzahn: [C: 031] "PS4: use libemail-sender-perl instead of libemail-send-perl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn) [04:14:45] ./msg memoserv send apergos friendly memo for tomorrow because you are on duty: !change 101174 would be nice if you find the time :) tx :) [04:21:14] !memo is if you want to leave a memo for somebody on IRC to read later when they come online, try /query memoserv and type help in that new window, or /msg memoserv help to see the commands [04:21:15] Key was added [04:39:12] (03PS2) 10Tholam: Update favicon wiktionary/si.ico [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100949 [05:09:57] (03PS3) 10Tholam: Update favicon wiktionary/si.ico [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100949 [05:10:46] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [05:11:26] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [05:15:15] (03PS4) 10Tholam: Update favicon wiktionary/si.ico [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100949 [06:20:42] PROBLEM - Puppet freshness on tungsten is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 12:18:59 AM UTC [06:55:52] RECOVERY - Puppet freshness on tungsten is OK: puppet ran at Fri Dec 13 06:55:46 UTC 2013 [07:32:01] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [08:08:06] Good morning. Repeating a question from #wm-tech here as more appropriate channel, and pinging apergos. Question for you all. bits.wikimedia.org seems to be rejecting ICMPv6 packets, specifically Packet Too Large for PMTUD. I'm on an IPv6 tunnel with MTU of 1280, and connections to bits.wikimedia.org often hang for 20+ seconds after handshake. Who should I poke re. investigating whether bits rejects such packets? [08:08:45] morning [08:09:24] we will want mark or para void later in the day [08:10:08] I can try to look at it earlier but even if I am able to verify this I wouldn't have the knowledge to fix it [08:10:11] kenneaal: [08:11:02] * apergos adds it to today's list [08:11:44] Thanks for ping. That would be neat. I'll be around most of the day, and available for assisting in testing that. I also have a public RIPE Atlas probe on the network in question, connected through the same IPv6 tunnel, so it would be possible for WM personnel to test the case directly. Can also get a linux VM going with the same connectivity. [08:12:23] that's great, it would be very helpful. [08:13:17] Hehe. Well, vested self-interest in not having to hit reload every hour or so when bits.wm.org forgets my MTU. :P [08:14:19] I wouldn't be surprised if the server suppresses ICMPv6, but other servers (en.wikipedia.org, etc) seem to handle PMTUD correctly. So perhaps there is inconsistent configuration at work. [08:18:10] what ip address does bits resolve to for you? [08:28:21] (03CR) 10Alexandros Kosiaris: "I you do decide to go through all that, it would be best to also create a role class that has all the monitoring, firewall , backup, syste" [operations/puppet] - 10https://gerrit.wikimedia.org/r/100760 (owner: 10Matanya) [08:29:24] Sorry, please ping if speaking to me, I have a metric ton of windows open at the moment. Resolves to bits-lb.esams.wikimedia.org (2620:0:862:ed1a::a) [08:30:45] apergos: [08:30:48] yep [08:31:01] I'll ping if I need instant answer, no worries :-) [08:31:20] Paste of traceroute6: http://pste.me/35ulx/ [08:32:01] great, thank you [08:33:29] Also, given that it responds to ICMPv6 ping, PMTUD not happening properly makes even less sense. [08:34:07] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [08:52:29] (03PS4) 10Alexandros Kosiaris: Modularize misc::install-server classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/89687 [08:53:26] (03PS1) 10ArielGlenn: access to pdf1,2,3 for mwalker, rt #6468 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101181 [08:53:57] (03CR) 10Alexandros Kosiaris: [C: 032] Modularize misc::install-server classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/89687 (owner: 10Alexandros Kosiaris) [08:54:17] (03PS2) 10ArielGlenn: access to pdf1,2,3 for mwalker, rt #6468 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101181 [08:55:25] (03CR) 10ArielGlenn: [C: 032] access to pdf1,2,3 for mwalker, rt #6468 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101181 (owner: 10ArielGlenn) [08:55:54] apergos: please don't merge yet on palladium [08:56:00] uh huh [08:56:09] I was just seeing this huge pile of changes [09:04:13] (03PS1) 10Alexandros Kosiaris: Migrate to new install-server module [operations/puppet] - 10https://gerrit.wikimedia.org/r/101182 [09:06:16] (03CR) 10Alexandros Kosiaris: [C: 032] Migrate to new install-server module [operations/puppet] - 10https://gerrit.wikimedia.org/r/101182 (owner: 10Alexandros Kosiaris) [09:07:43] apergos: ok I merged [09:07:48] thank you [09:25:08] PROBLEM - NTP on bast4001 is CRITICAL: NTP CRITICAL: No response from NTP server [09:27:48] PROBLEM - NTP on hooft is CRITICAL: NTP CRITICAL: No response from NTP server [09:29:14] * kenneaal checks. It's actually a two component polymer resin that hardens when the joint is crimped. So over 20kV/mm resistance. Should be fine. [09:29:21] Er... Totally not the right window. [09:31:03] :-D [09:31:08] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [09:54:13] (03PS1) 10Alexandros Kosiaris: Adding install-server::ubuntu-mirror class [operations/puppet] - 10https://gerrit.wikimedia.org/r/101187 [09:57:31] (03CR) 10Alexandros Kosiaris: [C: 032] Adding install-server::ubuntu-mirror class [operations/puppet] - 10https://gerrit.wikimedia.org/r/101187 (owner: 10Alexandros Kosiaris) [09:58:30] !log disabling puppet on carbon for a short time (debugging the new install-server module) [09:58:46] Logged the message, Master [10:01:05] PROBLEM - Squid on brewster is CRITICAL: Connection timed out [10:01:15] PROBLEM - HTTP on brewster is CRITICAL: Connection timed out [10:01:39] I am those [10:04:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [10:08:02] (03CR) 10ArielGlenn: [C: 04-1] "libemail-mime-modifier-perl should not be installed, as it is a virtual package, and puppet never realizes that virtual packages have been" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn) [10:09:57] * apergos looks arund for hashar [10:10:02] apergos: i am there [10:10:05] I am looking at https://rt.wikimedia.org/Ticket/Display.html?id=4824 [10:10:10] talk to me [10:10:25] ah yeah [10:10:36] replied to it in a hurry yesterday evening :/ [10:10:53] I should write down the ferm rules right now and get that fixed for good [10:11:12] do it today and I'll look at it today [10:11:36] the reason that ticket took so long was basically because of Augeas which nobody could help with :( [10:11:52] * hashar dig in ferm [10:13:00] do we have a way to apply a class by default on all instances? Aka the equivalent of production base ? :D [10:13:10] PROBLEM - NTP on brewster is CRITICAL: NTP CRITICAL: No response from NTP server [10:13:10] allll? [10:13:18] all instances of a project [10:13:19] I have no idea [10:13:32] well we just write a misc class for now =D [10:15:00] RECOVERY - NTP on brewster is OK: NTP OK: Offset -0.0014783144 secs [10:15:10] RECOVERY - HTTP on brewster is OK: HTTP OK: HTTP/1.1 200 OK - 3089 bytes in 0.187 second response time [10:15:57] akosiaris: oh oops [10:16:02] i just restarted lighttpd and squid on brewster [10:16:31] mark: no worries [10:17:00] RECOVERY - Squid on brewster is OK: TCP OK - 0.035 second response time on port 8080 [10:17:01] heh, I did the same thing, well 1/2 because I saw squid already running... [10:17:14] didn't connect the dots [10:17:30] * apergos goes to make an omlette and reclaim some brain cells from bz perl module dependency hell [10:17:34] back in about 10 mins [10:17:37] hmmm like I never said: (12:01:39 μμ) akosiaris: I am those [10:17:55] I didn't see the backread here, I got a page [10:17:58] I will blame me and will be clearer next time [10:18:25] you were clear [10:18:27] hmmm i never did [10:18:34] i never did get a page i mean [10:18:49] nimsoft alert ubuntu mirror... [10:19:10] and there's the recovery [10:19:33] brb [10:19:36] ok [10:19:42] !g Ie9be31bec57e70fc84bd59a5524e8f848bb61630 [10:19:42] https://gerrit.wikimedia.org/r/#q,Ie9be31bec57e70fc84bd59a5524e8f848bb61630,n,z [10:35:47] * hashar digs in ferm manual [10:36:06] gotta convert: iptables -t nat -I OUTPUT --dest $public_ip -j DNAT --to-dest $private_ip [10:37:50] * apergos sis down to hot omelette and the earlier gerrit changesets [10:38:29] *sitsss [10:38:31] grrr [10:41:38] (03PS1) 10Alexandros Kosiaris: Fix ferm rules for install-server, haproxy, backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/101191 [10:43:10] (03CR) 10Alexandros Kosiaris: [C: 032] Fix ferm rules for install-server, haproxy, backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/101191 (owner: 10Alexandros Kosiaris) [10:45:57] !g I0b02a46f350a99e2e0d29a2da72d6ef6932c8c22 [10:45:57] https://gerrit.wikimedia.org/r/#q,I0b02a46f350a99e2e0d29a2da72d6ef6932c8c22,n,z [10:46:36] (03PS1) 10Hashar: beta: public IP rewriting using DNAT [operations/puppet] - 10https://gerrit.wikimedia.org/r/101192 [10:48:30] hashar: its daddr, not dest [10:49:36] akosiaris: thx :-) [10:50:08] /etc/init.d/ferm start [10:50:08] * Starting Firewall ferm iptables-restore v1.4.4: invalid mask `46' specified [10:50:12] damn... [10:50:21] using my patch ? [10:50:21] it thinks IPv6 is IPv4.... :-( [10:50:23] nope [10:50:25] mine [10:50:25] ah [10:50:35] was wondering how you managed to get my patch tested so fast :-D [10:52:17] (03PS2) 10Hashar: beta: public IP rewriting using DNAT [operations/puppet] - 10https://gerrit.wikimedia.org/r/101192 [10:54:51] * hashar watches puppet running while listening to loud techno music [10:55:13] aaah it's ferm 2.1.2 ? [10:55:15] hmmmm [10:55:45] hey Could not stop Service[ferm]: :D [10:56:20] Ubuntu 10.04.4 LTS :-(... snif... [10:56:33] 10.04.4?? ouch [10:56:44] that is brewster isn't it ? [10:56:50] and carbon [10:57:02] ah I thought ori phased out that machine [10:57:28] he did a bunch of work to migrate graphite/gdash.. to eqiad [10:57:58] RECOVERY - NTP on bast4001 is OK: NTP OK: Offset -0.001607775688 secs [10:58:51] carbon is in eqiad [10:59:11] elements [10:59:36] I say next time we go for exoplanets :P [10:59:36] (03PS3) 10Hashar: beta: public IP rewriting using DNAT [operations/puppet] - 10https://gerrit.wikimedia.org/r/101192 [11:00:21] https://wikitech.wikimedia.org/wiki/Talk:Server_naming_conventions [11:00:22] (03PS4) 10Hashar: beta: public IP rewriting using DNAT [operations/puppet] - 10https://gerrit.wikimedia.org/r/101192 [11:00:32] put your edits where your mouth is [11:00:38] RECOVERY - NTP on hooft is OK: NTP OK: Offset 7.164478302e-05 secs [11:02:18] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [11:02:45] apergos: Ι 'll start by upload this http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/100000/20000/4000/400/124456/124456.strip.gif [11:03:06] perfect [11:05:42] :D [11:06:55] yeah iptables rules!!! http://paste.debian.net/70669/ [11:07:29] your commit message has 208.80.153.243 twice... might wanna fix that [11:07:49] yeah will [11:07:56] got to add some patch from labs [11:10:58] (03PS5) 10Hashar: beta: public IP rewriting using DNAT [operations/puppet] - 10https://gerrit.wikimedia.org/r/101192 [11:13:47] (03PS1) 10Alexandros Kosiaris: Add linux-host-entries.ttyS1-9600 empty file [operations/puppet] - 10https://gerrit.wikimedia.org/r/101194 [11:15:35] (03CR) 10Hashar: [V: 031] "Patchset 5 tested on deployment-staging-cache-mobile02:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101192 (owner: 10Hashar) [11:16:02] apergos: patch works for me on deployment-staging-cache-mobile02 (a puppet self instance) [11:16:18] I have commented on the patch listing a ferm rule config file and the result of iptables --list -t nat [11:16:31] great [11:16:42] I can't believe it only took me an hour to configure the rules, thanks to ferm! [11:17:18] ah ubuntu next LTS is in April 2013. [11:17:20] 2014 [11:17:24] going to be fun times [11:17:38] ugh [11:17:58] contint boxes are already volunteering for migration [11:18:02] though gallium will probably have to be reinstalled from scratch :D [11:18:59] and we still apparently have some 8.04 boxes:P [11:20:15] is that hardy ? [11:20:56] (03CR) 10Alexandros Kosiaris: [C: 032] Add linux-host-entries.ttyS1-9600 empty file [operations/puppet] - 10https://gerrit.wikimedia.org/r/101194 (owner: 10Alexandros Kosiaris) [11:20:59] yes [11:21:33] yep: https://en.wikipedia.org/wiki/Ubuntu_releases [11:22:12] "trusty" [11:23:02] apergos: can we proceed with https://gerrit.wikimedia.org/r/101192 ? [11:23:10] I am a little confused about how you can possibly get the right rules out... looking at the ferm::rule class, it has default table and chain, how do you get away with setting them in the rule text? [11:23:23] passing parameters to ferm::rule [11:23:35] maybe I have sent the wrong patch in gerrit ghmhm [11:23:55] https://gerrit.wikimedia.org/r/#/c/101192/5/manifests/misc/beta.pp,unified line 88-89 [11:24:00] table => 'nat', [11:24:05] chain => 'OUTPUT' [11:24:13] ohh that's better [11:24:19] maybe I was looking at an earlier version [11:24:23] yeah made that with patchset 5 [11:24:24] sorry :( [11:24:30] yep ps4 [11:25:22] ok lemme stare at it for a couple more minutes given it's the new patchset [11:25:25] sorry... [11:25:45] tis ok [11:25:58] the patch is applied on deployment-staging-cache-mobile02.pmtpa.wmflabs [11:26:11] if you want to play with it (like deleting all ferm rules, running puppet or looking at the generated conf [11:26:30] a potential issue we have with ferm::rule is that it does not purge the /etc/ferm/conf.d files :( [11:26:38] nothing we can really do about I am afraid [11:28:31] ughhh [11:29:10] you can pass ensure = absent to the rule [11:29:34] but if you are worried about someone else remembering that, you could add a comment to the beta stanza [11:29:35] your call [11:30:06] well puppet file {} as a purge parameter which would delete any file not managed by puppet [11:30:18] it does purge [11:30:18] no clue how it is going to work with multiple file{} statements in the same dir [11:30:28] file { '/etc/ferm/conf.d' : [11:30:28] ensure => directory, [11:30:28] owner => root, [11:30:28] group => adm, [11:30:28] mode => '0500', [11:30:29] recurse => true, [11:30:29] purge => true, [11:30:30] require => Package['ferm'], [11:30:30] notify => Service['ferm'], [11:30:30] potentially it might only keep the very last file {} :/ [11:30:31] } [11:30:37] ahhh [11:30:37] recurse = > true, purge => true [11:30:41] at the dir level nice [11:30:43] conf.d will be purge on every run [11:30:48] purged* [11:30:50] \o/ [11:30:53] noisy but nonetheless [11:31:03] noisy ? [11:31:08] so I guess it is work for me :-D [11:31:15] well we'll see the recreations on every run I suppose [11:31:19] nope [11:31:27] ? [11:31:30] only additions and deletions [11:32:01] files created in a purged directory still exist in the catalog [11:32:14] so puppet does not touch them if the exist in the system [11:32:27] oh ho [11:32:32] it will only purge files not existing in the catalog [11:32:52] amazing.. puppet doing something well :-D [11:46:15] hey [11:46:26] is there anyone i can ask my stupid db questions? ;) [11:47:32] ah ok...not stupid...dbtree also states db73 has replication lag [11:50:48] yes it does show that [11:51:35] (03CR) 10Aude: [C: 031] "looks good to me :) thanks hashar!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101192 (owner: 10Hashar) [11:51:36] we have an open ticket, [11:51:40] the info on the ticket says that [11:51:51] "WikiExporter dumps from searchidx* (pmtpa lucene I guess) combined with wikidata write activity and long-running research queries... make for an unhappy slave." [11:51:55] (quoting sprin gle) [11:54:48] (03PS1) 10ArielGlenn: fix typo in one of the bastion host ips for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/101199 [11:55:08] nosy1: are you affected by that? [11:56:01] (03CR) 10ArielGlenn: [C: 032] fix typo in one of the bastion host ips for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/101199 (owner: 10ArielGlenn) [11:56:16] (03PS1) 10Alexandros Kosiaris: tftp is udp not tcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/101200 [11:58:57] hello [11:59:07] hey [11:59:15] (03CR) 10Alexandros Kosiaris: [C: 032] tftp is udp not tcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/101200 (owner: 10Alexandros Kosiaris) [11:59:21] apergos: yes currently [11:59:35] springle suggested we should use this db host [11:59:43] but i can also switch [11:59:57] apergos: do you know when the job should complete? [12:01:36] not a clue [12:01:50] ok...np...ill change the master here [12:01:51] we would have to ask the authors of those queries :-( [12:01:54] ok [12:02:33] paravoid: how difficult was it to backport ferm 2.2 to precise ? any chance we could do it for lucid ? (god i hate myself for asking this) [12:14:03] which lucid hosts are we keeping? :) [12:17:32] akosiaris: reprepro copy [12:17:46] akosiaris: its only dependency is perl [12:17:55] (and iptables) [12:55:21] hey hashar, when you have a moment. i need some help regarding the puppet config for production https://gerrit.wikimedia.org/r/#/c/101058/ [12:55:30] we want to make sure a runner picks up those 3 gwtoolset jobs … aaron helped me out with that config and i just want to verify that it's okay plus ... [12:55:38] how do i do the same for the beta cluster? [12:57:14] and lastly, csteipp would like some ops input on one of our configs https://bugzilla.wikimedia.org/show_bug.cgi?id=58417 [13:10:21] (03PS5) 10Dan-nl: Production configuration for GWToolset [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 [13:14:45] (03PS1) 10Alexandros Kosiaris: dhcp is called bootps in /etc/services [operations/puppet] - 10https://gerrit.wikimedia.org/r/101204 [13:18:10] (03CR) 10Alexandros Kosiaris: [C: 032] dhcp is called bootps in /etc/services [operations/puppet] - 10https://gerrit.wikimedia.org/r/101204 (owner: 10Alexandros Kosiaris) [13:27:16] PROBLEM - MySQL Processlist on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:16] PROBLEM - MySQL Processlist on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:16] PROBLEM - Disk space on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:16] PROBLEM - MySQL Processlist on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:16] PROBLEM - MySQL InnoDB on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:16] PROBLEM - MySQL InnoDB on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:16] PROBLEM - mysqld processes on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:17] PROBLEM - MySQL InnoDB on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:26] PROBLEM - MySQL Processlist on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:26] PROBLEM - RAID on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:36] PROBLEM - MySQL Recent Restart on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:46] PROBLEM - Disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:47] PROBLEM - RAID on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:50] hmm [13:28:56] PROBLEM - mysqld processes on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:06] RECOVERY - Disk space on es1002 is OK: DISK OK [13:29:06] RECOVERY - MySQL Processlist on es1002 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 14 statistics [13:29:06] PROBLEM - mysqld processes on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:06] PROBLEM - MySQL Recent Restart on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:16] PROBLEM - Disk space on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:16] PROBLEM - RAID on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:26] PROBLEM - DPKG on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:56] PROBLEM - MySQL InnoDB on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:56] PROBLEM - puppet disabled on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:57] PROBLEM - DPKG on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:57] PROBLEM - RAID on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:06] PROBLEM - MySQL Recent Restart on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:07] traffic spikeon es1 [13:30:16] PROBLEM - MySQL disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:16] PROBLEM - mysqld processes on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:26] PROBLEM - puppet disabled on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:36] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [13:30:36] RECOVERY - RAID on es1002 is OK: OK: optimal, 1 logical, 2 physical [13:30:46] RECOVERY - mysqld processes on es1002 is OK: PROCS OK: 1 process with command name mysqld [13:30:56] PROBLEM - puppet disabled on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:56] PROBLEM - Disk space on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:56] RECOVERY - MySQL Recent Restart on es1003 is OK: OK seconds since restart [13:30:56] RECOVERY - mysqld processes on es1003 is OK: PROCS OK: 1 process with command name mysqld [13:31:06] RECOVERY - Disk space on es1003 is OK: DISK OK [13:31:06] RECOVERY - RAID on es1003 is OK: OK: optimal, 1 logical, 2 physical [13:31:16] PROBLEM - MySQL disk space on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:31:23] !log reduce max_connections on es100[1-4]. try to survive sudden spike [13:31:36] RECOVERY - Disk space on es1001 is OK: DISK OK [13:31:40] Logged the message, Master [13:31:46] RECOVERY - puppet disabled on es1001 is OK: OK [13:31:46] RECOVERY - DPKG on es1001 is OK: All packages OK [13:31:47] RECOVERY - RAID on es1001 is OK: OK: optimal, 1 logical, 2 physical [13:31:56] RECOVERY - MySQL Recent Restart on es1001 is OK: OK seconds since restart [13:32:06] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [13:32:06] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [13:32:16] PROBLEM - Disk space on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:16] PROBLEM - MySQL Processlist on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:16] PROBLEM - MySQL InnoDB on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:46] RECOVERY - puppet disabled on es1004 is OK: OK [13:32:46] RECOVERY - Disk space on es1004 is OK: DISK OK [13:32:56] PROBLEM - MySQL InnoDB on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:06] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [13:33:06] RECOVERY - mysqld processes on es1004 is OK: PROCS OK: 1 process with command name mysqld [13:33:16] RECOVERY - RAID on es1004 is OK: OK: optimal, 1 logical, 2 physical [13:33:16] RECOVERY - DPKG on es1004 is OK: All packages OK [13:33:16] RECOVERY - puppet disabled on es1003 is OK: OK [13:33:16] PROBLEM - MySQL Processlist on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:26] RECOVERY - MySQL Recent Restart on es1004 is OK: OK seconds since restart [13:33:47] PROBLEM - RAID on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:56] PROBLEM - mysqld processes on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:56] PROBLEM - MySQL disk space on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:56] PROBLEM - MySQL Recent Restart on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:56] PROBLEM - puppet disabled on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:56] PROBLEM - DPKG on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:04] (03PS6) 10Hashar: beta: public IP rewriting using DNAT [operations/puppet] - 10https://gerrit.wikimedia.org/r/101192 [13:34:16] PROBLEM - MySQL InnoDB on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:26] PROBLEM - MySQL Processlist on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:46] PROBLEM - Disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:04] PROBLEM - puppet disabled on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:04] PROBLEM - DPKG on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:04] PROBLEM - RAID on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:04] PROBLEM - MySQL Recent Restart on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:14] PROBLEM - MySQL disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:24] PROBLEM - MySQL InnoDB on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:24] PROBLEM - MySQL Processlist on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:24] PROBLEM - mysqld processes on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:04] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [13:36:06] !log killing queries on es100[1-4] [13:36:14] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [13:36:14] RECOVERY - Disk space on es1002 is OK: DISK OK [13:36:14] RECOVERY - MySQL disk space on es1002 is OK: DISK OK [13:36:22] Logged the message, Master [13:36:24] PROBLEM - MySQL InnoDB on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:24] RECOVERY - DPKG on es1001 is OK: All packages OK [13:36:24] PROBLEM - MySQL Processlist on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:34] RECOVERY - Disk space on es1001 is OK: DISK OK [13:36:34] RECOVERY - MySQL Recent Restart on es1001 is OK: OK seconds since restart [13:36:40] dan-nl: hello :-) [13:36:48] dan-nl: so jobs-loops.sh.erb is hmm [13:36:50] HORRIBLE [13:37:24] PROBLEM - MySQL Processlist on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:24] PROBLEM - MySQL InnoDB on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:24] PROBLEM - MySQL Processlist on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:24] PROBLEM - MySQL Processlist on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:24] PROBLEM - MySQL InnoDB on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:44] PROBLEM - MySQL InnoDB on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:46] hashar: ja, don't understand it [13:37:53] huh [13:37:57] the "Internal error" [13:38:03] the "Internal error" pages doesn't expand {{SITENAME}}. [13:38:06] page* [13:38:14] PROBLEM - mysqld processes on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:14] PROBLEM - MySQL Recent Restart on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:14] PROBLEM - Disk space on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:14] PROBLEM - RAID on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:14] PROBLEM - Disk space on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:14] PROBLEM - MySQL disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:15] (03CR) 10Hashar: [C: 031] "sounds good to me." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/101058 (owner: 10Dan-nl) [13:39:24] PROBLEM - MySQL Processlist on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:24] PROBLEM - MySQL InnoDB on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:24] PROBLEM - mysqld processes on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:24] PROBLEM - MySQL Processlist on es1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:24] PROBLEM - MySQL disk space on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:31] dan-nl: your change looks fine anyway [13:39:34] PROBLEM - DPKG on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:35] aaron said that we should place each GWT job with $wgJobTypesExcludedFromDefaultQueue in CommonSettings.php and because of that we need to add a runner in that puppet file [13:39:44] PROBLEM - Disk space on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:44] PROBLEM - MySQL Recent Restart on es1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:13] (03CR) 10ArielGlenn: [C: 032] beta: public IP rewriting using DNAT [operations/puppet] - 10https://gerrit.wikimedia.org/r/101192 (owner: 10Hashar) [13:40:24] RECOVERY - DPKG on es1001 is OK: All packages OK [13:40:24] RECOVERY - mysqld processes on es1002 is OK: PROCS OK: 1 process with command name mysqld [13:40:34] RECOVERY - puppet disabled on es1002 is OK: OK [13:40:34] RECOVERY - Disk space on es1001 is OK: DISK OK [13:40:34] RECOVERY - MySQL Recent Restart on es1001 is OK: OK seconds since restart [13:40:44] RECOVERY - RAID on es1002 is OK: OK: optimal, 1 logical, 2 physical [13:40:44] RECOVERY - RAID on es1001 is OK: OK: optimal, 1 logical, 2 physical [13:40:44] RECOVERY - puppet disabled on es1001 is OK: OK [13:40:45] RECOVERY - MySQL Recent Restart on es1002 is OK: OK seconds since restart [13:40:54] RECOVERY - DPKG on es1002 is OK: All packages OK [13:41:04] RECOVERY - MySQL Recent Restart on es1003 is OK: OK seconds since restart [13:41:04] RECOVERY - mysqld processes on es1003 is OK: PROCS OK: 1 process with command name mysqld [13:41:04] RECOVERY - Disk space on es1003 is OK: DISK OK [13:41:04] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [13:41:04] RECOVERY - RAID on es1003 is OK: OK: optimal, 1 logical, 2 physical [13:41:14] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [13:41:14] PROBLEM - MySQL disk space on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:24] PROBLEM - mysqld processes on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:04] RECOVERY - Disk space on es1002 is OK: DISK OK [13:42:14] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [13:42:14] RECOVERY - mysqld processes on es1004 is OK: PROCS OK: 1 process with command name mysqld [13:42:14] RECOVERY - MySQL disk space on es1002 is OK: DISK OK [13:42:51] hashar: so i wanted to understand that setting a bit more and make sure it's correct … sounds like it's correct … [13:42:54] PROBLEM - puppet disabled on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:24] PROBLEM - MySQL Processlist on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:24] PROBLEM - MySQL InnoDB on es1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:24] PROBLEM - MySQL Processlist on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:24] PROBLEM - MySQL InnoDB on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:25] hashar: do you need to deal with these icinga-wm messages? [13:43:33] dan-nl: yeah so you basically pass to that shell function a bunch of space separated jobs [13:43:35] if so i can explain later [13:43:44] RECOVERY - puppet disabled on es1004 is OK: OK [13:43:51] they ends up being passed to runJobs.php --types="jobtype1 jobtype2 ..." [13:44:04] I am not part of ops [13:44:04] hashar: so understanding it better … it creates one job runner or many? [13:44:17] I guess es10** boxes are elastic search so that would be manybubbles|away :-D [13:44:18] oh, i thought you were … sorry [13:44:33] I'm not away right now [13:44:37] was looking at stuff. [13:44:39] what is up? [13:44:43] there is a bunch of es1004 es1002 spam above [13:44:54] es1003 and es1001 as well [13:44:59] es is external storage, I believe [13:45:01] assuming they are elastic search box aren't they ? [13:45:04] elastic10XX is elasticsearch [13:45:06] oh my [13:45:12] yeah [13:45:15] unfortunate [13:45:30] these are es, sprin gle is looking at them [13:45:34] RECOVERY - MySQL InnoDB on es1001 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:45:36] lets rename elastic search to Recherche Elastic and get the box named RE :D [13:45:42] er ext storage I mean [13:46:14] RECOVERY - MySQL Processlist on es1001 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [13:46:14] RECOVERY - MySQL InnoDB on es1002 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:46:15] RECOVERY - MySQL Processlist on es1004 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [13:46:15] RECOVERY - MySQL InnoDB on es1003 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:46:15] RECOVERY - MySQL Processlist on es1002 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [13:46:15] RECOVERY - MySQL InnoDB on es1004 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:46:15] RECOVERY - MySQL Processlist on es1003 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [13:46:45] aaand not working... [13:46:45] hashar: in regards to that puppet config … does that create one job runner or is it a config for any job runner in production? [13:46:46] giant spike for text caches in the last little while [13:46:51] Uh oh, what just broke? Getting error pages [13:47:23] dan-nl: the shell script is used on all job runners [13:47:37] anomie: springl e is on it, i think [13:47:49] dan-nl: in labs that is jobrunner008 instance which run most of the jobs and the videoscaler005 instance (which only run video scaling, aka TimedMediahandler extension jobs) [13:48:10] hashar is there a similar puppet config for the beta cluster? [13:48:13] Now working again here [13:48:27] dan-nl: yeah beta use the same configuration as production. [13:48:57] oh, okay, so as soon as that gerrit commit is made it will apply to both environments? [13:49:10] "If your wiki is testing the new search tool ("[[mw:Search|CirrusSearch]]"), you can now test it by adding "New search" in your [[$prefbeta|Beta features preferences]]." [13:49:19] will it matter if a job is in $wgJobTypesExcludedFromDefaultQueue or not? [13:49:21] can someone tell me one wiki where that is happening? [13:52:36] hashar: is it okay to merge that puppet config change now or does GWToolset need to be on production first? [13:52:39] Nikerabbit: [[mw:Search]] [13:53:02] Nikerabbit: it is on by default in mediawiki.org. enwikisource has it as an option [13:53:53] dan-nl: nextJobs.php would look for a gwtoolset job by qyering the job queue for the job type [13:54:09] dan-nl: since the are no jobs, I guess it will yield 0 jobs and keep proceeding [13:54:21] but not sure, gotta look at the runJobs.php code [13:54:59] manybubbles: I was strying to check the translation of "New search", but could not find it present on any wiki I tried randomly [13:55:13] hashar cool, good to know… so the only reason we need to add that config is because we're placing the GWToolset jobs in the $wgJobTypesExcludedFromDefaultQueue array? otherwise we wouldn't need that config change? [13:55:18] itwiki shoudl have it [13:55:25] wait, sorry, no [13:55:32] it has it as a beta feature [13:55:35] if that is what you need [13:56:04] Nikerabbit: There it is: https://it.wikipedia.org/wiki/Speciale:Preferenze#mw-prefsection-betafeatures [13:56:55] manybubbles: yep, thanks [13:57:14] manybubbles: added notes for future translators [13:57:24] Nikerabbit: Thanks! [13:59:25] anyone on the ops team, yesterday csteipp mentioned a concern in regards to a setting within GWToolset. it's a throttle for how many media file jobs are placed in the job queue each minute. he wanted someone on the ops team to review the concept and decide what the throttle should be set at in WMF for the Commons server … is there anyone online who could help me with this? [14:02:15] hashar: do you know if there's a way to limit the nr of gwtoolset jobs run at once? for example, could a job runner check to see how many gwtoolsetMediafileJobs are already running and not pick up another if a threshold was set? [14:06:37] !log just ran aptitude remove apache2-mpm-prefork apache2-utils apache2.2-bin apache2.2-common libapache2-mod-php5 on bast1001. Autoremoved texlive-* packages and timidity. All seemed to be installed because of wikimedia-task-appserver (also removed) [14:06:54] Logged the message, Master [14:09:38] (03PS2) 10Alexandros Kosiaris: let bastion hosts have base::firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [14:10:11] (03CR) 10jenkins-bot: [V: 04-1] let bastion hosts have base::firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [14:11:17] !log applied the new installserver role to brewster, carbon, bast4001 and hooft. This implies installation of ferm and usage of base::firewall which means default firewall policy will now be DROP. Punched necessary holes for services to work. [14:11:33] Logged the message, Master [14:11:37] Anybody into certificates around? https://links.email.donate.wikimedia.org/ triggers "Invalid cert" warning [14:11:39] https://bugzilla.wikimedia.org/show_bug.cgi?id=58373 [14:12:23] (03PS1) 10Mark Bergsma: Derive the corresponding original URL and store it with the thumb [operations/puppet] - 10https://gerrit.wikimedia.org/r/101207 [14:13:36] anyone know if there's a way to limit the nr of GWToolset jobs run at once? for example, could a job runner check to see how many gwtoolsetMediafileJobs are already running and not pick up another if a threshold was set? [14:17:04] (03PS3) 10Alexandros Kosiaris: let bastion hosts have base::firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [14:20:05] (03CR) 10Mark Bergsma: [C: 032] Derive the corresponding original URL and store it with the thumb [operations/puppet] - 10https://gerrit.wikimedia.org/r/101207 (owner: 10Mark Bergsma) [14:20:51] anyone know which ganglia grid shows all of the job runners on production? [14:27:39] I'm not ignoring you dan-nl, just still looking into the es spike [14:31:18] mw1001-1016 for eqiad, these are the ones in operation [14:31:19] http://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [14:31:52] if you look at the two jobrunner overviews on the main ganglia page you can see pmtpa is not doing much [14:31:57] dan-nl: ^^ [14:32:41] apergos: np i understand [14:34:00] apergos cool, thanks is there a way for me to monitor the runJobs.log file for all of those runner instances? [14:35:50] so on fluorine [14:36:01] /a/mw-log [14:36:15] we have aggregated logs for a lot of this stuff including runJobs [14:36:53] https://wikitech.wikimedia.org/wiki/Log_files this has a good overview of what's where [14:37:09] thanks! [14:37:12] sure [14:37:26] as far as the job queue question you were asking earlier [14:38:10] yes [14:38:13] the runners generally pick up one type at a time [14:38:22] so 'do priority types til there are none' [14:38:27] 'now work through the rest' [14:38:34] doing a set number at once [14:38:48] hmm I haven't looked at the architecture in awhile but that is how it was [14:39:04] so it would not be a lot of work to add that in for your job type, I believe [14:39:16] aaron used to be the go to person for that [14:39:18] ahh [14:39:27] default policy of ferm is DROP !! :-D [14:39:33] yes [14:39:57] you even pointed that out on an earlier changeset, how the defaultwas accept and it wasn't doing what you needed [14:40:05] yup true [14:40:06] I mean, lnked in the channel earlier today [14:40:07] forgot about it [14:40:09] heh [14:40:15] E_BRAINFULL [14:40:15] so I have applied my DNAT on the beta apches [14:40:16] k, i think chris' concern is that gwtoolset may add 20 + media file jobs at once. if each media file is 1gb and each runner picks up one of those jobs all 16 runners could, in theory be running a job that downloads a 1gb file [14:40:27] and now they are happily rejecting connections to port 80 :-D [14:40:32] ahahahahah [14:40:35] BUT, I can ssh from the bastion [14:40:41] lucky you :-D [14:40:53] I should monkey patch a ferm rule to allow port 80 for apaches :D [14:40:53] ah firewalls [14:40:58] the gift that keeps on giving... [14:41:06] andd notice: /Stage[main]/Misc::Udp2log::Iptables_drops/Iptables_add_service[udp2log_drop_udp]/Iptables_add_rule[udp2log_drop_udp_udp]/Augeas[iptables udp2log_drop_udp_udp source]/returns: executed successfully [14:41:16] deployment-bastio now has both Augeas and Ferm maintained rules [14:42:48] yuck [14:46:40] apergos: csteipp was hoping that someone from ops might comment on the max setting of 20 gwtoolset media file jobs per minute … is that okay for production job runners or not … [14:46:53] I have that bug open [14:47:08] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [14:47:13] the deal is whether they complete [14:47:23] but if we mean 'at most 20 at any time' [14:47:49] apergos: ah, cool [14:48:03] the idea is that each person could create a batch upload job ... [14:48:20] that batch upload job could potentially add 20 media file jobs per minute [14:48:49] if 10 people set-up batch jobs concurrently that could potentially be 200 media file jobs per minute [14:48:58] the thing will be the scalers [14:49:02] uh [14:49:15] there are 16 job runners as i understand it and each would potentially pick up one at a time [14:49:18] again it's a matter of concurrency [14:49:45] right the possibility is there, but there's no way to say for certain that it would happen [14:50:04] that's why i was wondering if we could limit the number of concurrent gwtoolset media file jobs [14:50:19] we have from 70 at lowest to about 110 peak req / sec to the scalers [14:50:33] that's across all 8 boxes, i.e. total [14:50:59] these upload jobs [14:51:11] do they do more than shovel the file into swift? [14:51:19] i.e. do they scale at some resolution(s)? [14:52:43] if the sole thing they do is store the original in the backend, then we need to lok at the swift numbers only [14:53:36] (03PS1) 10Hashar: beta: ferm on appservers must allow port 80 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101209 [14:53:42] (03PS1) 10Hashar: role::parsoid::beta must allow port 8080 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101210 [14:55:15] apergos: they only download the media files from the external server and store them [14:55:22] ok [14:55:30] so swift will be the bottleneck [14:57:23] https://wikitech.wikimedia.org/wiki/PoolCounter might be able to limit the number of simultaneous uploads [14:57:35] but honestly, I am not sure it could nor whether it is wanted [14:57:46] + I have no idea how PoolCounter works [14:57:58] I think the place to do this is in the code with the rest of the job runner logic [14:58:14] as far as how many are run at once [14:58:22] apergos: I monkey patched some rules for beta app servers and parsoid server to respectively allow port 80 and port 8000 (see the two patches above) [14:59:06] k, i'll need to hash it out with aaron then … [14:59:12] thanks for thinking with me ... [14:59:42] I see them hashar and have them open already [15:04:16] and I don't think MediaWiki job system has a way to enforce a limit of # of jobs being run [15:04:31] http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=swift_PUT_hits&s=by+name&c=Swift+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=swift+backend+eqiad&hide-hf=false&sh=1&z=small&hc=4 this is the swift backend PUT requests [15:05:03] iirc the job runners will check what's already running on that host and not start up more than X new ones (of a type) [15:05:12] that's only per server [15:05:48] right i was hoping we could check across all runners ... [15:05:58] afaik there's no way to do that [15:06:04] as hashar says [15:06:25] though pool counter might do it [15:08:17] to do what? [15:08:43] making sure we don't overload swift by having a thousands of jobs attempting to write to it [15:09:05] dan-nl is working on an extension that let volunteers mass import files from museums, libraries etc.. [15:09:46] so there is a possibility someone would attempt to import a million image with all jobs being released at the same time and thus overloading swift [15:09:54] (that is what I understood about the problem) [15:10:06] jobs? what jobs? [15:10:29] the extension let you submit a batch of files to upload, it then creates MediaWiki async jobs that are inserted in the jobqueue [15:10:35] the jobrunner would then proceed them [15:11:10] (03CR) 10ArielGlenn: [C: 032] beta: ferm on appservers must allow port 80 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101209 (owner: 10Hashar) [15:11:11] files to upload how? by URL? [15:11:47] yes, gwtoolset will create several media file jobs … i've placed a throttle that lets a user decide between 1 - 20 at a time [15:11:55] then yes it uses uploadbyurl [15:12:28] hashar: https://gerrit.wikimedia.org/r/#/c/101210/1/manifests/role/parsoid.pp has 8000 in the rule and 8080 and 8000 in the commit message, which do you need? [15:12:29] are the uploads paced? [15:13:29] apergos: 8000 (eight zero zero zero) [15:13:53] please fix your commit message then [15:13:55] apergos: uploading new patchset [15:13:59] paced? there's a metadatajob that creates the media file jobs … i've scheduled it to run once per minute … each time it runs it places N media file jobs into the queue, default is 10, but user can set between 1 - 20 [15:14:05] (03PS2) 10Hashar: role::parsoid::beta must allow port 8000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101210 [15:14:10] apergos: apaches fixed! [15:14:16] good [15:14:24] step by step [15:14:32] lovely ferm [15:14:42] so 20 uploads/min per "batch upload"? [15:14:47] (up to) [15:14:53] yes, that's the current max potential [15:14:59] yeah that's fine [15:15:09] (03CR) 10ArielGlenn: [C: 032] role::parsoid::beta must allow port 8000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101210 (owner: 10Hashar) [15:15:32] don't worry about it [15:15:49] would that scale with multiple users? for example this example thus far considers one user … what about 10 users at a time kick off their batch uploads [15:15:56] that's 200/min [15:16:06] yes [15:16:06] it's nothing :) [15:16:13] cool, that's good to know :) [15:17:07] paravoid, do you mind commenting on https://gerrit.wikimedia.org/r/#/c/101008/ to that effect or in the bug https://bugzilla.wikimedia.org/show_bug.cgi?id=58417 ? [15:17:21] sure [15:17:45] apergos: fixed! thank you very much :-] [15:17:49] those 20 in a batch are not serial though, if I understand correctly? that is, they could be launched at the same time? [15:18:01] hashar: it's been a long haul :-) [15:18:14] any other services we should look out for over there? [15:18:24] apergos: it's still 20 uploads [15:19:17] I think mediawiki will break before swift [15:19:18] we'll see :P [15:19:20] apergos: that's correct … there's no order [15:19:23] great :-D [15:19:37] well if yu are signing off I will sleep easy :-D [15:22:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This will cause a ton of problems with fenari, don't merge yet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [15:22:27] I mean, the maximum bandwidth we can currently write is 130MB/s [15:22:50] about a gigabit [15:23:04] apergos: and parsoid is working. Thank you ! [15:23:10] sweet! [15:23:20] apergos: if anything is missing folks will complain, fill bug and we can open rules. [15:23:30] do we expect to have many museums that will want to write to us at 1Gbps? :) [15:23:32] ok, at least the big things are covered [15:24:03] not sure … i think once glans start to upload videos that's when bandwidth might become an issue ... [15:24:19] i'll chat with aaron … i'd like to see the ability to limit how many gwtoolset media file jobs are picked up as a whole [15:24:22] the museums won't, it's those pesky wiki{m,p}edia interns that will find some fast pipe to shovel stuff over :-D [15:24:30] but not out of the gate I imagine [15:24:36] :) [15:24:52] there's a limit on who can use the tool … a user must be in the gwtoolset group [15:24:53] we can always throttle the download-by-url proxy [15:25:15] so the commons admins will have some control over who can use the extension [15:25:24] so gradual ramp up anyways [15:25:30] yes [15:36:59] paravoid: would you have time to look at an ipv6 issue? this is someone with an mtu of 1280 with a problem with bits connections, he connects to bits in esams [15:37:24] (they are in channel if you are available) [15:37:41] sure [15:40:50] kenneaal: may I introduce you to paravoid [15:40:57] who has actual network chops [15:41:54] what's the issue? [15:42:56] the reported symptoms were that connections to bits often hang for up to 20 seconds after handshake... a traceroute is here: http://pste.me/35ulx/ and he says he has a public RIPE atlas probe on the network going the the same ipv6 tunnel he is using, if we wanted to do direct testing [15:46:49] kenneaal: are you here? [15:50:54] !log reenabling ospfv3 on eqiad-esams [15:50:58] (this is unrelated) [15:51:04] (03PS1) 10Alexandros Kosiaris: Temporarily punch holes for hooft and bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101216 [15:51:11] Logged the message, Master [15:51:41] (03CR) 10jenkins-bot: [V: 04-1] Temporarily punch holes for hooft and bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101216 (owner: 10Alexandros Kosiaris) [15:52:28] akosiaris: no need for @ in @ferm::rule [15:52:34] akosiaris: ferm::rule has a @file [15:52:41] akosiaris: also, broken tabs :) [15:53:04] meh [15:53:07] ok [15:53:09] me fix [15:53:28] (03CR) 10BryanDavis: Production configuration for GWToolset (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [15:59:07] (03PS6) 10Dan-nl: Production configuration for GWToolset [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 [15:59:13] (03PS2) 10Alexandros Kosiaris: Temporarily punch holes for hooft and bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101216 [15:59:48] (03CR) 10Dan-nl: "correcting the commit message. sorry about that bryan." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [16:03:20] (03CR) 10Alexandros Kosiaris: [C: 032] Temporarily punch holes for hooft and bast4001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/101216 (owner: 10Alexandros Kosiaris) [16:09:11] (03PS1) 10Alexandros Kosiaris: Also open ganglia TCP ports [operations/puppet] - 10https://gerrit.wikimedia.org/r/101217 [16:10:11] (03CR) 10Aklapper: "Re libmime-(tools-)-perl: http://www.bugzilla.org/docs/4.4/en/html/installation.html#install-perlmodules states "MIME::Parser (5.406) for " [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn) [16:19:01] (03PS1) 10ArielGlenn: comment out dysprosium in decomm list so configs cleanup won't [operations/puppet] - 10https://gerrit.wikimedia.org/r/101218 [16:20:17] (03CR) 10ArielGlenn: [C: 032] comment out dysprosium in decomm list so configs cleanup won't [operations/puppet] - 10https://gerrit.wikimedia.org/r/101218 (owner: 10ArielGlenn) [16:26:42] (03CR) 10Odder: [C: 031] Update favicon wiktionary/si.ico [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100949 (owner: 10Tholam) [16:27:39] out of here see you monday [16:28:50] (03PS1) 10Manybubbles: Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 [16:29:36] (03CR) 10Manybubbles: [C: 04-1] "No merging until the deployment window!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 (owner: 10Manybubbles) [16:29:59] (03CR) 10Alexandros Kosiaris: [C: 032] Also open ganglia TCP ports [operations/puppet] - 10https://gerrit.wikimedia.org/r/101217 (owner: 10Alexandros Kosiaris) [16:32:29] (03PS2) 10Manybubbles: Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 [16:33:07] (03CR) 10Manybubbles: [C: 04-1] "Updated deployment notes but still no merging until the deployment window!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 (owner: 10Manybubbles) [16:38:25] (03PS1) 10John F. Lewis: Redirect kr.wikimedia [operations/apache-config] - 10https://gerrit.wikimedia.org/r/101220 [16:38:31] (03CR) 10jenkins-bot: [V: 04-1] Redirect kr.wikimedia [operations/apache-config] - 10https://gerrit.wikimedia.org/r/101220 (owner: 10John F. Lewis) [16:48:13] (03CR) 10Aklapper: "Jenkins output says "Invalid command 'RewriteEngine', perhaps misspelled or defined by a module not included in the server configuration" " [operations/apache-config] - 10https://gerrit.wikimedia.org/r/101220 (owner: 10John F. Lewis) [16:53:00] paravoid: Many apologies, I ended up on an extended store run. If you're still around, I'm her enow. [16:53:13] I am, although multitasking [16:54:13] No worries, please ping if you take a while to answer. As I mentioned to aper.gos earlier, I believe bits.wikimedia.org may not be responding to ICMPv6 Packet Too Large responses properly. [16:55:21] that's fairly unlikely to happen on our side [16:55:25] but can we get some more data first? [16:55:34] The reason I believe so is that bits.wm.org connections will occassionally hang for around 20 seconds after TCP handshake and my host initiating the GET request, after which it deadlocks for that time. I believe it may be transmitting a 1280+ sized packet, and not responding to the PMTUD packet. [16:55:45] (03CR) 10Chad: [C: 04-1] Cirrus config updates (032 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 (owner: 10Manybubbles) [16:56:11] what's your MTU? [16:56:14] We can. I can offer access to a VM on the network, or the probe ID of a RIPE Atlas on my network, if that helps. [16:56:19] MTU for the link is 1280. [16:56:36] is 2001:16d8:ee00:8146:20c:29ff:fe99:4669 your ip? [16:56:43] (03CR) 10Manybubbles: Cirrus config updates (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 (owner: 10Manybubbles) [16:56:51] a VM on your network would be best [16:57:15] From 2001:16d8:ee00:146::1 icmp_seq=1 Packet too big: mtu=1428 [16:57:15] That's one of my hosts, yes. [16:57:30] that's the port80 sixxs node [16:57:33] Sec, let me generate an account for you. [16:58:14] I can transmit > 1280 sized packets to that IP [16:58:31] (03CR) 10Chad: Cirrus config updates (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 (owner: 10Manybubbles) [16:58:51] 1428 pass fine [16:59:03] Sorry, that was my bad. I increased MTU from 1280 (default) to 1428 yesterday, as part of debugging the connection. [16:59:17] and everything works now? [16:59:38] No, same problem persists. [17:00:38] PMed VM login details. [17:10:45] (03PS3) 10Manybubbles: Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 [17:12:02] (03CR) 10Chad: [C: 031] Cirrus config updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101219 (owner: 10Manybubbles) [18:32:34] PROBLEM - DPKG on mw1017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:33:38] !log install PHP5 5.3.10-1ubuntu3.9+wmf1 on mw1017 (test.wikipedia.org) [18:33:52] Logged the message, Master [18:41:29] (03PS1) 10Alexandros Kosiaris: Rename akosiaris => alexandros kosiaris [operations/puppet] - 10https://gerrit.wikimedia.org/r/101244 [18:42:43] (03CR) 10Alexandros Kosiaris: [C: 032] Rename akosiaris => alexandros kosiaris [operations/puppet] - 10https://gerrit.wikimedia.org/r/101244 (owner: 10Alexandros Kosiaris) [18:47:55] hola [18:48:13] hi [18:48:21] I was wondering how do i get a RT account. I was looking at analytics onboarding wiki [18:48:26] and it looks like i need one [18:48:35] https://www.mediawiki.org/wiki/Analytics/Onboarding [18:52:43] (03CR) 10Ori.livneh: [C: 04-1] "Might as well echo a time delta and report it to Graphite as well. This is how I did it in scap:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/100913 (owner: 10Anomie) [18:56:52] nuria: could you send a quick mail with that request to the address in the channel topic [18:57:19] yes, will do. [18:59:07] nuria: thx [19:02:42] (03CR) 10Dzahn: "Ariel, for this you should ignore what is installed on kaulen altogether. the kaulen installation never used the system packages, it used " [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn) [19:14:29] paravoid: gdash & graphite running on apache now; graphite w/LDAP [19:14:34] should be done with the migration today or tomorrow, finally [19:18:38] (03CR) 10Ori.livneh: "..you could add this to mw-deployment-vars:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/100913 (owner: 10Anomie) [19:24:35] ACKNOWLEDGEMENT - DPKG on mw1017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages alexandros kosiaris caused by manual PHP5 upgrade. Will be fixed after deployment of new packages to brewster. Pending deployment to production. [19:33:07] paravoid, we'll merge https://gerrit.wikimedia.org/r/#/c/101052/ if you have no objections [19:33:27] greg-g: fix for the broken https://www.mediawiki.org/wiki/Talk:Sandbox is https://gerrit.wikimedia.org/r/#/c/101253 , just need to sync_dir extensions/Flow. When can someone Lightning Deploy? (We have pastries!) [19:34:00] spagewmf: at any point, is kaldari available for it? [19:34:04] (I'll take a pastry ;) ) [19:37:38] is Kaldari required for this? I have deploy capability from my stint in E3... [19:38:28] spagewmf: are you comfortable with it? then yeah, you can [19:38:44] comfortable is a relative term. OK, doing it, thanks. [19:38:48] :) [19:45:43] hmmm [19:45:45] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Filtered_RecentChanges_errors [19:45:50] this might be a db performance issue. [19:46:17] (long queries in commonly-used UI) [19:46:31] springle-away: ^ [19:53:32] gwicke: no objections, but let's not use that for private wikis [19:53:36] gwicke: make https to http instead [19:54:00] paravoid, yup [19:54:02] oh hm, will you get redirects? [19:54:09] greg-g: It's fixed, logmsgbot missed the log. [19:54:17] maybe it won't work actually :) [19:54:39] paravoid, we should check if the backend is happy with http for those wikis [19:54:43] spagewmf: that's odd [19:54:48] !log Flow team deployed bug 58455 fix to 1.23wmf7 [19:54:48] spagewmf: but glad it's fixed [19:55:03] oh, logmsgbot was gone [19:55:06] Logged the message, Master [19:55:40] greg-g logmsgbot rejoined 11:52. Something about going out for a snack [20:00:56] (03CR) 10Dzahn: "libmime-tools-perl has formerly been libmime-perl and was listed as such in the mozilla wiki requirements page" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn) [20:03:14] (03CR) 10Dzahn: "and libmime-perl is in the Required section on https://wiki.mozilla.org/Bugzilla:Prerequisites#Ubuntu" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn) [20:06:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:08:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:10:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:12:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:14:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:14:55] (03PS5) 10Dzahn: install various Perl modules needed by Bugzilla [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 [20:16:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:18:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:20:27] (03CR) 10Dzahn: "not installing virtual libemail-mime-modifier-perl anymore, installing all the real packages that would come from it instead. also sorted " [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn) [20:20:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:22:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:24:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:26:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:27:17] (03CR) 10Ori.livneh: [C: 04-1] "Unscored reviews are easy to miss in Gerrit, so I'm updating my review to a -1, even though it's more of a -0.25." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [20:28:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:29:23] (03PS1) 10Ottomata: Using custom ganglia module instead of Logster. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 [20:30:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:01:27 PM UTC [20:30:43] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Fri Dec 13 20:30:40 UTC 2013 [20:32:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:30:40 PM UTC [20:34:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 08:30:40 PM UTC [20:41:39] (03CR) 10Ori.livneh: [C: 04-1] "There's a nice pattern for tail -f in Python at , which is part of dabeaz's awesome "Generator" (035 comments) [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 (owner: 10Ottomata) [20:43:31] !IE6 [20:43:31] April 8 2014 - celebrate end of extended support [20:54:55] does that also make all IE6 auto-destroy? [20:56:09] (03PS1) 10Jgreen: add *-dev hostname cnames for testing on lutetium [operations/dns] - 10https://gerrit.wikimedia.org/r/101432 [20:57:23] (03CR) 10Jgreen: [C: 032 V: 031] add *-dev hostname cnames for testing on lutetium [operations/dns] - 10https://gerrit.wikimedia.org/r/101432 (owner: 10Jgreen) [21:00:47] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Fri Dec 13 21:00:45 UTC 2013 [21:09:17] lol, a ticket that uses "please do the needful." [21:21:45] PROBLEM - Puppet freshness on mchenry is CRITICAL: Last successful Puppet run was Fri 13 Dec 2013 06:20:43 PM UTC [21:28:37] (03CR) 10Faidon Liambotis: "I don't understand why you're beating yourself with all this work, both with logster and with this, instead of just hacking a few lines in" [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 (owner: 10Ottomata) [21:53:42] looks like analytics1012 has been down for 29 days. i'm going to ack the alarm in icinga to test that i'm able to ack. [21:54:03] ACKNOWLEDGEMENT - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% Jeff Gage testing ACK ability [21:54:08] cool [21:58:23] hah, thanks [21:58:29] yeah that thing is very unhappy [21:58:37] waiting for some firmware fix or something :( [21:58:42] Hi gage! [22:00:11] paravoid, ahhh! [22:00:18] i don't know why i'm beating myself up with all that either [22:00:35] originally logster because it was less coupled [22:00:55] kept monitoring code out of varnishkafka [22:01:06] and logster made parsing the json files generic to any monitoring solution, which was nice [22:01:09] but mHHhhhhh [22:01:35] i could look into statsd i guess... [22:01:57] Snaps and I talked about it, he didn't seem excited about that, but maybe I'm wrong there [22:02:34] ? [22:02:46] statsd in vk? [22:05:03] (03CR) 10Mattflaschen: "We're targeting Wed., the 18th 00:00 UTC (Tue., the 17th, 16:00 PST) for deployment." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [22:06:12] yup [22:08:53] that would mean transforming from json to statsd format inside vk. cant we do that in logster? [22:09:03] or some other external tool [22:09:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [22:10:12] logster was giving me a giant headache because of an apparent ganglia positive slope bug [22:10:29] so, i went with ori-l's suggestion and decided to just compute the rate of chagne manually in a custom ganglia module [22:10:44] still working on this: [22:10:44] https://gerrit.wikimedia.org/r/#/c/101431/ [22:11:00] paravoid says "why don't you just add statsd support!" [22:11:07] I say "iunno, maybe we should" [22:11:09] shoudl we? [22:12:09] preferably not in varnishkafka, but in an external tool that tails the json stats file. we could even spawn that program from vk so it can just read from stdin [22:12:27] sure, that works [22:12:28] (03PS3) 10Nemo bis: Per bug #48012. Patch for worker.py. It checks for external programs existence in the initialization part. [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/63390 (owner: 10Sanja pavlovic) [22:12:46] oof, no thanks :) [22:12:55] i mean, what I just wrote is an external tool that tails the json file [22:13:26] (03CR) 10Ottomata: "Because Snaps doesn't like it(?)" [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 (owner: 10Ottomata) [22:13:34] (03CR) 10Ottomata: ":)" [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 (owner: 10Ottomata) [22:14:03] good! whats wrong with it? [22:14:37] Snaps: I don't have a strong opinion about this, but [22:14:48] Generally I understand the reservations about encumbering a nice UNIXy tool with gratuitous features; there's something to be said for doing one thing and doing it well [22:15:15] but StatsD is an emergent standard with a lot of support, and it's almost laughably trivial [22:15:26] see https://github.com/b/statsd_spec [22:16:01] yeah, and I dont think there is much to win from embedding that functionaly in vk. It is more work than making a standalone prog. [22:16:02] just sock.send('varnishkafka.myfoo|15ms') [22:17:07] a standalone statsd tailer would be a lot slimmer than a full-blown ganglia plugin, too [22:17:23] because statsd would be keeping state for you [22:17:32] so you're absolved from having to track deltas [22:17:52] a bit like RRD COUNTERs, except not broken :) [22:18:38] I hear you [22:20:29] the stats data is in json format, which means vk will need to parse that json and transform it into the statsd format. It is absolutely doable, but its much less work doing it in high level language with fancy json parsing. Thats all I'm saying :) [22:20:47] and that would also allow for other stats outputs than statsd, such as ganglia or whatever. [22:24:03] But its up to you, so dont let my opinion ruin the day :) [22:27:20] yeah, i'm persuaded [22:27:34] but it's up to ottomata & paravoid i guess [22:29:25] jgage: nice, icinga ACK test successful [22:30:45] if you happen to see an open ticket with a matching host name, you could mention/link it in the ACK message even [22:30:57] cool ok [22:31:28] #6238 in this case [22:31:54] yes [22:32:35] jgage: btw,on a related note, when you do stuff on servers, there is the !log feature here, you should use it for any non gerrit changes on prod [22:32:51] and it ends up on https://wikitech.wikimedia.org/wiki/SAL [22:32:55] yeah, noticed that [22:33:02] and, if it wasnt broken right now, even on Twitter [22:33:04] i had fun peeking at the SAL before i was hired [22:33:05] and/or identi.ca [22:33:13] k, cool! [22:33:44] oh, SAL -> Twitter is even broken? [22:33:50] yea [22:33:56] hah [22:34:00] last time i checked [22:34:29] i'm not really a twitter guy, but reading the SAL on wikitech is a good habit [22:34:34] to see what others did last night [22:53:25] greg-g: isn't it identica -> twitter thats actually broken for that [22:53:58] p858snake|l: don't think it ever went straight to identi.ca. there was a point in time when the identi.ca bit broke but the twitter bit didn't [22:54:14] I mean "through identi.ca" not "straight to" [22:54:31] hmm maybe not, my client is saying wikimedia something as the client [22:55:12] yeah [23:34:41] (03CR) 10Arav93: "Is this fixed or should I merely change those mentioned in the comment?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [23:36:19] we are just overhauling our repo layout and are creating a deploy repo along the lines of https://www.mediawiki.org/wiki/Parsoid/Packaging [23:37:04] the debian/ subdir will hold the upstart and systemd config besides eventually an actual debianization [23:37:12] so it needs to be controlled by ops [23:37:40] should we add all of puppet as a submodule and symlink debian to some subdir in there? [23:38:04] or should we create a new ops-controlled repository for this? [23:46:14] (03CR) 10Aklapper: "I tested on a fresh Labs instance (boogswolibs) with a copy of the kaulen setup (means: custom /lib still existing in /bugzilla)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn) [23:48:56] (03CR) 10Dzahn: "thanks for testing! it was worth a try because of the comment above starting "PS4: use libemail-sender-perl instead of libemail-send-perl"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn)