[07:59:37] !log dump es1007 to db1004, tokudb external storage page compression test. ok to kill in emergency [07:59:42] Logged the message, Master [08:30:57] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [08:46:28] <_joe_> springle: cool [08:46:45] <_joe_> springle: did toku fix its issues? [08:48:12] (03PS2) 10JanZerebecki: puppetize icinga tmpfs mount in prod [puppet] - 10https://gerrit.wikimedia.org/r/158555 (owner: 10Dzahn) [08:49:35] (03CR) 10JanZerebecki: [C: 031] "Its fine to use the tmpfs mount on labs, too." [puppet] - 10https://gerrit.wikimedia.org/r/158555 (owner: 10Dzahn) [08:51:49] (03CR) 10JanZerebecki: [C: 031] fix cert mismatch on mail.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/154223 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [08:56:39] (03CR) 10JanZerebecki: "Because of the HTTPS certificate requirements the main cluster seems to be the logical choice for this redirect, otherwise one would need " [dns] - 10https://gerrit.wikimedia.org/r/154222 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [08:59:06] (03CR) 10JanZerebecki: [C: 031] ssl_ciphersuite - change Header add to Header set [puppet] - 10https://gerrit.wikimedia.org/r/155016 (owner: 10Chmarkine) [09:09:24] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove submodule [puppet] - 10https://gerrit.wikimedia.org/r/158601 [09:09:26] (03PS1) 10Giuseppe Lavagetto: beta: use HHVM everywhere, get rid of mod_php [puppet] - 10https://gerrit.wikimedia.org/r/158602 [09:16:08] (03CR) 10Giuseppe Lavagetto: [C: 032] "This submodule is not used anymore." [puppet] - 10https://gerrit.wikimedia.org/r/158601 (owner: 10Giuseppe Lavagetto) [09:27:57] (03PS2) 10Giuseppe Lavagetto: beta: use HHVM everywhere, get rid of mod_php [puppet] - 10https://gerrit.wikimedia.org/r/158602 [09:38:19] (03PS3) 10Giuseppe Lavagetto: beta: use HHVM everywhere, get rid of mod_php [puppet] - 10https://gerrit.wikimedia.org/r/158602 [10:12:28] (03CR) 10JanZerebecki: "Yes, please abandon. The monitoring script in here actually parses the JSON which is better, but unless anyone else cares I only want to r" [puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE)) [10:27:47] _joe_: toku still has foibles :) [10:28:31] works will in tandem with innodb for certain use cases, but would never handle everything by itself [10:33:40] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [10:49:58] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: alleviate disk space problems [puppet] - 10https://gerrit.wikimedia.org/r/158612 [10:51:26] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: alleviate disk space problems [puppet] - 10https://gerrit.wikimedia.org/r/158612 (owner: 10Giuseppe Lavagetto) [10:59:28] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: fixups [puppet] - 10https://gerrit.wikimedia.org/r/158613 [10:59:38] <_joe_> dam puppet [11:00:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:01:09] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: fixups [puppet] - 10https://gerrit.wikimedia.org/r/158613 (owner: 10Giuseppe Lavagetto) [11:02:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:08:40] <_joe_> I'm away [12:04:52] !log running extensions/GlobalCssJs/removeOldManualUserPages.php for [[m:GlobalCssJs]] [12:04:59] Logged the message, Master [12:29:10] legoktm: s3 content model fields are complete [12:29:29] :DDD [12:29:34] yay, thanks :) [12:29:35] do you want me to prioritise another shard, or can I go back to slave-by-slave for the rest? [12:30:01] I think slave-by-slave should be good for now [12:30:07] ok [12:32:57] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [12:44:31] (03PS4) 10Giuseppe Lavagetto: beta: use HHVM everywhere, get rid of mod_php [puppet] - 10https://gerrit.wikimedia.org/r/158602 [12:59:30] PROBLEM - Host lutetium is DOWN: PING CRITICAL - Packet loss = 100% [13:00:06] <_joe_> let's see [13:00:17] RECOVERY - Host lutetium is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [13:00:24] !log lutetium dist-upgrade and reboot [13:00:34] <_joe_> Jeff_Green: :) [13:00:40] <_joe_> I got paged [13:00:46] <_joe_> is that normal? [13:00:48] Logged the message, Master [13:00:50] what does lutetium do? [13:04:08] (03PS1) 10Giuseppe Lavagetto: Add patch to make RUSAGE_THREAD available to profiling. [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158616 [13:08:41] sorry, I tried to mark it down for maintenance in icinga but I guess it didn't take [13:09:11] it looks like it took, though. wtf icinga. [13:09:48] lutetium is the all-purpose/QA box in fundraising, I ran a dist-upgrade and rebooted for the kernel update [13:16:34] heya akosiaris, yt? [13:16:58] ok [13:17:08] is it a SPOF & does it need to be a critical, paging check? [13:20:58] yeah, it's the box the fundraisers all use for various things requiring shell access [13:21:23] i.e. it hosts db's they can query for reporting, test versions of civicrm, etc [13:21:41] so if it's down it's likely to inconvenience several people [13:22:22] that said, I think this is an icinga (or user=me not understanding icinga) problem [13:22:48] why did it page when it was marked down for maintenance? [13:59:18] <_joe_> mmmh grrrit-wm seems unhealthy [14:02:33] (03PS2) 10Giuseppe Lavagetto: Add patch to make RUSAGE_THREAD available to profiling. [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158616 [14:02:36] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add patch to make RUSAGE_THREAD available to profiling. [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158616 (owner: 10Giuseppe Lavagetto) [14:02:38] (03PS1) 10Giuseppe Lavagetto: fix lintian overrides [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158621 [14:02:39] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] fix lintian overrides [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158621 (owner: 10Giuseppe Lavagetto) [14:03:16] hey, so, i'd like to get cdh5 working with trusty [14:03:17] but. [14:03:18] http://archive-primary.cloudera.com/cdh5/ubuntu/ [14:03:21] packages not available [14:04:56] can I uh...somehow import them into our apt( via updates?) to make the precise ones just install on trusty? [14:05:04] I will test, but I think it is likely they will just work as is [14:05:18] (03PS1) 10Reedy: Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 [14:05:20] (03PS1) 10Reedy: Remove sync-l10nupdate(-1)? [puppet] - 10https://gerrit.wikimedia.org/r/158624 [14:07:16] (03PS5) 10BBlack: Move all geoip-based resolution to DYNA [dns] - 10https://gerrit.wikimedia.org/r/158382 [14:08:40] (03PS2) 10Reedy: Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 [14:33:19] is GDubuc on irc? [14:33:57] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [14:36:16] Steinsplitter: Not currently it would seem [14:36:39] k, thanks. [14:39:02] (03PS2) 10JanZerebecki: Avoid referencing private contacts in icinga::monitor on labs. [puppet] - 10https://gerrit.wikimedia.org/r/158355 [14:40:39] Steinsplitter: gi11es is here, yes. [14:40:59] marktraceur: thanks [14:41:01] He may want to hilight on gdubuc :) [14:41:09] Oh, I think he went to an appointment actually [14:41:12] See -multimedia [14:41:51] i see :D thanks. [14:56:36] (03PS1) 10Nuria: Setting celery concurrency to 24 [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/158629 [14:56:38] (03CR) 10jenkins-bot: [V: 04-1] Setting celery concurrency to 24 [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/158629 (owner: 10Nuria) [14:59:32] ottomata: what has importing them into our apt respository to do with if they were made for precise or trusty? [14:59:42] (03PS2) 10Nuria: Setting celery concurrency to 24 [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/158629 [15:02:45] jzerebecki: i don't know, but, if I can install the precise packages and they work on trusty, just wondering if we have to repackage them or if I can just override and trusty install the precise packages [15:03:01] (there's a lot I don't know about apt and deb packages and releases) [15:03:09] (distributions) [15:03:40] * YuviPanda waves at _joe_ [15:03:48] want to merge the vumi patch? [15:13:16] (03CR) 10Chad: "Let's go ahead and get the dns merged." [dns] - 10https://gerrit.wikimedia.org/r/156214 (owner: 10Dzahn) [15:15:09] <_joe_> YuviPanda: err I'm actually going off now [15:15:13] ah ok [15:15:16] <_joe_> I'm working since 7 AM [15:15:19] I'll bug someone else, or push it to Monda [15:15:23] y [15:15:25] <_joe_> I may be back later [15:15:28] heh [15:15:31] hopefully not :) [15:15:34] * YuviPanda eyes ottomata [15:15:41] (03PS3) 10Reedy: Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) [15:15:42] <_joe_> ehehe otto is your man [15:15:57] (03PS2) 10Reedy: Remove sync-l10nupdate(-1)? [puppet] - 10https://gerrit.wikimedia.org/r/158624 [15:16:00] ehhhh? [15:16:16] ottomata: want to merge the patch removing all the mobile::vumi code? [15:16:22] (03CR) 10Faidon Liambotis: "\o/ I've been wanting to do this since the beginning, so glad I see this! +2 in principle, I'll try to look at it more closely on Sunday, " [dns] - 10https://gerrit.wikimedia.org/r/158382 (owner: 10BBlack) [15:16:23] you +1'd it yesterday ;) [15:16:37] ah, um, yeah, i'm cool with that, uhhhhhhhhHHHHH but you know, it IS Friday and all [15:17:11] heh [15:17:19] Monday then :) [15:20:05] The rule is "no deploys on Friday" [15:20:09] No rule about Saturday [15:20:13] :3 [15:22:00] <^d> Is gerrit really freaking slow for everyone else? [15:22:35] (03CR) 10BryanDavis: [C: 031] "files/misc/l10nupdate/sync-l10nupdate-1 is just a poor man's sync dir. This looks sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [15:23:05] <^d> Hmm, gerrit's wigging out on cpu. [15:24:22] ottomata: I don't know about anything technical that would prevent those packages from being put into our repos, just because they were done for a different release. I did something like that though not at wikimedia. where come the hadoop packages we currently use from? [15:25:22] from cloudera, we mirror them via reprepro updates [15:26:04] heading to cafe, bbl [15:26:21] (03CR) 10BryanDavis: "Oops. sync-dir needs an ssh-agent to operate." [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [15:26:35] (03PS1) 10JanZerebecki: Puppetize icinga log file permission fix. [puppet] - 10https://gerrit.wikimedia.org/r/158633 [15:29:22] (03CR) 10JanZerebecki: "Could you check that that is what is actually the current state on neon or is better and works with ircecho?" [puppet] - 10https://gerrit.wikimedia.org/r/158633 (owner: 10JanZerebecki) [15:30:22] (03CR) 10BryanDavis: "See https://github.com/wikimedia/operations-puppet/blob/production/modules/beta/files/wmf-beta-scap for a working example of a shell scrip" [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [15:36:01] <^d> Hmm, better. [15:47:48] (03CR) 10Mark Bergsma: [C: 04-1] "Getting there. :) A few more comments inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [15:49:27] (03PS3) 10JanZerebecki: puppetize icinga tmpfs mount in prod [puppet] - 10https://gerrit.wikimedia.org/r/158555 (owner: 10Dzahn) [15:51:10] (03CR) 10JanZerebecki: "It seems the mount resource type doesn't ensure there is a directory for the target first." [puppet] - 10https://gerrit.wikimedia.org/r/158555 (owner: 10Dzahn) [15:52:00] (03CR) 10Reedy: [C: 04-1] "-1 per Bryan" [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [15:57:47] (03PS1) 10Mark Bergsma: Allocate sandbox vlans for codfw and ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158636 [16:00:42] things seem exceptionally slow on enwiki right now [16:05:37] (03CR) 10JanZerebecki: [C: 031] "Works in labs:" [puppet] - 10https://gerrit.wikimedia.org/r/158555 (owner: 10Dzahn) [16:09:14] and css is intermittently not loading [16:16:56] (03PS1) 10BBlack: Turn off include_optional_ns for gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/158637 [16:19:47] (03PS2) 10BBlack: Turn off include_optional_ns for gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/158637 [16:23:08] ottomata: you can continute to do the mirroring like you already do. so it just depends on the packages themselves working on trusty. [16:24:25] aye, i probably just have to tell apt to pin to the precise release in trusty, or something... [16:33:39] Eeep, bits issues [16:33:41] ottomata: you shouldn't need to, pinning is usually used if you want an lower version package than what is available form any repo. but if for those packages that repo contains the only version, that is the one that gets selected. [16:34:05] marktraceur: ? [16:34:08] Hm, back [16:34:19] greg-g: Just loaded a file page and didn't get the vector CSS [16:34:41] marktraceur: please join complainers on #wikimedia-tech [16:34:57] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [16:40:37] PROBLEM - Varnish HTTP bits on cp1069 is CRITICAL: Connection timed out [16:41:23] bblack: ^^^ and people are complaining of no css/js right now [16:41:28] RECOVERY - Varnish HTTP bits on cp1069 is OK: HTTP OK: HTTP/1.1 200 OK - 212 bytes in 4.141 second response time [16:41:37] (03PS1) 10Manybubbles: Collect elasticsearch metrics less frequently [puppet] - 10https://gerrit.wikimedia.org/r/158639 [16:42:37] PROBLEM - Varnish HTTP bits on cp1057 is CRITICAL: Connection timed out [16:43:16] bblack: things going on that we should know about? ^^^ [16:43:27] RECOVERY - Varnish HTTP bits on cp1057 is OK: HTTP OK: HTTP/1.1 200 OK - 211 bytes in 2.167 second response time [16:44:19] anyone, bueler? [16:49:24] (03PS10) 10Chad: public_html directory service, see RT #6862 [puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [16:49:32] not that I'm aware of, no [16:50:08] I did notice some off issues of packet loss / latency on some ssh connections from my place to tampa this morning, and wrote it off as something in AT&T or my DSL [16:50:09] (03CR) 10Chad: "PS10 rebases, removes redirects from redirects.dat since we don't really support that sort of rewriting there. Put it in noc.wm.org's apac" [puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [16:50:16] could be there's some kind of traffic or network event going on [16:51:04] nothing looks super-crazy in ganglia [16:52:38] PROBLEM - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.031 second response time [16:52:59] greg-g: I just did a manual fetch for CSS on bits, and instead of the usual stuff I get (after a long timeout): [16:53:05] /* This file is the Web entry point for MediaWiki's ResourceLoader: . In this request, no modules were requested. Max made me put this here. */ [16:53:12] (that's the whole content of the response) [16:54:08] (03CR) 10Chad: [C: 031] Collect elasticsearch metrics less frequently [puppet] - 10https://gerrit.wikimedia.org/r/158639 (owner: 10Manybubbles) [16:54:26] cp1056 is out of sessions [16:54:28] PROBLEM - Varnish HTTP bits on cp1070 is CRITICAL: Connection timed out [16:54:37] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [16:55:24] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Bits%20caches%20eqiad&h=cp1056.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1409936098&v=200000&m=varnish.n_sess_mem&vl=N&ti=N%20struct%20sess_mem&z=large [16:55:27] RECOVERY - Varnish HTTP bits on cp1070 is OK: HTTP OK: HTTP/1.1 200 OK - 212 bytes in 2.500 second response time [16:55:28] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3874 bytes in 2.512 second response time [16:56:01] yeah that can't be the root of the problem though. the spike is very recent [16:56:15] what spike? [16:56:19] something is leaking sessions [16:56:57] spike in total sessions I mean (as in, it's not a problem with limits, it's a problem with whatever's consuming sessions) [16:57:02] yeah [16:57:47] RECOVERY - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3879 bytes in 3.602 second response time [16:57:54] might be packet loss indeed [17:02:37] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [17:03:47] PROBLEM - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.034 second response time [17:04:20] :( [17:04:28] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3874 bytes in 0.006 second response time [17:04:47] <_joe_> it seems it's a few servers hitting 502s [17:05:29] yeah, it doesn't seem like packet loss [17:05:37] RECOVERY - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3891 bytes in 0.062 second response time [17:05:59] <_joe_> no [17:06:35] <_joe_> greg-g: did we release something around 22:00 UTC yesterday? [17:06:48] * greg-g looks at sal [17:07:07] <_joe_> http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Bits+caches+eqiad&h=cp1057.eqiad.wmnet&jr=&js=&v=200000&m=varnish.n_sess_mem&vl=N&ti=N+struct+sess_mem [17:07:08] 22:52 logmsgbot: reedy Finished scap: consistency (duration: 20m 44s) [17:07:11] 22:31 logmsgbot: reedy Started scap: consistency [17:07:13] 22:28 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 4 22:27:45 UTC 2014 (duration 54m 38s) [17:07:37] "should" have beena no-op [17:08:12] <_joe_> no, this is from earlier yesterday. Something is creating a ton of "garbage" in varnish-bits [17:08:20] and a bit earlier than that? [17:08:51] well, new deploy at 18:00-19:00 [17:09:09] not sure what "20:56 MaxSem: Running cleanupPageProps.php from terbium, now for realz" is [17:09:20] by "new deploy" I mean the train [17:09:25] (03CR) 10CSteipp: "I'd like to see us enable these checks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158559 (https://bugzilla.wikimedia.org/49169) (owner: 10Physikerwelt) [17:09:35] it hasn't been an issue today [17:09:44] until now, past hour [17:09:46] or less [17:09:55] afaik [17:09:59] <_joe_> aude: something is filling the bits varnish cache [17:10:03] greg-g, yu dunno what running a script means? :P [17:10:09] <_joe_> since yesterday around 12:00Z I'd say [17:10:13] oh [17:10:20] <_joe_> so it may be our software [17:10:24] <_joe_> or something external [17:10:38] <_joe_> aude: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Bits+caches+eqiad&h=cp1057.eqiad.wmnet&jr=&js=&v=200000&m=varnish.n_sess_mem&vl=N&ti=N+struct+sess_mem [17:10:52] MaxSem: well, what it did/the purpose :P [17:10:55] oooo [17:12:58] So there was a change to docroots recently -- https://gerrit.wikimedia.org/r/#/c/158278/ [17:13:24] Not sure if that is related or not [17:15:37] how would it be related? [17:16:28] ori: Not sure. Just a change to some bit s stuff. I panicked for a minute when I checked https://bugzilla.wikimedia.org/show_bug.cgi?id=70445 on a mw* box but then realized that docroot is beta only [17:16:45] ori: I asked the question only because I saw outage in -operations and "mwXXXX symlinks not right" in -core [17:22:52] related or not, GeoIP is reported down as only one on status.wikimedia.org https://status.wikimedia.org/8777/156490/GeoIP-lookup [17:25:41] (03CR) 10Aaron Schulz: "Does the code still shell out every time even if the rendering is cached? AFAIK that was the problem but if that was fixed than this is fi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158559 (https://bugzilla.wikimedia.org/49169) (owner: 10Physikerwelt) [17:28:47] se4598: remember, status.wikimedia.org uses an invalid security certificate. The certificate is only valid for the following names: status.io.watchmouse.com, www.status.io.watchmouse.com, api.io.watchmouse.com, proxy.io.watchmouse.com (Error code: ssl_error_bad_cert_domain) [17:30:02] bblack: _joe_ can you guesstimate serious of the issue? I want to communicate out via tweet [17:30:06] if it is [17:32:19] (03Abandoned) 10Nuria: Setting celery concurrency to 24 [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/158629 (owner: 10Nuria) [17:34:03] greg-g: "Wikimedia sites are having intermittent issues with JavaScript & CSS. Our engineers are working on it!" <-- accurate enough? [17:35:20] I was going to add a link to http://status.wikimedia.org/ but strangely it only show GeoIP issues, not JS/CSS. [17:36:01] * Nemo_bis can't remember a single time in years status.wm.o provided some truthful information [17:36:30] :/ [17:37:14] also seems that WikiLove might be broken since yesterday.... [17:37:35] are users still seeing issues then? [17:37:42] (03PS1) 10BryanDavis: Fix relative symlinks for bits/static-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158643 (https://bugzilla.wikimedia.org/70445) [17:37:42] VP/T reports no love over the last 24 hours and "{"servedby":"mw1141","error":{"code":"mustposttoken","info":"The 'token' parameter must be POSTed"}" [17:38:13] Thats the api change bug. anomie ^ [17:39:29] seems the bits issues have calmed down. [17:40:12] guillom: good enough [17:40:21] guillom: thank you sir [17:41:06] greg-g: sure; although, if thedj says it's over, should I still send it? I don't have much visibility into the impact and scope of the issue [17:42:00] bd808, thedj: Is that WikiLove? Why was it ever using an edit token via GET? That rather defeats the purpose of an edit token. [17:43:28] (03CR) 10Ori.livneh: [C: 032] Fix relative symlinks for bits/static-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158643 (https://bugzilla.wikimedia.org/70445) (owner: 10BryanDavis) [17:43:33] (03Merged) 10jenkins-bot: Fix relative symlinks for bits/static-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158643 (https://bugzilla.wikimedia.org/70445) (owner: 10BryanDavis) [17:43:39] greg-g, guillom: frontend perf graph is going back to normal -- http://gdash.wikimedia.org/dashboards/frontend/ [17:43:52] ok, thanks bd808. [17:43:59] I won't send, then [17:44:16] instead I'll have dinner :) [17:44:41] guillom: good choice [17:45:48] !log ori Synchronized docroot and w: I55a01a712: Fix relative symlinks for bits/static-master (duration: 00m 13s) [17:45:54] Logged the message, Master [17:46:07] greg-g: I'll be around on gtalk if needed. [17:46:11] (03PS1) 10Nuria: Setting max celery concurrency to 24 [puppet] - 10https://gerrit.wikimedia.org/r/158644 [17:46:28] frontend graph has been back to normal for 30 mins I think [17:46:39] which is why I was wondering if clients are still affected, my impression was not :) [17:47:58] (03CR) 10Ori.livneh: [C: 031] beta: use HHVM everywhere, get rid of mod_php [puppet] - 10https://gerrit.wikimedia.org/r/158602 (owner: 10Giuseppe Lavagetto) [17:48:07] (03PS1) 10BryanDavis: Fix spelling in symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158645 (https://bugzilla.wikimedia.org/70445) [17:48:22] guillom: thanks :) [17:49:51] (03CR) 10Ori.livneh: [C: 032] Fix spelling in symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158645 (https://bugzilla.wikimedia.org/70445) (owner: 10BryanDavis) [17:49:53] (03CR) 10BryanDavis: [C: 032] Fix spelling in symlink (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158645 (https://bugzilla.wikimedia.org/70445) (owner: 10BryanDavis) [17:49:55] (03Merged) 10jenkins-bot: Fix spelling in symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158645 (https://bugzilla.wikimedia.org/70445) (owner: 10BryanDavis) [17:50:11] ori: you got that on tin? [17:50:24] also thanks [17:50:37] thanks, yeah [17:50:44] !log ori Synchronized docroot and w: Iaa7518613: Fix spelling in symlink (duration: 00m 15s) [17:50:50] Logged the message, Master [17:53:32] bd808, thedj: Ah, that explains it. The wikilove action is using a DerivativeRequest to re-enter the API (sigh), and it set $wasPosted = false (double sigh). [17:54:50] anomie: yikes, that thing loos somewhat arcane to begin with... [17:54:55] looks [17:55:20] anomie: shall i file a bugreport to track this ? [17:55:34] thedj: Sure. I'm testing a patch now, but always good to have a but. [17:55:36] bug [17:56:14] It's facepalm friday [17:56:59] https://bugzilla.wikimedia.org/show_bug.cgi?id=70448 [17:57:02] there you go [17:58:23] thedj: https://gerrit.wikimedia.org/r/158647 [17:59:43] anomie: worse, this is actually the example for how to do this... [17:59:44] https://www.mediawiki.org/wiki/API:Calling_internally [17:59:58] so there might be more.. :( [18:04:08] thedj: I don't see any more that are using an action that needs posting. Some action=query, one action=post, one action=wbsearchentities. [18:05:04] anomie: testing patch... [18:06:34] come on vagrant... [18:13:16] anomie: confirmed and +2 [18:14:02] (03CR) 10Krinkle: "extenstions ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158643 (https://bugzilla.wikimedia.org/70445) (owner: 10BryanDavis) [18:14:16] (03CR) 10Krinkle: "Fixed in Iaa7518613." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158643 (https://bugzilla.wikimedia.org/70445) (owner: 10BryanDavis) [18:15:24] ^d: ori: Reedy: I so hope Phabricator brings back the thing we had in SVN CodeReview to see which commits point back at this one [18:15:30] will save a lot of time and discovert [18:16:40] <^d> <3 [18:18:28] Is that a generic wish to the confidence fairy or is there some hint it may actually happen? [18:20:49] add that on fab.wmflabs.org as a request , for now [18:22:01] http://krugman.blogs.nytimes.com/2013/11/08/ecb-thinking-explained/ [18:22:13] (03CR) 10Ottomata: [C: 032 V: 032] Setting max celery concurrency to 24 [puppet] - 10https://gerrit.wikimedia.org/r/158644 (owner: 10Nuria) [18:23:54] Nemo_bis: paywall ? do i need to go to bugmenot.com first, does that still work? heh [18:24:35] no, it doesn't "This site has been barred from the bugmenot system." [18:25:15] Nemo_bis: ooh, i just get the paywall login because of browser extensions blocking stuff [18:25:23] funny [18:25:24] nevermind, can read [18:26:48] NoScript + ABP + privacy badger don't give me that effect btw [18:29:58] (03CR) 10Dzahn: [C: 031] puppetize icinga tmpfs mount in prod [puppet] - 10https://gerrit.wikimedia.org/r/158555 (owner: 10Dzahn) [18:30:28] (that's not giving myself the +1, it's a new PS by another author, wiki style) [18:31:32] Nemo_bis: i think i'm not sending referer at all and that may be it [18:33:04] Hm, I have network.http.sendRefererHeader = 2 [18:33:06] (03CR) 10Ottomata: "I can't remember the exact details, but I'm remembering a problem with counter metrics that are sent less frequently than gmetad's collec" [puppet] - 10https://gerrit.wikimedia.org/r/158639 (owner: 10Manybubbles) [18:34:14] !log Depooled cp1056 for testing [18:34:19] Logged the message, Master [18:35:57] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [18:36:51] (03PS1) 10RobH: provision ms-be/fe hosts in codfw [dns] - 10https://gerrit.wikimedia.org/r/158657 [18:39:26] (03CR) 10RobH: [C: 032] provision ms-be/fe hosts in codfw [dns] - 10https://gerrit.wikimedia.org/r/158657 (owner: 10RobH) [18:39:36] <^d> mutante: Mind giving https://gerrit.wikimedia.org/r/#/q/status:open+topic:people.wikimedia,n,z another round of review? This is too easy to finish to let it go another week :p [18:44:27] (03PS2) 10Manybubbles: Collect elasticsearch metrics less frequently [puppet] - 10https://gerrit.wikimedia.org/r/158639 [18:44:43] (03CR) 10Manybubbles: "What about this way then?" [puppet] - 10https://gerrit.wikimedia.org/r/158639 (owner: 10Manybubbles) [18:45:05] (03CR) 10jenkins-bot: [V: 04-1] Collect elasticsearch metrics less frequently [puppet] - 10https://gerrit.wikimedia.org/r/158639 (owner: 10Manybubbles) [18:45:17] <^d> manybubbles: jenkins says no, not that way. [18:45:18] <^d> :) [18:46:07] (03PS3) 10Manybubbles: Collect elasticsearch metrics less frequently [puppet] - 10https://gerrit.wikimedia.org/r/158639 [18:46:12] pep-hate [18:58:53] (03CR) 10Ottomata: "Not sure! It might have the same effect." [puppet] - 10https://gerrit.wikimedia.org/r/158639 (owner: 10Manybubbles) [19:00:08] (03CR) 10Manybubbles: [C: 04-1] "It'll just return the same number for 60 seconds...... Actually, that'll creates spikes in all the rate measures....." [puppet] - 10https://gerrit.wikimedia.org/r/158639 (owner: 10Manybubbles) [19:08:40] (03PS1) 10MaxSem: Monitor mediawiki-installation DSH group [puppet] - 10https://gerrit.wikimedia.org/r/158665 [19:26:11] (03PS1) 10John F. Lewis: mailman: Korean encoding fixes [puppet] - 10https://gerrit.wikimedia.org/r/158668 (https://bugzilla.wikimedia.org/70180) [19:26:45] Coren ^^ +2? :) [19:26:58] JohnLewis: Looking at it [19:27:30] Coren: https://gerrit.wikimedia.org/r/#/c/157708/ for quick reference to the similar one from a few days ago [19:27:46] Yeah, I recognize. [19:27:49] (03CR) 10coren: [C: 032 V: 032] "Moar encoding tweaks." [puppet] - 10https://gerrit.wikimedia.org/r/158668 (https://bugzilla.wikimedia.org/70180) (owner: 10John F. Lewis) [19:28:13] Once that goes live; I can close a high prio bug, Thanks Coren :D [19:28:26] np [19:37:08] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:37:08] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:45:59] (03PS2) 10Ori.livneh: Add shell helper for the Puppet catalog compiler [puppet] - 10https://gerrit.wikimedia.org/r/158435 [19:51:10] _joe_: https://asciinema.org/a/11986 [19:52:08] <_joe_> ori: _cool_ [20:05:00] ^d, I can't really make much sense of this data, or rather, I can't seem to find see any real difference [20:05:08] (03PS1) 10BBlack: add 120s hit_for_pass on <=0s objects for bits [puppet] - 10https://gerrit.wikimedia.org/r/158671 [20:05:14] the way we collected makes it kind of hard too, since we can't just line it up time-wise very well [20:05:46] i'm trying to compare a few of the metrics, things like, r/s vs average queue size or await [20:05:58] i can't see much difference (we weren't really expecting much anwyay) [20:06:14] on Monday, let's try to figure out a way to compare query performance [20:07:19] (03PS2) 10BBlack: add 120s hit_for_pass on <=0s objects for bits [puppet] - 10https://gerrit.wikimedia.org/r/158671 [20:08:52] (03CR) 10BBlack: [C: 032] add 120s hit_for_pass on <=0s objects for bits [puppet] - 10https://gerrit.wikimedia.org/r/158671 (owner: 10BBlack) [20:11:08] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:11:08] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:18:54] !log restarted cp1056 bits cache and re-enabled in pybal [20:19:00] Logged the message, Master [20:20:16] !log coms folks still accessing blog data on holmium, powering back up [20:20:21] Logged the message, RobH [20:22:17] RECOVERY - Host holmium is UP: PING OK - Packet loss = 0%, RTA = 2.22 ms [20:30:07] <^d> ottomata: Sounds good [20:36:05] ok, jzerebecki, i think i just had to add an extra apt source on trusty to tell it to use precise, that's all! [20:36:57] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [20:47:29] jzerebecki: Congrats on the new job. I look forward to you fixing lots more stuff. :) [20:49:50] eh. i have 404s on bits now.. [20:50:36] ori: hiya [20:51:13] i need to make vagrant render an apt/sources.list.d file for precise as well as trusty [20:51:20] in order to install the cdh packages [20:51:32] (03CR) 10Dzahn: [C: 032] "true, the script itself is alreayd in the repo, it just wasn't really used. and lgtm to me:" [puppet] - 10https://gerrit.wikimedia.org/r/158665 (owner: 10MaxSem) [20:51:43] what's the best way tot do that, just render the file manually somewhere? [20:51:45] (03PS1) 10RobH: setting mgmt ips for codfw db servers [dns] - 10https://gerrit.wikimedia.org/r/158769 [20:51:50] the vagrant apt module doesn't seem to be very flexible.. [20:52:59] I think technically vagrant mediawiki doesn't support precise anymore [20:53:38] (03CR) 10RobH: [C: 032] setting mgmt ips for codfw db servers [dns] - 10https://gerrit.wikimedia.org/r/158769 (owner: 10RobH) [20:54:30] master doesn't. There is the precise-compat branch [20:55:12] cdh needs to get on the trusty bandwagon :) [20:55:37] hrmm [20:55:46] maybe we should have servers not in site.pp not go into misc ganglia [20:55:51] but into a new 'default' group [20:56:06] cuz as i spin up new systems without service (just os) they're going to land in codfw misc... [20:56:10] (03CR) 10MaxSem: "Is there an option to run this check infrequently, e.g. once in an hour would be perfectly fine?" [puppet] - 10https://gerrit.wikimedia.org/r/158665 (owner: 10MaxSem) [20:56:11] mutante: what do you think? ^ [20:56:29] I ask cuz you most recently worked with me on ganglia and codfw, so figured you'd be a good person to ask [20:56:37] bd808: true! [20:56:47] it'd also show any hosts not defined in site.pp quite easily [20:57:24] (03CR) 10Dzahn: "yep, even the check command was here, just needed right path" [puppet] - 10https://gerrit.wikimedia.org/r/158665 (owner: 10MaxSem) [20:58:21] (03CR) 10Dzahn: "check_interval" [puppet] - 10https://gerrit.wikimedia.org/r/158665 (owner: 10MaxSem) [20:59:47] RobH: they land in the misc group even before you add them to site.pp ? [20:59:59] i think thats what the default does iirc [21:00:32] hmm, yes, default includes standard, standard includes ganglia [21:00:39] ottomata: um, yeah, just render it into place [21:01:16] so perhaps the default should include a different group [21:01:18] RobH: hmm.. looking.. [21:01:20] seems odd to put it in misc [21:01:29] dunno, not a big deal, but figured its worth chatting about =] [21:02:02] (03PS1) 10Aaron Schulz: Assigned rfb100[12] to the redis role [puppet] - 10https://gerrit.wikimedia.org/r/158772 [21:02:04] (03PS1) 10MaxSem: Check DSH groups once in 60 mins [puppet] - 10https://gerrit.wikimedia.org/r/158773 [21:02:09] ok ori, thanks [21:02:42] RobH: if we can easily change it in standard, a group seems to make sense, whatever the name, but it somehow tells you this server isn't puppetized yet.. [21:02:44] (03CR) 10jenkins-bot: [V: 04-1] Assigned rfb100[12] to the redis role [puppet] - 10https://gerrit.wikimedia.org/r/158772 (owner: 10Aaron Schulz) [21:03:11] mutante: well, it is puppetized [21:03:17] its just not in site.pp specifically [21:03:23] RobH: yea, it would be nicer than throwing actual running misc. servers into the same group [21:03:28] post-install group? [21:03:45] cuz thats the ONLY time a system should be calling into puppet and ganglia and not be in site.pp [21:03:49] shrug.. pre-puppet [21:03:54] its not pre puppet though [21:03:58] true [21:03:59] puppet is signed and running ;] [21:04:05] so is salt [21:04:10] spare/unused [21:04:13] not spare [21:04:17] its allocated to a task [21:04:26] it just hasnt had service deployment [21:04:37] if its actual spare, its wiped and powered down, and doesnt hit ganglia [21:05:07] so is there any reason other than post install that a system would be calling into puppet but not in site.pp? [21:05:08] 'needsrole' :p i dunno [21:05:20] cuz the more i think about it, i dont think there is [21:05:29] its the only legitimate reason for them to be in that 'group' [21:05:48] and it should exist for all datacenters, cuz i do this all the time [21:05:51] ok, so since we have "decom", this is "com" [21:05:59] we have a decom ganglia group? [21:06:05] its not commissioned until its live ;] [21:06:12] (see, its horrible to name things!) [21:06:26] but, we agree the group should exist [21:06:30] and i need to learn ganglia stuff [21:06:35] so i'll add and make you reviewer, sound good? [21:07:33] yea, reviewer is fine [21:08:21] "idle", they are all ready and sitting there but dont really do stuff [21:08:38] and if it is in there for too long, icinga says "warning, wasting energy" :p [21:08:45] yea, that is more generic than post-install [21:08:50] i like it. [21:09:05] idle servers are servers not in site.pp, that have been allocated but have not had services deployed [21:09:20] spare are servers that are not allocated, and should be wiped and powered down. [21:09:24] seems legit to me. [21:09:28] let me check the new monitoring for dsh though.. [21:09:37] was in the middle of that [21:10:04] * RobH will be spinning up a few dozen database servers soon which is what prompted this [21:10:16] having them all land in codfw misc ganglia was going to annoy me [21:10:21] (03PS1) 10Ottomata: Prefix variable with @ in hue.ini.erb [puppet/cdh] - 10https://gerrit.wikimedia.org/r/158776 [21:10:33] (03CR) 10Ottomata: [C: 032 V: 032] Prefix variable with @ in hue.ini.erb [puppet/cdh] - 10https://gerrit.wikimedia.org/r/158776 (owner: 10Ottomata) [21:10:35] and i wasnt about to decide how the services are being deployed for springle [21:10:37] heh [21:12:13] RobH: \o/ [21:13:44] springle: i put in new ticekts for the os deployment on the codfw db boxes [21:13:54] and the related blockign ticket for onsite work, etc... [21:14:00] you are requestor on all of them [21:14:27] my plan was to do the bulk of the old tampa server reinstalls with the os without issue (as i have a template to go by for paritioning and the like) [21:14:43] but the oddball db2020 with a disk shelf, plus the new hp boxes, i'll have to review with you once we have access [21:15:57] (03PS2) 10Ori.livneh: mediawiki: /usr/local/apache/common-local => /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/158317 [21:16:33] RobH: sounds fine [21:20:22] RobH: just being friendly here; you're doing great work from what I see :) (just thought I should mention it ;)) [21:20:43] thx =] [21:24:07] ori, would you rather me just do it manually, or add this to the vagrant apt module? [21:24:15] manually in its own role? [21:24:27] i could add a silly parameter to the apt class [21:24:44] $use_precise = false, [21:24:46] meh [21:25:39] ottomata: we used to have https://gerrit.wikimedia.org/r/#/c/144336/2/puppet/modules/apt/manifests/repo.pp [21:25:50] i deleted it in that commit because i thought i had gone too far in abstracting things [21:25:50] ja what happened to that? :) [21:25:56] you can resurrect it if you like [21:26:09] i trust your judgment on this stuff, whatever seems most appropriate to you is fine by me [21:27:32] well, seeing as i'm mainly just doing this to get cdh working in trusty until cloudera releases a trusty version [21:27:37] i'm ok with doing it halfway [21:27:44] i'll sick with the manual role then [21:37:01] greg-g: btw, labs graphite is now on graphite.wmflabs.org [21:37:05] labmon.wmflabs.org has gone awaaaaayyyyy [21:37:09] I'll kill that instance soon [21:37:25] ah ori, https://gerrit.wikimedia.org/r/#/c/158791/2 [21:37:27] YuviPanda: weee [21:37:36] greg-g: adding checks for disk space now [21:37:37] this works, but isn't in autoload layout...becuase roles are now in a module [21:38:08] Ah, I suppose i can just put it in analytics role itself... [21:39:00] greg-g: I think the disk space will go to critical instantly, btw. [21:43:56] YuviPanda: :( then maybe not ;) [21:47:37] (03PS1) 10Yuvipanda: labmon: Monitor for low disk space in /var for betalabs [puppet] - 10https://gerrit.wikimedia.org/r/158796 [21:47:43] (03PS2) 10Aaron Schulz: Assigned rfb100[12] to the redis role [puppet] - 10https://gerrit.wikimedia.org/r/158772 [21:47:52] greg-g: why not? should make it more pressing to free up the space :) [21:47:56] it shouldn't be too hard either [21:48:04] greg-g: I've reduced the thresholds [21:48:35] (03PS2) 10Yuvipanda: labmon: Monitor for low disk space in /var for betalabs [puppet] - 10https://gerrit.wikimedia.org/r/158796 [21:49:41] mutante: ^ [21:49:56] greg-g: it'll also help us fine tune the messages being given by icinga to be more useful [21:50:02] so it can report which hosts are having issues, etc [21:50:45] chasemp: ^ [21:56:29] (03CR) 10Dzahn: [C: 032] labmon: Monitor for low disk space in /var for betalabs [puppet] - 10https://gerrit.wikimedia.org/r/158796 (owner: 10Yuvipanda) [21:57:19] YuviPanda: give it some time though, neon is already adding dsh group monitoring.. gotta debug [21:57:32] but since i'm running puppet there anyways.. i'll tell you [21:57:50] mutante: cool [21:57:57] I'll be around for another 30m or so, and then back only on Wednesday [22:00:00] greg-g: because I reduced the limits (warn on 512M, crit on 256M) it will flap only for deployment-bastion [22:00:06] MaxSem: aah, so first it took neon a while, and we have a typo [22:00:08] so I'll see it flap and then clean it out [22:00:35] MaxSem: Service check command 'check_dsh_group' not found. because it's command_name check_dsh_groups [22:00:52] fixing, it just takes a while for neon again [22:02:38] (03PS1) 10Dzahn: fix typo in dsh group monitoring command name [puppet] - 10https://gerrit.wikimedia.org/r/158801 [22:03:25] (03CR) 10Dzahn: [C: 032] fix typo in dsh group monitoring command name [puppet] - 10https://gerrit.wikimedia.org/r/158801 (owner: 10Dzahn) [22:04:00] YuviPanda: ^ after this is fixed it should also add your check [22:04:07] cooool [22:04:09] * YuviPanda waits [22:04:33] mutante: I've to also fix/add to check_graphite functionality that will make it report exactly which metric is fucked up [22:04:55] YuviPanda: you saw the part where i changed the docs/comment ? [22:04:58] about the from parameter [22:05:03] ya [22:05:07] seems sensible [22:05:08] maybe that is in other places as well [22:05:11] I didn't realize the min thing [22:05:12] yeah [22:05:15] i didnt look everywhere [22:05:21] just fixed the obvious one you mentioned [22:05:47] the check for swift used "minutes" [22:05:50] which also didnt work [22:06:24] or maybe let's just add the options to make them work like aliases [22:06:37] it's somehow not consistent when "h" and "w" are ok, but "m" is not [22:07:45] yeah [22:07:51] I guess they could've also used m for month [22:08:01] so decided to fuck us there instead of fucking us elsewhere [22:08:07] haha, right [22:09:15] (03PS1) 10Kaldari: Enable Wikigrok prototype for beta labs (enwiki only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158802 [22:09:56] (03PS2) 10Kaldari: Enable Wikigrok prototype for beta labs (enwiki only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158802 [22:10:19] (03CR) 10Kaldari: [C: 032] Enable Wikigrok prototype for beta labs (enwiki only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158802 (owner: 10Kaldari) [22:10:59] (03Abandoned) 10Kaldari: Enable Wikigrok prototype for beta labs (enwiki only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158802 (owner: 10Kaldari) [22:13:29] (03PS3) 10Dzahn: add people.wm.org -> misc varnish, public_html's [dns] - 10https://gerrit.wikimedia.org/r/156214 [22:14:29] (03PS1) 10Kaldari: Enable Wikigrok prototype for beta labs (enwiki only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158804 [22:16:33] mutante: did it run? [22:17:37] YuviPanda: oh he, this is neon :) just 4 runs or so done [22:17:45] it's still busy correcting like 130 errors [22:17:48] heh [22:17:53] mutante: why is it so slow? [22:17:58] you might not be able to see it .. but i can update the ticket [22:18:03] wait:) [22:18:23] heh [22:18:34] mutante: oh wait, I remember its RAM or something is being upgraded soon [22:18:40] (03PS3) 10Aaron Schulz: Assigned rfb100[12] to the redis role [puppet] - 10https://gerrit.wikimedia.org/r/158772 [22:18:57] YuviPanda: puppet runs on icinga/nagios have always been among the slowest ones though [22:19:03] it's creating a ton of resources [22:19:13] (03CR) 10Kaldari: [C: 032] Enable Wikigrok prototype for beta labs (enwiki only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158804 (owner: 10Kaldari) [22:19:18] (03Merged) 10jenkins-bot: Enable Wikigrok prototype for beta labs (enwiki only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158804 (owner: 10Kaldari) [22:19:25] like, there are 10k service checks [22:19:31] (03PS4) 10Ori.livneh: Assigned rbf100[12] to the redis role [puppet] - 10https://gerrit.wikimedia.org/r/158772 (owner: 10Aaron Schulz) [22:19:34] we just added 230 .. shrug [22:20:01] YuviPanda: on every run it fixes some but not all of them.. [22:20:26] heh [22:20:27] say you have a wrong commandline, and there is a service defined by puppet for every host in mediawiki-installation to use it [22:20:31] I guess it's all the resource collection [22:20:32] right [22:20:34] that's forever [22:20:37] yes [22:21:27] heh, but it doesnt actually break the web interface like back in the days [22:21:46] because it checks for errors before ever restartin [22:22:24] Reedy: I'm going to do a quick config deployment of InitialiseSettings-labs.php. Won't effect anything else. Hope that's OK. [22:22:34] Fine by me :) [22:25:01] !log kaldari updated /a/common to {{Gerrit|I6039956eb}}: Enable Wikigrok prototype for beta labs (enwiki only) [22:25:08] Logged the message, Master [22:25:42] !log kaldari Synchronized wmf-config/InitialiseSettings-labs.php: enabling wikigrok on beta labs (en only) (duration: 00m 05s) [22:25:48] Logged the message, Master [22:26:09] (03CR) 10Dzahn: [C: 032] add people.wm.org -> misc varnish, public_html's [dns] - 10https://gerrit.wikimedia.org/r/156214 (owner: 10Dzahn) [22:29:31] (03CR) 10Reedy: [C: 04-1] add cawikimedia to dblists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158284 (owner: 10Dzahn) [22:32:40] (03CR) 10RobH: [C: 031] "seems pretty straightforward, since this isn't what really pushes them into service use." [puppet] - 10https://gerrit.wikimedia.org/r/158772 (owner: 10Aaron Schulz) [22:33:13] CUSTOM - mediawiki-installation DSH group on mw1033 is OK: OK [22:33:18] MaxSem: ^ :) thanks [22:33:35] :) [22:33:44] ^d: ping people.wikimedia.org [22:34:45] mutante: MaxSem how do you let it come up with 'custom' instead of being attached to a host? [22:34:48] that would be nice [22:34:55] since the betalabs stuff isn't really attached to labmon [22:34:59] YuviPanda: that is when i use the feature in the web ui [22:35:07] oh? [22:35:12] you can reconfigure things in the web ui? [22:35:27] not really reconfigure, i'll show you [22:35:35] go to https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon for example [22:35:45] check one of the boxes next to a service [22:35:51] click the drop-down menu [22:36:06] yah? [22:36:10] select "send custom notification.." [22:36:49] you _may_ need more icinga perms to send commands [22:37:07] you could be a login user but not have rights to execute things [22:37:16] but it's in puppet (tm) [22:37:24] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [22:37:42] mutante: hmm, I pressed a few options, etc, nothing seems to have happened [22:37:48] but if it's a one off thing and not permanent... [22:37:49] oh well [22:37:55] mutante: did puppet get to the part with my new check? [22:37:58] * YuviPanda has to go soonish [22:38:01] YuviPanda: you want the same thing to be able to ACK things, but we can do later [22:38:12] like..when you become ops we will for sure [22:38:29] heh [22:38:37] * YuviPanda plans on running away in the next few months [22:39:02] away from whom/what? [22:39:47] YuviPanda: not yet on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon [22:40:28] MaxSem: from becoming opsen, obviously [22:41:46] (03CR) 10BryanDavis: "Conceptual +1 because tabs suck. PSR-2 even agrees . On the other hand PSR-2 doesn't use 1TBS so I'm lo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [22:42:49] (03CR) 10Yuvipanda: [C: 031] "TABS SUCK TIME FOR CHANGE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [22:43:18] (03PS1) 10Reedy: Add ca.wikimedia.org to wikimedia-chapter apache site [apache-config] - 10https://gerrit.wikimedia.org/r/158808 [22:43:19] mutante: ^^ [22:43:40] (03CR) 10EBernhardson: "i also agree conceptually, but all php should follow https://www.mediawiki.org/wiki/Manual:Coding_conventions/PHP" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [22:44:22] Reedy: wrong repo :p [22:44:28] it's part of mediawiki module now [22:44:37] and i'm not even sure about the deployment part [22:44:56] also, bug about jenkins checking them [22:45:10] Can the apache-config repo be made read-only or something? [22:45:28] greg-g: how long does it usually take beta labs to pick up config changes? Do I need to sync changes with beta labs explicitily? [22:45:55] (03CR) 10Dzahn: [C: 04-1] "the apache config got moved into the mediawiki module. we should probably close this repo" [apache-config] - 10https://gerrit.wikimedia.org/r/158808 (owner: 10Reedy) [22:47:08] (03CR) 10Dzahn: ".. or move back.. dunno, for some changes it's ambigious now which team deploys them" [apache-config] - 10https://gerrit.wikimedia.org/r/158808 (owner: 10Reedy) [22:48:57] let's take a server OUT of mediawiki-installation dsh [22:49:08] just to see the monitor working [22:50:38] mutante: I've to go now, i'll check on the check tomorrow [22:50:39] * YuviPanda waves [22:52:03] YuviPanda|zzzz: yep, cu! [22:52:07] <^d> mutante: Gah, missed scrollback. [22:52:20] ^d: not that much, just added the DNS so far [22:52:31] <^d> \o/ [22:58:25] (03CR) 10Dzahn: add cawikimedia to dblists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158284 (owner: 10Dzahn) [22:58:47] Reedy: i think that is my vim adding an EOL and it might be right about that :) [22:59:54] same thing happens on many patches, basically our editors are fighting over the last line [23:02:35] (03CR) 10Dzahn: add cawikimedia to dblists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158284 (owner: 10Dzahn) [23:03:37] (03CR) 10Dzahn: "is this going to be an echowiki? i have no idea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158284 (owner: 10Dzahn) [23:05:52] (03CR) 10Ori.livneh: [C: 032] Assigned rbf100[12] to the redis role [puppet] - 10https://gerrit.wikimedia.org/r/158772 (owner: 10Aaron Schulz) [23:06:16] (03PS1) 10Dzahn: remove fenari from mw-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/158813 [23:07:27] (03CR) 10Chad: "If we shut this down before noc.wm.o is moved the /conf/ files will be stale. Per discussion with Mark this morning, we should serve it as" [puppet] - 10https://gerrit.wikimedia.org/r/158813 (owner: 10Dzahn) [23:08:40] chrismcmahon: Do you happen to know how long it usually takes for beta labs to pick up config changes? Do I need to sync changes with beta labs explicitily? [23:09:04] (03CR) 10Ori.livneh: "re: echo: sure, why not?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158284 (owner: 10Dzahn) [23:09:24] PROBLEM - puppet last run on rbf1002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:09:40] (03CR) 10Dzahn: "ok, yes, but do you know any reason why it would still need to have a mediawiki installed on it? also, i just want to test the monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/158813 (owner: 10Dzahn) [23:09:47] <^d> kaldari: 5 minutes, roughly. [23:09:51] <^d> And no, you don't have to manually. [23:10:01] kaldari: it should take around 3 minutes [23:10:03] PROBLEM - puppet last run on rbf1001 is CRITICAL: CRITICAL: Puppet has 1 failures [23:10:47] kaldari: the update job croaks sometimes, but I'm not aware that it's toast just now [23:12:56] ^d, chrismcmahon: hmm, doesn't seem to have worked. I deployed https://gerrit.wikimedia.org/r/#/c/158804/1/wmf-config/InitialiseSettings-labs.php about an hour ago, but mw.config.get( 'wgMFEnableWikiGrok' ) still returns false. [23:13:35] kaldari: yurgh, OK. I'll poke Jenkins and ask bd808 if he knows anything [23:13:52] chrismcmahon: thanks! [23:14:14] kaldari, your change doesn't set $wgMFEnableWikiGrok = $wmgMFEnableWikiGrok [23:14:34] or, y'know, check the code change MaxSem :-) [23:14:42] WikiGrok? [23:14:43] MaxSem: isn't there some magic that handles that? [23:14:50] nah [23:20:40] chrismcmahon: nevermind. I didn't see the mapping since they were in mobile.php instead of CommonSettings.php, so I assumed there was some magic :P [23:21:03] on deployment-bastion -- echo 'var_dump( $wgMFEnableWikiGrok );' | mwscript eval.php --wiki=enwiki == bool(false) -- and the config is merged [23:21:15] kaldari: np, thanks MaxSem for the review [23:21:35] (03PS1) 10Ori.livneh: redis: make /srv/redis the default directory [puppet] - 10https://gerrit.wikimedia.org/r/158815 [23:22:03] kaldari: You shold be able to poke around on deployment-bastion to check things like this. If you don't have shell access, I can add you to the project [23:24:31] bd808: I'm interested, what do you mean by "poke around on deployment-bastion" ? [23:24:50] (03PS1) 10Kaldari: Map config var for $wgMFEnableWikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158816 [23:25:07] chrismcmahon: Just like debugging in prod you can see the files and run mwscript commands on deployment-bastion [23:25:18] (03CR) 10Kaldari: [C: 032] Map config var for $wgMFEnableWikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158816 (owner: 10Kaldari) [23:25:22] (03Merged) 10jenkins-bot: Map config var for $wgMFEnableWikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158816 (owner: 10Kaldari) [23:25:30] logs files, live code checkout and interactive tools are all there [23:26:13] (03CR) 10Ori.livneh: [C: 032] "straightforward" [puppet] - 10https://gerrit.wikimedia.org/r/158815 (owner: 10Ori.livneh) [23:26:46] (03CR) 10MaxSem: "It needs mw do that it could publish its config at http://noc.wikimedia.org/conf/" [puppet] - 10https://gerrit.wikimedia.org/r/158813 (owner: 10Dzahn) [23:26:47] bd808: yeah, I only ever read the log files [23:26:52] See my mwscript eval.php command above as an example. And I checked the git log on /a/common to ensure that the change in question had been checked out [23:27:29] mwscript eval.php can make many issues more transparent [23:27:56] !log kaldari updated /a/common to {{Gerrit|Iec209bde0}}: Map config var for $wgMFEnableWikiGrok [23:27:57] (03PS1) 10Ori.livneh: Typo fix for Id1143c335 [puppet] - 10https://gerrit.wikimedia.org/r/158817 [23:28:02] Logged the message, Master [23:28:07] (03CR) 10Ori.livneh: [C: 032 V: 032] Typo fix for Id1143c335 [puppet] - 10https://gerrit.wikimedia.org/r/158817 (owner: 10Ori.livneh) [23:28:07] !log kaldari Synchronized wmf-config/InitialiseSettings-labs.php: enabling wikigrok on beta labs (en only) (duration: 00m 04s) [23:28:13] Logged the message, Master [23:28:30] !log kaldari Synchronized wmf-config/mobile-labs.php: enabling wikigrok on beta labs (en only) (duration: 00m 03s) [23:28:36] Logged the message, Master [23:30:09] RECOVERY - puppet last run on rbf1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [23:31:29] RECOVERY - puppet last run on rbf1002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:34:10] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Epic puppet fail [23:34:15] (03CR) 10Dzahn: [C: 031] "ok, yea, makes sense. let me just remove it for an hour since nobody is going to deploy now?" [puppet] - 10https://gerrit.wikimedia.org/r/158813 (owner: 10Dzahn) [23:36:09] PROBLEM - puppet last run on rbf1001 is CRITICAL: CRITICAL: Epic puppet fail [23:37:05] rbfs are me [23:37:13] (03CR) 10Dzahn: "works! Monitor for low disk space on /var for beta labs" [puppet] - 10https://gerrit.wikimedia.org/r/158796 (owner: 10Yuvipanda) [23:37:19] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:39:19] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:41:15] (03PS1) 10Nuria: [WIP] Adding caching headers for wikimetrics public directory [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/158819 [23:46:09] (03CR) 10MaxSem: [C: 031] "Yeah, go for it." [puppet] - 10https://gerrit.wikimedia.org/r/158813 (owner: 10Dzahn) [23:47:35] MaxSem: you know, meanwhile, found an actual alarm :) [23:47:44] so that tells us already, heh, mw1163 [23:47:49] as pointed out by bd808 [23:47:58] ololo [23:48:02] but it is in scheduled downtime! [23:48:05] :>) [23:48:40] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mw1163&service=mediawiki-installation+DSH+group [23:48:47] now gotta have a check for hosts supposed to be down not be present in pybal;) [23:49:17] hah, well, present is ok, but not enabled :) [23:49:31] mutante: Better thing to bookmark -- https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?nostatusheader&host=all&sorttype=2&sortoption=3&search_string=dsh [23:49:37] sorted by status [23:49:55] bd808: good point, :) [23:50:14] and now is when i still connect to fenari [23:50:24] Can we put that in the sidebar in wikitech or something? [23:50:48] we should just make is pester everyone in this channel [23:51:21] eqiad/apaches:{'host': 'mw1163.eqiad.wmnet', 'weight': 12, 'enabled': False } [23:51:29] gotta check for the True/False [23:52:21] MaxSem: and then check for existing RT ticket with the hostname in subject :) [23:52:25] that's what i do as a human [23:52:42] which gives us https://rt.wikimedia.org/Ticket/Display.html?id=8243 and the details:) [23:53:06] RAM broken [23:53:09] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [23:54:00] but it's all good, because it's disabled, notifications disabled as well by scheduling it, all as it should be [23:55:10] RECOVERY - puppet last run on rbf1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [23:55:54] * mutante just ACKs it and links the ticket, then you can tell on web ui [23:56:15] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1163 is CRITICAL: Host mw1163 is not in mediawiki-installation dsh group daniel_zahn RT #8243 - broken RAM - disabled in pybal [23:57:02] now, if you hover over the little comment icon, you can tell