[00:02:43] Are the mailman queue'd messages still waiting to go out? I haven't seen anything from the lists since it went down 5 hrs ago. (Sorry if this is an ongoing battle... ("I can't read your crazy moonlanguage" - The Tick)) [00:07:13] If any ops are still around: Could somebody please take a look at https://bugzilla.wikimedia.org/show_bug.cgi?id=67805#c22 for the memcached error and connection failures? [00:07:20] * andre__ tries IRC before hitting the mailing list [00:09:35] I suspect it just hasn't had nutcracker updated on it [00:09:41] tin [00:09:42] nobody 23483 0.0 0.0 18732 1704 ? Sl Apr25 8:59 /usr/local/bin/nutcracker -m 65536 -a 127.0.0.1 -c /usr/local/apache/common-local/wmf-config/twemproxy-eqiad.yaml [00:09:47] mw1017 [00:09:48] 112 14857 0.1 0.0 21652 2144 ? Ssl Jul07 13:11 /usr/sbin/nutcracker --mbuf-size=65536 --stats-port=22223 [00:10:04] * jgage takes a look [00:10:11] Yeah [00:10:15] tin still is on "twemproxy" [00:10:21] apaches are on "nutcracker" [00:10:39] ok, cool [00:10:47] jgage: i have a patch: https://gerrit.wikimedia.org/r/#/c/146288/ [00:10:54] thanks ori [00:11:00] At a quick guess, terbium is probably in a similar state [00:11:02] * Reedy checks [00:11:22] nope, just tin [00:11:49] (03CR) 10Gage: [C: 032] tin: include ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/146288 (owner: 10Ori.livneh) [00:11:58] Reedy: https://gerrit.wikimedia.org/r/#/c/146305/ needs manual rebase [00:12:57] wtf gerrit [00:13:04] or git [00:13:15] jgage: would you like me to deploy it? [00:13:55] ori i just +2'd it but puppet doesn't see it, i'm unclear what state this patch is in [00:14:15] oh got it [00:15:06] running puppet agent on tin now [00:16:01] meh Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: File[/etc/cluster] is already declared in file /etc/puppet/modules/mediawiki/manifests/init.pp:13; cannot redeclare at /etc/puppet/modules/apachesync/manifests/init.pp:10 on node tin.eqiad.wmnet [00:16:27] i suppose that's why the toplevel class wasn't declared originally eh [00:17:31] i'll fix [00:17:36] thanks [00:18:10] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Complete puppet failure [00:19:15] (03PS1) 10Ori.livneh: dedupe File['/etc/cluster'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/146358 [00:19:21] ^ jgage [00:19:29] ACKNOWLEDGEMENT - puppet last run on tin is CRITICAL: CRITICAL: Complete puppet failure Jeff Gage working on this in irc [00:19:44] ori, taking a look.. [00:20:28] (03CR) 10Gage: [C: 032] dedupe File['/etc/cluster'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/146358 (owner: 10Ori.livneh) [00:25:38] waiting on puppet.. [00:28:01] Notice: /Stage[main]/Nutcracker/Service[nutcracker]/ensure: ensure changed 'stopped' to 'running' [00:28:22] yay [00:28:26] andre__: nutcracker is now updated on tin [00:28:33] lovely! thanks so much! [00:28:36] jgage, ^ [00:28:42] mmm [00:28:50] When does localisation update occur? [00:29:11] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:29:25] Reedy: 02:00 UTC ish [00:29:37] Reedy: I.e. 90 minutes' time. [00:29:39] Just in tim then [00:37:31] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jul 15 00:37:27 UTC 2014 [00:47:26] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 12 hours old. [00:49:25] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 0 hours old. [01:08:32] (03PS1) 10BryanDavis: beta: Stop using obsolete 'maxclients' param for mediawiki::web [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 [01:14:52] jgage: ^^ Are the mailing lists meant to be back up? They appear to still be down. [01:16:15] Hmm. Gage may not be around. :-( [01:17:14] (the most recent message at wikitech-l, was at 12:41 pacific. Nothing since then, from any mailing list, of the many I'm subscribed to. http://lists.wikimedia.org/pipermail/wikitech-l/2014-July/077596.html [01:26:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [01:34:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [01:41:54] (03PS5) 10BryanDavis: Manage /usr/local/apache from ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 [01:44:36] (03CR) 10BryanDavis: "Cherry-picked and applied in beta." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 (owner: 10BryanDavis) [01:54:40] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:55:40] PROBLEM - puppet last run on search1022 is CRITICAL: CRITICAL: Puppet has 1 failures [02:10:25] PROBLEM - Host db1035 is DOWN: PING CRITICAL - Packet loss = 100% [02:13:35] PROBLEM - MySQL Replication Heartbeat on db71 is CRITICAL: CRIT replication delay 308 seconds [02:14:35] RECOVERY - puppet last run on search1022 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [02:18:08] (03CR) 1020after4: [C: 031] "Cool." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 (owner: 10BryanDavis) [02:27:07] !log powercycle db1035 unresponsive [02:27:15] Logged the message, Master [02:30:06] !log LocalisationUpdate completed (1.24wmf12) at 2014-07-15 02:29:02+00:00 [02:30:12] Logged the message, Master [02:30:28] RECOVERY - Host db1035 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [02:34:06] !log springle Synchronized wmf-config/db-eqiad.php: depool db1035, crashed (duration: 00m 13s) [02:34:11] Logged the message, Master [02:42:13] PROBLEM - MySQL Slave Running on db1035 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 14635079 for key old_id on query. Default [02:43:00] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 2073 seconds [02:48:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:50:30] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:20] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.017 second response time [02:56:10] PROBLEM - Host db1035 is DOWN: PING CRITICAL - Packet loss = 100% [03:01:07] !log LocalisationUpdate completed (1.24wmf13) at 2014-07-15 03:00:03+00:00 [03:01:12] Logged the message, Master [03:03:42] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [03:05:52] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [03:13:02] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 01:12:48 UTC [03:19:36] (03CR) 10BryanDavis: "It looks like one way we could use this in beta would be by creating files/apache/sites/beta/*.conf or maybe even mirroring the files/apac" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 (owner: 10Giuseppe Lavagetto) [03:33:10] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Tue Jul 15 03:33:03 UTC 2014 [03:34:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 15 03:33:38 UTC 2014 (duration 33m 37s) [03:34:51] Logged the message, Master [03:49:30] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [03:49:30] RECOVERY - Host db1035 is UP: PING WARNING - Packet loss = 86%, RTA = 0.52 ms [03:49:30] RECOVERY - MySQL Slave Running on db1035 is OK: OK replication [03:54:00] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [04:03:31] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 6904 seconds [04:03:31] PROBLEM - MySQL Slave Running on db1035 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 14635079 for key old_id on query. Default [04:34:06] PROBLEM - MySQL Slave Delay on db71 is CRITICAL: CRIT replication delay 5221 seconds [04:35:36] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db71 is CRITICAL: CRIT replication delay 4601 seconds Sean Pringle changed to replicate from db1019. recovering... [04:35:37] ACKNOWLEDGEMENT - MySQL Slave Delay on db71 is CRITICAL: CRIT replication delay 4686 seconds Sean Pringle changed to replicate from db1019. recovering... [04:40:00] I love that you actually acknowledge things in icinga, springle [04:40:04] wish more did [04:40:07] :) [04:41:07] RECOVERY - MySQL Slave Delay on db71 is OK: OK replication delay 0 seconds [04:41:37] RECOVERY - MySQL Replication Heartbeat on db71 is OK: OK replication delay -0 seconds [04:46:16] :) [04:47:17] after that it caught up fast anyway [04:47:35] sneaky server [04:48:43] !log db1035 crash cycle. down for memtest and stuff [04:48:49] Logged the message, Master [05:15:51] <_joe_> hey springle [05:15:56] <_joe_> good morning [05:27:43] hi _joe_ [05:28:03] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [05:47:19] (03CR) 10TTO: "If you're going to do this you may as well get rid of the extension from the cluster altogether." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146031 (owner: 10Legoktm) [05:47:40] (03CR) 10Legoktm: "Isn't it already gone?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146031 (owner: 10Legoktm) [05:49:03] IRC went down? [05:49:10] Er, irc.wikimedia.org [06:17:27] (03CR) 10TTO: "It still shows up on testwiki:Special:Version, so I suppose it isn't." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146031 (owner: 10Legoktm) [06:21:21] <_joe_> Bsadowski1: is irc still not working for you? [06:22:24] <_joe_> because the daemon is up and running on argon [06:23:05] <_joe_> and I can connect correctly [06:28:49] ACKNOWLEDGEMENT - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 6 failures Giuseppe Lavagetto osmium is the hhvm test host, puppet is disabled AFAIK [06:29:26] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:25] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove twemproxy::decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/146391 [06:36:56] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: remove twemproxy::decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/146391 (owner: 10Giuseppe Lavagetto) [06:42:54] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:43:24] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:43:41] <_joe_> :) [06:44:02] <_joe_> I like icinga free of unacknowledged alarms [06:46:24] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:48:14] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:54:47] <_joe_> !log killed jenkins stale process on gallium, stuck in a futex while shutting down [06:54:52] Logged the message, Master [07:22:07] <_joe_> !log stopping mailman on sodium for repairing [07:22:12] Logged the message, Master [07:27:51] <_joe_> !log restarted mailman on sodium [07:27:57] Logged the message, Master [07:28:51] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:37:43] _joe_: morning [07:37:51] <_joe_> eh [07:37:55] <_joe_> "good morning" [07:38:02] poke me when you have a moment [07:38:16] <_joe_> matanya: around half of next week? [07:38:17] <_joe_> :P [07:38:30] <_joe_> matanya: joking aside, mailman will take me some time [07:38:36] I know [07:38:59] does Halloween sound like a good timing ? [07:42:43] <_joe_> godog: ping [07:42:44] _joe_: ping detected, please leave a message! [08:31:54] anyone around who can look at the mailing lists? They appear to be swallowing posts at the moment [08:33:58] <_joe_> Jamesofur: it should be recovering [08:34:03] <_joe_> but we're checking [08:34:08] thank ye [08:34:38] I'm likely heading to bed but I'll let the community member who was reaching out (and had me forward an email that also got swallowed) so that he knows. I appreciate it [08:35:44] <_joe_> Jamesofur: sorry for the inconvenience :( [08:36:27] yeah, sucks :( they seem to be having more issues in the recent weeks to months I wonder if the boxes are having issues but I appreciate you looking into it [08:39:13] <_joe_> Jamesofur: this was all triggered by a crash of the mailman server the other day [08:39:21] <_joe_> we're still recovering from that [08:39:22] <_joe_> :( [08:40:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:40:07] ouch, was that when the root partition filled up a month ago or is this a new thing? [08:41:36] <_joe_> Jamesofur: a new thing, that caused quite a few problems [08:41:48] :-/ [08:42:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:42:40] ok, I'm off to bed, good look _joe_! [08:43:39] <_joe_> thanks [08:44:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:46:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:48:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:50:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:50:24] !log restart mailman on sodium after inodes freed [08:50:30] Logged the message, Master [08:52:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:54:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:56:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:56:21] <_joe_> this is bogus btw [08:57:45] RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Tue Jul 15 08:57:35 UTC 2014 [08:59:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:57:35 UTC [09:09:16] <_joe_> !log restarting mailman on sodium, again, for testing [09:09:22] Logged the message, Master [09:17:58] RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Tue Jul 15 09:17:55 UTC 2014 [09:42:33] <_joe_> oh ok thanks godog [09:42:36] <_joe_> :) [09:54:34] (03PS2) 10Giuseppe Lavagetto: monitoring-git: fix icinga message [operations/puppet] - 10https://gerrit.wikimedia.org/r/146078 [10:05:53] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [10:10:08] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring-git: fix icinga message [operations/puppet] - 10https://gerrit.wikimedia.org/r/146078 (owner: 10Giuseppe Lavagetto) [10:18:06] (03CR) 10Filippo Giunchedi: [C: 031] twemproxy: remove leftovers post-decom (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144757 (owner: 10Ori.livneh) [10:20:53] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [10:21:43] PROBLEM - Apache HTTP on mw1017 is CRITICAL: Connection refused [10:22:41] (03CR) 10Filippo Giunchedi: [C: 031] jobrunner: provision on mw1001; fix template [operations/puppet] - 10https://gerrit.wikimedia.org/r/146177 (owner: 10Ori.livneh) [10:22:49] (03PS1) 10Aude: Add wikidata wb_property_info table to dumps [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/146419 [10:23:48] <_joe_> mw1017 is me [10:24:01] (03PS1) 10Aude: Fix typos in wikidata table descriptions [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/146420 [10:24:51] (03CR) 10Filippo Giunchedi: [C: 031] "looks good to me!" [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [10:25:41] (03CR) 10Filippo Giunchedi: Packaging for debian using pkg-php-tools/dh_php5. (031 comment) [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [10:26:43] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.154 second response time [10:30:48] (03CR) 10QChris: "Needed change" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145980 (owner: 10QChris) [10:38:42] (03PS2) 10Filippo Giunchedi: beta: Stop using obsolete 'maxclients' param for mediawiki::web [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 (owner: 10BryanDavis) [10:38:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] beta: Stop using obsolete 'maxclients' param for mediawiki::web [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 (owner: 10BryanDavis) [10:46:50] ori _joe_ https://gerrit.wikimedia.org/r/#/c/146177/ looks good, ready to be merged? [10:49:28] <_joe_> godog: yes, this evening when both ori and aaron are around :) [10:49:46] yep [10:51:16] (03PS2) 10Aude: Add wikidata wb_property_info table to dumps [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/146419 (https://bugzilla.wikimedia.org/68024) [10:51:28] (03PS2) 10Aude: Fix typos in wikidata table descriptions [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/146420 [11:14:34] (03CR) 10QChris: [C: 031] kafka process monitoring: make it send pages [operations/puppet] - 10https://gerrit.wikimedia.org/r/145711 (owner: 10Dzahn) [11:47:26] Where does mw config file for betawiki(s) stays? I need to import some artciles to es betawiki and there seems config preventing it. [11:51:37] kart_: It's the same place as the production config [11:51:50] https://noc.wikimedia.org/conf/ / https://git.wikimedia.org/tree/operations%2Fmediawiki-config.git [12:53:05] (03PS2) 10Hoo man: add index.html pages for various directories on dataset hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [12:53:22] (03CR) 10Hoo man: "Fixed syntax error" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [12:54:15] (03CR) 10jenkins-bot: [V: 04-1] add index.html pages for various directories on dataset hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [12:59:50] apergos: Around [12:59:51] ? [13:20:53] qchris yayyyyy [13:20:54] hello! [13:20:57] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.38 [13:21:16] ottomata: Heya [13:21:45] oops, this is not analytics chat [13:21:49] someone rearranged my chat tabs! [13:21:53] but hi over here anyway! [13:22:03] Evil tab mangling monsters! [13:22:19] (03PS1) 10Jgreen: correct parameters for spamassassin class in otrs role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/146450 [13:26:03] hoo: don't worry about it, or, worry if you like but I'll be staring at that soon enough, before it gets merged [13:26:11] also, I am around (was getting lunch) [13:26:41] apergos: Ok... I would like to have that merged, but it's not a big blocked towards Wikidata json dumps [13:26:57] I've got a bash script hacked up to generate the dumps, but I doubt it's production read [13:26:58] y [13:27:03] needs polishing etc. [13:27:28] is it committed anywhere yet or you want to work o it some more first? [13:27:56] it's not committed anywhere, yet... but I would like to do that [13:28:09] can you give me pointers where it should sit? [13:28:50] do you have other scripts or maintenance stuff for wikidata anyplace? [13:29:14] we have cron jobs in the maintenance.pp, but despite of that no [13:29:27] hrm [13:30:43] I guess add it to the snapshot module, there's already a cro job for central auth dumps in there [13:31:02] Ok, will have a look at that [13:31:31] how should I do the removal of old dumps? Should/ can that be in the same bash script or ...? [13:31:34] under files, then in the manifests dir there's a class for it [13:32:03] oh, I would think so, just clean up after you complete a successful round [13:32:06] something like that [13:33:04] Ok, so add one, remove one... that makes sense [13:33:24] Guess we want to keep around a few, but that's not going to be an issue as they are small [13:33:36] sure [13:33:47] you could settle on an arbitrary number like 10 [13:33:54] (chosen radomly) [13:33:56] *randomly [13:34:06] Lydia_WMDE might have an opinion on htat [13:34:10] if you run it once a week that's a couple months worth [13:34:15] ok great [13:34:30] (03CR) 10Jgreen: [C: 032 V: 031] correct parameters for spamassassin class in otrs role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/146450 (owner: 10Jgreen) [13:35:04] number of json dumps we keep? [13:35:32] not possible to keep them all? [13:35:47] i guess it is ok with 10 for now [13:35:52] Lydia_WMDE: mh... we want one dump pre week? [13:35:53] * per [13:36:02] no we do't keep dumps forever [13:36:10] ok [13:36:20] in theory the new one should have all the data of the old one anyways [13:36:24] k [13:36:43] in terms of how often: more often is better :D [13:36:54] apergos: Well, guess that could be interesting to see how Wikidata grows (as the jsondumps only have the current data) [13:37:03] ok well our regular dump are every 10 days or so, so once a week is more often than that [13:37:42] Every ten days is the least we should do, I would prefer weekly. Lydia_WMDE ? [13:37:58] (03PS2) 10Ottomata: kafka process monitoring: make it send pages [operations/puppet] - 10https://gerrit.wikimedia.org/r/145711 (owner: 10Dzahn) [13:38:02] ok then let's do once a week [13:38:06] :) [13:38:19] sweet! [13:38:21] (03CR) 10Ottomata: [C: 032 V: 032] "Process monitoring pages, yes! We are ready for that. Thanks Daniel!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145711 (owner: 10Dzahn) [13:38:32] !log elastic1017 had a load average of 60 - was thashing in io. bounced Elasticsearch. lets see if it recovers on its own [13:38:38] Logged the message, Master [13:38:47] (03CR) 10Hoo man: [C: 04-1] add index.html pages for various directories on dataset hosts (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [13:39:29] <^d> manybubbles: 17, seriously? Hmm. [13:39:41] ^d: yeah - going to back out japanese, I think [13:39:54] <^d> Ouch, we are kind of red. [13:40:37] (03PS1) 10Manybubbles: Switch jawiki back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146455 [13:40:54] (03CR) 10Chad: [C: 031] Switch jawiki back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146455 (owner: 10Manybubbles) [13:40:58] (03CR) 10Manybubbles: [C: 032] Switch jawiki back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146455 (owner: 10Manybubbles) [13:41:03] yeah [13:41:05] (03Merged) 10jenkins-bot: Switch jawiki back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146455 (owner: 10Manybubbles) [13:41:47] more unstashed changes on tin.... [13:41:53] someone in themiddle of something? [13:42:55] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: jawiki back to lsearchd (duration: 00m 05s) [13:42:59] Logged the message, Master [13:43:08] lets see if that helps [13:44:48] load is pretty high because all the shards are franticly trying to resassign after I bounced elastic1017 [13:44:55] probably should have turned of jawiki first [13:45:01] and maybe I wouldn't have had to [13:52:49] !log after switching jawiki back to lsearchd by default load is mostly recovered. the cluster is still healing from bouncing elastic1017 and that'll take a while. the load will be a bit high during that but searches are coming back in a reasonably amount of time again [13:52:53] Logged the message, Master [14:00:08] (03PS3) 10Tim Landscheidt: Tools: Remove lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/124001 [14:00:45] (03CR) 10Tim Landscheidt: [C: 04-1] "Needs to be tested on Toolsbeta first." [operations/puppet] - 10https://gerrit.wikimedia.org/r/124001 (owner: 10Tim Landscheidt) [14:09:30] growl gitdeployyyyyy go! [14:09:31] go! [14:09:32] nope. [14:10:10] sad_trombone.wav [14:11:15] anybody know where salt stores its grain info? [14:13:57] hmm, no, this has to be a puppet problem [14:13:57] hmm [14:16:10] (03PS1) 10Ottomata: Troubleshoot why refinery role is not applied on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 [14:16:19] (03PS2) 10Ottomata: Troubleshoot why refinery role is not applied on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 [14:17:02] ori: this bug https://bugzilla.wikimedia.org/show_bug.cgi?id=63981 seems to affect all of beta labs. If it can't be fixed pretty soon, can whatever caused it be reverted? [14:17:17] OHHHH, DOH [14:18:37] (03PS3) 10Ottomata: Deploy analytics/refinery repository to analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 [14:18:45] (03PS4) 10Ottomata: Deploy analytics/refinery repository to analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 [14:19:11] (03CR) 10Ottomata: [C: 032 V: 032] Deploy analytics/refinery repository to analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 (owner: 10Ottomata) [14:24:08] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Complete puppet failure [14:24:58] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [14:26:08] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:28:50] apergos: Still around? Can you create the target dir for the dumps? [14:28:55] (03PS1) 10Ottomata: Ensuring stats user and group exist for refinery deployments [operations/puppet] - 10https://gerrit.wikimedia.org/r/146459 [14:29:09] I can indeed [14:29:12] (03CR) 10Ottomata: [C: 032 V: 032] Ensuring stats user and group exist for refinery deployments [operations/puppet] - 10https://gerrit.wikimedia.org/r/146459 (owner: 10Ottomata) [14:29:36] apergos: Ok... I would suggest to go for xmldatadumps/public/other/wikidataJson [14:29:40] (03PS1) 10Dzahn: wikitech - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 [14:29:41] I guess that name is ok [14:30:00] if you have a better idea, go ahead :P [14:30:34] (03CR) 10jenkins-bot: [V: 04-1] wikitech - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 (owner: 10Dzahn) [14:30:36] not just wikidata? [14:30:49] (03PS1) 10Ottomata: Fix hasty syntax error [operations/puppet] - 10https://gerrit.wikimedia.org/r/146462 [14:31:04] hoo: [14:31:06] (03CR) 10Ottomata: [C: 032 V: 032] Fix hasty syntax error [operations/puppet] - 10https://gerrit.wikimedia.org/r/146462 (owner: 10Ottomata) [14:31:12] guess we can also do that... just thought it might be confusing as we also have wikidata xml dumps [14:31:15] but I guess that's ok [14:31:30] the xml dumps all live in another area [14:31:33] I think it'll be ok [14:32:04] dir is there now [14:32:12] indeed :) [14:32:24] (03PS1) 10Dzahn: gerrit - disabled DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146464 [14:32:59] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:34:15] (03CR) 10Matanya: [C: 031] gerrit - disabled DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146464 (owner: 10Dzahn) [14:40:12] (03PS1) 10Dzahn: dynamicproxy - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146466 [14:43:33] (03PS2) 10Dzahn: bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 [14:43:48] (03PS2) 10Dzahn: gerrit - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146464 [14:44:29] (03CR) 10jenkins-bot: [V: 04-1] bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 (owner: 10Dzahn) [14:55:45] chrismcmahonbrb: i'll fix it [14:57:18] (03PS25) 10Andrew Bogott: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [14:57:18] aww manybubbles [14:57:30] Nemo_bis: exciting morning [14:57:44] thanks ori [14:57:46] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 12:56:48 UTC [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140715T1500) [15:00:25] * anomie observes no patches for SWAT [15:01:37] (03PS1) 10Hoo man: Introduce snapshot::wikidatajsondump [operations/puppet] - 10https://gerrit.wikimedia.org/r/146470 [15:01:51] apergos: https://gerrit.wikimedia.org/r/146470 [15:02:01] Don't be to hard on it... I've kept it simple [15:04:17] anomie: if you want, some trivial maintenance https://gerrit.wikimedia.org/r/#/c/145861/ [15:05:47] Nemo_bis: Sure. Put it on the Deployments page, please [15:06:04] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [15:08:09] no worries, thanks! [15:08:47] anomie: done [15:09:13] Nemo_bis: You forgot to put yourself as the requesting developer, as indicated [15:09:27] (03CR) 10Anomie: [C: 032] "SWAT deploy" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145861 (owner: 10Nemo bis) [15:09:36] (03Merged) 10jenkins-bot: Remove dead ULS variable after I49e812eae32266f165591c75fd67b86ca06b13f0 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145861 (owner: 10Nemo bis) [15:10:24] Who left uncommitted changes in /a/common on tin? [15:12:04] springle: Are these changes in /a/common on tin yours? [15:12:52] anomie: he's asleep [15:12:59] anomie: I believe they are his, yeah [15:13:03] (g'morning, btw) [15:13:10] greg-g: So who can fix tin so I can SWAT? [15:13:15] I had to jam something out that super fast this morning to I stashed them, rebased, and then unstashed them [15:13:20] it was horrible and I'm ashamed [15:13:38] what are the changes? [15:13:43] db config stuff [15:13:47] :/ [15:13:48] greg-g: Appears to be disabling db1035 [15:14:04] anomie: commit it [15:14:11] 04:48 springle: db1035 crash cycle. down for memtest and stuff [15:14:14] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 13:13:25 UTC [15:14:38] brb [15:16:44] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jul 15 15:16:37 UTC 2014 [15:16:47] !log anomie updated /a/common to {{Gerrit|I7ca6a16d5}}: Switch jawiki back to lsearchd [15:16:52] Logged the message, Master [15:17:12] That's wrong, logmsgbot [15:18:33] !log anomie actually committed a live hack someone left on tin (removing db1035) [15:18:38] Logged the message, Master [15:19:22] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /a/common/). [15:19:26] !log anomie Synchronized wmf-config/: SWAT: Remove dead ULS variable [[gerrit:145861]] (duration: 00m 10s) [15:19:31] Logged the message, Master [15:19:57] Nemo_bis: ^ Test please [15:20:47] (03CR) 10Andrew Bogott: "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: " [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [15:21:31] (03PS3) 10Andrew Bogott: Updating debian package files [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [15:21:33] (03PS1) 10Ottomata: Add Camus cron job on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146474 [15:22:48] (03CR) 10Ottomata: [C: 032 V: 032] Add Camus cron job on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146474 (owner: 10Ottomata) [15:24:47] (03PS4) 10Andrew Bogott: Updating debian package files [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [15:26:19] Nemo_bis: Did you test that ULS still works after your change? [15:26:21] (03PS1) 10Manybubbles: Raise Elasticsearch filter cache size in prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 [15:26:57] (03PS2) 10Manybubbles: Raise Elasticsearch filter cache size in prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 [15:27:37] (03CR) 10Andrew Bogott: [C: 032] Updating debian package files [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [15:30:10] (03PS3) 10Manybubbles: Raise Elasticsearch filter cache size in prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 [15:31:26] (03CR) 10Andrew Bogott: [C: 032] labs_vagrant: Install to /srv/vagrant [operations/puppet] - 10https://gerrit.wikimedia.org/r/145974 (owner: 10BryanDavis) [15:32:25] !log setting filter cache size to 20% on elastic1001 to see if it takes/helps us [15:32:30] Logged the message, Master [15:34:57] anomie: just did, it does [15:35:45] Nemo_bis: Thanks [15:36:19] andrewbogott: lots of activity there for a rebase [15:36:49] AzaToth: Yeah, I updated the license files and merged. [15:37:00] (I checked in with Domas and Ryan about licensing) [15:37:29] ah [15:37:45] that little evil detail [15:37:58] (03CR) 10Manybubbles: "Dynamically applied to production - we'll see if it helps us any." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 (owner: 10Manybubbles) [15:41:18] <_joe_> manybubbles: can I go on with an update of the appservers, or will this interfere with your work? [15:41:29] _joe_: me? not a bit [15:41:44] <_joe_> manybubbles: ok thanks [15:42:09] !log setting the filter cache on one node in the cluster set it on all. yay, I guess. Anyway, I'm going to let it soak for a while. [15:42:14] Logged the message, Master [15:47:21] <_joe_> !log starting rolling update of all appservers to apache2 2.2.22-1ubuntu1.6, half of them are on 2.2.22-1ubuntu1.5 now [15:47:28] Logged the message, Master [15:51:29] !log elasticsearch1017 is freaking out again - maybe there is something wrong with it. odds aren't good it picked up the same shard again after restart and that shard is somehow poison just for it and not the other two nodes with the same shard.... [15:51:35] Logged the message, Master [15:52:19] (03PS3) 10Andrew Bogott: labs_vagrant: cleanup sudoers config [operations/puppet] - 10https://gerrit.wikimedia.org/r/145975 (owner: 10BryanDavis) [15:52:39] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Tue Jul 15 15:52:37 UTC 2014 [15:53:40] <_joe_> !log mw101[0-9] updated [15:53:46] Logged the message, Master [15:56:58] <_joe_> !log mw1020-mw1059 updated [15:57:04] Logged the message, Master [15:57:59] !log restarting Elasticsearch on elastic1017 - its thrashing the disk again. I'm still not 100% sure why [15:58:04] Logged the message, Master [16:00:00] would anyone mind brainstorming with me about elastic1017? Its being sad. [16:00:46] <_joe_> !log mw1060-mw1099 updated [16:00:50] Logged the message, Master [16:04:55] such an unlucky machine [16:06:11] <_joe_> manybubbles: if ori does not show up soon, I may be available to help shortly [16:10:13] <_joe_> !log mw1100 and onwards updated [16:10:18] Logged the message, Master [16:19:50] _joe_: i'm here [16:22:50] _joe_: should we maybe split up the apache module patch into two parts [16:22:58] the first removing the cruft, the other adding the include ::apache? [16:24:12] <_joe_> ori: I think it's fairly straightforward at this point [16:24:16] <_joe_> I mean, we "tested" that [16:24:29] ok, probably true [16:24:42] <_joe_> but that would probably be more correct commit-wise [16:24:53] i'll split it up [16:25:03] <_joe_> ok :) [16:25:27] <_joe_> !log all mw servers updated [16:25:32] Logged the message, Master [16:29:07] (03PS3) 10Aaron Schulz: jobrunner: provision on mw1001; fix template [operations/puppet] - 10https://gerrit.wikimedia.org/r/146177 (owner: 10Ori.livneh) [16:31:04] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.01 [16:31:17] (03PS3) 10Ori.livneh: mediawiki: remove module-local apache service [operations/puppet] - 10https://gerrit.wikimedia.org/r/146162 (owner: 10Giuseppe Lavagetto) [16:33:37] (03CR) 10Ori.livneh: [C: 031] mediawiki: remove module-local apache service [operations/puppet] - 10https://gerrit.wikimedia.org/r/146162 (owner: 10Giuseppe Lavagetto) [16:33:53] (03PS1) 10Ori.livneh: mediawiki: use canonical apache module [operations/puppet] - 10https://gerrit.wikimedia.org/r/146485 [16:36:46] (03PS3) 10Ori.livneh: twemproxy: remove leftovers post-decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144757 [16:38:53] (03CR) 10Giuseppe Lavagetto: [C: 032] twemproxy: remove leftovers post-decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144757 (owner: 10Ori.livneh) [16:39:24] (03PS3) 10Ori.livneh: add 'puppet-run' bash alias to my .bash_profile [operations/puppet] - 10https://gerrit.wikimedia.org/r/146132 [16:39:29] <_joe_> ori: http://puppet-compiler.wmflabs.org/145/change/146162/html/mw1212.eqiad.wmnet.html [16:39:36] <_joe_> doesn't seem right [16:40:00] (03CR) 10Ori.livneh: [C: 032 V: 032] "(trivial, zero impact on prod)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146132 (owner: 10Ori.livneh) [16:40:25] _joe_: it is right; we're not introducing the apache module yet in that patch [16:40:38] so we can't reference the apache service [16:40:49] but it's ok, but it's only until we deploy the follow-up patch, and those files already exist on all target nodes [16:40:53] <_joe_> ok yes [16:41:07] <_joe_> and apache is running [16:41:11] nod [16:41:23] <_joe_> which was my main fear [16:42:00] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: remove module-local apache service [operations/puppet] - 10https://gerrit.wikimedia.org/r/146162 (owner: 10Giuseppe Lavagetto) [16:42:18] (03PS2) 10Giuseppe Lavagetto: mediawiki: use canonical apache module [operations/puppet] - 10https://gerrit.wikimedia.org/r/146485 (owner: 10Ori.livneh) [16:44:44] (03PS3) 10Andrew Bogott: wikitech-make ServerAlias configurable as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/145610 (owner: 10Dzahn) [16:45:16] there's also a design flaw with trebuchet that is going to complicate the jobrunner deployment somewhat [16:45:39] adding deployment::target { 'jobrunner': } doesn't actually deploy to the node during the puppet run [16:45:41] (03CR) 10Andrew Bogott: [C: 032] wikitech-make ServerAlias configurable as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/145610 (owner: 10Dzahn) [16:45:50] it only marks it as a deployment target for future deployments, which are still presumed to be manually initiated [16:46:26] so i should probably commit a separate patch to add the deployment::target to the relevant nodes [16:46:38] do a jobrunner deploy, and only then push the followup patch [16:46:43] <_joe_> ori: make it tagged [16:47:01] <_joe_> so that we can safely run once with salt as a tagged run [16:47:15] i don't follow [16:47:45] <_joe_> if you tag the class that includes the new deployment::target in puppet [16:48:14] <_joe_> I can then run puppet in parallel on all appservers if needed [16:48:19] oh, i see what you're saying [16:48:24] it's not all appservers tho [16:48:25] just the jobrunners [16:48:35] <_joe_> yes so who cares :) [16:48:38] hiii all! my tech talk on analytics infrastructures stuff starts in 10 mins here [16:48:38] https://plus.google.com/u/0/events/c53ho5esd0luccd09a1c30rlrmg?cfem=1 [16:48:42] come! :) [16:48:45] <_joe_> still it should be a best practice [16:48:57] <_joe_> ottomata: sorry I won't be there :( [16:49:03] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Complete puppet failure [16:49:11] s'ok, just a reminder to whomever can and wants to make it [16:49:34] ottomata regarding the task for data, i need some help [16:49:39] should i proceed on that? [16:49:59] dogeydogey: i'm prepping for a talk, will discuss after [16:50:02] join the talk if you like! [16:50:21] k [16:50:44] <_joe_> ori: merging the other patch btw [16:50:51] coooool [16:51:36] _joe_: https://gerrit.wikimedia.org/r/#/c/146142/ is also related (and trivial) [16:51:41] (03PS3) 10Giuseppe Lavagetto: mediawiki: use canonical apache module [operations/puppet] - 10https://gerrit.wikimedia.org/r/146485 (owner: 10Ori.livneh) [16:52:37] <_joe_> ori: fair enough, but focus on the jobrunners please :) [16:53:15] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use canonical apache module [operations/puppet] - 10https://gerrit.wikimedia.org/r/146485 (owner: 10Ori.livneh) [16:54:17] <_joe_> here we go. [16:54:28] weeeeee [16:54:43] (03PS2) 10Andrew Bogott: Tools: Remove redundant package requirement [operations/puppet] - 10https://gerrit.wikimedia.org/r/129239 (owner: 10Tim Landscheidt) [16:55:05] <_joe_> ori: do you have a set of urls I should test on a test host after the puppet change? [16:55:14] <_joe_> or should I use my usual one? [16:55:34] usual one should be fine [16:55:49] (03PS1) 10Ori.livneh: add jobrunner deployment target to job runners [operations/puppet] - 10https://gerrit.wikimedia.org/r/146492 [16:56:03] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [16:56:10] <_joe_> seems fine [16:56:19] \o/ [16:56:32] (03CR) 10Andrew Bogott: [C: 032] Tools: Remove redundant package requirement [operations/puppet] - 10https://gerrit.wikimedia.org/r/129239 (owner: 10Tim Landscheidt) [16:57:03] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:57:03] <_joe_> ori: next step is https://gerrit.wikimedia.org/r/#/c/146082/ ;) [16:57:44] (03PS2) 10Ori.livneh: mediawiki: move File['/usr/local/apache'] from web.pp -> sync.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/146142 [16:57:48] yes, that is friggin awesome btw [16:58:12] <_joe_> back in 5 [17:00:54] (03PS1) 10Ori.livneh: Fold mediawiki::web::config back into mediawiki::web [operations/puppet] - 10https://gerrit.wikimedia.org/r/146495 [17:17:54] !log lowered Elasticsearch concurrent recovery streams to 2 (from 3) and total write rate across those streams to 20MB/sec (from 4MB/sec). This should prevent io thrash on recovery which looked to cause echo distruptions in service while recovering from some other disruption. [17:18:00] Logged the message, Master [17:18:13] lowered from 4 to 20 ? :) [17:19:35] (03PS1) 10Ori.livneh: apache module: test config before attempting restart [operations/puppet] - 10https://gerrit.wikimedia.org/r/146497 [17:20:33] (03PS4) 10Manybubbles: Make permanent some Elasticsearch config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 [17:21:37] _joe_: so i think https://gerrit.wikimedia.org/r/#/c/146492/ should be next [17:21:59] <_joe_> ori: ok sorry, juggling between things as usual [17:22:06] np at all, you're great [17:25:18] (03PS2) 10Giuseppe Lavagetto: add jobrunner deployment target to job runners [operations/puppet] - 10https://gerrit.wikimedia.org/r/146492 (owner: 10Ori.livneh) [17:25:28] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] add jobrunner deployment target to job runners [operations/puppet] - 10https://gerrit.wikimedia.org/r/146492 (owner: 10Ori.livneh) [17:25:55] ok, so now i'll force a puppet run on the job runners, and then do a trebuchet deploy [17:28:53] (03PS4) 10Ori.livneh: jobrunner: provision on mw1001; fix template [operations/puppet] - 10https://gerrit.wikimedia.org/r/146177 [17:29:18] !log elastic1017 went nuts again. just shutting elasticsearch off on it for now [17:29:24] Logged the message, Master [17:29:46] (03PS1) 10Jkrauska: Lowers TTL on corp ns1 record to faciliate changeover [operations/dns] - 10https://gerrit.wikimedia.org/r/146498 [17:30:16] cmjohnson1: can you do another dns deploy for me? [17:30:30] sure [17:31:08] (03CR) 10Cmjohnson: [C: 032] Lowers TTL on corp ns1 record to faciliate changeover [operations/dns] - 10https://gerrit.wikimedia.org/r/146498 (owner: 10Jkrauska) [17:31:10] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.39 [17:34:17] (03CR) 10Cmjohnson: [C: 04-1] "These are still in use on the 12th floor in Tampa. This can merge once we complete are out of Tampa." [operations/dns] - 10https://gerrit.wikimedia.org/r/143202 (owner: 10Dzahn) [17:34:27] (03CR) 10Cmjohnson: "These are still in use on the 12th floor in Tampa. This can merge once we complete are out of Tampa." [operations/dns] - 10https://gerrit.wikimedia.org/r/143202 (owner: 10Dzahn) [17:34:49] cmjohnson1: thanks - see you again in an hour when that old TTL expires? :) [17:35:07] heh [17:37:47] !log updated jobrunner to bef32b9120 [17:37:52] Logged the message, Master [17:38:55] <_joe_> ori: 146177 ? Are we GTG? [17:39:02] yep [17:40:03] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: provision on mw1001; fix template [operations/puppet] - 10https://gerrit.wikimedia.org/r/146177 (owner: 10Ori.livneh) [17:40:23] !log my last attempt to lower the concurrent traffic for recovery was a failure - tried again and succeeded. that seems to have fixed the echo service disruption from taking elastic1017 out of service [17:40:24] <_joe_> ori: merged [17:40:29] Logged the message, Master [17:40:44] running puppet on mw1001 [17:41:00] (03CR) 10Cmjohnson: [C: 031] Repurposing osm-db100{1,2} as labsdb100{6,7} [operations/puppet] - 10https://gerrit.wikimedia.org/r/137691 (owner: 10Alexandros Kosiaris) [17:41:02] (03PS1) 10Ori.livneh: beta: connect to memcached on port 11212 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146501 [17:41:10] <_joe_> ok, what should we look for after that starts? [17:41:13] (03PS2) 10Ori.livneh: beta: connect to memcached on port 11212 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146501 [17:41:32] the service should be running and /var/log/upstart/jobrunner.log should look sane [17:41:46] puppet applied correctly [17:41:55] <_joe_> 2014-07-15T17:41:45+0000: No jobs available... [17:41:59] (03CR) 10Ori.livneh: [C: 032] beta: connect to memcached on port 11212 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146501 (owner: 10Ori.livneh) [17:42:01] <_joe_> seems 'correct' [17:42:06] yeah, that's fine [17:42:09] <_joe_> :) [17:42:19] yay [17:42:23] should we apply it on the rest? [17:42:27] i think it's safe [17:42:34] AaronSchulz: ^ [17:43:05] <_joe_> so, the plan would be: you move gradually jobs from the old job queue to a new one? [17:43:18] <_joe_> did I get this right? [17:43:47] yep [17:45:45] (03PS1) 10Giuseppe Lavagetto: jobrunner: install on all jobrunners [operations/puppet] - 10https://gerrit.wikimedia.org/r/146502 [17:47:04] <_joe_> AaronSchulz: btw, took a look at jobrunner, and it looks great :) [17:48:02] <_joe_> I think it's a lot better than job-loop as-is, it has the added benefit that it's not a nightmare to modify [17:48:43] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: install on all jobrunners [operations/puppet] - 10https://gerrit.wikimedia.org/r/146502 (owner: 10Giuseppe Lavagetto) [17:49:32] hehe [17:49:49] <_joe_> well, great >> a lot better than job-loop :P [17:50:43] <_joe_> running puppet an all jobrunners [17:51:04] <_joe_> when it's done, we've finished deploying the jobrunner service in prod [17:51:27] <_joe_> then just tell me when you're switching some jobs over, I'll keep an eye on it [17:55:31] matanya: you there? [17:56:01] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.61 [17:56:21] <_joe_> btw, can we change /etc/jobrunner.ini in /etc/jobrunner/jobrunner.ini before paravoid gets to see that? [17:56:25] <_joe_> we should have time [17:56:51] <_joe_> :) [17:57:45] heheh [17:57:48] yes let's [17:57:51] i'll submit a patch [17:57:54] <_joe_> ori, AaronSchulz jobrunner is running on all jobrunners AFAICS [17:58:04] <_joe_> so help yourselves [17:58:09] <_joe_> ori: I can do that btw [17:58:15] !log _joe_ deployed jobrunner to all job runners [17:58:20] Logged the message, Master [17:58:22] _joe_: you have other changes you could review ;) [17:58:34] <_joe_> ori: I have dinner I could review [17:58:38] that's also true [17:58:42] have dinner [17:58:43] <_joe_> given it's 8 PM [17:58:45] <_joe_> :) [17:58:46] things look good, yeah [17:58:49] <_joe_> see you later [17:58:51] thanks a lot! [17:59:08] really appreciate it as ever [17:59:15] <_joe_> me too [17:59:40] <_joe_> btw, what about releasing hhvm to one jobrunner this week? [17:59:55] <_joe_> I just need to rebuild the package basically [18:00:04] Reedy, greg-g: The time is nigh to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140715T1800) [18:00:05] <_joe_> if the jobrunner works well, why not? [18:00:42] sure [18:00:44] it'd be awesome [18:00:55] we'd need to reprovision an instance with trusty [18:01:51] <_joe_> ori: that's true, it will take some time [18:02:01] <_joe_> not sure if I can make it this week [18:02:32] could you delegate that part to someone else? (decom a jobrunner, do a fresh trusty install)? [18:03:00] <_joe_> I'll se what I can do in that respect [18:04:13] cool [18:04:19] dinnerrrrr! :P [18:04:22] bbiab myself [18:04:48] re channel subject Lists: backlog cleared -- does this imply there was something broken with email list delivery this AM? [18:06:25] dogeydogey: yt? [18:07:25] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:08:15] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.018 second response time [18:08:21] yt? [18:08:24] ottomata what is that? [18:08:42] you there [18:08:46] "you there" :) [18:08:51] oh yeah [18:08:52] so, 1. you wantd to talk about something? [18:08:53] and [18:08:54] 2.! [18:09:01] mary wants to volunteer [18:09:02] the email about datasets [18:09:09] who is mary? [18:09:11] ottomata: now i am [18:09:11] and I'm looking for you and matanya do help guide? [18:09:21] perfect! mary, say hi to dogeydogey and matanya [18:09:25] they are ops volunteers [18:09:48] who can probably answer your volunteer questions better than I [18:09:54] I can possibly help you find tasks to work on [18:10:01] but they know what needs to be done to get you in :) [18:10:06] hola dogeydogey matanya [18:10:14] hi mary! [18:10:17] oh yeah, sure, I can help, hi mary, i'm pretty new too but i'd be glad to help [18:10:25] welcome :) [18:10:40] (03PS6) 10BryanDavis: Manage /usr/local/apache from ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 [18:10:41] <_joe_> I think matanya has a beautiful list of chores (welcome, btw) [18:10:55] mary dogeydogey this : https://etherpad.wikimedia.org/p/Puppet3 is the best newbie task i can offer [18:11:01] greg-g: looks like you maintain https://wikitech.wikimedia.org/wiki/Incident_documentation [18:11:13] _joe_: i apologize for lists thingy [18:11:41] but new still like https://wikitech.wikimedia.org/wiki/Incident_documentation/2014-07-14-Lists doesn't show up on the main page [18:11:44] new stuff [18:11:57] cajoel: just missing a link [18:11:59] <_joe_> cajoel: oh I forgot to link it [18:12:01] <_joe_> sigh [18:12:01] you can add it [18:12:23] it is a wiki after all ... :D [18:12:32] (03CR) 10BryanDavis: "Yet another manual rebase needed because of ongoing work in ::mediawiki by Ori and/or Giuseppe. It would be swell to see this merged so th" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 (owner: 10BryanDavis) [18:12:50] mary: if you need/want any guidance, i would love to help, just let me know [18:13:05] not just a new link [18:13:24] couldn't that be done with a subpage index? [18:13:36] if the pages are all named similarly? [18:13:40] it's there [18:13:44] and some background of your knowledge would help as well, to better route you. in PM if you prefer [18:13:45] just named differently [18:13:49] wondering if there's a way to avoid the manual step [18:13:49] https://wikitech.wikimedia.org/wiki/Incident_documentation/2014-07-14-Lists [18:13:52] notice the dashes [18:13:56] * greg-g moves [18:13:57] matanya: looking around for a 'getting started' doc (and links to repos etc ) [18:14:28] got it [18:14:29] cajoel: matanya it uses SMW searches [18:14:46] mary: Sadly, that doesn't really exists. we have https://wikitech.wikimedia.org/wiki/Get_involved [18:14:53] mary lots of good info here: https://wikitech.wikimedia.org/wiki/Main_Page [18:15:08] mary can find repos here: http://git.wikimedia.org/project/operations [18:15:16] greg-g: yeah, well. still a wiki [18:15:17] so, mary, were you specifically interested in analytics related ops tasks [18:15:19] or ops stuff in general? [18:15:41] matanya: sure, just saying it's not "just add a link" ;) [18:15:56] :) [18:15:58] ottomata: things around hadoop (&| hive, ....) mostly [18:16:22] mary: fixing cdh4 repo sounds fun ? [18:16:27] cdh4 is no more!!!! :) [18:16:29] cdh [18:16:29] . [18:16:30] :) [18:16:32] :) [18:16:34] let's check out todos there! [18:16:56] matanya, ottomata : sure :) [18:17:09] ottomata: the cdh4 module is still there, and still not puppet3 compatible [18:17:46] cdh4 has been removed as a submoulde from operations/puppet [18:17:51] there is now a cdh module [18:18:00] mary: just wondering, how did you get to find ops ? [18:18:02] if there are parts of that are not puppet3 compatible, then we shoudl fix those for sure! [18:18:21] ottomata: so please fix the list on https://etherpad.wikimedia.org/p/Puppet3 [18:18:34] effing smw [18:18:45] mary: http://git.wikimedia.org/blob/operations%2Fpuppet%2Fcdh.git/69b6d3e853d248c5977a6909eaceab72a9620284/TODO.md [18:19:38] matanya: via subbu:) [18:19:59] matanya, i know him for over 10 years. [18:21:02] good, very good [18:21:37] <_joe_> greg-g: should I do something about my latest incident report? [18:21:39] ottomata: when is hadoop tech talk ? [18:22:42] oh, i missed it. what a shame [18:23:39] ottomata: ty - hopefully it doesn't need wizard level skills (puppet) [18:24:34] mary, puppet is ruby. [18:25:09] actually, puppet is DSL, if we are strictly talking [18:25:20] yes, ruby dsl. :) [18:25:21] subbu: last time i used puppet was about a year ago - much of the (local) cluster management is now done via CM4 :) [18:25:47] matanya: https://plus.google.com/u/0/events/c53ho5esd0luccd09a1c30rlrmg [18:25:57] _joe_: other than fix the action items, nope :) [18:26:06] oh, great. thanks ottomata [18:28:09] _joe_: i broke, i fix. how many mailman procs should be on sodium ? [18:28:27] Krenair: i gave you the right, in case you missed it. [18:39:17] (03CR) 10JanZerebecki: [C: 031] dynamicproxy - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146466 (owner: 10Dzahn) [18:39:33] (03PS1) 10Chmarkine: update SSL ciphers for contacts.wm.org to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) [18:41:03] matanya, I saw, ty [18:41:32] Reedy: around? [18:42:13] (03CR) 10Matanya: update SSL ciphers for contacts.wm.org to support PFS (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) (owner: 10Chmarkine) [18:43:11] (03CR) 10Matanya: [C: 031] dynamicproxy - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146466 (owner: 10Dzahn) [18:44:13] (03CR) 10Matanya: [C: 031] retab role/gerrit.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/146087 (owner: 10Dzahn) [18:46:30] (03PS2) 10Chmarkine: update SSL ciphers for contacts.wm.org to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) [18:47:53] who is Chamrkine ? [18:48:16] (03CR) 10Matanya: update SSL ciphers for contacts.wm.org to support PFS (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) (owner: 10Chmarkine) [18:53:13] PROBLEM - ElasticSearch health check on elastic1018 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.40 [18:53:21] !log bouncing elastic1018 to pick up new merge policy. hopefully that'll help with io thrashing [18:53:27] Logged the message, Master [19:00:15] ottomata: re oozie + extjs 2.2 - can the rpm be added to local repos ? (can be copied from horontoworks repo (for ex: http://private-repo-1.hortonworks.com.s3.amazonaws.com/HDP-1.1.0.15/repos/centos5/oozie/extjs-2.2-1.noarch.rpm) [19:00:21] (03PS3) 10Chmarkine: update SSL ciphers for contacts.wm.org to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) [19:01:49] ) [19:02:28] (03CR) 10Matanya: [C: 031] update SSL ciphers for contacts.wm.org to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) (owner: 10Chmarkine) [19:03:12] mary: debs, if at all, not rpms [19:03:46] wmf runs on ubuntu [19:04:50] mary, if you can make or find a working extjs-2.2 .deb package, you would be my hero! [19:05:00] and ya, we could add it to the WMF repo [19:05:07] apt.wikimedia.org [19:06:45] matanya: ah - k - debs; i suppose alien should be able to do it [19:06:45] checking [19:10:35] btw, dogeydogey spent some time trying to do this for me too [19:14:24] 15:11 <@ James_F> greg-g: Speaking of, do we plan to upgrade SMW on wikitech-wiki at some point? It's over a year old… [19:14:27] andrewbogott: ^^^^^^ [19:14:53] greg-g: I would love to. Modern versions of SMW can only be installed via composer. [19:14:57] :( :( [19:14:59] nvm [19:15:06] I'm hoping that that will somehow become… possible? in the future. [19:15:07] so, how's the transition to wikidata or whatever going? [19:15:10] You could do a build thing like we do for Wikidata :P [19:15:11] only installed via composer? [19:15:12] :P [19:15:31] might be easiest to use composer but it shouldn't be absolutely necessary... [19:15:33] You wouldn't need to update it as often as we do [19:15:43] andrewbogott: Could we instead just delete the SMW use on wikitech instead? [19:15:50] I welcome suggestions and/or implementations. There's a wikitech role in vagrant. [19:16:03] James_F: Not without breaking very many things [19:16:21] yeah :/ [19:16:23] I actually don't hate SMW or our use of it. I'm just sad that that I can't upgrade. [19:16:38] non-upgradable software == bad bad bad [19:16:39] Darn. [19:16:59] It might be possible to install by hand, or to install on a VM with composer and then transfer, or something... [19:17:12] But, I dunno, whenever I look at it I become confused and alarmed. [19:17:37] also not good [19:17:39] :) [19:18:13] hoo: Are there docs someplace for your wikidata process? [19:18:28] there are occasional emails about composer+prod so I've been hoping that someone is making progress there. [19:18:39] andrewbogott: Sure! https://wikitech.wikimedia.org/wiki/How_to_deploy_Wikidata_code [19:18:52] Maybe you can derive from that [19:19:05] lot of specific stuff there, though [19:19:20] I'll look, thanks [19:20:59] dogeydogey: got a bit of time to look over some cleanup commits? [19:21:01] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [19:21:13] ori sure [19:21:51] andrewbogott: i'll try to poke at etherpad module, but not so quickly [19:21:59] https://gerrit.wikimedia.org/r/#/c/146142/ , https://gerrit.wikimedia.org/r/#/c/146495/ and https://gerrit.wikimedia.org/r/#/c/146497/ would be helpful [19:27:27] ori: is the new job queue live? there's a centralauth rename job on lmowiki that is stuck [19:27:34] greg-g, andrewbogott, James_F: https://www.mediawiki.org/wiki/Requests_for_comment/Composer_managed_libraries_for_use_on_WMF_cluster [19:27:35] AaronSchulz: ^ [19:27:37] could you look? [19:27:41] LocalRenameUserJob: 0 queued; 1 claimed (1 active, 0 abandoned); 0 delayed [19:28:24] it's been like that for a few minutes now, job should have finished in like 10 seconds [19:28:43] and I know part of the job has executed, but it didn't finish [19:29:53] AaronSchulz: ack? [19:30:45] gah, in the middle of some commit [19:31:31] * AaronSchulz though that was for puppet reviews for a second [19:34:04] ottomata: can you help me a sec ? how many procs of mailman on sodium ? and what is the cmd line ? [19:34:38] ori: chasemp do we have any custom patches on top of upstream diamond in our package? [19:34:51] yes [19:35:06] chasemp: where are they at? [19:35:12] wondering if I should just roll my patches into those as well [19:35:33] not sure what you mean on the second, like amend those commits with your new stuff? [19:35:43] <_joe_> AaronSchulz: I have some output from the jobrunner [19:36:15] chasemp: no, I mean, 1. I assume that means we've a python-diamond repository that we build the package from, 2. I could just get my patches to Diamond from github into our repo and get those merged and deployed [19:36:24] https://github.com/wikimedia/python-diamond [19:36:27] not much activity there, and there's a *lot* of pull requests pending [19:36:32] chasemp: oh, is it maintained on github? [19:36:40] no but all of our stuff is mirrored there [19:36:49] what's the gerrit name? [19:37:03] operations/deb/python-diamond [19:37:04] I think? [19:37:24] but I guess that one [19:37:31] doesn't mirror correctly [19:37:38] I have no idea of that setup honestly [19:37:45] hmm, right [19:37:45] _joe_: is it a MW exception/fatal or jobrunner specific? [19:37:59] matanya: root@sodium:~# ps aux | grep mailman | grep -v grep | wc -l [19:37:59] 9 [19:38:17] YuviPanda: so backstory, my patches are essentially a custom handler and logic to support [19:38:23] thanks ottomata, i need the cmd line as well :) [19:38:24] flushing by collector groups [19:38:36] which upstream didn't see value in, and I needed at one time for debug and sanity [19:38:50] been considering just killing it if it's not that valuable here / anymore [19:38:53] but atm that's the deal [19:39:04] hmm, right [19:39:07] we pass the collector name to the handler and it flushes a queue specific to that [19:39:10] etc etc [19:39:41] none of my patches are really needed either - exim not having to sudo as root would be nice, but not mandatory. And graphite is on hold while the machine goes through the procurement queues... [19:39:52] it's just that... slightly unresponsive upstreams make me a bit queasy :) [19:40:19] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2033: active_shards: 6098: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:40:19] RECOVERY - ElasticSearch health check on elastic1018 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2033: active_shards: 6098: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:40:28] I don't know the ppl other than kormoc and I think him starting that new twitter job has bogged him down a lot [19:40:48] but if you want to integrate locally in anticipation of upstream following suit [19:40:51] I don't mind [19:40:58] but that stuff all has a rolling cost [19:41:00] matanya, its there no? [19:41:02] ps aux | grep mailman | grep -v grep | wc -l [19:41:29] ottomata: i mean the output of ps, with running procs [19:41:35] oh [19:41:52] isn't that called the cmd line ? [19:42:01] https://gist.github.com/ottomata/7a1a79c5f18db5e82214 [19:42:03] AaronSchulz: if I can help you out with the debugging in any way, let me know [19:42:49] thanks a lot ottomata now i can push a monitoring patch [19:43:38] chasemp: yeah, don't want to accidentally end up with a fork [19:47:44] YuviPanda: it hasn't been that long has it? I would give it a few weeks honestly [19:48:04] ottomata, I have a quick question re deb repos in operations/: did you just ask for a new gerrit repo with ops-only merge rights? [19:48:14] small project like this I don't see local patching for our needs as a big minus, especially if we are pursuing integration upstream [19:48:33] chasemp: no hasn't been, I was just looking at our current setup (debs et all). Considering ours isn't significantly patched, I'd give it a month. [19:49:00] if, however, we had 10 biggish patches on top, then that's different :) so was just checking [19:49:07] gotcha [19:49:32] chasemp: there's still discussion about ssd vs spinning disk in the RT, so do chime in if you have opinions :) [19:49:42] not sure where that is [19:49:46] that ticket specifically [19:51:50] dogeydogey: are you still working on the deb pkg for extjs-2.2.x ? [19:52:00] mary nope [19:52:52] chasemp: https://rt.wikimedia.org/Ticket/Display.html?id=7814 [19:53:03] dogeydogey: any existing packaing config files would be useful :) [19:53:06] packaging [19:53:35] mary i don't have anything :( [19:54:16] (03PS1) 10Matanya: mailman: monitor number of running processes [operations/puppet] - 10https://gerrit.wikimedia.org/r/146526 [19:55:39] dogeydogey: k - trying to build a package using existing packaging info for 3.3.x [19:55:55] !log Applied extensions/UploadWizard/UploadWizard.sql to rowiki (re bug 59242) [19:55:59] Logged the message, Master [19:57:34] greg-g: Crap [19:57:36] Sorry [20:00:42] (03PS1) 10Reedy: Commit db1035-related live hack [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146527 [20:00:44] (03PS1) 10Reedy: Non Wikipedias to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146528 [20:01:38] (03PS2) 10Reedy: Commit db1035-related live hack [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146527 [20:01:43] (03CR) 10Reedy: [C: 032] Commit db1035-related live hack [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146527 (owner: 10Reedy) [20:01:51] (03Merged) 10jenkins-bot: Commit db1035-related live hack [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146527 (owner: 10Reedy) [20:02:02] Reedy: luckily we're good for another hour [20:02:26] Flow not using their window? [20:02:29] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:02:37] chasemp: can you review a patch please ? [20:02:44] they arne't until 2pm Pacific/9 UTC [20:03:03] matanya: link? probably get to in the morning most likely tho. but sure [20:03:03] (03CR) 10Anomie: "I was hoping whoever was responsible would come along and give it a better commit message. Or just remove it, if they intended to undo it " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146527 (owner: 10Reedy) [20:03:05] (03PS1) 10Jkrauska: Change ns1 ip to new record [operations/dns] - 10https://gerrit.wikimedia.org/r/146529 [20:03:18] chasemp: https://gerrit.wikimedia.org/r/#/c/146526/ [20:04:06] ah two things actually, I'm rush in gerrit / ldap and I don't know much about mailman so probably not the right person :) [20:04:16] (03CR) 10Reedy: "Guess it was fairly major and Sean was more concentrated on fixing stuff up" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146527 (owner: 10Reedy) [20:04:24] cmjohnson1: hello! [20:04:25] https://gerrit.wikimedia.org/r/146529 [20:04:40] hi cajoel [20:04:59] there will only be one more fixup after that goes out. [20:05:06] (03CR) 10Cmjohnson: [C: 032] Change ns1 ip to new record [operations/dns] - 10https://gerrit.wikimedia.org/r/146529 (owner: 10Jkrauska) [20:05:07] thanks anyway chasemp [20:05:36] cajoel...merged [20:05:53] does dns auto-deploy? [20:07:51] I see it live... [20:08:00] so you must have done that too. [20:09:13] anybody got experience tuning drives with hdparm? [20:09:39] manybubbles noticed that 3 of the elastic nodes have a different multiple sector transfer setting [20:09:59] and those 3 have more iowait...they also have less memory so less cache and more io to do in general [20:10:12] we're thinking of experimenting with changing this setting, but I wanted to ask if anyone had experience with it first [20:10:24] !log restarted logstash on logstash1001; log volume looked to be down from "normal" [20:10:29] Logged the message, Master [20:11:25] (03PS2) 10Reedy: Non Wikipedias to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146528 [20:12:00] !log log volume up after logstash restart [20:12:08] Logged the message, Master [20:12:24] * greg-g turns up the jams [20:12:24] cajoel yep i merged and updated dns [20:12:57] !log Reloading Zuul to deploy If2312bcf18bdbe8dee [20:13:02] Logged the message, Master [20:16:05] (03CR) 10Reedy: [C: 032] Non Wikipedias to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146528 (owner: 10Reedy) [20:16:11] (03Merged) 10jenkins-bot: Non Wikipedias to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146528 (owner: 10Reedy) [20:17:21] (03PS1) 10Ori.livneh: jobrunner: inform trebuchet of service name [operations/puppet] - 10https://gerrit.wikimedia.org/r/146587 [20:17:26] ori: so it's just git deploy start, git pull, git deploy sync, on tin in /srv/deployment/jobrunner/jobrunner ? [20:18:05] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf13 [20:18:09] Logged the message, Master [20:18:13] AaronSchulz: yep [20:18:22] AaronSchulz: go for it [20:18:41] the service won't get restarted automatically (i just submitted a change for that above), but i'll do it via salt once you're done [20:19:26] Missing the following configuration item: user.name [20:19:49] ~/.gitconfig [20:20:04] any convention on names? [20:20:05] $ git config --global user.name "John Doe" [20:20:05] $ git config --global user.email johndoe@example.com [20:20:41] must be a pokemon [20:20:54] but no water pokemon [20:21:01] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0133333333333 [20:21:03] but you're free to choose within those constraints [20:21:47] also just enter 'y' on all the prompts, they're confusing as hell [20:21:48] !log reedy Synchronized docroot and w: (no message) (duration: 00m 21s) [20:21:53] Logged the message, Master [20:22:02] hmm, so my password is now y? [20:22:11] yep [20:22:31] huh, no auto IRC or SAL entries [20:22:45] YuviPanda: i imagine it's old news but https://www.youtube.com/watch?v=EiIxkOah09E&list=UUpU9EZn1Ll9kPpSuBsn4VyA was hilarious [20:22:50] AaronSchulz: yep [20:23:02] !log Deployed /srv/jobrunner to 31e54c564d369e89613db48977eec0a5891b6498 [20:23:06] Logged the message, Master [20:23:14] i'll restart the service [20:23:32] ori: :D not *that* old :) might be realistic soon, tho. [20:24:45] !log restarted jobrunner on all jobrunners [20:24:49] Logged the message, Master [20:24:56] AaronSchulz: salt -G 'cluster:jobrunner' cmd.run 'service jobrunner restart' [20:24:58] incase you were curious [20:25:15] * YuviPanda should setup salt for toollabs [20:25:23] ori: I wish I could tail the upstart log :? [20:26:02] i'll fix [20:26:50] dogeydogey, ottomata - i have a crude deb pkg (derived from current debian pkg files for 3.x) - is there a recommended way to share it with you ? [20:28:12] ori, AaronSchulz: the job got un-stuck, thanks! [20:28:20] hoo: ^ [20:28:29] not sure I did anything [20:29:26] dunno :P [20:29:38] :P [20:29:39] Nice [20:31:03] <_joe_> ori: hey I'm here [20:31:22] <_joe_> need help with the jobrunners? [20:31:51] oo, mary, ummmm, you can zip and email to me? or put it up online somewhere? [20:32:01] github? or whatever [20:32:06] _joe_: yes, actually [20:32:32] things look good, but i'd like aaron to be able to view the log files. i chmodded the file right now as a hack, but we should have some persistent solution [20:32:47] probably redirect output to /var/log/mediawiki/jobrunner.log, and logrotate that [20:32:47] <_joe_> mmmh ok, let me check [20:33:10] <_joe_> yes that would be better IMO [20:33:32] if you can hang around for a minute i'll submit a patch for that [20:33:34] <_joe_> but we may also change the way upstart works [20:33:48] <_joe_> also, please don't chmod files [20:33:50] there's also https://gerrit.wikimedia.org/r/#/c/146587/ [20:34:12] _joe_: ok, i'll chmod them back to their old mode [20:34:29] <_joe_> ori: not necessary, upstart will do that [20:34:34] <_joe_> when rotating [20:34:40] all i did was /var/log/mediawiki/jobrunner.log 0640 -> 0644 [20:34:43] nod [20:34:44] <_joe_> new files will have default permissions :) [20:34:55] hence "we should have some persistent solution" ;) [20:35:11] anyways, writing a patch [20:35:11] sec [20:35:49] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: inform trebuchet of service name [operations/puppet] - 10https://gerrit.wikimedia.org/r/146587 (owner: 10Ori.livneh) [20:38:04] ori: why isn't the password var resolving? [20:38:32] * ori looks [20:38:51] <_joe_> AaronSchulz: where? puppet? [20:38:54] blech [20:39:22] <_joe_> oh ok [20:39:25] <_joe_> lemme see [20:39:41] needs include passwords::redis [20:40:30] (03PS1) 10Ori.livneh: jobrunner: include passwords:redis, to populate template var [operations/puppet] - 10https://gerrit.wikimedia.org/r/146592 [20:40:39] ^ _joe_ [20:40:47] <_joe_> ori: ok [20:41:40] <_joe_> ori: lemme check please [20:41:59] nod [20:42:42] (03PS2) 10Giuseppe Lavagetto: jobrunner: include passwords:redis, to populate template var [operations/puppet] - 10https://gerrit.wikimedia.org/r/146592 (owner: 10Ori.livneh) [20:42:53] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] jobrunner: include passwords:redis, to populate template var [operations/puppet] - 10https://gerrit.wikimedia.org/r/146592 (owner: 10Ori.livneh) [20:44:02] <_joe_> ori: running puppet to reload the jobrunners with the new config [20:44:11] thank you very much [20:44:36] ottomata: a bit more nudging please. what is the output of : ls -la /var/lib/mailman/qfiles/ on sodium [20:44:47] and of /var/lib/mailman/bin/show_qfiles in [20:44:59] ottomata: done ( /msg) [20:46:18] _joe_: is the loop actually running as apache? [20:46:32] <_joe_> AaronSchulz: now it should start working. [20:46:40] <_joe_> AaronSchulz: don't remember, lemme check [20:46:41] upstart log looked sad :( [20:47:04] <_joe_> it is [20:47:41] <_joe_> AaronSchulz: should I stop it from running? [20:48:01] <_joe_> I can kill it, disable puppet, and we can look into it [20:48:16] <_joe_> AaronSchulz: it's run as that user, but how is that check done? [20:48:25] <_joe_> I mean in the code [20:48:31] in PHP code, I think MWScript.php [20:49:21] $info = posix_getgrgid( $gid ); [20:49:22] if ( $info && in_array( $info['name'], array( 'sudo', 'wikidev', 'root' ) ) ) { [20:49:28] checks that for each group the user has [20:49:31] <_joe_> grid [20:49:33] <_joe_> ok [20:49:41] right, group not name [20:49:43] <_joe_> the job runs as apache:root [20:50:02] <_joe_> probably [20:50:04] <_joe_> lemme check [20:51:04] <_joe_> no it runs as apache:wikidev afaict [20:51:12] <_joe_> it should [20:51:22] <_joe_> AaronSchulz: is it ok for it to run? [20:52:02] you mean right now? It just spams the log, nothing else [20:52:20] odd that apache:wikidev wouldn't work though [20:53:54] ori: wouldn't apache:apache make more sense? [20:54:00] oh, wait, duh of course wikidev won't work [20:54:05] * AaronSchulz read that backwards [20:54:18] <_joe_> wikidev won't work? [20:54:19] I don't think anything automatic should be wikidev [20:54:53] that check excludes root, wikidev, and sudo [20:54:56] <_joe_> AaronSchulz: I thought that was a standard [20:55:04] <_joe_> oh lol [20:55:13] <_joe_> ok so change its group to apache? [20:55:18] <_joe_> let me write a patch [20:56:07] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [20:57:00] (03CR) 10JanZerebecki: [C: 031] "Nitpick: For consistency reasons you could also remove kEDH+AESGCM from the cipher list like said in the comments on the commit this was b" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) (owner: 10Chmarkine) [20:58:29] (03PS1) 10Giuseppe Lavagetto: jobrunner: run as the apache user [operations/puppet] - 10https://gerrit.wikimedia.org/r/146598 [20:58:38] chasemp: a bit more nudging please. what is the output of : ls -la /var/lib/mailman/qfiles/ on sodium [20:58:38] and of /var/lib/mailman/bin/show_qfiles in [20:59:10] (03CR) 10Aaron Schulz: [C: 031] jobrunner: run as the apache user [operations/puppet] - 10https://gerrit.wikimedia.org/r/146598 (owner: 10Giuseppe Lavagetto) [21:00:04] spagewmf: The time is nigh to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140715T2100) [21:00:07] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: run as the apache user [operations/puppet] - 10https://gerrit.wikimedia.org/r/146598 (owner: 10Giuseppe Lavagetto) [21:03:45] everybody seems to be very busy. maybe Jeff_Green or RobH can help with this request ? [21:03:48] <_joe_> AaronSchulz: mw1001 looks better [21:04:14] matanya: [21:04:27] /var/lib/mailman/bin/show_qfiles needs an argument [21:04:39] Example: show_qfiles qfiles/shunt/*.pck [21:04:46] the argument is "in" [21:05:33] root@sodium:~# /var/lib/mailman/bin/show_qfiles in [21:05:33] ====================> in [21:05:33] Traceback (most recent call last): [21:05:33] File "/var/lib/mailman/bin/show_qfiles", line 95, in [21:05:33] main() [21:05:34] File "/var/lib/mailman/bin/show_qfiles", line 81, in main [21:05:34] fp = open(filename) [21:05:35] IOError: [Errno 2] No such file or directory: 'in' [21:05:35] no it needs files [21:06:12] oh, i understand what you mean [21:06:20] ah yes same result for me actually [21:06:27] was trying to figure out what I was doing wrong [21:07:23] sorry, i meant the argument is the in the output of ls -la /var/lib/mailman/qfiles/ [21:07:42] i.e. one of the qfiles lists [21:08:02] too late to be clear :/ [21:08:31] there's nothing in the in/ dir [21:08:50] so /var/lib/mailman/qfiles/ is empty ? [21:08:53] three is in the archive folder [21:09:01] but the output is a lot of ppl's emails I probably can't release [21:09:16] there is^ [21:09:26] this is weird [21:09:27] /var/lib/mailman/qfiles/in is empty [21:09:39] is very populated /var/lib/mailman/qfiles/bad [21:10:24] chasemp: and if you run /var/lib/mailman/bin/show_qfiles on /var/lib/mailman/qfiles/bad ? [21:11:15] 1219 entries in bad/ [21:11:21] and it dumps all teh raw emails [21:11:47] <_joe_> the in and out dir should be empty most of the times [21:11:57] <_joe_> chasemp: pickled [21:11:57] so i need only the number output [21:12:23] <_joe_> matanya: check if the in or out queue exceed a certain threshold [21:12:29] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:12:55] I see in processing, or at least files come and go [21:12:59] assuming taht's a staging dir [21:13:01] _joe_: that is what i'm trying to do, but very hard without shell to check a *threshold* [21:13:29] (03PS1) 10Ori.livneh: Add logrotation for /var/log/mediawiki/* [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 [21:14:42] so chasemp how did you get the number? only ran the command ? [21:15:09] matanya: what number do you mean? entries in bad/ [21:15:12] ls | wc -l [21:15:13] yes [21:15:19] !log to test r146607, locally modified upstart conf for jobrunner on mw1001 to log to /var/log/mediawiki, and restarted service [21:15:25] Logged the message, Master [21:15:44] _joe_: https://gerrit.wikimedia.org/r/#/c/146607/ [21:15:55] isn't it super late for you btw? [21:15:55] so you didn't use the /var/lib/mailman/bin/show_qfiles command [21:16:18] ori: you lost the right to ask that a long time ago :P [21:16:31] matanya: you can only use that command on a pck file [21:16:35] <_joe_> ori: it is [21:16:37] if I point it at the dir it bails, I can do bad/* [21:16:43] but it's far too much output to be useful [21:16:51] I don't know what you are looking for here [21:16:54] (03PS2) 10Ori.livneh: Add logrotation for /var/log/mediawiki/* [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 [21:16:58] <_joe_> ori: the commit is missing the logrotate file [21:16:58] so i'll stick to ls |wc -l thanks! [21:17:10] <_joe_> maybe not anymore [21:17:19] _joe_: i just caught that, sorry. yeah, the updated patch includes it. [21:19:19] <_joe_> ori: logrotate will mess with upstart I guess [21:20:20] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [21:20:26] greg-g: ping [21:20:26] aude: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [21:20:31] bah [21:20:51] we forgot to update wmf13 submodule for our fix last week for entity search [21:20:56] so it's broke again [21:21:05] broken* [21:21:12] would anyone care if i do that now? [21:21:35] * aude check deployment calendar [21:22:08] aude: bah, you're right, it was all my fault... for wmf13 I forgot the submodule update [21:22:13] it happens [21:22:45] I usually do git status after I'm done to verify but forgot this time [21:22:57] (it shows you uncommited changes if you don't submodule update [21:22:59] ) [21:23:00] * aude waits a few minutes, but can't really stay awake for swat [21:23:22] aude: doit [21:23:26] ok :) [21:26:16] !log aude Synchronized php-1.24wmf13/extensions/Wikidata: Update submodule to fix entity search issue on Wikidata (duration: 00m 21s) [21:26:22] Logged the message, Master [21:26:48] done [21:27:19] thanks, aude [21:27:47] sure [21:27:56] glad it was easy [21:34:25] <_joe_> !log disabling puppet on mw1001, tests [21:34:30] Logged the message, Master [21:45:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This can't work." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 (owner: 10Ori.livneh) [21:48:07] (03PS1) 10Chad: Disabling Special:Random integration with Cirrus for now [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146615 [21:52:59] (03CR) 10Ori.livneh: "hmm! good to know. what about telling logrotate to copytruncate?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 (owner: 10Ori.livneh) [22:31:59] (03PS1) 10BryanDavis: labs_vagrant: Ensure that lvm volume is mounted first [operations/puppet] - 10https://gerrit.wikimedia.org/r/146632 [22:45:24] (03CR) 10Andrew Bogott: [C: 04-1] labs_vagrant: Ensure that lvm volume is mounted first (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146632 (owner: 10BryanDavis) [22:57:26] (03PS9) 10Withoutaname: Reduce string URLs to defined constant [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131914 (https://bugzilla.wikimedia.org/48618) [22:57:32] (03CR) 10BryanDavis: labs_vagrant: Ensure that lvm volume is mounted first (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146632 (owner: 10BryanDavis) [23:00:04] RoanKattouw, mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140715T2300) [23:00:14] sure [23:00:34] (03PS15) 10Withoutaname: Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) [23:01:57] <^d> MaxSem: I can do my own I listed. [23:03:12] (03CR) 10Andrew Bogott: labs_vagrant: Ensure that lvm volume is mounted first (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146632 (owner: 10BryanDavis) [23:06:53] (03PS2) 10BryanDavis: labs_vagrant: Ensure that lvm volume is mounted first [operations/puppet] - 10https://gerrit.wikimedia.org/r/146632 [23:07:36] (03CR) 10BryanDavis: labs_vagrant: Ensure that lvm volume is mounted first (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146632 (owner: 10BryanDavis) [23:13:46] !log maxsem Synchronized php-1.24wmf13/extensions/CirrusSearch/: https://gerrit.wikimedia.org/r/#q,146471,n,z (duration: 00m 04s) [23:13:52] Logged the message, Master [23:14:45] !log maxsem Synchronized php-1.24wmf13/includes/specials/SpecialVersion.php: (no message) (duration: 00m 04s) [23:14:51] Logged the message, Master [23:15:31] Reedy, your fix is live [23:16:29] !log maxsem Synchronized php-1.24wmf12/extensions/CirrusSearch/: https://gerrit.wikimedia.org/r/#q,146471,n,z (duration: 00m 05s) [23:16:35] Logged the message, Master [23:16:59] (03CR) 10MaxSem: [C: 032] Disabling Special:Random integration with Cirrus for now [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146615 (owner: 10Chad) [23:17:07] (03Merged) 10jenkins-bot: Disabling Special:Random integration with Cirrus for now [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146615 (owner: 10Chad) [23:19:26] thanks MaxSem [23:19:36] not yet [23:20:07] (03PS1) 10Jkrauska: Final change for corp dns - change ttl back to 1h [operations/dns] - 10https://gerrit.wikimedia.org/r/146647 [23:20:26] !log maxsem Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/146615/ (duration: 00m 04s) [23:20:30] Logged the message, Master [23:20:42] ^d and manybubbles, deployed [23:20:44] cmjohnson1:final dns change [23:20:44] https://gerrit.wikimedia.org/r/146647 [23:20:48] <^d> MaxSem: ty! [23:22:05] done, apparently [23:22:26] <^d> Oh yeah [23:22:29] <^d> Ganglia shows it [23:26:42] jgage: can you do a dns deploy? [23:28:24] <^d> MaxSem: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=elastic1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Elasticsearch+cluster+eqiad :P [23:28:59] oh come on, it was just at 40% [23:29:15] and you disabled an useful feature! :P [23:33:04] <^d> MaxSem: Well, it just went back to doing what it did before :p [23:36:00] <^d> Oh well, it was a fun experiment for a bit. Maybe we'll play with it again soon. [23:39:10] ^d: so the Japanese click Special:Random much more than others? [23:41:03] noted on https://bugzilla.wikimedia.org/show_bug.cgi?id=65366 [23:41:19] <^d> No. It was on all wikis, even those in secondary. [23:41:22] <^d> So mostly enwiki. [23:41:30] <^d> But jawiki was causing its own load issues. [23:41:33] oki [23:41:57] <^d> Turning off random gives us some breathing room. [23:51:03] <^d> Nemo_bis: RandomLoad.png. [23:51:08] <^d> Filename made me chuckle. [23:55:37] (03PS1) 10Phuedx: Disable the anonymous signup invite experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146651 [23:55:47] (03CR) 10Phuedx: [C: 04-1] Disable the anonymous signup invite experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146651 (owner: 10Phuedx)