[00:02:43] Are the mailman queue'd messages still waiting to go out? I haven't seen anything from the lists since it went down 5 hrs ago. (Sorry if this is an ongoing battle... ("I can't read your crazy moonlanguage" - The Tick)) [00:07:13] If any ops are still around: Could somebody please take a look at https://bugzilla.wikimedia.org/show_bug.cgi?id=67805#c22 for the memcached error and connection failures? [00:07:20] * andre__ tries IRC before hitting the mailing list [00:09:35] I suspect it just hasn't had nutcracker updated on it [00:09:41] tin [00:09:42] nobody 23483 0.0 0.0 18732 1704 ? Sl Apr25 8:59 /usr/local/bin/nutcracker -m 65536 -a 127.0.0.1 -c /usr/local/apache/common-local/wmf-config/twemproxy-eqiad.yaml [00:09:47] mw1017 [00:09:48] 112 14857 0.1 0.0 21652 2144 ? Ssl Jul07 13:11 /usr/sbin/nutcracker --mbuf-size=65536 --stats-port=22223 [00:10:04] * jgage takes a look [00:10:11] Yeah [00:10:15] tin still is on "twemproxy" [00:10:21] apaches are on "nutcracker" [00:10:39] ok, cool [00:10:47] jgage: i have a patch: https://gerrit.wikimedia.org/r/#/c/146288/ [00:10:54] thanks ori [00:11:00] At a quick guess, terbium is probably in a similar state [00:11:02] * Reedy checks [00:11:22] nope, just tin [00:11:49] (03CR) 10Gage: [C: 032] tin: include ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/146288 (owner: 10Ori.livneh) [00:11:58] Reedy: https://gerrit.wikimedia.org/r/#/c/146305/ needs manual rebase [00:12:57] wtf gerrit [00:13:04] or git [00:13:15] jgage: would you like me to deploy it? [00:13:55] ori i just +2'd it but puppet doesn't see it, i'm unclear what state this patch is in [00:14:15] oh got it [00:15:06] running puppet agent on tin now [00:16:01] meh Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: File[/etc/cluster] is already declared in file /etc/puppet/modules/mediawiki/manifests/init.pp:13; cannot redeclare at /etc/puppet/modules/apachesync/manifests/init.pp:10 on node tin.eqiad.wmnet [00:16:27] i suppose that's why the toplevel class wasn't declared originally eh [00:17:31] i'll fix [00:17:36] thanks [00:18:10] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Complete puppet failure [00:19:15] (03PS1) 10Ori.livneh: dedupe File['/etc/cluster'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/146358 [00:19:21] ^ jgage [00:19:29] ACKNOWLEDGEMENT - puppet last run on tin is CRITICAL: CRITICAL: Complete puppet failure Jeff Gage working on this in irc [00:19:44] ori, taking a look.. [00:20:28] (03CR) 10Gage: [C: 032] dedupe File['/etc/cluster'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/146358 (owner: 10Ori.livneh) [00:25:38] waiting on puppet.. [00:28:01] Notice: /Stage[main]/Nutcracker/Service[nutcracker]/ensure: ensure changed 'stopped' to 'running' [00:28:22] yay [00:28:26] andre__: nutcracker is now updated on tin [00:28:33] lovely! thanks so much! [00:28:36] jgage, ^ [00:28:42] mmm [00:28:50] When does localisation update occur? [00:29:11] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:29:25] Reedy: 02:00 UTC ish [00:29:37] Reedy: I.e. 90 minutes' time. [00:29:39] Just in tim then [00:37:31] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jul 15 00:37:27 UTC 2014 [00:47:26] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 12 hours old. [00:49:25] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 0 hours old. [01:08:32] (03PS1) 10BryanDavis: beta: Stop using obsolete 'maxclients' param for mediawiki::web [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 [01:14:52] jgage: ^^ Are the mailing lists meant to be back up? They appear to still be down. [01:16:15] Hmm. Gage may not be around. :-( [01:17:14] (the most recent message at wikitech-l, was at 12:41 pacific. Nothing since then, from any mailing list, of the many I'm subscribed to. http://lists.wikimedia.org/pipermail/wikitech-l/2014-July/077596.html [01:26:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [01:34:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [01:41:54] (03PS5) 10BryanDavis: Manage /usr/local/apache from ::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/144599 [01:44:36] (03CR) 10BryanDavis: "Cherry-picked and applied in beta." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 (owner: 10BryanDavis) [01:54:40] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:55:40] PROBLEM - puppet last run on search1022 is CRITICAL: CRITICAL: Puppet has 1 failures [02:10:25] PROBLEM - Host db1035 is DOWN: PING CRITICAL - Packet loss = 100% [02:13:35] PROBLEM - MySQL Replication Heartbeat on db71 is CRITICAL: CRIT replication delay 308 seconds [02:14:35] RECOVERY - puppet last run on search1022 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [02:18:08] (03CR) 1020after4: [C: 031] "Cool." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 (owner: 10BryanDavis) [02:27:07] !log powercycle db1035 unresponsive [02:27:15] Logged the message, Master [02:30:06] !log LocalisationUpdate completed (1.24wmf12) at 2014-07-15 02:29:02+00:00 [02:30:12] Logged the message, Master [02:30:28] RECOVERY - Host db1035 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [02:34:06] !log springle Synchronized wmf-config/db-eqiad.php: depool db1035, crashed (duration: 00m 13s) [02:34:11] Logged the message, Master [02:42:13] PROBLEM - MySQL Slave Running on db1035 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 14635079 for key old_id on query. Default [02:43:00] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 2073 seconds [02:48:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:50:30] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:20] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.017 second response time [02:56:10] PROBLEM - Host db1035 is DOWN: PING CRITICAL - Packet loss = 100% [03:01:07] !log LocalisationUpdate completed (1.24wmf13) at 2014-07-15 03:00:03+00:00 [03:01:12] Logged the message, Master [03:03:42] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [03:05:52] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [03:13:02] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 01:12:48 UTC [03:19:36] (03CR) 10BryanDavis: "It looks like one way we could use this in beta would be by creating files/apache/sites/beta/*.conf or maybe even mirroring the files/apac" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 (owner: 10Giuseppe Lavagetto) [03:33:10] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Tue Jul 15 03:33:03 UTC 2014 [03:34:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 15 03:33:38 UTC 2014 (duration 33m 37s) [03:34:51] Logged the message, Master [03:49:30] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [03:49:30] RECOVERY - Host db1035 is UP: PING WARNING - Packet loss = 86%, RTA = 0.52 ms [03:49:30] RECOVERY - MySQL Slave Running on db1035 is OK: OK replication [03:54:00] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [04:03:31] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 6904 seconds [04:03:31] PROBLEM - MySQL Slave Running on db1035 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 14635079 for key old_id on query. Default [04:34:06] PROBLEM - MySQL Slave Delay on db71 is CRITICAL: CRIT replication delay 5221 seconds [04:35:36] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db71 is CRITICAL: CRIT replication delay 4601 seconds Sean Pringle changed to replicate from db1019. recovering... [04:35:37] ACKNOWLEDGEMENT - MySQL Slave Delay on db71 is CRITICAL: CRIT replication delay 4686 seconds Sean Pringle changed to replicate from db1019. recovering... [04:40:00] I love that you actually acknowledge things in icinga, springle [04:40:04] wish more did [04:40:07] :) [04:41:07] RECOVERY - MySQL Slave Delay on db71 is OK: OK replication delay 0 seconds [04:41:37] RECOVERY - MySQL Replication Heartbeat on db71 is OK: OK replication delay -0 seconds [04:46:16] :) [04:47:17] after that it caught up fast anyway [04:47:35] sneaky server [04:48:43] !log db1035 crash cycle. down for memtest and stuff [04:48:49] Logged the message, Master [05:15:51] <_joe_> hey springle [05:15:56] <_joe_> good morning [05:27:43] hi _joe_ [05:28:03] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [05:47:19] (03CR) 10TTO: "If you're going to do this you may as well get rid of the extension from the cluster altogether." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146031 (owner: 10Legoktm) [05:47:40] (03CR) 10Legoktm: "Isn't it already gone?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146031 (owner: 10Legoktm) [05:49:03] IRC went down? [05:49:10] Er, irc.wikimedia.org [06:17:27] (03CR) 10TTO: "It still shows up on testwiki:Special:Version, so I suppose it isn't." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146031 (owner: 10Legoktm) [06:21:21] <_joe_> Bsadowski1: is irc still not working for you? [06:22:24] <_joe_> because the daemon is up and running on argon [06:23:05] <_joe_> and I can connect correctly [06:28:49] ACKNOWLEDGEMENT - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 6 failures Giuseppe Lavagetto osmium is the hhvm test host, puppet is disabled AFAIK [06:29:26] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:25] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove twemproxy::decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/146391 [06:36:56] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: remove twemproxy::decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/146391 (owner: 10Giuseppe Lavagetto) [06:42:54] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:43:24] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:43:41] <_joe_> :) [06:44:02] <_joe_> I like icinga free of unacknowledged alarms [06:46:24] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:48:14] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:54:47] <_joe_> !log killed jenkins stale process on gallium, stuck in a futex while shutting down [06:54:52] Logged the message, Master [07:22:07] <_joe_> !log stopping mailman on sodium for repairing [07:22:12] Logged the message, Master [07:27:51] <_joe_> !log restarted mailman on sodium [07:27:57] Logged the message, Master [07:28:51] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:37:43] _joe_: morning [07:37:51] <_joe_> eh [07:37:55] <_joe_> "good morning" [07:38:02] poke me when you have a moment [07:38:16] <_joe_> matanya: around half of next week? [07:38:17] <_joe_> :P [07:38:30] <_joe_> matanya: joking aside, mailman will take me some time [07:38:36] I know [07:38:59] does Halloween sound like a good timing ? [07:42:43] <_joe_> godog: ping [07:42:44] _joe_: ping detected, please leave a message! [08:31:54] anyone around who can look at the mailing lists? They appear to be swallowing posts at the moment [08:33:58] <_joe_> Jamesofur: it should be recovering [08:34:03] <_joe_> but we're checking [08:34:08] thank ye [08:34:38] I'm likely heading to bed but I'll let the community member who was reaching out (and had me forward an email that also got swallowed) so that he knows. I appreciate it [08:35:44] <_joe_> Jamesofur: sorry for the inconvenience :( [08:36:27] yeah, sucks :( they seem to be having more issues in the recent weeks to months I wonder if the boxes are having issues but I appreciate you looking into it [08:39:13] <_joe_> Jamesofur: this was all triggered by a crash of the mailman server the other day [08:39:21] <_joe_> we're still recovering from that [08:39:22] <_joe_> :( [08:40:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:40:07] ouch, was that when the root partition filled up a month ago or is this a new thing? [08:41:36] <_joe_> Jamesofur: a new thing, that caused quite a few problems [08:41:48] :-/ [08:42:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:42:40] ok, I'm off to bed, good look _joe_! [08:43:39] <_joe_> thanks [08:44:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:46:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:48:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:50:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:50:24] !log restart mailman on sodium after inodes freed [08:50:30] Logged the message, Master [08:52:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:54:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:56:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:37:55 UTC [08:56:21] <_joe_> this is bogus btw [08:57:45] RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Tue Jul 15 08:57:35 UTC 2014 [08:59:05] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 08:57:35 UTC [09:09:16] <_joe_> !log restarting mailman on sodium, again, for testing [09:09:22] Logged the message, Master [09:17:58] RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Tue Jul 15 09:17:55 UTC 2014 [09:42:33] <_joe_> oh ok thanks godog [09:42:36] <_joe_> :) [09:54:34] (03PS2) 10Giuseppe Lavagetto: monitoring-git: fix icinga message [operations/puppet] - 10https://gerrit.wikimedia.org/r/146078 [10:05:53] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [10:10:08] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring-git: fix icinga message [operations/puppet] - 10https://gerrit.wikimedia.org/r/146078 (owner: 10Giuseppe Lavagetto) [10:18:06] (03CR) 10Filippo Giunchedi: [C: 031] twemproxy: remove leftovers post-decom (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144757 (owner: 10Ori.livneh) [10:20:53] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [10:21:43] PROBLEM - Apache HTTP on mw1017 is CRITICAL: Connection refused [10:22:41] (03CR) 10Filippo Giunchedi: [C: 031] jobrunner: provision on mw1001; fix template [operations/puppet] - 10https://gerrit.wikimedia.org/r/146177 (owner: 10Ori.livneh) [10:22:49] (03PS1) 10Aude: Add wikidata wb_property_info table to dumps [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/146419 [10:23:48] <_joe_> mw1017 is me [10:24:01] (03PS1) 10Aude: Fix typos in wikidata table descriptions [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/146420 [10:24:51] (03CR) 10Filippo Giunchedi: [C: 031] "looks good to me!" [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [10:25:41] (03CR) 10Filippo Giunchedi: Packaging for debian using pkg-php-tools/dh_php5. (031 comment) [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [10:26:43] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.154 second response time [10:30:48] (03CR) 10QChris: "Needed change" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145980 (owner: 10QChris) [10:38:42] (03PS2) 10Filippo Giunchedi: beta: Stop using obsolete 'maxclients' param for mediawiki::web [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 (owner: 10BryanDavis) [10:38:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] beta: Stop using obsolete 'maxclients' param for mediawiki::web [operations/puppet] - 10https://gerrit.wikimedia.org/r/146367 (owner: 10BryanDavis) [10:46:50] ori _joe_ https://gerrit.wikimedia.org/r/#/c/146177/ looks good, ready to be merged? [10:49:28] <_joe_> godog: yes, this evening when both ori and aaron are around :) [10:49:46] yep [10:51:16] (03PS2) 10Aude: Add wikidata wb_property_info table to dumps [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/146419 (https://bugzilla.wikimedia.org/68024) [10:51:28] (03PS2) 10Aude: Fix typos in wikidata table descriptions [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/146420 [11:14:34] (03CR) 10QChris: [C: 031] kafka process monitoring: make it send pages [operations/puppet] - 10https://gerrit.wikimedia.org/r/145711 (owner: 10Dzahn) [11:47:26] Where does mw config file for betawiki(s) stays? I need to import some artciles to es betawiki and there seems config preventing it. [11:51:37] kart_: It's the same place as the production config [11:51:50] https://noc.wikimedia.org/conf/ / https://git.wikimedia.org/tree/operations%2Fmediawiki-config.git [12:53:05] (03PS2) 10Hoo man: add index.html pages for various directories on dataset hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [12:53:22] (03CR) 10Hoo man: "Fixed syntax error" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [12:54:15] (03CR) 10jenkins-bot: [V: 04-1] add index.html pages for various directories on dataset hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [12:59:50] apergos: Around [12:59:51] ? [13:20:53] qchris yayyyyy [13:20:54] hello! [13:20:57] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.38 [13:21:16] ottomata: Heya [13:21:45] oops, this is not analytics chat [13:21:49] someone rearranged my chat tabs! [13:21:53] but hi over here anyway! [13:22:03] Evil tab mangling monsters! [13:22:19] (03PS1) 10Jgreen: correct parameters for spamassassin class in otrs role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/146450 [13:26:03] hoo: don't worry about it, or, worry if you like but I'll be staring at that soon enough, before it gets merged [13:26:11] also, I am around (was getting lunch) [13:26:41] apergos: Ok... I would like to have that merged, but it's not a big blocked towards Wikidata json dumps [13:26:57] I've got a bash script hacked up to generate the dumps, but I doubt it's production read [13:26:58] y [13:27:03] needs polishing etc. [13:27:28] is it committed anywhere yet or you want to work o it some more first? [13:27:56] it's not committed anywhere, yet... but I would like to do that [13:28:09] can you give me pointers where it should sit? [13:28:50] do you have other scripts or maintenance stuff for wikidata anyplace? [13:29:14] we have cron jobs in the maintenance.pp, but despite of that no [13:29:27] hrm [13:30:43] I guess add it to the snapshot module, there's already a cro job for central auth dumps in there [13:31:02] Ok, will have a look at that [13:31:31] how should I do the removal of old dumps? Should/ can that be in the same bash script or ...? [13:31:34] under files, then in the manifests dir there's a class for it [13:32:03] oh, I would think so, just clean up after you complete a successful round [13:32:06] something like that [13:33:04] Ok, so add one, remove one... that makes sense [13:33:24] Guess we want to keep around a few, but that's not going to be an issue as they are small [13:33:36] sure [13:33:47] you could settle on an arbitrary number like 10 [13:33:54] (chosen radomly) [13:33:56] *randomly [13:34:06] Lydia_WMDE might have an opinion on htat [13:34:10] if you run it once a week that's a couple months worth [13:34:15] ok great [13:34:30] (03CR) 10Jgreen: [C: 032 V: 031] correct parameters for spamassassin class in otrs role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/146450 (owner: 10Jgreen) [13:35:04] number of json dumps we keep? [13:35:32] not possible to keep them all? [13:35:47] i guess it is ok with 10 for now [13:35:52] Lydia_WMDE: mh... we want one dump pre week? [13:35:53] * per [13:36:02] no we do't keep dumps forever [13:36:10] ok [13:36:20] in theory the new one should have all the data of the old one anyways [13:36:24] k [13:36:43] in terms of how often: more often is better :D [13:36:54] apergos: Well, guess that could be interesting to see how Wikidata grows (as the jsondumps only have the current data) [13:37:03] ok well our regular dump are every 10 days or so, so once a week is more often than that [13:37:42] Every ten days is the least we should do, I would prefer weekly. Lydia_WMDE ? [13:37:58] (03PS2) 10Ottomata: kafka process monitoring: make it send pages [operations/puppet] - 10https://gerrit.wikimedia.org/r/145711 (owner: 10Dzahn) [13:38:02] ok then let's do once a week [13:38:06] :) [13:38:19] sweet! [13:38:21] (03CR) 10Ottomata: [C: 032 V: 032] "Process monitoring pages, yes! We are ready for that. Thanks Daniel!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145711 (owner: 10Dzahn) [13:38:32] !log elastic1017 had a load average of 60 - was thashing in io. bounced Elasticsearch. lets see if it recovers on its own [13:38:38] Logged the message, Master [13:38:47] (03CR) 10Hoo man: [C: 04-1] add index.html pages for various directories on dataset hosts (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [13:39:29] <^d> manybubbles: 17, seriously? Hmm. [13:39:41] ^d: yeah - going to back out japanese, I think [13:39:54] <^d> Ouch, we are kind of red. [13:40:37] (03PS1) 10Manybubbles: Switch jawiki back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146455 [13:40:54] (03CR) 10Chad: [C: 031] Switch jawiki back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146455 (owner: 10Manybubbles) [13:40:58] (03CR) 10Manybubbles: [C: 032] Switch jawiki back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146455 (owner: 10Manybubbles) [13:41:03] yeah [13:41:05] (03Merged) 10jenkins-bot: Switch jawiki back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146455 (owner: 10Manybubbles) [13:41:47] more unstashed changes on tin.... [13:41:53] someone in themiddle of something? [13:42:55] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: jawiki back to lsearchd (duration: 00m 05s) [13:42:59] Logged the message, Master [13:43:08] lets see if that helps [13:44:48] load is pretty high because all the shards are franticly trying to resassign after I bounced elastic1017 [13:44:55] probably should have turned of jawiki first [13:45:01] and maybe I wouldn't have had to [13:52:49] !log after switching jawiki back to lsearchd by default load is mostly recovered. the cluster is still healing from bouncing elastic1017 and that'll take a while. the load will be a bit high during that but searches are coming back in a reasonably amount of time again [13:52:53] Logged the message, Master [14:00:08] (03PS3) 10Tim Landscheidt: Tools: Remove lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/124001 [14:00:45] (03CR) 10Tim Landscheidt: [C: 04-1] "Needs to be tested on Toolsbeta first." [operations/puppet] - 10https://gerrit.wikimedia.org/r/124001 (owner: 10Tim Landscheidt) [14:09:30] growl gitdeployyyyyy go! [14:09:31] go! [14:09:32] nope. [14:10:10] sad_trombone.wav [14:11:15] anybody know where salt stores its grain info? [14:13:57] hmm, no, this has to be a puppet problem [14:13:57] hmm [14:16:10] (03PS1) 10Ottomata: Troubleshoot why refinery role is not applied on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 [14:16:19] (03PS2) 10Ottomata: Troubleshoot why refinery role is not applied on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 [14:17:02] ori: this bug https://bugzilla.wikimedia.org/show_bug.cgi?id=63981 seems to affect all of beta labs. If it can't be fixed pretty soon, can whatever caused it be reverted? [14:17:17] OHHHH, DOH [14:18:37] (03PS3) 10Ottomata: Deploy analytics/refinery repository to analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 [14:18:45] (03PS4) 10Ottomata: Deploy analytics/refinery repository to analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 [14:19:11] (03CR) 10Ottomata: [C: 032 V: 032] Deploy analytics/refinery repository to analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146458 (owner: 10Ottomata) [14:24:08] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Complete puppet failure [14:24:58] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [14:26:08] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:28:50] apergos: Still around? Can you create the target dir for the dumps? [14:28:55] (03PS1) 10Ottomata: Ensuring stats user and group exist for refinery deployments [operations/puppet] - 10https://gerrit.wikimedia.org/r/146459 [14:29:09] I can indeed [14:29:12] (03CR) 10Ottomata: [C: 032 V: 032] Ensuring stats user and group exist for refinery deployments [operations/puppet] - 10https://gerrit.wikimedia.org/r/146459 (owner: 10Ottomata) [14:29:36] apergos: Ok... I would suggest to go for xmldatadumps/public/other/wikidataJson [14:29:40] (03PS1) 10Dzahn: wikitech - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 [14:29:41] I guess that name is ok [14:30:00] if you have a better idea, go ahead :P [14:30:34] (03CR) 10jenkins-bot: [V: 04-1] wikitech - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 (owner: 10Dzahn) [14:30:36] not just wikidata? [14:30:49] (03PS1) 10Ottomata: Fix hasty syntax error [operations/puppet] - 10https://gerrit.wikimedia.org/r/146462 [14:31:04] hoo: [14:31:06] (03CR) 10Ottomata: [C: 032 V: 032] Fix hasty syntax error [operations/puppet] - 10https://gerrit.wikimedia.org/r/146462 (owner: 10Ottomata) [14:31:12] guess we can also do that... just thought it might be confusing as we also have wikidata xml dumps [14:31:15] but I guess that's ok [14:31:30] the xml dumps all live in another area [14:31:33] I think it'll be ok [14:32:04] dir is there now [14:32:12] indeed :) [14:32:24] (03PS1) 10Dzahn: gerrit - disabled DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146464 [14:32:59] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:34:15] (03CR) 10Matanya: [C: 031] gerrit - disabled DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146464 (owner: 10Dzahn) [14:40:12] (03PS1) 10Dzahn: dynamicproxy - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146466 [14:43:33] (03PS2) 10Dzahn: bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 [14:43:48] (03PS2) 10Dzahn: gerrit - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146464 [14:44:29] (03CR) 10jenkins-bot: [V: 04-1] bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 (owner: 10Dzahn) [14:55:45] chrismcmahonbrb: i'll fix it [14:57:18] (03PS25) 10Andrew Bogott: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [14:57:18] aww manybubbles [14:57:30] Nemo_bis: exciting morning [14:57:44] thanks ori [14:57:46] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 12:56:48 UTC [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140715T1500) [15:00:25] * anomie observes no patches for SWAT [15:01:37] (03PS1) 10Hoo man: Introduce snapshot::wikidatajsondump [operations/puppet] - 10https://gerrit.wikimedia.org/r/146470 [15:01:51] apergos: https://gerrit.wikimedia.org/r/146470 [15:02:01] Don't be to hard on it... I've kept it simple [15:04:17] anomie: if you want, some trivial maintenance https://gerrit.wikimedia.org/r/#/c/145861/ [15:05:47] Nemo_bis: Sure. Put it on the Deployments page, please [15:06:04] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [15:08:09] no worries, thanks! [15:08:47] anomie: done [15:09:13] Nemo_bis: You forgot to put yourself as the requesting developer, as indicated [15:09:27] (03CR) 10Anomie: [C: 032] "SWAT deploy" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145861 (owner: 10Nemo bis) [15:09:36] (03Merged) 10jenkins-bot: Remove dead ULS variable after I49e812eae32266f165591c75fd67b86ca06b13f0 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145861 (owner: 10Nemo bis) [15:10:24] Who left uncommitted changes in /a/common on tin? [15:12:04] springle: Are these changes in /a/common on tin yours? [15:12:52] anomie: he's asleep [15:12:59] anomie: I believe they are his, yeah [15:13:03] (g'morning, btw) [15:13:10] greg-g: So who can fix tin so I can SWAT? [15:13:15] I had to jam something out that super fast this morning to I stashed them, rebased, and then unstashed them [15:13:20] it was horrible and I'm ashamed [15:13:38] what are the changes? [15:13:43] db config stuff [15:13:47] :/ [15:13:48] greg-g: Appears to be disabling db1035 [15:14:04] anomie: commit it [15:14:11] 04:48 springle: db1035 crash cycle. down for memtest and stuff [15:14:14] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 15 Jul 2014 13:13:25 UTC [15:14:38] brb [15:16:44] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jul 15 15:16:37 UTC 2014 [15:16:47] !log anomie updated /a/common to {{Gerrit|I7ca6a16d5}}: Switch jawiki back to lsearchd [15:16:52] Logged the message, Master [15:17:12] That's wrong, logmsgbot [15:18:33] !log anomie actually committed a live hack someone left on tin (removing db1035) [15:18:38] Logged the message, Master [15:19:22] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /a/common/). [15:19:26] !log anomie Synchronized wmf-config/: SWAT: Remove dead ULS variable [[gerrit:145861]] (duration: 00m 10s) [15:19:31] Logged the message, Master [15:19:57] Nemo_bis: ^ Test please [15:20:47] (03CR) 10Andrew Bogott: "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: " [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [15:21:31] (03PS3) 10Andrew Bogott: Updating debian package files [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [15:21:33] (03PS1) 10Ottomata: Add Camus cron job on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146474 [15:22:48] (03CR) 10Ottomata: [C: 032 V: 032] Add Camus cron job on analytics1027 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146474 (owner: 10Ottomata) [15:24:47] (03PS4) 10Andrew Bogott: Updating debian package files [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [15:26:19] Nemo_bis: Did you test that ULS still works after your change? [15:26:21] (03PS1) 10Manybubbles: Raise Elasticsearch filter cache size in prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 [15:26:57] (03PS2) 10Manybubbles: Raise Elasticsearch filter cache size in prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 [15:27:37] (03CR) 10Andrew Bogott: [C: 032] Updating debian package files [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [15:30:10] (03PS3) 10Manybubbles: Raise Elasticsearch filter cache size in prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 [15:31:26] (03CR) 10Andrew Bogott: [C: 032] labs_vagrant: Install to /srv/vagrant [operations/puppet] - 10https://gerrit.wikimedia.org/r/145974 (owner: 10BryanDavis) [15:32:25] !log setting filter cache size to 20% on elastic1001 to see if it takes/helps us [15:32:30] Logged the message, Master [15:34:57] anomie: just did, it does [15:35:45] Nemo_bis: Thanks [15:36:19] andrewbogott: lots of activity there for a rebase [15:36:49] AzaToth: Yeah, I updated the license files and merged. [15:37:00] (I checked in with Domas and Ryan about licensing) [15:37:29] ah [15:37:45] that little evil detail [15:37:58] (03CR) 10Manybubbles: "Dynamically applied to production - we'll see if it helps us any." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 (owner: 10Manybubbles) [15:41:18] <_joe_> manybubbles: can I go on with an update of the appservers, or will this interfere with your work? [15:41:29] _joe_: me? not a bit [15:41:44] <_joe_> manybubbles: ok thanks [15:42:09] !log setting the filter cache on one node in the cluster set it on all. yay, I guess. Anyway, I'm going to let it soak for a while. [15:42:14] Logged the message, Master [15:47:21] <_joe_> !log starting rolling update of all appservers to apache2 2.2.22-1ubuntu1.6, half of them are on 2.2.22-1ubuntu1.5 now [15:47:28] Logged the message, Master [15:51:29] !log elasticsearch1017 is freaking out again - maybe there is something wrong with it. odds aren't good it picked up the same shard again after restart and that shard is somehow poison just for it and not the other two nodes with the same shard.... [15:51:35] Logged the message, Master [15:52:19] (03PS3) 10Andrew Bogott: labs_vagrant: cleanup sudoers config [operations/puppet] - 10https://gerrit.wikimedia.org/r/145975 (owner: 10BryanDavis) [15:52:39] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Tue Jul 15 15:52:37 UTC 2014 [15:53:40] <_joe_> !log mw101[0-9] updated [15:53:46] Logged the message, Master [15:56:58] <_joe_> !log mw1020-mw1059 updated [15:57:04] Logged the message, Master [15:57:59] !log restarting Elasticsearch on elastic1017 - its thrashing the disk again. I'm still not 100% sure why [15:58:04] Logged the message, Master [16:00:00] would anyone mind brainstorming with me about elastic1017? Its being sad. [16:00:46] <_joe_> !log mw1060-mw1099 updated [16:00:50] Logged the message, Master [16:04:55] such an unlucky machine [16:06:11] <_joe_> manybubbles: if ori does not show up soon, I may be available to help shortly [16:10:13] <_joe_> !log mw1100 and onwards updated [16:10:18] Logged the message, Master [16:19:50] _joe_: i'm here [16:22:50] _joe_: should we maybe split up the apache module patch into two parts [16:22:58] the first removing the cruft, the other adding the include ::apache? [16:24:12] <_joe_> ori: I think it's fairly straightforward at this point [16:24:16] <_joe_> I mean, we "tested" that [16:24:29] ok, probably true [16:24:42] <_joe_> but that would probably be more correct commit-wise [16:24:53] i'll split it up [16:25:03] <_joe_> ok :) [16:25:27] <_joe_> !log all mw servers updated [16:25:32] Logged the message, Master [16:29:07] (03PS3) 10Aaron Schulz: jobrunner: provision on mw1001; fix template [operations/puppet] - 10https://gerrit.wikimedia.org/r/146177 (owner: 10Ori.livneh) [16:31:04] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.01 [16:31:17] (03PS3) 10Ori.livneh: mediawiki: remove module-local apache service [operations/puppet] - 10https://gerrit.wikimedia.org/r/146162 (owner: 10Giuseppe Lavagetto) [16:33:37] (03CR) 10Ori.livneh: [C: 031] mediawiki: remove module-local apache service [operations/puppet] - 10https://gerrit.wikimedia.org/r/146162 (owner: 10Giuseppe Lavagetto) [16:33:53] (03PS1) 10Ori.livneh: mediawiki: use canonical apache module [operations/puppet] - 10https://gerrit.wikimedia.org/r/146485 [16:36:46] (03PS3) 10Ori.livneh: twemproxy: remove leftovers post-decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144757 [16:38:53] (03CR) 10Giuseppe Lavagetto: [C: 032] twemproxy: remove leftovers post-decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/144757 (owner: 10Ori.livneh) [16:39:24] (03PS3) 10Ori.livneh: add 'puppet-run' bash alias to my .bash_profile [operations/puppet] - 10https://gerrit.wikimedia.org/r/146132 [16:39:29] <_joe_> ori: http://puppet-compiler.wmflabs.org/145/change/146162/html/mw1212.eqiad.wmnet.html [16:39:36] <_joe_> doesn't seem right [16:40:00] (03CR) 10Ori.livneh: [C: 032 V: 032] "(trivial, zero impact on prod)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146132 (owner: 10Ori.livneh) [16:40:25] _joe_: it is right; we're not introducing the apache module yet in that patch [16:40:38] so we can't reference the apache service [16:40:49] but it's ok, but it's only until we deploy the follow-up patch, and those files already exist on all target nodes [16:40:53] <_joe_> ok yes [16:41:07] <_joe_> and apache is running [16:41:11] nod [16:41:23] <_joe_> which was my main fear [16:42:00] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: remove module-local apache service [operations/puppet] - 10https://gerrit.wikimedia.org/r/146162 (owner: 10Giuseppe Lavagetto) [16:42:18] (03PS2) 10Giuseppe Lavagetto: mediawiki: use canonical apache module [operations/puppet] - 10https://gerrit.wikimedia.org/r/146485 (owner: 10Ori.livneh) [16:44:44]