[00:09:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [00:09:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [00:09:57] whew, all better ;) [00:12:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103317) [00:12:31] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103314) [00:12:35] (03PS1) 10Ryan Lane: Make appserver common a mediawiki deploy target [operations/puppet] - 10https://gerrit.wikimedia.org/r/94832 [00:28:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [00:28:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [00:32:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103838) [00:32:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103831) [00:44:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [00:44:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [01:05:50] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [01:10:49] (03PS4) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [01:12:22] (03PS4) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [01:12:41] (03PS5) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [01:27:29] !log restart db1050 mariadb after outage, let repl catch up. new lvm snaps mount ok. leave out of pool for now [01:27:49] Logged the message, Master [01:34:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103110) [01:34:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103081) [01:37:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [01:37:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [01:40:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100854) [01:40:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100846) [01:42:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [01:42:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [01:46:50] PROBLEM - Puppet freshness on sq48 is CRITICAL: No successful Puppet run in the last 10 hours [01:49:36] (03PS1) 10Springle: track client/user/table/index stats for audit. disable excess warnings for mariadb until we switch to RBR. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94841 [01:50:49] (03CR) 10Springle: [C: 032] track client/user/table/index stats for audit. disable excess warnings for mariadb until we switch to RBR. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94841 (owner: 10Springle) [01:58:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (109860) [01:58:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (110003) [02:05:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [02:05:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [02:15:04] !log LocalisationUpdate completed (1.23wmf3) at Tue Nov 12 02:15:03 UTC 2013 [02:15:22] Logged the message, Master [02:16:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104802) [02:16:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104789) [02:20:54] !log LocalisationUpdate completed (1.23wmf2) at Tue Nov 12 02:20:54 UTC 2013 [02:21:10] Logged the message, Master [02:25:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [02:25:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [02:33:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100432) [02:33:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100426) [02:34:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [02:34:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [02:39:30] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:20] RECOVERY - MySQL Processlist on db1002 is OK: OK 1 unauthenticated, 0 locked, 4 copy to table, 9 statistics [02:42:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104180) [02:42:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104103) [02:49:20] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 7 copy to table, 222 statistics [02:50:20] RECOVERY - MySQL Processlist on db1002 is OK: OK 1 unauthenticated, 0 locked, 4 copy to table, 7 statistics [02:52:10] PROBLEM - MySQL Idle Transactions on db1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:00] RECOVERY - MySQL Idle Transactions on db1002 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:00:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:00:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:05:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (108151) [03:06:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104566) [03:09:30] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 1 unauthenticated, 0 locked, 5 copy to table, 312 statistics [03:11:30] RECOVERY - MySQL Processlist on db1002 is OK: OK 0 unauthenticated, 0 locked, 5 copy to table, 1 statistics [03:11:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:11:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:12:05] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Nov 12 03:12:05 UTC 2013 [03:12:23] Logged the message, Master [03:14:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102136) [03:14:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102073) [03:21:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:21:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:24:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101860) [03:24:40] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101814) [03:26:30] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [03:29:30] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 6 copy to table, 152 statistics [03:32:30] RECOVERY - MySQL Processlist on db1002 is OK: OK 0 unauthenticated, 0 locked, 4 copy to table, 2 statistics [03:32:36] (03PS1) 10Springle: aim for at least 3 equivalent slaves on shards not using groupLoadsByDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94843 [03:33:11] (03CR) 10Springle: [C: 032] aim for at least 3 equivalent slaves on shards not using groupLoadsByDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94843 (owner: 10Springle) [03:34:27] !log springle synchronized wmf-config/db-eqiad.php 'slave balancing' [03:34:44] Logged the message, Master [03:37:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:37:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:38:40] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:39:30] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.939 second response time [03:49:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104886) [03:49:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104649) [03:52:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:52:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:55:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102874) [03:55:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102817) [03:57:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:57:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [04:05:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (106488) [04:05:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (106405) [04:14:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [04:14:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [04:17:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101970) [04:17:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102936) [05:04:10] PROBLEM - MySQL Idle Transactions on db1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:04:10] PROBLEM - MySQL InnoDB on db1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:06:10] RECOVERY - MySQL InnoDB on db1002 is OK: OK longest blocking idle transaction sleeps for 0 seconds [05:06:10] RECOVERY - MySQL Idle Transactions on db1002 is OK: OK longest blocking idle transaction sleeps for 0 seconds [06:09:43] (03PS1) 10Tim Starling: Disable client idle disconnection [operations/puppet] - 10https://gerrit.wikimedia.org/r/94848 [06:29:20] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 100,000 [06:32:20] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:08] (03PS1) 10Springle: move recache jobs to snapshot host in future [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94851 [06:59:45] (03CR) 10Springle: [C: 032] move recache jobs to snapshot host in future [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94851 (owner: 10Springle) [07:02:01] !log springle synchronized wmf-config/db-eqiad.php 'recache jobs on S2 to db1018' [07:02:21] Logged the message, Master [07:23:11] (03CR) 10Ori.livneh: "Useful thread from the redis mailing list:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94848 (owner: 10Tim Starling) [07:51:58] (03PS1) 10Raimond Spekking: Temporary lift of IP cap for WikiCon 2013 in de/en.WP, Commons, de/en.wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 [07:59:50] PROBLEM - Puppet freshness on amssq58 is CRITICAL: No successful Puppet run in the last 10 hours [08:38:56] (03CR) 10Tim Starling: "On rdb1003, netstat shows only 312 connections at present, 82 in TIME_WAIT, and tcpdump shows about 125 connections per second, so we are " [operations/puppet] - 10https://gerrit.wikimedia.org/r/94848 (owner: 10Tim Starling) [08:48:20] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 100,000 [08:51:21] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:39] (03PS1) 10ArielGlenn: monitoring for text-varnish wikipedia in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/94868 [08:57:18] (03CR) 10ArielGlenn: [C: 032] monitoring for text-varnish wikipedia in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/94868 (owner: 10ArielGlenn) [09:14:40] RECOVERY - Puppet freshness on neon is OK: puppet ran at Tue Nov 12 09:14:34 UTC 2013 [09:22:37] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [09:23:09] morning [09:23:18] springle: still here? [09:24:25] paravoid: yep [09:24:32] hey [09:25:04] so, on sunday, even after your pt-kill jobs, there was still some load spikes on databases [09:25:23] 90% CPU and such, they're apparent in ganglia [09:25:31] I only saw them after the fact [09:25:47] there were tons of "too many connection" errors in dberror, look at around 14:00 UTC iirc [09:26:04] yes the 30s kill limit was too long [09:26:18] my very cursory ishmael digging only showed the logpager query as an outlier, with 1.1% of queries / 72% of time (wtf...) [09:28:23] actually, to be accurate, 30s was too long and the pt-kill interval at 5s was too long to catch the surge of SpecialAllpages::showToplevel [09:28:43] twofold issue. they're now shorter [09:29:43] ok [09:29:49] as long as you're aware of it :) [09:30:18] paravoid: did you see this http://aerosuidae.net/paste/22/52807966 (was in an email in Problem SQl thread) [09:31:55] I didn't [09:32:22] is that ishmael? [09:32:30] hmmm, that looks like a much nicer way than the web intf [09:32:35] hmmz [09:32:39] now amssq58 is in trouble [09:32:46] [497924.126440] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [09:32:49] again [09:34:06] hm [09:34:17] ms-be1003 had very weird symptoms last night [09:34:29] kernel locks spinning forever [09:34:34] 50% cpu 50% iowait [09:34:44] load 190 [09:34:53] paravoid: yes, ishmael plus a cross-slave view [09:34:56] maybe related? [09:35:39] springle: so, what am I looking at? [09:36:22] the last seven days aggregating query review history from one non-snapshot slave per cluster, looking for queries based on total time and rows examined [09:36:48] showToplevel isn't even close to the top? [09:37:06] no, this doesn't account for spikes [09:37:13] right [09:37:14] only volume over 7 days [09:39:55] of these, the problem ones are some forms of LogPager (paging all user history forever is madness), Wikibase\TermSqlIndex::getMatchingIDs (fixed with reindexing mid last week), and SpecialAllpages::showToplevel [09:40:07] problems == cause spikes [09:40:21] nod [09:40:41] logpager and showtoplevel are not new, though, right? [09:41:18] maybe someone's crawling as theorized on list, who knows... [09:41:42] have been fighting with logpager for months. showtoplevel is new in that i've not had it show up on this sort of list before [09:41:50] but it's an old query [09:45:43] !log rebooting amssq58 with sysrq-trigger [09:46:02] Logged the message, Master [09:46:21] paravoid: a parallel issue that's affected slave cpu is updatespecialpages jobs on terbim. working on getting those onto the snapshot slaves along with dumps [09:46:58] right, the one I inadvertently fixed the other day [09:47:54] mark: "cdn" 5xx are elevated since yesterday [09:48:12] http://gdash.wikimedia.org/dashboards/reqerror/ [09:48:27] RECOVERY - Puppet freshness on amssq58 is OK: puppet ran at Tue Nov 12 09:48:22 UTC 2013 [09:50:00] i don't... see that? [09:50:53] it's not very apparent, I only noticed because I looked at those graphs yesterday too [09:51:02] so if you look at the -1 day one [09:51:36] the blue line is usually close to zero, there's a small bump around 13:00 UTC which is the esams congestion issue [09:51:51] i'll believe it, I think especially those gzip errors are suspicious [09:51:54] then it's fixed shortly after that (by your change) and goes back to zero for a while [09:52:02] then it starts going up around 16:00 again [09:52:42] little before [09:52:52] you switched traffic to varnish at 15:30 [10:16:52] so I think these amssq* boxes are dying with kmem_alloc errors because they're under memory pressure due to too many dirty pages [10:16:58] I guess we should change the thresholds a little [10:28:20] (03PS3) 10Faidon Liambotis: Remove references to 'olivneh' account from node defs [operations/puppet] - 10https://gerrit.wikimedia.org/r/92267 (owner: 10Ori.livneh) [10:28:32] (03CR) 10Faidon Liambotis: [C: 032] Remove references to 'olivneh' account from node defs [operations/puppet] - 10https://gerrit.wikimedia.org/r/92267 (owner: 10Ori.livneh) [10:30:38] (03Abandoned) 10Faidon Liambotis: Slight restructure for java module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/74380 (owner: 10Ottomata) [10:32:40] (03CR) 10Faidon Liambotis: [C: 04-1] "The text in parentheses is actually quite useful. We have a very confusing (to some :) rule that we've even debated in the past that we us" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81630 (owner: 10Ori.livneh) [10:35:13] (03CR) 10Faidon Liambotis: [C: 04-1] "Why do we neeed (tool)labsbeta.pp? Just kill that and rename labs.pp to toollabs.pp." [operations/puppet] - 10https://gerrit.wikimedia.org/r/84926 (owner: 10Yuvipanda) [10:35:32] (03PS1) 10ArielGlenn: remove rose (long gone); fix range comments [operations/dns] - 10https://gerrit.wikimedia.org/r/94876 [10:36:42] (03CR) 10Faidon Liambotis: "Is this a -1 (improve by moving to dev_environ) or a -2 (do not submit)? If it's the latter, then abandon the change since it's over a mon" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84288 (owner: 10DrTrigon) [10:37:06] (03CR) 10ArielGlenn: [C: 032] remove rose (long gone); fix range comments [operations/dns] - 10https://gerrit.wikimedia.org/r/94876 (owner: 10ArielGlenn) [10:38:02] (03Abandoned) 10Faidon Liambotis: Hopefully fix the Parsoid Varnishes not showing up as such in Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/69443 (owner: 10Catrope) [10:39:40] (03CR) 10Faidon Liambotis: [C: 04-1] "-1 because of what Reedy said. Max, you recently said you've been using this successfully, so let's fix it up and merge it?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/38252 (owner: 10MaxSem) [10:40:05] ahhh, remembering fail:P [10:41:16] (03CR) 10Faidon Liambotis: "Any progress, Andrew?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83960 (owner: 10Ottomata) [10:43:37] (03Abandoned) 10Faidon Liambotis: Adding IPv4/6 networks in ferms defs [operations/puppet] - 10https://gerrit.wikimedia.org/r/89791 (owner: 10Akosiaris) [10:47:05] apergos: there's quite a lot of beta patchsets submitted by hashar, are you handling these? [10:47:09] you're the one doing beta now, aren't you? :) [10:47:19] uhh [10:47:31] (and ci) [10:47:40] what is ci? [10:47:47] contint [10:48:10] I can do thebeta ones, he usually adds me as a reviewer if he wants me to look [10:48:11] ci is the more industry-known abbreviation [10:48:24] but I can rmeind him of that [10:48:28] *remind [10:48:36] he's here :) [10:48:39] hashar: [10:48:42] hashar: heeelllo :) [10:48:56] hello [10:49:02] if you want reviews from me on puppet changesets having to do with beta, add me as a reviewer [10:49:09] (03CR) 10Faidon Liambotis: [C: 031] "I like this and despite plans to deprecate decom.pp, I don't see this happening very soon. Rebase & merge?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93930 (owner: 10ArielGlenn) [10:49:18] I would apergos :-] [10:49:19] nagging me doesn't hurt but I do look at the dash once every few days [10:49:32] been a bit too busy with CI for the last 5-6 weeks or so though [10:49:47] who's reviewing those, hashar? [10:50:14] https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+owner:%22Hashar+%253Chashar%2540free.fr%253E%22+status:open,n,z [10:50:37] ops ? :-D [10:50:43] I haven't nagged anyone for them [10:51:00] 3 of them are related to an upgrade of Zuul, I have wrote them lsat week [10:51:09] haven't written down the migration plan yet [10:51:21] as MaxSem and others know well, we generally suck at processing open patchsets with no reviewers set :) [10:51:29] (03PS3) 10Faidon Liambotis: rake validate now let puppet output colors [operations/puppet] - 10https://gerrit.wikimedia.org/r/77381 (owner: 10Hashar) [10:51:43] (03CR) 10Faidon Liambotis: [C: 032] rake validate now let puppet output colors [operations/puppet] - 10https://gerrit.wikimedia.org/r/77381 (owner: 10Hashar) [10:51:48] one is about tweaking jobrunner / videoscaler roles which Ariel reviewed last week, still have to follow up though [10:52:11] I just went through the list and you're on top of it ;) [10:52:17] well [10:52:21] because I send a ton of patches [10:52:48] nowadays, most of my puppet changes are reviewed /merged quite fast [10:53:22] hi apergos [10:53:32] there is https://gerrit.wikimedia.org/r/65254 which is all about setting a symlink on beta :] [10:54:42] hello aude [10:54:53] apergos: i want to ask about json dumps for wikidata [10:54:54] per https://bugzilla.wikimedia.org/show_bug.cgi?id=54369 [10:55:12] what can we do to move it forward :) [10:55:14] ? [10:55:23] would an RT ticket be helpful? [10:56:10] no, its not something I can just fold into the regular dumps, so it's another maintenance script that needs to be put somewhere, tested, then we argue about the frequency, where the output goes, etc [10:56:22] where is that somewhere? [10:56:39] terbium (?) or arsenic(?) [10:56:46] maintenance scripts run on terbium I think (don't they?) [10:56:49] ok [10:57:08] I'm not sure what arsenic is being used for exactly, I know some cirrussearch stuff was happening there [10:57:09] and then where the output goes? [10:57:12] cirrus [10:57:17] it's a new box for cirrus [10:57:32] but might be used for cron jobs and scripts [10:57:33] ok well this script doesn't belong there then [10:57:38] oh [10:57:40] :-D [10:57:43] i'd have to ask chad [10:57:47] right [10:58:00] we can try terbium and if it's too much load, then find somewhere else [10:58:27] we can test the setup with test.wikidata [10:58:32] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [10:58:32] although test.wikidata is very small [10:58:36] (03PS4) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 [10:58:48] if you run it on that at least you flush out silly errors [10:58:56] yes [10:59:05] do you have terbium access? [10:59:08] (03CR) 10jenkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [10:59:15] i don't but can use puppet [11:00:03] well first steps would be to run by hand over there once in a screen session and see how it behaves, for test.wikidata that is [11:00:09] ok [11:00:11] and then the same for wikidatawiki [11:00:17] i could ask about shell access :) [11:00:25] don't know if it's possible but think it would help [11:00:28] how large do we expect these files to be? any notion? [11:00:45] i don't know [11:00:47] well..... [11:00:54] similar to the current pages xml [11:01:06] probably, though it's not quite as verbose [11:01:20] let's see how big those are [11:01:20] wikidatawiki-20131006-pages-meta-current.xml.bz2 1.8 GB [11:01:27] ok that's not too bad [11:01:29] and just items / properties [11:01:47] then there's the matter of where it should get put when done [11:01:56] and how often it should get produced [11:01:59] then we'll need to make something like http://dumps.wikimedia.org/wikidatawiki/ [11:02:04] yep [11:02:17] once a week is probably fine [11:02:28] it's going to land in other/something [11:02:36] we try not to clutter up the toplevel [11:02:40] ok [11:02:42] note the word "try" [11:02:55] looks like just directory listing [11:02:58] anyways, can we discuss this on the bug report? maybe summarize what's been said here so far [11:03:02] ok [11:03:09] and next step [11:04:28] added to the bug report [11:04:49] to try in terbium, it's either i try to get shell access and do it or need help [11:05:20] thanks apergos :) [11:05:58] to try in terbium, someone gives good instructions (see my comment just now on the report) [11:06:00] yw [11:06:08] ok [11:10:28] hashar: [11:10:56] is /data/project/apache or /data/project/apache/common-local set up anywhere? [11:11:05] in the puppet manifests that is [11:16:06] apergos: put the command in [11:16:19] (03PS1) 10Hashar: rake validate was failling on non tty [operations/puppet] - 10https://gerrit.wikimedia.org/r/94880 [11:17:04] apergos: not y [11:17:08] grr [11:17:22] aude, thanks [11:18:19] apergos: /data/project… are not used directly [11:18:20] we have symlinks all over the place [11:18:20] not going to shard for this round. let's see what the script does normally. [11:19:16] bah puppet linting is broken :D the rake file used to validate the lints is wrong. https://gerrit.wikimedia.org/r/#/c/94880/ should fix it [11:19:37] I don't care that it's not used drectly, that's not an issue [11:19:59] sharding not needed for test wikidata [11:20:06] but puppet should manage those two directories (in case you ever set up a new box right?) [11:20:11] and can be experimented with for wikidata [11:20:30] hmm just a second [11:20:39] we ant to pipe to bzip2 [11:20:49] apergos: yes [11:21:08] --output won't let medo that [11:21:18] you can omit it probably [11:21:22] and then just piple [11:21:23] pipe [11:21:40] where do progress messages go in that case? [11:21:51] hmmmm [11:21:53] ok :) [11:22:04] looking [11:22:09] thanks [11:24:20] (03PS2) 10Dereckson: Temporary lift of IP cap for WikiCon 2013 in de/en.WP, Commons, de/en.wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:24:40] (03CR) 10Dereckson: [C: 031] Throttle rule for WikiCon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:27:03] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [11:28:47] (03CR) 10Hashar: [C: 032] Throttle rule for WikiCon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:28:59] (03Merged) 10jenkins-bot: Throttle rule for WikiCon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:29:57] !log hashar synchronized wmf-config/throttle.php 'thottle rule for WikiCon {{gerrit|94860}}' [11:30:03] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (110729) [11:30:15] Logged the message, Master [11:30:18] !log hashar synchronized wmf-config/InitialiseSettings.php 'thottle rule for WikiCon {{gerrit|94860}}' [11:30:19] (03CR) 10Hashar: "deployed in production" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:30:34] Logged the message, Master [11:33:49] !log Manually set dirty_background_ratio to 5 (from 10) on amssq58 [11:34:06] Logged the message, Master [11:36:15] I am off for lunch / nap [11:36:23] ops/puppet validation is broken right now, https://gerrit.wikimedia.org/r/#/c/94880/ should fix it [11:43:43] (03CR) 10Faidon Liambotis: "This happens because redis first attempts to set the limit to 10032 (default maxclients = 10000 + 32 fds reserved for internal usage), fai" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94848 (owner: 10Tim Starling) [11:44:17] (03CR) 10Faidon Liambotis: [C: 032] rake validate was failling on non tty [operations/puppet] - 10https://gerrit.wikimedia.org/r/94880 (owner: 10Hashar) [11:49:57] (03CR) 10ArielGlenn: "I would like to see declarations for the directories /data/project/apache and /data/project/apache/common-local, just as you have done for" [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 (owner: 10Hashar) [12:28:30] (03PS3) 10Mark Bergsma: Allow caching of login.wikimedia.org requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94765 [12:28:31] (03PS1) 10Mark Bergsma: Filter out some noise requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94886 [12:30:04] (03CR) 10Mark Bergsma: [C: 032 V: 032] Filter out some noise requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94886 (owner: 10Mark Bergsma) [12:34:01] (03PS4) 10Mark Bergsma: Allow caching of login.wikimedia.org requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94765 [12:34:02] (03PS1) 10Mark Bergsma: req.request instead of req.method [operations/puppet] - 10https://gerrit.wikimedia.org/r/94887 [12:34:26] (03CR) 10Mark Bergsma: [C: 032 V: 032] req.request instead of req.method [operations/puppet] - 10https://gerrit.wikimedia.org/r/94887 (owner: 10Mark Bergsma) [13:12:15] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 178 statistics [13:13:15] RECOVERY - MySQL Processlist on db1002 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 3 statistics [13:25:43] (03CR) 10Mark Bergsma: [C: 032] Allow caching of login.wikimedia.org requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94765 (owner: 10Mark Bergsma) [13:39:40] (03PS7) 10Hashar: beta: symlink /a/common [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 [13:40:00] (03CR) 10Hashar: "defines /data/project/apache and /data/project/apache/common-local as belonging to mwdeploy:mwdeploy." [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 (owner: 10Hashar) [13:40:05] (03PS8) 10Hashar: beta: symlink /a/common [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 [13:47:53] eww [13:49:21] (03PS1) 10Akosiaris: Moving esams to new puppet infrastructure [operations/dns] - 10https://gerrit.wikimedia.org/r/94898 [13:50:11] heh interesting [13:50:15] a wikimedia.org CNAMEd to wmnet [13:50:23] not sure how I feel about that [13:50:32] (03CR) 10Akosiaris: [C: 032] Moving esams to new puppet infrastructure [operations/dns] - 10https://gerrit.wikimedia.org/r/94898 (owner: 10Akosiaris) [13:50:52] lol [13:51:07] well... it is weird to say the least.... [13:51:32] how would you feel about a wikimedia.org A record pointing to an 10.x address ? [13:51:46] not much different... [13:53:24] I guess [13:53:33] maybe we should just set server = explicitly? :) [13:54:44] wouldn't that make it more difficult to make such changes ? [13:54:54] hmmm well not really now that i think about it [13:55:04] anyway, let's think about it when you're done [13:55:09] not the right time now I guess :) [13:55:16] we would have to maintain a hash in puppet for at least the DCs [13:55:25] yeah ok [13:55:53] (03CR) 10Hashar: [C: 031] "Fine to me, thank you :-] Feel free to merge at anytime." [operations/puppet] - 10https://gerrit.wikimedia.org/r/94257 (owner: 10Andrew Bogott) [13:56:13] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [14:03:04] PROBLEM - MySQL Slave Running on db1021 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428654 for key PRIMARY on query. Defaul [14:03:04] PROBLEM - MySQL Slave Running on db1026 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428654 for key PRIMARY on query. Defaul [14:03:14] PROBLEM - MySQL Slave Running on db1045 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428654 for key PRIMARY on query. Defaul [14:03:15] PROBLEM - MySQL Slave Running on db73 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428654 for key PRIMARY on query. Defaul [14:03:56] wft [14:03:57] that doesn't sound very good [14:04:07] oh, heh, hey sean [14:04:08] OSC gone wrong [14:04:40] OSC? [14:04:51] ah, schema change? [14:05:11] anyway, I'll shut up, let me know if you need anything [14:06:16] PROBLEM - MySQL Slave Running on db1005 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428655 for key PRIMARY on query. Defaul [14:07:05] PROBLEM - MySQL Replication Heartbeat on db73 is CRITICAL: CRIT replication delay 321 seconds [14:07:15] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 329 seconds [14:07:16] PROBLEM - MySQL Replication Heartbeat on db1045 is CRITICAL: CRIT replication delay 331 seconds [14:07:55] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [14:08:15] RECOVERY - MySQL Slave Running on db1045 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:08:15] RECOVERY - MySQL Replication Heartbeat on db1045 is OK: OK replication delay -0 seconds [14:08:36] (03CR) 10Akosiaris: [C: 032] More fixes for file permissions/ownerships [operations/puppet] - 10https://gerrit.wikimedia.org/r/94777 (owner: 10Akosiaris) [14:09:05] RECOVERY - MySQL Slave Running on db1026 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:09:05] RECOVERY - MySQL Slave Running on db1021 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:09:15] RECOVERY - MySQL Slave Running on db1005 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:09:15] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay -0 seconds [14:10:56] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (108724) [14:11:05] RECOVERY - MySQL Replication Heartbeat on db73 is OK: OK replication delay -0 seconds [14:11:15] RECOVERY - MySQL Slave Running on db73 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:17:10] !log paused externallinks OSC jobs after replication glitch on dewiki. original table and data remain untouched [14:17:28] Logged the message, Master [14:17:32] *sigh* [14:18:09] (03PS2) 10Faidon Liambotis: Remove misc::maintenance::foundationwiki cronjobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/91676 (owner: 10Reedy) [14:19:15] (03CR) 10Faidon Liambotis: [C: 032] Remove misc::maintenance::foundationwiki cronjobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/91676 (owner: 10Reedy) [14:19:16] (03PS1) 10Mark Bergsma: Ignore Range requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94902 [14:20:34] (03CR) 10Mark Bergsma: [C: 032] Ignore Range requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94902 (owner: 10Mark Bergsma) [14:20:41] I have them open [14:20:50]