[00:09:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [00:09:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [00:09:57] whew, all better ;) [00:12:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103317) [00:12:31] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103314) [00:12:35] (03PS1) 10Ryan Lane: Make appserver common a mediawiki deploy target [operations/puppet] - 10https://gerrit.wikimedia.org/r/94832 [00:28:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [00:28:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [00:32:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103838) [00:32:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103831) [00:44:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [00:44:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [01:05:50] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [01:10:49] (03PS4) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [01:12:22] (03PS4) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [01:12:41] (03PS5) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [01:27:29] !log restart db1050 mariadb after outage, let repl catch up. new lvm snaps mount ok. leave out of pool for now [01:27:49] Logged the message, Master [01:34:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103110) [01:34:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103081) [01:37:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [01:37:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [01:40:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100854) [01:40:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100846) [01:42:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [01:42:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [01:46:50] PROBLEM - Puppet freshness on sq48 is CRITICAL: No successful Puppet run in the last 10 hours [01:49:36] (03PS1) 10Springle: track client/user/table/index stats for audit. disable excess warnings for mariadb until we switch to RBR. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94841 [01:50:49] (03CR) 10Springle: [C: 032] track client/user/table/index stats for audit. disable excess warnings for mariadb until we switch to RBR. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94841 (owner: 10Springle) [01:58:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (109860) [01:58:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (110003) [02:05:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [02:05:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [02:15:04] !log LocalisationUpdate completed (1.23wmf3) at Tue Nov 12 02:15:03 UTC 2013 [02:15:22] Logged the message, Master [02:16:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104802) [02:16:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104789) [02:20:54] !log LocalisationUpdate completed (1.23wmf2) at Tue Nov 12 02:20:54 UTC 2013 [02:21:10] Logged the message, Master [02:25:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [02:25:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [02:33:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100432) [02:33:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100426) [02:34:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [02:34:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [02:39:30] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:20] RECOVERY - MySQL Processlist on db1002 is OK: OK 1 unauthenticated, 0 locked, 4 copy to table, 9 statistics [02:42:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104180) [02:42:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104103) [02:49:20] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 7 copy to table, 222 statistics [02:50:20] RECOVERY - MySQL Processlist on db1002 is OK: OK 1 unauthenticated, 0 locked, 4 copy to table, 7 statistics [02:52:10] PROBLEM - MySQL Idle Transactions on db1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:00] RECOVERY - MySQL Idle Transactions on db1002 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:00:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:00:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:05:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (108151) [03:06:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104566) [03:09:30] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 1 unauthenticated, 0 locked, 5 copy to table, 312 statistics [03:11:30] RECOVERY - MySQL Processlist on db1002 is OK: OK 0 unauthenticated, 0 locked, 5 copy to table, 1 statistics [03:11:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:11:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:12:05] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Nov 12 03:12:05 UTC 2013 [03:12:23] Logged the message, Master [03:14:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102136) [03:14:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102073) [03:21:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:21:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:24:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101860) [03:24:40] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101814) [03:26:30] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [03:29:30] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 6 copy to table, 152 statistics [03:32:30] RECOVERY - MySQL Processlist on db1002 is OK: OK 0 unauthenticated, 0 locked, 4 copy to table, 2 statistics [03:32:36] (03PS1) 10Springle: aim for at least 3 equivalent slaves on shards not using groupLoadsByDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94843 [03:33:11] (03CR) 10Springle: [C: 032] aim for at least 3 equivalent slaves on shards not using groupLoadsByDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94843 (owner: 10Springle) [03:34:27] !log springle synchronized wmf-config/db-eqiad.php 'slave balancing' [03:34:44] Logged the message, Master [03:37:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:37:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:38:40] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:39:30] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.939 second response time [03:49:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104886) [03:49:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104649) [03:52:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:52:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [03:55:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102874) [03:55:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102817) [03:57:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [03:57:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [04:05:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (106488) [04:05:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (106405) [04:14:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [04:14:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [04:17:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101970) [04:17:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102936) [05:04:10] PROBLEM - MySQL Idle Transactions on db1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:04:10] PROBLEM - MySQL InnoDB on db1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:06:10] RECOVERY - MySQL InnoDB on db1002 is OK: OK longest blocking idle transaction sleeps for 0 seconds [05:06:10] RECOVERY - MySQL Idle Transactions on db1002 is OK: OK longest blocking idle transaction sleeps for 0 seconds [06:09:43] (03PS1) 10Tim Starling: Disable client idle disconnection [operations/puppet] - 10https://gerrit.wikimedia.org/r/94848 [06:29:20] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 100,000 [06:32:20] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:08] (03PS1) 10Springle: move recache jobs to snapshot host in future [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94851 [06:59:45] (03CR) 10Springle: [C: 032] move recache jobs to snapshot host in future [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94851 (owner: 10Springle) [07:02:01] !log springle synchronized wmf-config/db-eqiad.php 'recache jobs on S2 to db1018' [07:02:21] Logged the message, Master [07:23:11] (03CR) 10Ori.livneh: "Useful thread from the redis mailing list:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94848 (owner: 10Tim Starling) [07:51:58] (03PS1) 10Raimond Spekking: Temporary lift of IP cap for WikiCon 2013 in de/en.WP, Commons, de/en.wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 [07:59:50] PROBLEM - Puppet freshness on amssq58 is CRITICAL: No successful Puppet run in the last 10 hours [08:38:56] (03CR) 10Tim Starling: "On rdb1003, netstat shows only 312 connections at present, 82 in TIME_WAIT, and tcpdump shows about 125 connections per second, so we are " [operations/puppet] - 10https://gerrit.wikimedia.org/r/94848 (owner: 10Tim Starling) [08:48:20] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 100,000 [08:51:21] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:39] (03PS1) 10ArielGlenn: monitoring for text-varnish wikipedia in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/94868 [08:57:18] (03CR) 10ArielGlenn: [C: 032] monitoring for text-varnish wikipedia in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/94868 (owner: 10ArielGlenn) [09:14:40] RECOVERY - Puppet freshness on neon is OK: puppet ran at Tue Nov 12 09:14:34 UTC 2013 [09:22:37] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [09:23:09] morning [09:23:18] springle: still here? [09:24:25] paravoid: yep [09:24:32] hey [09:25:04] so, on sunday, even after your pt-kill jobs, there was still some load spikes on databases [09:25:23] 90% CPU and such, they're apparent in ganglia [09:25:31] I only saw them after the fact [09:25:47] there were tons of "too many connection" errors in dberror, look at around 14:00 UTC iirc [09:26:04] yes the 30s kill limit was too long [09:26:18] my very cursory ishmael digging only showed the logpager query as an outlier, with 1.1% of queries / 72% of time (wtf...) [09:28:23] actually, to be accurate, 30s was too long and the pt-kill interval at 5s was too long to catch the surge of SpecialAllpages::showToplevel [09:28:43] twofold issue. they're now shorter [09:29:43] ok [09:29:49] as long as you're aware of it :) [09:30:18] paravoid: did you see this http://aerosuidae.net/paste/22/52807966 (was in an email in Problem SQl thread) [09:31:55] I didn't [09:32:22] is that ishmael? [09:32:30] hmmm, that looks like a much nicer way than the web intf [09:32:35] hmmz [09:32:39] now amssq58 is in trouble [09:32:46] [497924.126440] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [09:32:49] again [09:34:06] hm [09:34:17] ms-be1003 had very weird symptoms last night [09:34:29] kernel locks spinning forever [09:34:34] 50% cpu 50% iowait [09:34:44] load 190 [09:34:53] paravoid: yes, ishmael plus a cross-slave view [09:34:56] maybe related? [09:35:39] springle: so, what am I looking at? [09:36:22] the last seven days aggregating query review history from one non-snapshot slave per cluster, looking for queries based on total time and rows examined [09:36:48] showToplevel isn't even close to the top? [09:37:06] no, this doesn't account for spikes [09:37:13] right [09:37:14] only volume over 7 days [09:39:55] of these, the problem ones are some forms of LogPager (paging all user history forever is madness), Wikibase\TermSqlIndex::getMatchingIDs (fixed with reindexing mid last week), and SpecialAllpages::showToplevel [09:40:07] problems == cause spikes [09:40:21] nod [09:40:41] logpager and showtoplevel are not new, though, right? [09:41:18] maybe someone's crawling as theorized on list, who knows... [09:41:42] have been fighting with logpager for months. showtoplevel is new in that i've not had it show up on this sort of list before [09:41:50] but it's an old query [09:45:43] !log rebooting amssq58 with sysrq-trigger [09:46:02] Logged the message, Master [09:46:21] paravoid: a parallel issue that's affected slave cpu is updatespecialpages jobs on terbim. working on getting those onto the snapshot slaves along with dumps [09:46:58] right, the one I inadvertently fixed the other day [09:47:54] mark: "cdn" 5xx are elevated since yesterday [09:48:12] http://gdash.wikimedia.org/dashboards/reqerror/ [09:48:27] RECOVERY - Puppet freshness on amssq58 is OK: puppet ran at Tue Nov 12 09:48:22 UTC 2013 [09:50:00] i don't... see that? [09:50:53] it's not very apparent, I only noticed because I looked at those graphs yesterday too [09:51:02] so if you look at the -1 day one [09:51:36] the blue line is usually close to zero, there's a small bump around 13:00 UTC which is the esams congestion issue [09:51:51] i'll believe it, I think especially those gzip errors are suspicious [09:51:54] then it's fixed shortly after that (by your change) and goes back to zero for a while [09:52:02] then it starts going up around 16:00 again [09:52:42] little before [09:52:52] you switched traffic to varnish at 15:30 [10:16:52] so I think these amssq* boxes are dying with kmem_alloc errors because they're under memory pressure due to too many dirty pages [10:16:58] I guess we should change the thresholds a little [10:28:20] (03PS3) 10Faidon Liambotis: Remove references to 'olivneh' account from node defs [operations/puppet] - 10https://gerrit.wikimedia.org/r/92267 (owner: 10Ori.livneh) [10:28:32] (03CR) 10Faidon Liambotis: [C: 032] Remove references to 'olivneh' account from node defs [operations/puppet] - 10https://gerrit.wikimedia.org/r/92267 (owner: 10Ori.livneh) [10:30:38] (03Abandoned) 10Faidon Liambotis: Slight restructure for java module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/74380 (owner: 10Ottomata) [10:32:40] (03CR) 10Faidon Liambotis: [C: 04-1] "The text in parentheses is actually quite useful. We have a very confusing (to some :) rule that we've even debated in the past that we us" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81630 (owner: 10Ori.livneh) [10:35:13] (03CR) 10Faidon Liambotis: [C: 04-1] "Why do we neeed (tool)labsbeta.pp? Just kill that and rename labs.pp to toollabs.pp." [operations/puppet] - 10https://gerrit.wikimedia.org/r/84926 (owner: 10Yuvipanda) [10:35:32] (03PS1) 10ArielGlenn: remove rose (long gone); fix range comments [operations/dns] - 10https://gerrit.wikimedia.org/r/94876 [10:36:42] (03CR) 10Faidon Liambotis: "Is this a -1 (improve by moving to dev_environ) or a -2 (do not submit)? If it's the latter, then abandon the change since it's over a mon" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84288 (owner: 10DrTrigon) [10:37:06] (03CR) 10ArielGlenn: [C: 032] remove rose (long gone); fix range comments [operations/dns] - 10https://gerrit.wikimedia.org/r/94876 (owner: 10ArielGlenn) [10:38:02] (03Abandoned) 10Faidon Liambotis: Hopefully fix the Parsoid Varnishes not showing up as such in Ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/69443 (owner: 10Catrope) [10:39:40] (03CR) 10Faidon Liambotis: [C: 04-1] "-1 because of what Reedy said. Max, you recently said you've been using this successfully, so let's fix it up and merge it?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/38252 (owner: 10MaxSem) [10:40:05] ahhh, remembering fail:P [10:41:16] (03CR) 10Faidon Liambotis: "Any progress, Andrew?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83960 (owner: 10Ottomata) [10:43:37] (03Abandoned) 10Faidon Liambotis: Adding IPv4/6 networks in ferms defs [operations/puppet] - 10https://gerrit.wikimedia.org/r/89791 (owner: 10Akosiaris) [10:47:05] apergos: there's quite a lot of beta patchsets submitted by hashar, are you handling these? [10:47:09] you're the one doing beta now, aren't you? :) [10:47:19] uhh [10:47:31] (and ci) [10:47:40] what is ci? [10:47:47] contint [10:48:10] I can do thebeta ones, he usually adds me as a reviewer if he wants me to look [10:48:11] ci is the more industry-known abbreviation [10:48:24] but I can rmeind him of that [10:48:28] *remind [10:48:36] he's here :) [10:48:39] hashar: [10:48:42] hashar: heeelllo :) [10:48:56] hello [10:49:02] if you want reviews from me on puppet changesets having to do with beta, add me as a reviewer [10:49:09] (03CR) 10Faidon Liambotis: [C: 031] "I like this and despite plans to deprecate decom.pp, I don't see this happening very soon. Rebase & merge?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93930 (owner: 10ArielGlenn) [10:49:18] I would apergos :-] [10:49:19] nagging me doesn't hurt but I do look at the dash once every few days [10:49:32] been a bit too busy with CI for the last 5-6 weeks or so though [10:49:47] who's reviewing those, hashar? [10:50:14] https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+owner:%22Hashar+%253Chashar%2540free.fr%253E%22+status:open,n,z [10:50:37] ops ? :-D [10:50:43] I haven't nagged anyone for them [10:51:00] 3 of them are related to an upgrade of Zuul, I have wrote them lsat week [10:51:09] haven't written down the migration plan yet [10:51:21] as MaxSem and others know well, we generally suck at processing open patchsets with no reviewers set :) [10:51:29] (03PS3) 10Faidon Liambotis: rake validate now let puppet output colors [operations/puppet] - 10https://gerrit.wikimedia.org/r/77381 (owner: 10Hashar) [10:51:43] (03CR) 10Faidon Liambotis: [C: 032] rake validate now let puppet output colors [operations/puppet] - 10https://gerrit.wikimedia.org/r/77381 (owner: 10Hashar) [10:51:48] one is about tweaking jobrunner / videoscaler roles which Ariel reviewed last week, still have to follow up though [10:52:11] I just went through the list and you're on top of it ;) [10:52:17] well [10:52:21] because I send a ton of patches [10:52:48] nowadays, most of my puppet changes are reviewed /merged quite fast [10:53:22] hi apergos [10:53:32] there is https://gerrit.wikimedia.org/r/65254 which is all about setting a symlink on beta :] [10:54:42] hello aude [10:54:53] apergos: i want to ask about json dumps for wikidata [10:54:54] per https://bugzilla.wikimedia.org/show_bug.cgi?id=54369 [10:55:12] what can we do to move it forward :) [10:55:14] ? [10:55:23] would an RT ticket be helpful? [10:56:10] no, its not something I can just fold into the regular dumps, so it's another maintenance script that needs to be put somewhere, tested, then we argue about the frequency, where the output goes, etc [10:56:22] where is that somewhere? [10:56:39] terbium (?) or arsenic(?) [10:56:46] maintenance scripts run on terbium I think (don't they?) [10:56:49] ok [10:57:08] I'm not sure what arsenic is being used for exactly, I know some cirrussearch stuff was happening there [10:57:09] and then where the output goes? [10:57:12] cirrus [10:57:17] it's a new box for cirrus [10:57:32] but might be used for cron jobs and scripts [10:57:33] ok well this script doesn't belong there then [10:57:38] oh [10:57:40] :-D [10:57:43] i'd have to ask chad [10:57:47] right [10:58:00] we can try terbium and if it's too much load, then find somewhere else [10:58:27] we can test the setup with test.wikidata [10:58:32] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [10:58:32] although test.wikidata is very small [10:58:36] (03PS4) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 [10:58:48] if you run it on that at least you flush out silly errors [10:58:56] yes [10:59:05] do you have terbium access? [10:59:08] (03CR) 10jenkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [10:59:15] i don't but can use puppet [11:00:03] well first steps would be to run by hand over there once in a screen session and see how it behaves, for test.wikidata that is [11:00:09] ok [11:00:11] and then the same for wikidatawiki [11:00:17] i could ask about shell access :) [11:00:25] don't know if it's possible but think it would help [11:00:28] how large do we expect these files to be? any notion? [11:00:45] i don't know [11:00:47] well..... [11:00:54] similar to the current pages xml [11:01:06] probably, though it's not quite as verbose [11:01:20] let's see how big those are [11:01:20] wikidatawiki-20131006-pages-meta-current.xml.bz2 1.8 GB [11:01:27] ok that's not too bad [11:01:29] and just items / properties [11:01:47] then there's the matter of where it should get put when done [11:01:56] and how often it should get produced [11:01:59] then we'll need to make something like http://dumps.wikimedia.org/wikidatawiki/ [11:02:04] yep [11:02:17] once a week is probably fine [11:02:28] it's going to land in other/something [11:02:36] we try not to clutter up the toplevel [11:02:40] ok [11:02:42] note the word "try" [11:02:55] looks like just directory listing [11:02:58] anyways, can we discuss this on the bug report? maybe summarize what's been said here so far [11:03:02] ok [11:03:09] and next step [11:04:28] added to the bug report [11:04:49] to try in terbium, it's either i try to get shell access and do it or need help [11:05:20] thanks apergos :) [11:05:58] to try in terbium, someone gives good instructions (see my comment just now on the report) [11:06:00] yw [11:06:08] ok [11:10:28] hashar: [11:10:56] is /data/project/apache or /data/project/apache/common-local set up anywhere? [11:11:05] in the puppet manifests that is [11:16:06] apergos: put the command in [11:16:19] (03PS1) 10Hashar: rake validate was failling on non tty [operations/puppet] - 10https://gerrit.wikimedia.org/r/94880 [11:17:04] apergos: not y [11:17:08] grr [11:17:22] aude, thanks [11:18:19] apergos: /data/project… are not used directly [11:18:20] we have symlinks all over the place [11:18:20] not going to shard for this round. let's see what the script does normally. [11:19:16] bah puppet linting is broken :D the rake file used to validate the lints is wrong. https://gerrit.wikimedia.org/r/#/c/94880/ should fix it [11:19:37] I don't care that it's not used drectly, that's not an issue [11:19:59] sharding not needed for test wikidata [11:20:06] but puppet should manage those two directories (in case you ever set up a new box right?) [11:20:11] and can be experimented with for wikidata [11:20:30] hmm just a second [11:20:39] we ant to pipe to bzip2 [11:20:49] apergos: yes [11:21:08] --output won't let medo that [11:21:18] you can omit it probably [11:21:22] and then just piple [11:21:23] pipe [11:21:40] where do progress messages go in that case? [11:21:51] hmmmm [11:21:53] ok :) [11:22:04] looking [11:22:09] thanks [11:24:20] (03PS2) 10Dereckson: Temporary lift of IP cap for WikiCon 2013 in de/en.WP, Commons, de/en.wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:24:40] (03CR) 10Dereckson: [C: 031] Throttle rule for WikiCon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:27:03] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [11:28:47] (03CR) 10Hashar: [C: 032] Throttle rule for WikiCon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:28:59] (03Merged) 10jenkins-bot: Throttle rule for WikiCon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:29:57] !log hashar synchronized wmf-config/throttle.php 'thottle rule for WikiCon {{gerrit|94860}}' [11:30:03] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (110729) [11:30:15] Logged the message, Master [11:30:18] !log hashar synchronized wmf-config/InitialiseSettings.php 'thottle rule for WikiCon {{gerrit|94860}}' [11:30:19] (03CR) 10Hashar: "deployed in production" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94860 (owner: 10Raimond Spekking) [11:30:34] Logged the message, Master [11:33:49] !log Manually set dirty_background_ratio to 5 (from 10) on amssq58 [11:34:06] Logged the message, Master [11:36:15] I am off for lunch / nap [11:36:23] ops/puppet validation is broken right now, https://gerrit.wikimedia.org/r/#/c/94880/ should fix it [11:43:43] (03CR) 10Faidon Liambotis: "This happens because redis first attempts to set the limit to 10032 (default maxclients = 10000 + 32 fds reserved for internal usage), fai" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94848 (owner: 10Tim Starling) [11:44:17] (03CR) 10Faidon Liambotis: [C: 032] rake validate was failling on non tty [operations/puppet] - 10https://gerrit.wikimedia.org/r/94880 (owner: 10Hashar) [11:49:57] (03CR) 10ArielGlenn: "I would like to see declarations for the directories /data/project/apache and /data/project/apache/common-local, just as you have done for" [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 (owner: 10Hashar) [12:28:30] (03PS3) 10Mark Bergsma: Allow caching of login.wikimedia.org requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94765 [12:28:31] (03PS1) 10Mark Bergsma: Filter out some noise requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94886 [12:30:04] (03CR) 10Mark Bergsma: [C: 032 V: 032] Filter out some noise requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94886 (owner: 10Mark Bergsma) [12:34:01] (03PS4) 10Mark Bergsma: Allow caching of login.wikimedia.org requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94765 [12:34:02] (03PS1) 10Mark Bergsma: req.request instead of req.method [operations/puppet] - 10https://gerrit.wikimedia.org/r/94887 [12:34:26] (03CR) 10Mark Bergsma: [C: 032 V: 032] req.request instead of req.method [operations/puppet] - 10https://gerrit.wikimedia.org/r/94887 (owner: 10Mark Bergsma) [13:12:15] PROBLEM - MySQL Processlist on db1002 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 178 statistics [13:13:15] RECOVERY - MySQL Processlist on db1002 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 3 statistics [13:25:43] (03CR) 10Mark Bergsma: [C: 032] Allow caching of login.wikimedia.org requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94765 (owner: 10Mark Bergsma) [13:39:40] (03PS7) 10Hashar: beta: symlink /a/common [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 [13:40:00] (03CR) 10Hashar: "defines /data/project/apache and /data/project/apache/common-local as belonging to mwdeploy:mwdeploy." [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 (owner: 10Hashar) [13:40:05] (03PS8) 10Hashar: beta: symlink /a/common [operations/puppet] - 10https://gerrit.wikimedia.org/r/65254 [13:47:53] eww [13:49:21] (03PS1) 10Akosiaris: Moving esams to new puppet infrastructure [operations/dns] - 10https://gerrit.wikimedia.org/r/94898 [13:50:11] heh interesting [13:50:15] a wikimedia.org CNAMEd to wmnet [13:50:23] not sure how I feel about that [13:50:32] (03CR) 10Akosiaris: [C: 032] Moving esams to new puppet infrastructure [operations/dns] - 10https://gerrit.wikimedia.org/r/94898 (owner: 10Akosiaris) [13:50:52] lol [13:51:07] well... it is weird to say the least.... [13:51:32] how would you feel about a wikimedia.org A record pointing to an 10.x address ? [13:51:46] not much different... [13:53:24] I guess [13:53:33] maybe we should just set server = explicitly? :) [13:54:44] wouldn't that make it more difficult to make such changes ? [13:54:54] hmmm well not really now that i think about it [13:55:04] anyway, let's think about it when you're done [13:55:09] not the right time now I guess :) [13:55:16] we would have to maintain a hash in puppet for at least the DCs [13:55:25] yeah ok [13:55:53] (03CR) 10Hashar: [C: 031] "Fine to me, thank you :-] Feel free to merge at anytime." [operations/puppet] - 10https://gerrit.wikimedia.org/r/94257 (owner: 10Andrew Bogott) [13:56:13] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [14:03:04] PROBLEM - MySQL Slave Running on db1021 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428654 for key PRIMARY on query. Defaul [14:03:04] PROBLEM - MySQL Slave Running on db1026 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428654 for key PRIMARY on query. Defaul [14:03:14] PROBLEM - MySQL Slave Running on db1045 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428654 for key PRIMARY on query. Defaul [14:03:15] PROBLEM - MySQL Slave Running on db73 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428654 for key PRIMARY on query. Defaul [14:03:56] wft [14:03:57] that doesn't sound very good [14:04:07] oh, heh, hey sean [14:04:08] OSC gone wrong [14:04:40] OSC? [14:04:51] ah, schema change? [14:05:11] anyway, I'll shut up, let me know if you need anything [14:06:16] PROBLEM - MySQL Slave Running on db1005 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 19428655 for key PRIMARY on query. Defaul [14:07:05] PROBLEM - MySQL Replication Heartbeat on db73 is CRITICAL: CRIT replication delay 321 seconds [14:07:15] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 329 seconds [14:07:16] PROBLEM - MySQL Replication Heartbeat on db1045 is CRITICAL: CRIT replication delay 331 seconds [14:07:55] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [14:08:15] RECOVERY - MySQL Slave Running on db1045 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:08:15] RECOVERY - MySQL Replication Heartbeat on db1045 is OK: OK replication delay -0 seconds [14:08:36] (03CR) 10Akosiaris: [C: 032] More fixes for file permissions/ownerships [operations/puppet] - 10https://gerrit.wikimedia.org/r/94777 (owner: 10Akosiaris) [14:09:05] RECOVERY - MySQL Slave Running on db1026 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:09:05] RECOVERY - MySQL Slave Running on db1021 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:09:15] RECOVERY - MySQL Slave Running on db1005 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:09:15] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay -0 seconds [14:10:56] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (108724) [14:11:05] RECOVERY - MySQL Replication Heartbeat on db73 is OK: OK replication delay -0 seconds [14:11:15] RECOVERY - MySQL Slave Running on db73 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [14:17:10] !log paused externallinks OSC jobs after replication glitch on dewiki. original table and data remain untouched [14:17:28] Logged the message, Master [14:17:32] *sigh* [14:18:09] (03PS2) 10Faidon Liambotis: Remove misc::maintenance::foundationwiki cronjobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/91676 (owner: 10Reedy) [14:19:15] (03CR) 10Faidon Liambotis: [C: 032] Remove misc::maintenance::foundationwiki cronjobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/91676 (owner: 10Reedy) [14:19:16] (03PS1) 10Mark Bergsma: Ignore Range requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94902 [14:20:34] (03CR) 10Mark Bergsma: [C: 032] Ignore Range requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94902 (owner: 10Mark Bergsma) [14:20:41] I have them open [14:20:50] merged [14:21:00] (them = sockpuppet + palladium) [14:21:36] we need to merge in both places now? [14:21:43] temporarily [14:21:44] for a couple of days [14:21:46] alex mailed about that :) [14:21:51] sorry :-( [14:22:14] I could switch everything at once... but I get a feeling it wouldn't be nice [14:22:37] not a problem [14:35:48] paravoid: the client IPs on these recent spikes are msnbot :) [14:36:19] 22:07 < paravoid> second time I'm seeing this msnbot IP [14:36:25] it just hit wikidata with a load of LogPager queries with absurd LIMIT offsets [14:36:40] 22:28 < paravoid> 536184595 wikiuser 10.64.32.55:52792 dewiki Query 208 Sending data SELECT /* IndexPager::buildQueryInfo (LogPager) 157.55.32.209 */ log_id,log_type,log_action,log_timestamp,log_user,log_user_text,lo [14:36:44] g_namespace,log_title,log_comment,log_params,log_deleted,user_id,user_name,user_editcount,ts_tags FROM `logging` LEFT JOIN `user` ON ((log_user=user_id)) LEFT JOIN `tag_summary` ON ((ts_log_id=log_id)) WHERE (log_type NOT IN ('suppress','spambl [14:36:48] acklist')) AND log_user = '676408' AND ((log_deleted & 4) = 0) AND (log_type != 'review') ORDER BY log_timestamp DESC LIMI [14:36:51] T 51 0.000 [14:36:53] this is from the day of the outage [14:36:53] ban it! [14:36:55] 22:28 < paravoid> is one of the ones I captured before [14:36:58] 22:29 < paravoid> the IP is msnbot's, don't worry about leaking it [14:37:00] :) [14:37:02] paravoid: ah :) [14:38:25] i wish i was around at the time. so far behind [14:38:33] heh, sorry [14:38:43] that was from ishmael, btw [14:38:58] the outage lasted for about 10', I didn't get enough time to go digging in databases in realtime [14:39:08] time to notice, time to pinpoint it to a db issue etc. [14:39:45] so I did see your email about running 'show full processlist' etc. and I usually do that (although usually not show engine innodb), but it was too late at the time anyway [14:40:30] hm, or not, the above was from a show full processlist [14:40:38] but it was after the site was back up [14:43:41] i checked a couple from ishmael but got no reverse dns at the time. seeing these now in the pt-kill stderr [14:51:50] (03CR) 10Ottomata: [C: 031] Remove Kraken-specific varnishncsa instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/91492 (owner: 10Ori.livneh) [15:01:31] (03CR) 10Akosiaris: [C: 032] Remove references to /etc/puppet/software [operations/puppet] - 10https://gerrit.wikimedia.org/r/94779 (owner: 10Akosiaris) [15:07:33] (03CR) 10Hashar: "tested in labs :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 (owner: 10Hashar) [15:21:37] does anyone have a chance to look at Bug 56936? en.wp search failures that I can't reproduce or find logs for. [15:25:21] akosiaris: hiii [15:25:27] got some time to figure out the varnishkafka.log thing? [15:25:50] ottomata: yes [15:26:28] gimme a sec to read so I remember what we are talking about again [15:30:07] k [15:30:19] akosiaris: https://gerrit.wikimedia.org/r/#/c/78782/16/debian/rules [15:30:40] ottomata: are you going to deploy https://gerrit.wikimedia.org/r/#/c/91492/ ? [15:30:50] paravoid: sure shall I? [15:30:59] why not? :) [15:31:02] k [15:31:04] on it [15:31:14] mark only gave it a -1 because of making sure it gets removed [15:31:19] paravoid, the other thing I want to work on today is IPv6 [15:31:21] as long as you do that, it's fine, imho [15:31:21] k yeah, i'll do that [15:31:34] https://gerrit.wikimedia.org/r/#/c/93983/ [15:31:41] i got your email about the AAAA taking priority [15:31:42] so ok [15:31:49] i don't have a full grasp of everything that needs to happen here though [15:32:41] what are you missing? [15:32:43] like, i'm not sure how the dhcp bit is supposed to work i think [15:32:46] dhcp? [15:32:48] there's no dhcp [15:32:49] ummm [15:32:51] ok [15:32:55] how are the IPs actaully assigned? [15:33:09] these are just dns records, right? [15:33:10] the boxes have IP(v6)s already but that's due to autoconfiguration [15:33:21] that is, they have the network + the mac address suffix [15:33:23] are they inferred from the dns lookup? [15:33:26] no [15:33:33] you need to use interface::add_ip6_mapped [15:33:40] that infers the IPv6 from the IPv4 [15:33:43] ah! [15:33:54] ahhhh [15:33:56] that is what I'm missing [15:33:56] ok [15:33:58] great [15:34:02] so that will get IPs and names all set up [15:34:08] not names [15:34:10] well [15:34:11] my change [15:34:12] names is your commit [15:34:13] is names [15:34:13] yes [15:34:16] exactly [15:34:22] so those two together [15:34:22] $ipv6_address = inline_template("<%= require 'ipaddr'; (IPAddr.new(scope.lookupvar(\"::ipaddress6_${intf}\")).mask(64) | IPAddr.new(ip4_address.gsub('.', ':'))).to_s() %>") [15:34:23] ok [15:34:28] great [15:34:35] ok, and then does anything need to happen for public routing to work? [15:34:53] so the boxes will have ipv6 connectivity (they already do) [15:35:00] but we have ACLs in place, since these are private zones [15:35:18] so you need to make a hole in the ACL, Leslie can probably help you with that [15:35:32] I can too, but I had my fair share of net ops for a while :P [15:35:52] haha [15:35:53] ok [15:36:07] ok great [15:36:42] ottomata: varnishkafka.log is what ? [15:36:56] just the process logs [15:37:01] error messages, startmessages, stats [15:37:01] etc. [15:37:02] so errors etc [15:37:03] ok [15:37:04] ja [15:37:34] also, paravoid: https://gerrit.wikimedia.org/r/#/c/94148/ [15:37:38] ottomata: btw, as we discussed on that other ticket [15:37:42] /var/log/varnishkafka/varnishkafka.log for starters and your 1st problem is solved (with rsyslog not being able to write to /var/log/varnish [15:37:43] we need to move one of the brokers to a different row [15:37:47] right yeah [15:37:49] this means different IP [15:37:50] and IPv6 [15:37:51] oof [15:37:54] ok [15:37:54] and postinst indeed for creating files etc [15:37:59] (each row is a different network) [15:38:03] so you might want to do that first [15:38:06] oof [15:38:10] posting that to the change as well [15:38:10] yeah, rats [15:38:10] ok [15:38:15] that is going to take a while I betcha though [15:38:16] sigh [15:38:39] it's just chris unracking and racking again [15:38:50] it's work, but nothing insurmountable [15:38:58] is there enough room? do we have 3 rows? does he have to move around other boxes too? [15:38:59] but I'll leave scheduling to you ;) [15:39:13] we have 3 rows and we're in the process of setting up a fourth [15:39:16] (row D) [15:39:27] no idea about room, have a look at racktables [15:39:33] k [15:39:44] it's a little more complicated than that, some racks have only 10gbe ports [15:40:42] you can see the chassis topology for each row from the asw, and from the model you can see if it's a 10gbe switch or not [15:41:08] akosiaris: how does /var/log/varnishkafka help me? [15:41:40] akosiaris: rules is the wrong place for this, but creating a logfile in postinst is also wrong [15:42:00] you don't have to give rsyslog special access to /var/log/varnish ? [15:42:23] paravoid: why if I may ask ? [15:42:53] (03CR) 10Faidon Liambotis: [C: 032] "Yay :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94148 (owner: 10Ottomata) [15:43:00] akosiaris: not to /var/log/varnish, all I did was create a first varnishkafka.log file that was syslog:adm writeable [15:43:01] altough in this case having rsyslog create the file might be better [15:43:15] if /var/log/varnish{kafka} was syslog writeable, it'd be fine [15:43:18] wouldn't have to create file [15:43:38] akosiaris: it's very counterintuitive to rm varnishkafka.log and have it never being able to be created again by itself, for starters [15:44:10] ottomata: /var/log/varnish/varnishkafka.log i don't like anyway... it is counter-intuitive [15:44:25] I don't remember the details much, but I think the problem was reusing /var/log/varnish? [15:44:26] i'm fine with /var/log/varnishkafka [15:44:28] just not sure how it would help [15:44:51] paravoid, I don't think so, unless we were to make /var/log/varnishkafka somehow writeable by rsyslog [15:45:08] the problem is that rsyslog doesn't run as root so it can't just write files wherever [15:45:18] right [15:45:34] akosiaris: i think we ahve varnishkafka running as the varnishlog user right now [15:45:38] should we create a new user too? [15:45:39] and if so [15:45:48] we could perhaps add syslog user to the varnishkafka group [15:45:48] if so, please decide on the name first :) [15:45:52] lol [15:45:54] and make the dir group writeable [15:45:54] oh [15:45:56] ha [15:46:06] paravoid: due to upcoming changes in varnishncsa [15:46:12] and possible deprecation of varnishkafka one day [15:46:15] we've decided not to change the name [15:46:31] depreWHAT?! [15:46:33] how is that related? [15:46:39] hahha, Snaps, right? [15:46:48] wait I'm not making this up I thought we talked about this? [15:46:56] yeah, otto is right [15:47:57] I don't care either way actually [15:48:02] should we rename it to varnishmill after all? [15:49:34] I don't really mind what name you decide [15:49:43] I do care about not renaming while in production though [15:50:33] indeed [15:50:42] we had decided not to change it, Snaps, do we stick to that decision? [15:51:01] ottomata: nah, lets stick with varnishkafka. If it catches on and becomes the de facto varnish generalizator of world domination then we'll rename it. [15:51:10] okay [15:51:52] ok phew [15:52:01] :-) [15:52:13] and not that we are at it [15:52:16] ok so akosiaris, what about making varnishkafka user & group and then addign syslog to the varnishkafka group and making /var/log/varnishkafka group writeable? [15:52:19] now* [15:52:32] let the file be created by rsyslog [15:52:44] rsyslog can create files in /var/log/ [15:52:47] that would do it, but the dir need sto be writable by syslgo then [15:52:49] hmmmMMMmmmm [15:53:01] /var/log/varnishkafkfa.log then? [15:53:05] yes [15:53:08] done. [15:53:08] easy [15:53:11] and no groups/users nothing :-) [15:53:27] akosiaris: nope [15:53:28] just a /etc/rsyslog.d/varnishkafka file ? [15:53:39] rsyslog cannot write to /var/log [15:53:45] ...Ubuntu. [15:53:48] yeah taht's this [15:53:48] https://gerrit.wikimedia.org/r/#/c/78782/16/debian/70-varnishkafka.conf [15:53:49] yes it can [15:53:53] i just tested it [15:54:11] # cat lala.conf [15:54:11] *.* /var/log/all.log [15:54:16] and the file just got created [15:54:44] I can't see how [15:54:45] # sudo -u syslog touch /var/log/test.log [15:54:45] touch: cannot touch `/var/log/test.log': Permission denied [15:54:56] did you test it in a Debian box perhaps? :) [15:55:02] Debian isn't as silly [15:55:14] Distributor ID: Ubuntu [15:55:14] Description: Ubuntu 13.04 [15:55:15] Release: 13.04 [15:55:15] Codename: raring [15:55:16] drwxr-xr-x 17 root root 4096 Nov 12 06:25 /var/log/ [15:55:27] it probably has privileges on restart [15:55:30] Ubuntu 12.04.2 LTS \n \l [15:55:33] that it drops afterwards [15:55:36] hm [15:55:50] we care about logrotate/sighup too, though [15:56:00] (even if that's the case) [15:56:40] logrotate's job then [15:56:50] tell rsyslog to restart... [15:56:55] won't be the first [15:57:00] eww :) [15:57:23] /etc/logrotate.d/rsyslog [15:57:31] you are already doing it here anyway [15:57:36] postrotate [15:57:36] reload rsyslog >/dev/null 2>&1 || true [15:57:40] reload, not restart [15:57:43] sighup [15:58:12] anyway [15:58:24] too many people on a single problem [15:58:34] I'll stay out of your away :) [15:59:10] :-) [15:59:30] lemme check what happens on rsyslog reload... damned upstart jobs [15:59:50] resending spam: [15:59:52] does anyone have a chance to look at Bug 56936? en.wp search failures that I can't reproduce or find logs for. [15:59:53] that's what we are doing now anyway? [15:59:54] https://gerrit.wikimedia.org/r/#/c/78782/16/debian/varnishkafka.logrotate [16:00:25] ottomata: ew.... pgrep ? [16:00:26] manybubbles: I don't think anyone has looked at it, no [16:00:36] reload ? restart ? [16:01:17] paravoid: yeah, I've looked at it but no one else has had a chance far as I can tell. [16:01:37] manybubbles: ops typically don't look at bugzilla much, even less so if there's no "ops" keyword.... [16:01:50] akosiaris: i think there wa sa reason for that…not remembering though [16:02:07] manybubbles: I can tell you, though, that ulsfo (or esams) has nothing to do with all that [16:02:29] as far as I know, mediawiki appservers interface with the search cluster and we have appservers only in one place (at at time) [16:02:32] at a* [16:02:41] the other DCs are just frontend HTTP caching [16:02:51] akosiaris: rsyslogd does a full daemon restart when it is HUPed. [16:02:55] that's from here [16:02:55] http://blog.gerhards.net/2008/10/new-rsyslog-hup-processing.html [16:02:59] but maybe that is old, dunno [16:02:59] so your best luck would be to investigate on the mediawiki and or lsearchd layer [16:03:37] paravoid: that is what I thought. I did investigate the mediawiki/lsearchd layer and I can't reproduce or find logs of it. I'll go whine to -dev:) [16:03:43] ottomata: sure ... but if you wanted to restart, why not call... restart rsyslog ? [16:03:59] the init/upstart scripts are going to do it better anyway [16:03:59] i can [16:04:07] yeah i don't know why we did this right now…. [16:04:08] fine with that [16:04:11] manybubbles: you could add wfDebug calls wherever "An error has occurred while searching: The search backend returned an error: " is printed out, perhaps? [16:04:12] cept paravoid says 'eww' [16:04:13] :p [16:04:26] and i agree, but not sure what else to do [16:04:32] "reload syslog", not "restart syslog" please [16:04:33] seems silly to restart rsyslog on all rotates [16:04:35] (you summoned me!) [16:04:46] sure [16:04:46] but [16:04:49] just fyi: rsyslogd does a full daemon restart when it is HUPed. [16:05:09] (unless that is not true anymore, that quote is from 2008) [16:05:23] i expect it to be true [16:05:23] googling more [16:06:27] /lib/systemd/system/rsyslog.service [16:06:31] heh... I wanna cry [16:06:44] hahaha [16:06:45] thankfully not used ... [16:07:04] oo [16:07:04] In v4, we provide some support for the old-style semantics. We introduced a setting $HUPisRestart which may be set to "on" (tradional, heavy operation) or "off" (new, lightweight "file close only" operation). The initial versions had the default set to traditional behavior, but starting with 4.5.1 we are now using the new behavior as the default. [16:07:46] huh [16:07:49] so file close only ? [16:07:51] perfect [16:08:02] it will open the new file then [16:08:03] ok phew [16:08:03] Please note that restart-type HUP is depricated and will go away in rsyslog v5. [16:08:12] we are Version: 5.8.6-1ubuntu8 [16:08:24] yeah ok [16:08:29] so will use reload [16:09:02] ok. however the first time it will not work [16:09:26] after that and with logrotate creating the new files that rsyslog will open after reload [16:09:38] it will be fine [16:09:48] so rsyslog will need a restart the very first time [16:10:00] that might be a good case for postinst of varnishkafka [16:10:09] welllllllllllhm. [16:10:18] will rsyslog create the file even if there is no output yet? [16:10:23] varnishkafka isn't going to start on install [16:10:27] yes [16:10:28] ok [16:10:51] so a single restart rsyslog call in configure state of postinst is ok [16:12:04] ok [16:12:22] akosiaris: [16:12:27] is the pgrep for rsyslog bad? [16:12:28] /usr/bin/pgrep -P 1 rsyslogd >/dev/null [16:12:29] ? [16:12:36] yap [16:12:44] go for a single reload rsyslog there [16:12:46] how to test? status doesn't seem to change error goes [16:12:48] codes* [16:12:50] exit val [16:12:53] always 0 [16:12:54] it will do the right thing anyways [16:13:01] but what if rsyslogd isn't running [16:13:02] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:02] that's why the check [16:13:09] like if someone shut it down on purpose [16:13:27] then logrotate should not restart it [16:13:35] it would be wrong [16:13:50] riiiight, so it needs to check if rsyslog is running before calling reload [16:13:52] right? [16:14:07] oh reload won't start if stopped [16:14:14] will it? [16:14:15] hm checking [16:14:35] looks like it won't ok [16:14:35] it won't [16:14:49] that is why i said use restart, reload commands [16:14:54] those problems are already solved [16:14:58] ok [16:16:02] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:16:09] is anyone looking at 1013 or should I... huh. that's interesting [16:16:36] telepathy ? [16:16:51] it knew I was going to give it a talking to [16:17:23] rebooted [16:17:27] hah [16:17:34] uh, 1013 wassup, still power problem? [16:17:43] if that's not any of you then *cough* power issues*cough*? [16:17:51] yeah not me [16:17:55] yeah that's my first thought but let's look at things like ganglia to make sure right? [16:19:06] not swap, lots of disk busy at the last (atop) [16:19:52] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:53] also analytics1011? [16:19:57] ok ... [16:19:59] so. [16:20:01] now this is problem [16:20:13] i suppose same rack ? [16:20:19] same 4 as last time [16:20:21] same rack yeah, these are the same machines [16:20:24] same machines as 4 days ago [16:20:34] nice... [16:20:53] PDU problems again... [16:21:01] how did we make sure last time ? [16:21:06] and we just lost 1013 again [16:21:09] yes [16:21:10] chris was able to tell last time [16:21:18] physical presence I guess [16:21:23] heh [16:21:53] occam says it's the same thing as last time, that's for sure [16:22:06] lemme find that ticket [16:22:42] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:12] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 2.87 ms [16:23:38] cmjohnson1: around? we have it seems more of the pdu issue causing reboots [16:24:16] (I have commented on the ticket, 6238, just for the record) [16:25:54] apergos: yes..i am swapping the pdu with one of the new ones destined for row d [16:26:02] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:26:14] ok, crossing fingers [16:26:42] what's your eta for that? [16:26:46] it will be easier for me to t-shoot it or get it replaced...cuz it's going to be a vendor thing and we really can't afford for that [16:26:54] my eta is about an hour or so [16:27:08] awesome [16:27:16] cmjohnson1: lemme know when you want to talk about row reshuffling [16:27:22] PROBLEM - Varnish HTTP text-backend on amssq60 is CRITICAL: HTTP CRITICAL - No data received from host [16:28:15] ok, thanks a lot [16:28:28] ottomata: not going to row reshuffle...going to try and move w/out powering down....moving all side B to new pdu and then will move side A [16:29:02] unless something is backwards which i suspect an1013 is ...all should move without interruption [16:29:07] the reshuffle is for later; there's a ticket about splitting the hosts up so we have some redundancy, later [16:30:41] yeah cmjohnson1, that is to avoid potential dataloss problems if we get a row failure that affects all nodes in it [16:30:47] basically all analytics production nodes are in the same row right now [16:30:51] quote 'production' [16:30:51] !log swapping ps1-c7-eqiad...one side at a time...notifications pending [16:31:05] yippi [16:31:08] Logged the message, Master [16:31:32] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:33] heh... the ganglia_new module is so new that it uses tabs [16:31:42] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:22] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:32:37] (03PS17) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [16:33:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:34:02] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:52] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.003 second response time [16:40:07] (03CR) 10Edenhill: "(1 comment)" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [16:44:30] Snaps: i just set it to the librdkafka defaults [16:45:30] ottomata: ah, okay. I think its better to comment them out then so that librdkafka defaults will propogate properply. But leave them commented out to indicate what might be worth fiddling with. ? [16:46:19] yeah that is good [16:47:11] will do that for the other vals too [16:49:23] (03PS18) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [16:52:48] (03PS1) 10Manybubbles: CirrusSearch as secondary for nlwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94925 [16:54:49] (03PS1) 10Ottomata: Updating varnishkafka.conf with a few changes and using some actual librdkakfa defaults. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/94927 [16:55:11] (03CR) 10Ottomata: [C: 032 V: 032] Updating varnishkafka.conf with a few changes and using some actual librdkakfa defaults. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/94927 (owner: 10Ottomata) [16:58:07] (03PS4) 10Ottomata: Setting up varnishkafka on mobile varnish caches. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 [16:58:39] (03CR) 10jenkins-bot: [V: 04-1] Setting up varnishkafka on mobile varnish caches. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [16:58:45] (03CR) 10Edenhill: [C: 031] (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [17:01:22] (03CR) 10Edenhill: [C: 031] Setting up varnishkafka on mobile varnish caches. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [17:01:52] (03CR) 10Chad: [C: 032] Use descriptive heredoc [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94303 (owner: 10Chad) [17:02:02] (03Merged) 10jenkins-bot: Use descriptive heredoc [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94303 (owner: 10Chad) [17:02:53] <^d> manybubbles: You sync'ing and running the scripts or am I? :) [17:03:10] ^d: I'll sync and run if you'll +2 [17:03:17] (03CR) 10Chad: [C: 032] CirrusSearch as secondary for nlwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94925 (owner: 10Manybubbles) [17:03:26] uhhhh [17:03:26] ^d: starting [17:03:27] (03Merged) 10jenkins-bot: CirrusSearch as secondary for nlwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94925 (owner: 10Manybubbles) [17:03:29] :) [17:03:37] hi there :) [17:03:40] <^d> howdy [17:03:42] greg-g: ? [17:04:05] "sync and run" where I thought "run" meant "run away from the computer" ;) [17:04:23] <^d> Deploy then flee! [17:04:24] * apergos snickers [17:06:44] (03PS19) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [17:08:47] (03PS5) 10Ottomata: Setting up varnishkafka on mobile varnish caches. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 [17:09:35] !log manybubbles synchronized wmf-config/InitialiseSettings.php 'Enable Cirrus as secondary on nlwiki' [17:09:49] Logged the message, Master [17:09:59] (03CR) 10Edenhill: [C: 031] (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [17:11:30] (03CR) 10Edenhill: [C: 031] Setting up varnishkafka on mobile varnish caches. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [17:13:29] (03PS1) 10Ottomata: Using interface::add_ip6_mapped on analytics Kafka brokers. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94933 [17:14:13] paravoid: ^ [17:15:04] PROBLEM - Host ps1-c7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:18:30] ^d: search index is under constrution [17:18:36] *construction [17:18:59] <^d> mmk [17:19:00] [[File:Construction_man_digging.gif]] [17:19:45] greg-g: my favorite. also, I'm done syncing files. just needed the one. [17:20:21] manybubbles: thanks for the heads up [17:25:54] PROBLEM - Host wtp1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:30:34] RECOVERY - Host wtp1004 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:31:21] <^d> greg-g: I think you wanted https://commons.wikimedia.org/wiki/File:Under_Construction.jpeg :) [17:32:05] <^d> https://commons.wikimedia.org/wiki/File:Lavori.gif is pretty fun [17:33:04] PROBLEM - Host wtp1016 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:08] ^d: actually, no, that's not animated :) [17:33:18] <^d> Lavori.gif is :) [17:33:24] ooo, that is good! [17:33:35] <^d> Hah [17:33:36] <^d> https://commons.wikimedia.org/wiki/File:Enobras.gif [17:38:37] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable mwsearch logs' [17:38:43] !log reedy updated /a/common to {{Gerrit|I4baa719b9}}: CirrusSearch as secondary for nlwiki [17:38:51] Logged the message, Master [17:38:55] Lies logmsgbot [17:38:56] All lies [17:39:04] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:04] (03PS1) 10Reedy: Enable mwsearch logs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94934 [17:39:05] Logged the message, Master [17:39:13] <^d> Livehacks are bad yo :p [17:39:14] (03CR) 10Reedy: [C: 032] Enable mwsearch logs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94934 (owner: 10Reedy) [17:39:24] <^d> lobmsgbot just exposes livehackers :p [17:39:26] (03Merged) 10jenkins-bot: Enable mwsearch logs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94934 (owner: 10Reedy) [17:40:04] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 5.036 second response time [17:42:19] cmjohnson1: how's the pdu? :) [17:43:34] (03PS2) 10Yurik: Revert "Revoke Yuri's shell access" and change pub key [operations/puppet] - 10https://gerrit.wikimedia.org/r/94780 [17:44:42] anyone around to +2 pls ^ [17:44:52] i have no shell access in the mean time [17:45:02] can validate that its me on hangout [17:46:40] * greg-g can't [17:47:21] greg-g, ? [17:48:28] mark disabled it a week ago due to some possible security issues, but i'm not sure if he's around now to +2 - isn't he on vacation? [17:48:44] PROBLEM - Host terbium is DOWN: PING CRITICAL - Packet loss = 100% [17:49:05] paravoid ^ [17:49:19] I can't +2 in operations/ [17:49:50] oh :) [17:50:14] RECOVERY - Host terbium is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [17:53:12] ottomata: the new one is in...trying to get all the power correct. probably about 30 mins more to get power right...then some configuration is needed [17:54:05] cmjohnson1: I'm not sure if you saw/had time to reply me on list, how come this comes with downtime? don't those machines have redundant PSUs/supplies? [17:54:16] and cables coming from different PDUs? [17:54:30] I was wondering that too [17:54:32] !log reedy synchronized php-1.23wmf2/extensions/MWSearch/ [17:54:45] Logged the message, Master [17:55:30] !log reedy synchronized php-1.23wmf3/extensions/MWSearch/ [17:55:40] (03PS3) 10Ottomata: Revert "Revoke Yuri's shell access" and change pub key [operations/puppet] - 10https://gerrit.wikimedia.org/r/94780 (owner: 10Yurik) [17:55:47] Logged the message, Master [17:55:50] (03CR) 10Ottomata: [C: 032 V: 032] Revert "Revoke Yuri's shell access" and change pub key [operations/puppet] - 10https://gerrit.wikimedia.org/r/94780 (owner: 10Yurik) [17:56:18] ottomata, thx [17:56:22] yurik^ [17:56:22] yup! [17:56:39] did terbium restart? [17:57:13] paravoid: ah, well I can manage :) [17:57:19] ? [17:57:31] what? [17:57:39] 17:57:23 up 7 min, 2 users, load average: 0.65, 0.86, 0.52 [17:57:48] fallout from the PDU replacement? [17:57:49] cmjohnson1: ? [17:57:58] terbium is on C7, yes [17:58:17] but this was unexpected [17:58:30] anyway, that stash cleaning script finished a run [17:58:32] we also have a couple of more important systems on that rack, e.g ytterbium, rdb1002 [17:58:46] deleted like 1.4 million objects [17:58:48] Aaron|home: yup, I noticed, we're fully synced [17:58:56] Aaron|home: I ran another thumb sync [17:59:06] what about timeline-render and score-render? [17:59:10] the diff now is 20G, which I think may just be "lost" files [17:59:12] yeah, those too [17:59:18] * Aaron|home was going to make sure those were done today...ah good [17:59:18] and transcoded [17:59:25] feel free to double check :) [18:00:42] ^wikipedia-commons-local-thumb.[0-9a-f]{2}$ [18:00:42] ^wikipedia-..-local-thumb.[0-9a-f]{2}$ [18:00:42] ^wik[a-z]+-.*-local-thumb$ [18:00:42] ^global-.*$ [18:00:42] ^.*-timeline-render$ [18:00:45] ^.*-transcoded(.[0-9a-f]{2})?$ [18:00:47] is what I have [18:00:59] * Aaron|home already did captcha [18:01:12] well that's global-* [18:01:40] !log Rebooting amssq60, stuck in xfs kmem_alloc deadlock [18:01:48] should be fine then [18:01:58] Logged the message, Master [18:02:32] paravoid: wait why is it 20G? [18:02:49] my guess? [18:03:10] files that are in the listings (and added to the filesize counts) but don't actually exist to be copied [18:03:23] as sad as that sounds [18:05:54] Skipped non-stash 8/81/11mpb7ab2ktg.x71daf.2463293..16 [18:06:06] still some files not deleted due to the regex not catching stuff [18:06:18] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 100,000 [18:06:27] why are there so many different formats?? [18:09:27] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:12:57] PROBLEM - Host analytics1017 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:15] (03PS2) 10Ottomata: Remove Kraken-specific varnishncsa instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/91492 (owner: 10Ori.livneh) [18:18:21] (03CR) 10Ottomata: [C: 032 V: 032] Remove Kraken-specific varnishncsa instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/91492 (owner: 10Ori.livneh) [18:20:32] (03CR) 10Ottomata: "Ran this after merging:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91492 (owner: 10Ori.livneh) [18:21:08] (03PS1) 10Mark Bergsma: Reduce threshold for flushing of dirty pages [operations/puppet] - 10https://gerrit.wikimedia.org/r/94943 [18:22:21] mark: interesting. [18:22:51] dunno if it will help, but worth a shot [18:23:03] after 3 boxes with identical symptoms, i'm pretty sure it will happen again ;-) [18:23:05] (03PS2) 10Ottomata: Cleaning up analytics role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/94148 [18:23:10] (03CR) 10Ottomata: [C: 032 V: 032] Cleaning up analytics role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/94148 (owner: 10Ottomata) [18:24:19] (03PS2) 10Mark Bergsma: Reduce threshold for flushing of dirty pages [operations/puppet] - 10https://gerrit.wikimedia.org/r/94943 [18:24:42] (03CR) 10Mark Bergsma: [C: 032 V: 032] Reduce threshold for flushing of dirty pages [operations/puppet] - 10https://gerrit.wikimedia.org/r/94943 (owner: 10Mark Bergsma) [18:27:24] ottomata: I don't think that e.g. the puppetmaster work is relevant to the SoS [18:27:38] (that's you, right?) [18:27:44] yeah probably not [18:27:53] hmm [18:34:07] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [18:34:12] ottomata: is it too late for me to send you a SoS update? [18:34:21] naw we're in it now [18:34:22] send away! [18:34:38] Not that I've done anything that should matter to anyone anyway… come to think of it. [18:35:19] so, nevermind! [18:35:26] (03CR) 10GWicke: "Ping ;)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 (owner: 10Lcarr) [18:35:57] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [18:38:27] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [18:39:03] (03PS1) 10Mark Bergsma: Reexec procps on changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/94949 [18:41:16] (03CR) 10Mark Bergsma: "Yep, I'll have a version of this merged before the end of the week. ;)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 (owner: 10Lcarr) [18:52:26] ottomata: the power is stable in ps1-c7-eqiad now..sill have to cfg the pdu but we have redundancy again [18:52:48] cooool [18:52:49] danke [18:53:27] cmjohnson1: hey [18:53:32] cmjohnson1: did you see my question above? [18:53:48] 19:54 < paravoid> cmjohnson1: I'm not sure if you saw/had time to reply me on list, how come this comes with downtime? don't those machines have redundant PSUs/supplies? [18:53:52] 19:54 < paravoid> and cables coming from different PDUs? [18:53:57] (also, we lost terbium in the process) [18:54:04] (but it's back up now) [18:54:31] the pdu lost one side so we didn't have redundancy [18:54:51] don't we have *two* PDUs? [18:55:19] we have a single pdu with 2 sides that are on separate receptacles [18:55:35] okay [18:55:38] and we lost redundancy [18:55:43] yes [18:55:49] why did we lose the analytics boxes, though? [18:56:28] i don't know why we lost analytics boxes...if power was lost in the last couple of hours it happened during the pdu swap [18:56:40] example terbium [18:56:50] no, I mean, don't these boxes have two power supplies? [18:57:04] yes..they do have 2 power supplies [18:57:29] each power supply is connected to it's own power source ...side A or side B of the pdu [18:57:32] right [18:57:41] that way if we lose 1 side the other remains [18:57:48] so why did they poweroff, both today and the other day when the problem originally presneted? [18:57:52] i did not see any of them off [18:58:21] I don't know why any of would of powered off [18:58:43] we had reboots as I noted on the ticket [18:59:21] both then (multiple times) and today (multiple times) [18:59:21] apergos: i rebooted it on Friday 1x [19:00:34] something is up with that pdu...not sure what...I swapped it with one destined for row D so we'll either need to order another set of the new ones if see if we can the broken one replaced. [19:01:03] cmjohnson1: there's a mail thread on the list about this btw :) [19:01:26] due to the fact we bought through dell and servertech couldn't find a match on the S/N swapping was the fastest and best thing to do [19:01:48] paravoid: i read a bit of this morning. once i finish I will go through and comment [19:02:11] k, thanks :) [19:04:12] cmjohnson1: while you are here, this is kinda unrelated to the PDU problem, but I'd really like to do an analytics box reshuffle as soon as possible [19:05:10] https://rt.wikimedia.org/Ticket/Display.html?id=6279 [19:05:44] ottomata: ok...only issue is space at the moment. Probably will need to wait for row D [19:06:05] in the meantime create a ticket and put it in queue [19:08:18] (03PS1) 10Dereckson: Undeploy SimpleAntiSpam extension. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94956 [19:11:50] RECOVERY - Host wtp1016 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:17:08] which queue? [19:17:12] cmjohnson1: [19:17:16] this is the ticket [19:17:17] https://rt.wikimedia.org/Ticket/Display.html?id=6279 [19:18:03] !log Created EducationProgram tables on elwiki bug 56771 [19:18:21] Logged the message, Master [19:26:22] oh..okay...sorry ottomata been busy didn't look for it [19:30:01] (03CR) 10Legoktm: [C: 031] Undeploy SimpleAntiSpam extension. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94956 (owner: 10Dereckson) [19:37:34] (03PS3) 10Andrew Bogott: Move android::sdk and packages::ant18 into contint module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94257 [19:40:04] !log reedy synchronized php-1.23wmf3 'Support CIDR ranges in $wgSquidServersNoPurge' [19:40:22] Logged the message, Master [19:44:12] !log reedy synchronized php-1.23wmf2 'Support CIDR ranges in $wgSquidServersNoPurge' [19:44:26] Logged the message, Master [19:48:11] !log reedy updated /a/common to {{Gerrit|I1bb030fce}}: Enable mwsearch logs [19:48:18] (03PS1) 10Reedy: Enable EducationProgram on elwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94964 [19:48:26] Logged the message, Master [19:48:27] ori-l: That's starting to get slightly annoying as it's wrong... [19:48:37] (03CR) 10Reedy: [C: 032] Enable EducationProgram on elwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94964 (owner: 10Reedy) [19:49:02] Reedy: it's wrong? did you file a bug? [19:49:36] Bug for what? It being wrong? [19:49:40] RECOVERY - Host ps1-c7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.12 ms [19:51:49] (03CR) 10Dzahn: [C: 032] fix pdf servers in dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/94307 (owner: 10Dzahn) [19:53:40] (03CR) 10jenkins-bot: [V: 04-1] Enable EducationProgram on elwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94964 (owner: 10Reedy) [19:53:47] Oh sod off [19:54:16] (03CR) 10Reedy: [V: 032] Enable EducationProgram on elwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94964 (owner: 10Reedy) [19:55:30] !log reedy synchronized wmf-config/InitialiseSettings.php 'Iab47779a2c0f9fe239676d75a279336156353c4b' [19:55:48] Logged the message, Master [19:56:17] (03CR) 10Hashar: [C: 031] Move android::sdk and packages::ant18 into contint module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94257 (owner: 10Andrew Bogott) [19:57:56] lol Reedy. [20:01:08] so, why is application server cpu up 30% since 19:50? :) [20:01:51] Uhh [20:02:17] EducationProgram isn't that bad is it? [20:02:26] mark: that isn't due to your varnish stuff, right? [20:02:45] shouldn't be, I didn't do anything ;) [20:02:49] [all] [test2] [1.22wmf19] [1.22wmf18] [stats/all] [1.23wmf2] [1.23wmf3] [stats/1.22wmf13] [1.23wmf1] [stats/1.22wmf19] [stats/1.22wmf18] [thumb-1.23wmf1] [thumb-1.23wmf3] [thumb-1.23wmf2] [thumb-1.22wmf22] [thumb-1.22wmf21] [thumb-1.22wmf20] [1.22wmf22] [1.22wmf20] [1.22wmf21] [thumb-1.22wmf18] [thumb-1.22wmf19] [stats/1.22wmf20] [stats/1.22wmf21] [stats/1.22wmf22] [stats/1.23wmf1] [stats/1.23wmf2] [stats/1.23wmf3] [ showing 50 [20:02:49] events, show more ] [20:02:56] helpful profiling is helpful [20:03:25] mark: oh, thought you said there'd be an increase in api server usage due to reduced cache hit for loginwiki [20:03:26] !log reedy cleared profiling data [20:03:41] Logged the message, Master [20:03:43] greg-g: that was yesterday, and as of this afternoon that does get cached [20:03:48] didn't make much difference [20:03:56] ah, cool, sorry to interrupt then :) [20:04:19] it had actually been uncached by varnish for a long time, but squid did cache [20:04:28] so should be the same on varnish now [20:04:38] hopefully. [20:04:39] (03PS5) 10Dzahn: (bug 56412) Make all sidebar phrases on Planet translatable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94433 (owner: 10Odder) [20:04:48] varnish is a bit liberal with caching headers sometimes [20:05:30] When you say liberal, what does that mean exactly? [20:06:14] that it ignores some RFC 2616 stuff because it considers itself not a http cache "in the traditional sense" [20:06:23] and as such it used to ignore no-cache, private [20:06:28] until we configured it to not ignore in vcl [20:06:33] i think it's fine now [20:07:45] Oh, right, that [20:07:54] Cache-Control: no-cache in the *request* header, right? [20:07:58] (03CR) 10Dzahn: [C: 032] (bug 56412) Make all sidebar phrases on Planet translatable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94433 (owner: 10Odder) [20:08:17] I think I once cited that as an example of an HTTP spec I anecdotally knew was being ignored by caching software for DoS protection [20:08:22] RoanKattouw: nope, response [20:08:29] also in the request yes [20:08:29] wtf [20:08:41] Well in the request it should ignore it [20:08:48] But in the response it absolutely should not [20:09:01] clearing profiling data is taking an age [20:09:38] ori-l: do you know this one on puppet-merge? error: cannot open /var/lib/git/operations/puppet/.git/modules/modules/kafka/FETCH_HEAD: Permission denied [20:09:40] RoanKattouw: yeah, so we're now taking care of that in the wikimedia.vcl [20:09:51] ori-l: it just happened once but next time it worked normal again [20:10:52] has puppet ran in the past few hours on the bast1001? I can't login with the new key [20:13:18] yurik: i thought only some fraction of puppet runs do keys? [20:13:23] Excellent [20:14:11] (03CR) 10Andrew Bogott: [C: 032] Move android::sdk and packages::ant18 into contint module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94257 (owner: 10Andrew Bogott) [20:14:43] !log reedy cleared profiling data [20:14:45] yes it has run recently [20:14:48] That's just a lie [20:15:00] your key has been created over there [20:15:15] ori-l: Done much with professor? [20:15:26] and only some fraction of runs do host keys [20:15:51] apergos, i am still getting pubkey error, and it merged 2+ hours ago [20:16:18] should i wait a bit longer? [20:16:36] the answer to that is always: yes! :) [20:16:53] :) [20:17:30] well like I say it your key was added to authorized_keys over there [20:17:58] apergos, yes, but i still can't login :( [20:18:27] tail -f /var/log/auth.log :) [20:18:28] waiting isn't going to fix that [20:18:45] yurik: btw, your commit msg really sucks [20:18:46] yes, but if it will make mark happy... [20:19:36] jeremyb, :( [20:20:04] 3FYSE2s= is the last part of your correct public key right? [20:20:07] it was an automatic msg at first, and later i added that i regened the key [20:20:52] apergos, yes [20:21:01] double check your private key then [20:21:20] apergos, i am able to login into the bastion with the same key [20:21:36] into which bastion [20:21:46] labs [20:21:46] sorry but weren't we just talking about bast1001? [20:21:52] correct [20:21:53] uh wait [20:22:00] you are using the same key in labs? [20:22:15] i added it to test, will remove once it works [20:22:35] i have two keys - one for labs, one for prod [20:22:49] labs is easier to manage - adding it via web interface [20:24:33] (03PS2) 10Reedy: (bug 56760) Update logo for Korean Wikibooks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94413 (owner: 10Odder) [20:24:40] (03CR) 10Reedy: [C: 032] (bug 56760) Update logo for Korean Wikibooks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94413 (owner: 10Odder) [20:26:13] (03Merged) 10jenkins-bot: (bug 56760) Update logo for Korean Wikibooks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94413 (owner: 10Odder) [20:26:27] well I have nothing more useful I can tell you from this end; the key is there, in the right file, with the right permissions, and the log tells me that you have a public key issue [20:28:23] (03CR) 10Hashar: "Filled https://bugzilla.wikimedia.org/show_bug.cgi?id=56955 about it with a patch for wmerrors in https://gerrit.wikimedia.org/r/94978" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93009 (owner: 10Tim Starling) [20:28:31] !log deployed new AbortNewAccount hook on wikitech [20:28:50] Logged the message, Master [20:29:26] yurik: did you try with ssh -vvv -i ? [20:30:07] yurik: also, ssh-keygen -lf path/to/file/passed/to/-i [20:30:22] jeremyb, thx, checking [20:31:13] it might be worth doing that for your labs login too, to compare [20:31:14] (03PS2) 10Reedy: (bug 56761) Add shortcut for NS_PROJECT for kowiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94417 (owner: 10Odder) [20:31:21] (03CR) 10Reedy: [C: 032] (bug 56761) Add shortcut for NS_PROJECT for kowiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94417 (owner: 10Odder) [20:31:46] apergos, jeremyb this is weird - i was able to login via SSH command, but not via putty, even though putty works to labs. [20:32:21] putty has a window where you can get more details [20:32:21] so first things first: note to yourself to test in future with ssh :-D [20:32:44] apergos: no, note to self to not use windows! :-D [20:32:53] yurik: in the putty windows check the setting under ssh section [20:32:54] heh, true - but on Windows one doesn't usually test that way as putty generally has a good level of trust [20:33:09] if I thought I could win that one, I'd give him the pen and paper for that note :-D [20:33:20] :) [20:33:23] putty has a key format issue or has had in the past [20:33:43] Yeah [20:33:50] i would switch to linux in a heartbeat the moment it has 1) total commander (mc is nowhere near as powerful), and 2) winmerge [20:34:05] It can do an openssh key or a ssh.com key [20:34:11] http://unix.stackexchange.com/questions/74545/what-difference-between-openssh-key-and-putty-key [20:34:29] Reedy: also it's own putty format? [20:34:37] yurik: to solve the key stuff mobaextrem might help you [20:35:05] * mobaxterm [20:35:14] Yeah [20:35:17] puttygen will fix it [20:35:38] as for the rest of the switching stuff, all i can say is blah [20:36:17] (03CR) 10Andrew Bogott: "If you have tested and validated this on labs, let me know and I will merge." [operations/puppet] - 10https://gerrit.wikimedia.org/r/90117 (owner: 10Nemo bis) [20:36:28] I'm gonna check out (basically I'm pretty checked out already, but now I'm officially non-productive) ... might peek in here and there but not be useful [20:36:43] you should be able to get it sorted now [20:38:04] (03CR) 10Nemo bis: "andrewbogott, how does one do that?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90117 (owner: 10Nemo bis) [20:38:59] Nemo_bis, do you want me to talk you through it right now? [20:39:20] * jeremyb still has to point to putty's window with detailed logs. that should be able to tell you e.g. which key it's try to use. [20:39:42] andrewbogott: might i try and see if your training worked for me and i'll guide him? :) [20:40:01] matanya: yeah, sounds good! If he's around. [20:40:12] * andrewbogott adds Nemo_bis to puppet project [20:40:47] (03Merged) 10jenkins-bot: (bug 56761) Add shortcut for NS_PROJECT for kowiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94417 (owner: 10Odder) [20:40:49] …maybe [20:41:07] (03CR) 10Matanya: "https://wikitech.wikimedia.org/wiki/Puppet_coding#labs_testing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90117 (owner: 10Nemo bis) [20:41:20] at least for now. [20:41:25] ok, works now, i think its a weird bug in putty. jeremyb & apergos, & matanya, thank you for your help!!! [20:42:03] yurik: great. [20:42:06] yurik: now you can review my patch on the pywikibot :) [20:43:21] (03CR) 10Nemo bis: "I'm not sure how that applies here: is there a cluster available for me to test on in labs, with a maintenance host for crontabs where to " [operations/puppet] - 10https://gerrit.wikimedia.org/r/90117 (owner: 10Nemo bis) [20:43:31] (03PS2) 10Reedy: (bug 56807) Localize logo for Welsh Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94438 (owner: 10Odder) [20:43:37] (03CR) 10Reedy: [C: 032] (bug 56807) Localize logo for Welsh Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94438 (owner: 10Odder) [20:43:38] yurik: re msg: right. i'm thinking maybe you shouldn't use automatic msg if you're mixing in manual tweaks? or maybe the double negative confused me? ("revert revoke access" instead of just saying "restore access") also i think maybe i misread the first iteration of it yesterday [20:45:12] strictly speaking it's not a revert since it's a new key [20:45:24] * apergos whacks self for continuing to chime in in the work channel [20:45:25] gone! [20:45:43] * yurik shoots himself [20:46:08] yurik: please revert the shooting [20:47:24] jeremyb, only after https://gerrit.wikimedia.org/r/#/c/88261/ :) [20:47:39] otherwise dfoy & dr0ptp4kt won't let me stay alive [20:53:03] andrewbogott: regarding https://gerrit.wikimedia.org/r/#/c/90760/ I havn't found that .pep8 file, mind showing me where it might be? [20:53:10] (03CR) 10Reedy: [V: 032] (bug 56807) Localize logo for Welsh Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94438 (owner: 10Odder) [20:53:55] sigh https://gdash.wikimedia.org/dashboards/reqerror/ [20:54:01] (03PS2) 10Reedy: (bug 56899) Extra NS for Collection on enwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94723 (owner: 10Odder) [20:54:06] (03CR) 10Reedy: [C: 032] (bug 56899) Extra NS for Collection on enwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94723 (owner: 10Odder) [20:56:41] ksnider: hi, any news for me (rt access)? [20:57:29] matanya: Yes, actually - Legal literally got me the NDA 30 minutes ago. :) [20:57:58] ksnider: cool, thanks. great timing :) [20:57:58] matanya, do you have the pep8 tool installed so you can tell what the failure is? [20:58:14] yes andrewbogott, longer than 80 car's [20:58:26] Yeah, ok, probably fine to suppress that then. Just a second... [20:58:27] (03CR) 10Reedy: [V: 032] (bug 56899) Extra NS for Collection on enwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94723 (owner: 10Odder) [20:58:44] What's up with Jenkins tonight? [20:58:53] he is tired [20:58:53] andrewbogott: you can see the failure in the web UI [20:58:55] (03PS2) 10Reedy: Undeploy SimpleAntiSpam extension. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94956 (owner: 10Dereckson) [20:59:07] matanya: cat puppet/files/mirror/.pep8 <- ? [20:59:12] (03CR) 10Reedy: [C: 032] Undeploy SimpleAntiSpam extension. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94956 (owner: 10Dereckson) [20:59:55] andrewbogott: damn, i'm stupid [21:00:07] Well, it is hidden after all :) [21:00:47] (03CR) 10Reedy: [V: 032] Undeploy SimpleAntiSpam extension. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94956 (owner: 10Dereckson) [21:02:03] !log reedy synchronized wmf-config/ [21:02:18] Logged the message, Master [21:04:05] !log Profiling data clearing seems to work, but echo "-truncate" | nc -q0 -u professor.pmtpa.wmnet 3811 doesn't return... [21:04:22] Logged the message, Master [21:04:47] Reedy: ori-l was doing /something/ to professor recently [21:05:02] ori-l: How dare you. [21:06:08] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:58] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 2.977 second response time [21:14:02] what's up with professor? [21:14:07] Ryan_Lane: were is the code for the stuff you were working on? [21:15:04] Aaron|home: he says it's checked in but that he can add you as a reviewer [21:15:12] he says he also has a few things he needs to commit [21:17:08] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [21:21:15] !log aaron synchronized php-1.23wmf3/maintenance/cleanupUploadStash.php '6461a943ece41ac6914fc60ec16854f0591f0f2b' [21:21:36] Logged the message, Master [21:30:57] ori-l: The clear-profile script seems to just hang. Stealing the netcat command and running it manually also hangs [21:31:09] It used to only take a few seconds [21:31:25] I thought originally it might've been because it had a lot of data (many mw versions etc) [21:31:32] But running it again gives the same behaviour [21:34:04] !log aaron cleared profiling data [21:34:25] Logged the message, Master [21:40:54] (03PS1) 10Matanya: Merge "Move android::sdk and packages::ant18 into contint module." into production [operations/puppet] - 10https://gerrit.wikimedia.org/r/95055 [21:42:04] (03Abandoned) 10Matanya: Merge "Move android::sdk and packages::ant18 into contint module." into production [operations/puppet] - 10https://gerrit.wikimedia.org/r/95055 (owner: 10Matanya) [21:44:57] andrewbogott: thank you for the ant18 modularization :-] [21:45:51] (03PS9) 10Matanya: download: convert into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 [21:48:19] andrewbogott: ^ [21:48:48] (03CR) 10Dzahn: "Odder, thanks for this. It works and is live now. -- Daniel" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94433 (owner: 10Odder) [21:50:07] argh. Need apache config handholding :( ... This directive is declared in one VirtualHost, but somehow affecting the other VirtualHost. RewriteRule ^/$ /w/index.php [21:50:58] PROBLEM - Varnish HTTP text-backend on amssq48 is CRITICAL: HTTP CRITICAL - No data received from host [21:51:08] PROBLEM - Varnish HTCP daemon on amssq48 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:28] PROBLEM - Varnish traffic logger on amssq48 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:04] awight: where are you looking at that? [21:52:31] that's like the "Primary wiki redirector" in every wiki vhost [21:53:09] or what URL are you using as a test that's not behaving correctly? [21:53:18] mutante: I'm hacking on mediawiki/vagrant, for the fundraising-tech team. I've added a second server... [21:53:48] what are the ServerName's [21:53:55] fwiw, https://gerrit.wikimedia.org/r/#/c/94950/ [21:53:58] since you say one affects the other [21:54:01] the servernames are "devwiki" and "crm" [21:54:15] (03PS10) 10Matanya: download: convert into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 [21:54:28] apachectl -S gives me: [21:54:29] wildcard NameVirtualHosts and _default_ servers: [21:54:29] *:80 is a NameVirtualHost default server mediawiki-vagrant.dev (/etc/apache2/sites-enabled/000-default:1) port 80 namevhost mediawiki-vagrant.dev (/etc/apache2/sites-enabled/000-default:1) port 80 namevhost crm (/etc/apache2/sites-enabled/crm:8) port 80 namevhost devwiki (/etc/apache2/sites-enabled/devwiki:8) [21:55:37] awight: errrrr, you could mention from the beginning that you're talking about vagrant :) [21:55:43] (03PS1) 10Cmjohnson: Removing cp1021-1036 mgmt ip [operations/dns] - 10https://gerrit.wikimedia.org/r/95059 [21:55:52] hehe that would have closed the door on further help [21:55:53] eh, i don't really know about the vagrant part [21:56:02] but it sounds like you're talking to the default vhost [21:56:03] well, these are just puppet rules [21:56:06] instead of the one you want [21:56:16] nah, the default vhost is the standard apache uselessness [21:56:32] yea, we disable it in some puppet manifests [21:56:32] I get that if the hostname is not in (crm,devwiki) [21:56:48] (03CR) 10Cmjohnson: [C: 032] Removing cp1021-1036 mgmt ip [operations/dns] - 10https://gerrit.wikimedia.org/r/95059 (owner: 10Cmjohnson) [21:57:18] but if you go to devwiki you get ? [21:57:26] !log dns update [21:57:43] Logged the message, Master [21:57:49] mutante: http://devwiki/ is correct, but http://crm/ forwards to crm/w/index.php [21:58:19] awight: paste the whole sites-enabled/crm somewhere" [22:00:53] mutante: http://paste2.org/aOJh51F0 [22:09:05] * awight is looking for an apache flag to dump configuration lines in the order they are loaded [22:12:01] mutante: pm me please when you can [22:14:26] self: mod_info reports that only devwiki has the rewrite rule. [22:16:47] awight: how about curl -H 'Host: crm' localhost ?. also.. see any of these maybe in logs " Could not reliably determine the server's fully qualified domain name" ? [22:16:53] matanya: ok, doi [22:16:58] ng so [22:17:42] mutante: thanks, I think I'm narrowing it down to a Firefox problem ! [22:18:01] It doesn't like the hostname with a single component [22:19:37] Yep. [22:31:42] (03PS1) 10Nemo bis: Use log scale for 5xx errors in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95064 [22:36:28] (03PS3) 10Ryan Lane: Add shadow_reference support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94680 [22:45:08] (03PS1) 10Nemo bis: Also add 2 months and 1 year graphs in "(cdn) HTTP Error Rate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95068 [22:47:28] !log maxsem synchronized php-1.23wmf2/extensions/MobileFrontend/ 'Fix fatal' [22:47:43] Logged the message, Master [22:49:07] (03PS5) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [22:50:20] !log maxsem synchronized php-1.23wmf3/extensions/MobileFrontend/ 'Fix fatal' [22:50:39] Logged the message, Master [22:51:17] (03CR) 10MaxSem: [C: 032] Task 1355: Enable the infobox experiment (story 1301) on enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93988 (owner: 10Jdlrobson) [22:51:29] (03Merged) 10jenkins-bot: Task 1355: Enable the infobox experiment (story 1301) on enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93988 (owner: 10Jdlrobson) [22:53:54] ksnider, LeslieCarr - FYI just added a notice to the RFP that the deadline has passed and we're now in the process of selecting a vendor. https://wikimediafoundation.org/w/index.php?title=RFP/2013_Datacenter&diff=94296&oldid=93993 [22:54:10] Eloquence: Thanks [22:54:22] (03CR) 10GWicke: [C: 031] Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 (owner: 10Jforrester) [22:54:59] (03CR) 10GWicke: "As discussed on IRC and logged on https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Parsoid was deployed last Thursday." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 (owner: 10Jforrester) [22:55:00] (03CR) 10Catrope: [C: 031] Create visualeditor-default.dblist to simplify config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94471 (owner: 10Jforrester) [22:55:16] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/93988/' [22:55:27] (03CR) 10Catrope: [C: 031] Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 (owner: 10Jforrester) [22:55:29] Logged the message, Master [22:56:14] (03CR) 10Catrope: [C: 04-2] "Do not deploy until December 2nd, when I'll have the time to actually monitor the fallout of this change (in theory there should be none, " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94472 (owner: 10Jforrester) [22:56:21] (03CR) 10MaxSem: [C: 032] Disable slow UW queries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94404 (owner: 10MaxSem) [22:56:30] (03Merged) 10jenkins-bot: Disable slow UW queries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94404 (owner: 10MaxSem) [22:58:48] (03CR) 10Catrope: [C: 031] Make VisualEditor namespaces extend, not replace, default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94420 (owner: 10Jforrester) [22:58:56] (03PS6) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [22:59:01] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/94404' [22:59:17] Logged the message, Master [23:03:39] (03PS2) 10Ryan Lane: Make appserver common a mediawiki deploy target [operations/puppet] - 10https://gerrit.wikimedia.org/r/94832 [23:06:24] "Your cache administrator is nobody." [23:06:34] https://git.wikimedia.org/ is not working :/ [23:06:38] thanks Eloquence [23:08:20] ^d: ^^ [23:08:38] Error: 500, Internal Server Error at Tue, 12 Nov 2013 23:08:29 GMT [23:08:55] aude: Looked at the link of nobody? :P [23:09:57] heh [23:10:07] !log maxsem Started syncing Wikimedia installation... : Demo deployment, no actual changes [23:10:23] Logged the message, Master [23:10:42] aude: root@wikimedia.dev.null [23:10:48] :) [23:17:40] !log upgrading wikitech to 1.23wmf3 [23:17:55] Logged the message, Master [23:18:00] <^d> Reedy: Restarting. [23:18:54] RoanKattouw_away: done [23:20:50] <^d> !log gitblit process had hung on antimony, restarted [23:21:06] Logged the message, Master [23:21:23] gitblit wfm [23:23:26] <^d> jeremyb: Yeah cuz I restarted it :) [23:23:45] ^d: yeah, just confirming it's working :) [23:24:17] aude: fixededed [23:25:10] (03PS7) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [23:25:47] Reedy: great [23:25:52] thanks ^d [23:25:54] <^d> yw [23:26:10] errr, thanks nobody ;) [23:28:51] (03PS3) 10Ryan Lane: Make appserver common a mediawiki deploy target [operations/puppet] - 10https://gerrit.wikimedia.org/r/94832 [23:31:37] !log upgrading wikitech-static to 1.23wmf3 [23:31:51] Logged the message, Master [23:33:06] !log maxsem Finished syncing Wikimedia installation... : Demo deployment, no actual changes [23:33:25] Logged the message, Master [23:33:46] (03CR) 10MaxSem: [C: 032] Load ZeroRatedMobileAccess only where currently supported. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 (owner: 10Dr0ptp4kt) [23:34:31] (03Merged) 10jenkins-bot: Load ZeroRatedMobileAccess only where currently supported. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 (owner: 10Dr0ptp4kt) [23:36:37] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/94250/' [23:36:51] Logged the message, Master [23:57:14] This url is responding with a Wikimedia Tech error [23:57:14] https://git.wikimedia.org/blob/oojs%2Fcore.git [23:57:21] Forwarded for: 2001:980:a565:1:79ef:9a58:f00b:db94, 127.0.0.1 [23:57:21] Error: 500, Internal Server Error at Tue, 12 Nov 2013 23:56:58 GMT [23:57:36] Request: GET http://git.wikimedia.org/blob/oojs%2Fcore.git, from 127.0.0.1 via cp1044 cp1044 ([127.0.0.1]:80), Varnish XID 1195859957 [23:57:43] That seems to be because gitblit doesn't know about that repo [23:57:48] It doesn't show up in the search for instance [23:57:53] https://git.wikimedia.org/summary/oojs%2Fcore.git [23:57:57] that url works fine [23:58:03] wtf [23:58:14] the url is bad, I provided blob without a file or branch [23:58:22] aha [23:58:22] btu it shouldn't fail like that [23:58:27] That still shouldn't be a 500 [23:58:27] gitblib has a 404 handler [23:58:29] Maybe a 400 [23:58:31] indeed