[00:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141217T0000). Please do the needful. [00:00:16] we will need a few more minutes for our stuff [00:00:44] The config. change can go out now, though [00:01:18] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:03:52] * bd808 is here [00:04:17] No one yet jumped in for the SWAT :P [00:09:32] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:11:17] (03PS1) 10Yuvipanda: Add extra columns to report [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/180358 [00:13:44] (03PS2) 10Yuvipanda: Add extra columns to report [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/180358 [00:14:56] (03CR) 10Yuvipanda: [C: 032 V: 032] Add extra columns to report [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/180358 (owner: 10Yuvipanda) [00:15:11] !log killed runJobs procs on mw1015 with init as parent [00:15:14] Logged the message, Master [00:17:17] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:17:37] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 2 failures [00:19:58] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 3 failures [00:21:20] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:23:28] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [00:24:51] !log tstarling Synchronized php-1.25wmf12/extensions/SecurePoll/includes/crypt/Crypt.php: tallying fix (duration: 01m 04s) [00:24:56] Logged the message, Master [00:26:42] RoanKattouw, ^d, marktraceur, MaxSem: Anybody doing SWAT today? [00:26:49] eh [00:27:07] last time I looked, it was empty [00:27:12] okay [00:27:24] <^d> Somebody did? [00:27:24] ebernahrdson, aude - yt? [00:27:36] MaxSem: We're here [00:27:37] here [00:27:55] probably 2-3 more minutes [00:28:02] <^d> Oh that was last week [00:28:03] can do hoo's config patch first [00:28:45] MaxSem: yup [00:30:32] (03CR) 10MaxSem: [C: 032] Update entity suggester blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/179469 (owner: 10Hoo man) [00:30:47] (03Merged) 10jenkins-bot: Update entity suggester blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/179469 (owner: 10Hoo man) [00:31:09] * aude is waiting for jenkins [00:33:16] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 11 failures [00:33:21] uh-oh [00:33:42] sync is so slow, do we have overloaded appservers? [00:33:56] !log maxsem Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/179469 (duration: 01m 22s) [00:34:02] Logged the message, Master [00:34:02] ^^^^ [00:35:03] (03CR) 10MaxSem: [C: 032] Convert from wfErrorLog to MWLogger logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180082 (owner: 10BryanDavis) [00:35:21] (03Merged) 10jenkins-bot: Convert from wfErrorLog to MWLogger logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180082 (owner: 10BryanDavis) [00:36:00] MaxSem: https://gerrit.wikimedia.org/r/#/c/180368/ [00:36:18] TimStarling, there's an uncommitted live hack by you... [00:36:38] also, I didn't actually deploy ^^^ :P [00:36:40] yeah, I'm doing it properly to [00:36:43] too [00:36:47] just takes half an hour [00:37:03] I'm sitting around waiting for jenkins to approve the cherry pick [00:39:07] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [00:42:00] TimStarling, I see no patches by you in zuul [00:42:10] no, it merged it eventually [00:42:44] now I am up to step 53 [00:42:56] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 2 failures [00:43:36] now waiting for this one to merge: https://gerrit.wikimedia.org/r/#/c/180371/ [00:43:45] MaxSem: It look 01m 22s to sync a single file? [00:43:57] * bd808 goes to look at ganglie [00:44:00] yep, some hosts are overloaded, apparently [00:44:19] TimStarling, I personally just force-merge submodule changes:) [00:44:25] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:44:41] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:44:57] MaxSem: :( jobrunners -- http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Jobrunners%2520eqiad&tab=m&vn=&hide-hf=false [00:45:03] ok [00:45:28] because it's never going to produce any meaningful result anyway:P [00:45:58] step 55 now [00:47:05] oh never mind, can't do it [00:47:10] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:47:33] there's undeployed changes from bd808 [00:48:03] MaxSem was pulling them in for the swat deploy [00:48:17] and then noticed the live hack... [00:48:48] well, the live hack has been reset now, if you want to continue [00:49:14] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:49:25] I still observe + $wgSecurePollShowErrorDetail = true; [00:49:58] gone [00:50:38] thanks:) [00:51:22] !log maxsem Synchronized wmf-config/: (no message) (duration: 00m 52s) [00:51:28] Logged the message, Master [00:51:35] hoo, aude, bd808 ^^^ [00:52:04] MaxSem: I see xff data that I expected [00:52:14] so lgtm [00:53:10] thanks [00:53:17] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Puppet has 27 failures [00:53:43] (and then we need https://gerrit.wikimedia.org/r/#/c/180368/) [00:53:58] !log maxsem Synchronized php-1.25wmf12/autoload.php: https://gerrit.wikimedia.org/r/#/c/180214/ part 1 (duration: 00m 26s) [00:54:04] Logged the message, Master [00:55:59] !log maxsem Synchronized php-1.25wmf12/includes/: https://gerrit.wikimedia.org/r/#/c/180214/ part 2 (duration: 01m 38s) [00:56:02] bd808, ^^^ [00:56:05] Logged the message, Master [00:57:25] MaxSem: Yup. irc notifications are working again. \o/ [00:57:48] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 4 failures [00:59:17] !log maxsem Synchronized php-1.25wmf12/extensions/Flow/: https://gerrit.wikimedia.org/r/#/c/180303/ (duration: 00m 41s) [00:59:22] Logged the message, Master [00:59:23] ebernahrdson, ^^^ [00:59:52] MaxSem: thanks testing [01:00:03] fatal monitor is full of failures connecting to mysql on10.64.32.25 [01:00:23] MaxSem: works like a charm, thanks [01:01:22] !log maxsem Synchronized php-1.25wmf12/extensions/Wikidata/: https://gerrit.wikimedia.org/r/180368 (duration: 00m 59s) [01:01:26] Logged the message, Master [01:01:26] aude, hoo ^^^ [01:01:30] checking [01:01:36] checking [01:01:53] looks good [01:02:29] thanks [01:03:03] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 7 failures [01:05:38] (03PS1) 10Kaldari: Turn off Main Page special casing on en.wiki Beta Labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180376 [01:06:06] db1055 was very sad but seems to be getting better -- http://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=db1055.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [01:06:42] uh huh [01:09:12] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 9 failures [01:09:51] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:11:35] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:15:13] so speaking of overloaded hosts and jobrunners [01:15:33] is it normal to have runJobs.php fork off children that reparent to init (aren't being waited on by another parent)? [01:15:37] e.g.: [01:15:37] apache 29512 1 0 00:10 ? 00:00:28 /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService --config-file=/etc/jobrunner/jobrunner.conf [01:15:40] apache 30659 29512 0 00:18 ? 00:00:00 sh -c nice -19 php /srv/mediawiki/multiversion/MWScript.php runJobs.php --wiki='commonswiki' --type='cirrusSearchLinksUpdatePrioritized' --maxtime='60' --memory-limit='300M' --result=json [01:15:44] apache 30663 30659 0 00:18 ? 00:00:18 php /srv/mediawiki/multiversion/MWScript.php runJobs.php --wiki=commonswiki --type=cirrusSearchLinksUpdatePrioritized --maxtime=60 --memory-limit=300M --result=json [01:15:48] apache 31221 1 0 00:13 ? 00:00:00 php /srv/mediawiki/multiversion/MWScript.php runJobs.php --wiki=commonswiki --type=gwtoolsetUploadMediafileJob --maxtime=60 --memory-limit=300M --result=json [01:16:20] the second-to-last line there is a job that has a relationship back to jobrunner, but the last is detached and parented directly to init [01:17:39] (I killed several such jobs on mw1015 earlier to bring runaway oom conditions back under control. it was that or let the kernel kill them randomly, and these were all PID1-as-parent and had been running for hours longer than the others) [01:20:10] bblack: That doesn't sound like expected behavior to me but Aaron would know for sure [01:21:35] The code for the whole jobrunner is at https://github.com/wikimedia/mediawiki-services-jobrunner/blob/master/redisJobRunnerService [01:22:14] * bd808 is creeped out by long running php processes [01:23:58] * gwicke shares that sentiment [01:24:42] I'm not sure I feel much better about long running javascript processes. ;) [01:25:11] But I've never read the v8 source code [01:25:42] I don't see anything in there that explicitly wants to detach jobs like that. My guess would be that redisJobRunnerService exited/died with several outstanding jobs, which became long-running unmanaged orphans, and then a new redisJobRunnerService replaced the old and started its own jobs. [01:26:19] but in my example past above, the detached job started at 00:13 and the runnerService has been alive since 00:10 [01:26:26] s/past/paste/ [01:26:59] (so that kills that theory, unless multiple redisJobRunnerService ended up running in parallel) [01:30:06] maybe proc_close() went nuts at some point and made an orphan proc? [01:30:26] then again maybe runJobs has code inside it to daemonize away from the runner [01:31:15] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:31:15] I think the only ways for those child procs to end up with PID1 as the parent is either (a) the parent actually ceases to exist while they're still running or (b) the child goes through some hoops to explicitly detach via double-fork mechanisms [01:31:43] (like a normal daemon does) [01:32:43] runJobs doesn't do anything fancy/crazy like that. It is a normal MW maintenance script [01:34:16] (03CR) 10Ori.livneh: [C: 031] "I'd prefer '--headers' to '--request', but up to you." [puppet] - 10https://gerrit.wikimedia.org/r/180155 (owner: 10Giuseppe Lavagetto) [01:34:54] well, maybe :) [01:35:05] PROBLEM - RAID on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:35:08] PROBLEM - configured eth on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:35:09] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:35:09] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:35:19] wtf? [01:35:34] the processes I pasted above are still running on mw1015. I checked the process group of detached child 31221, and it's 29512. [01:35:54] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=large&c=Jobrunners+eqiad&m=cpu_report&s=by+name&mc=2&g=mem_report [01:36:07] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:36:07] which means the running redisJobRunnerService did launch that command, and it had to have detached itself with a pair of forks (but didn't bother to setsid() to hide the evidence in the process group id) [01:36:11] memory usage on the job runners is going crazy. [01:36:22] they're all on zend. [01:36:35] what got deployed today? [01:37:05] between 10 and 12 utc [01:38:02] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [01:38:13] gwicke: did you change the throttling setting for parsoid or restbase? [01:38:13] RECOVERY - configured eth on mw1008 is OK: NRPE: Unable to read output [01:38:18] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [01:38:19] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [01:41:03] https://github.com/wikimedia/mediawiki/blob/555e0b4b3c517cfb565ad275d9600806cd3cd50a/maintenance/runJobs.php#L52 [01:41:13] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:41:42] ^ it does have code to at least single-fork() off some children. maybe it's just that runJobs forked a child copy of itself and then exited, breaking the link between the child runJobs and the grandparent redisJobRunnerService [01:41:56] ori: no [01:42:22] ori: you can easily check parsoid load at https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn= [01:42:32] it looks like it's blocked on the API [01:42:32] there are a lot of procs blocked on tidy [01:44:20] is hhvm still shelling out? [01:45:24] jobrunners aren't on hhvm [01:45:27] zend is still shelling out [01:45:40] oh, job runners [01:45:46] there are >200 tidy procs on the job runners [01:45:52] parsoid is basically only doing some http requests there [01:46:05] no tidy, not parsing, no cpu usage [01:46:37] core refreshLinks does full rendering though [01:47:32] it started almost exactly at 12:00 UTC, and there are no SAL entries between 9:31 and 13:20 [01:47:44] high load on job runners would explain the low parsoid load though, as the http requests to the parsoid service are probably not happening any more at the normal rate [01:48:55] 1/3 of last 10k jobs logged on fluorine are ChangeNotification [01:49:15] tail -10000 /a/mw-log/runJobs.log | awk '{print $5}'|sort|uniq -c|sort -n [01:50:06] oh no [01:50:23] stupid ganglia shows browser local time in its javascript-driven 'inspect' view [01:51:02] so that's 20:00 [01:51:09] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: Puppet has 1 failures [01:51:14] wmf12 to pedias [01:51:48] and a wikibase cache purge change I think [01:52:24] yeah https://gerrit.wikimedia.org/r/180234 [01:55:05] the delta between wmf11 and wmf12 includes a change i made to MWTidy (98c2703f81083125a76c84a2721305dab6225907), re-reading it now to see if it is potentially related [01:57:08] * Make MWTidy::externalTidy() always read both stdout and stderr. We can read [01:57:08] stderr after stdout because tidy.c produces output in the same order. [01:57:11] that is suspect [01:57:43] yeah [01:58:16] I was gonna say, it looks like (in addition to the comments that were already there about possible deadlocks avoided...) there's a chance of lockup on reading the outputs there serially [01:58:28] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:58:48] it's reading all of stdout then all of stderr, but if stderr fills its buffers before stdout is done and then blocks being able to write more stdout.... [01:59:18] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 3 failures [02:01:13] it would be better to toss all 3 stdio descriptors into a non-blocking thingy (select/poll/epoll/whatever), which I assume there's some standardish way to do in our php-land. [02:01:23] yes, stream_select [02:03:01] i'll write a patch [02:03:52] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [02:04:43] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 1 failures [02:23:19] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 4 failures [02:26:46] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:30:30] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:35:35] PROBLEM - DPKG on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:58] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:58] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:15] PROBLEM - configured eth on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:15] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:17] PROBLEM - RAID on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:48] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:36:50] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:43] RECOVERY - DPKG on mw1007 is OK: All packages OK [02:38:49] TimStarling: could you possibly review ? the job runners are a bit unhappy. [02:39:00] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:39:00] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [02:39:28] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [02:39:28] RECOVERY - configured eth on mw1007 is OK: NRPE: Unable to read output [02:39:28] RECOVERY - RAID on mw1007 is OK: OK: no RAID installed [02:40:18] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [02:42:45] (03CR) 10Chad: "A host who's IP is in _ip or hostname is in _host will always be banned, the behavior is not undefined." [puppet] - 10https://gerrit.wikimedia.org/r/180210 (owner: 10Chad) [02:42:54] !log ori Synchronized php-1.25wmf12/includes/parser/MWTidy.php: I4909e5e20: use stream_select() to get external tidy stdout/stderr (uncommitted; pending review) (duration: 00m 33s) [02:43:04] Logged the message, Master [02:44:06] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [02:44:06] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [02:45:49] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:48:05] !log restarted jobrunner on jobrunners [02:48:12] Logged the message, Master [02:48:17] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:48:22] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:49:21] bblack: fixed in (i think). still waiting for review but synced a cherry-pick to appease the job runners. they look happier now. [02:50:20] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:50:43] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: Puppet has 6 failures [02:53:57] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [02:56:34] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [02:56:35] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:57:06] ori: awesome :) [03:09:11] (03PS1) 10Springle: repool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180387 [03:10:10] (03CR) 10Springle: [C: 032] repool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180387 (owner: 10Springle) [03:10:14] (03Merged) 10jenkins-bot: repool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180387 (owner: 10Springle) [03:10:55] (03CR) 10Andrew Bogott: [C: 032] "I will merge and test this." [puppet] - 10https://gerrit.wikimedia.org/r/180243 (owner: 10coren) [03:11:21] !log springle Synchronized wmf-config/db-eqiad.php: repool db1073, warm up (duration: 00m 06s) [03:11:29] Logged the message, Master [03:20:22] (03PS1) 10KartikMistry: Fix lint warnings in authdns.pp [puppet] - 10https://gerrit.wikimedia.org/r/180388 [03:28:01] (03CR) 10Jforrester: [C: 031] "Fine by me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [04:19:07] (03PS1) 10Springle: depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180389 [04:19:59] (03CR) 10Springle: [C: 032 V: 032] depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180389 (owner: 10Springle) [04:21:02] !log springle Synchronized wmf-config/db-eqiad.php: depool db1066 (duration: 00m 05s) [04:21:07] Logged the message, Master [04:25:44] (03PS1) 10Andrew Bogott: Optimistic rearranging of rules in hopes of getting a swap. [puppet] - 10https://gerrit.wikimedia.org/r/180390 [04:29:09] (03CR) 10Andrew Bogott: [C: 032] Optimistic rearranging of rules in hopes of getting a swap. [puppet] - 10https://gerrit.wikimedia.org/r/180390 (owner: 10Andrew Bogott) [04:33:52] ori, bblack: should mw1015 be handling jobs? it's still connecitng to the DBs, but not touched by a sync-file [04:39:51] !log mw1015 sync-common [04:39:56] Logged the message, Master [05:00:59] (03PS1) 10Springle: upgrade db1066 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/180392 [05:01:31] (03PS1) 10Andrew Bogott: Give partman an explicit size for /var rather than -1 [puppet] - 10https://gerrit.wikimedia.org/r/180393 [05:01:56] (03CR) 10Springle: [C: 032] upgrade db1066 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/180392 (owner: 10Springle) [05:03:42] (03CR) 10Andrew Bogott: [C: 032] Give partman an explicit size for /var rather than -1 [puppet] - 10https://gerrit.wikimedia.org/r/180393 (owner: 10Andrew Bogott) [05:41:23] (03PS1) 10Andrew Bogott: Create a big filesystem for virt storage on the hp virt servers. [puppet] - 10https://gerrit.wikimedia.org/r/180394 [05:53:18] (03PS2) 10Andrew Bogott: Create a big filesystem for virt storage on the hp virt servers. [puppet] - 10https://gerrit.wikimedia.org/r/180394 [05:53:20] (03PS1) 10Andrew Bogott: More partman tinkering woo [puppet] - 10https://gerrit.wikimedia.org/r/180396 [05:54:32] (03CR) 10Andrew Bogott: [C: 032] More partman tinkering woo [puppet] - 10https://gerrit.wikimedia.org/r/180396 (owner: 10Andrew Bogott) [05:55:05] (03CR) 10Andrew Bogott: [C: 032] Create a big filesystem for virt storage on the hp virt servers. [puppet] - 10https://gerrit.wikimedia.org/r/180394 (owner: 10Andrew Bogott) [06:02:17] (03PS1) 10Andrew Bogott: We're going to need these packages [puppet] - 10https://gerrit.wikimedia.org/r/180397 [06:03:11] (03CR) 10Andrew Bogott: [C: 032] We're going to need these packages [puppet] - 10https://gerrit.wikimedia.org/r/180397 (owner: 10Andrew Bogott) [06:10:46] (03PS1) 10Andrew Bogott: Fix alignment in parted command [puppet] - 10https://gerrit.wikimedia.org/r/180398 [06:11:35] (03CR) 10Andrew Bogott: [C: 032] Fix alignment in parted command [puppet] - 10https://gerrit.wikimedia.org/r/180398 (owner: 10Andrew Bogott) [06:13:02] (03PS1) 10Springle: depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180399 [06:13:48] (03CR) 10Springle: [C: 032 V: 032] depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180399 (owner: 10Springle) [06:14:30] !log springle Synchronized wmf-config/db-eqiad.php: depool db1064 (duration: 00m 06s) [06:14:37] Logged the message, Master [06:15:37] (03PS1) 10Andrew Bogott: All partmans come to this eventually -- one big partition! [puppet] - 10https://gerrit.wikimedia.org/r/180403 [06:16:22] (03CR) 10Andrew Bogott: [C: 032] All partmans come to this eventually -- one big partition! [puppet] - 10https://gerrit.wikimedia.org/r/180403 (owner: 10Andrew Bogott) [06:20:30] (03PS1) 10Andrew Bogott: Include nova-compute on new virt servers. [puppet] - 10https://gerrit.wikimedia.org/r/180404 [06:21:30] (03CR) 10Andrew Bogott: [C: 032] Include nova-compute on new virt servers. [puppet] - 10https://gerrit.wikimedia.org/r/180404 (owner: 10Andrew Bogott) [06:29:53] <_joe_> morning [06:34:12] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:46] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:05] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:24] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:46] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:03] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:24] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:53] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:07] $greeting _joe_ [06:42:55] (03PS1) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/180405 [06:42:59] (03PS1) 10Andrew Bogott: Oops, these are in base. [puppet] - 10https://gerrit.wikimedia.org/r/180406 [06:43:05] (03PS1) 10Springle: upgrade db1064 to trusty and mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/180407 [06:43:39] (03CR) 10Andrew Bogott: [C: 032] Oops, these are in base. [puppet] - 10https://gerrit.wikimedia.org/r/180406 (owner: 10Andrew Bogott) [06:44:03] (03CR) 10Springle: [C: 032] upgrade db1064 to trusty and mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/180407 (owner: 10Springle) [06:44:56] oh.. andrewbogott_afk, shall i merge both? [06:45:30] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:45:47] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:53] springle: yes, thanks [06:46:00] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:00] done [06:46:49] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:48:28] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:28] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:25] (03PS1) 10Andrew Bogott: Hm, not this one. [puppet] - 10https://gerrit.wikimedia.org/r/180408 [06:49:48] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:50:25] (03CR) 10Andrew Bogott: [C: 032] Hm, not this one. [puppet] - 10https://gerrit.wikimedia.org/r/180408 (owner: 10Andrew Bogott) [06:50:44] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:52:23] !log upgrade db1064 trusty [06:52:27] Logged the message, Master [07:07:43] (03PS1) 10Andrew Bogott: Move new virt servers to precise. [puppet] - 10https://gerrit.wikimedia.org/r/180410 [07:08:59] (03CR) 10Andrew Bogott: [C: 032] Move new virt servers to precise. [puppet] - 10https://gerrit.wikimedia.org/r/180410 (owner: 10Andrew Bogott) [07:10:08] PROBLEM - configured eth on virt1010 is CRITICAL: eth1 reporting no carrier. [07:10:31] PROBLEM - puppet last run on virt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [07:14:13] PROBLEM - Host virt1010 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:25] (03CR) 10Giuseppe Lavagetto: "heh, me too, but --headers was already taken :)" [puppet] - 10https://gerrit.wikimedia.org/r/180155 (owner: 10Giuseppe Lavagetto) [07:18:03] (03PS2) 10Giuseppe Lavagetto: mediawiki: furl support for passing headers [puppet] - 10https://gerrit.wikimedia.org/r/180155 [07:18:36] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: furl support for passing headers [puppet] - 10https://gerrit.wikimedia.org/r/180155 (owner: 10Giuseppe Lavagetto) [07:19:02] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [07:20:05] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [07:25:47] RECOVERY - Host virt1010 is UP: PING OK - Packet loss = 0%, RTA = 2.20 ms [07:32:33] PROBLEM - Host virt1010 is DOWN: PING CRITICAL - Packet loss = 100% [07:35:19] RECOVERY - Host virt1010 is UP: PING OK - Packet loss = 0%, RTA = 2.21 ms [07:37:30] (03PS1) 10Giuseppe Lavagetto: hhvm_cleanup_cache: fixups [puppet] - 10https://gerrit.wikimedia.org/r/180411 [07:37:45] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm_cleanup_cache: fixups [puppet] - 10https://gerrit.wikimedia.org/r/180411 (owner: 10Giuseppe Lavagetto) [07:42:28] PROBLEM - salt-minion processes on virt1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [07:46:58] (03PS1) 10Andrew Bogott: What the hell, partman? I mean, what the hell? [puppet] - 10https://gerrit.wikimedia.org/r/180412 [07:48:55] greetings [07:49:28] <_joe_> hi godog [07:49:44] ciao _joe_ [07:52:36] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "IMO a role to include one class is utterly redundant - I'd add the class to the deployment role." [puppet] - 10https://gerrit.wikimedia.org/r/177080 (owner: 10Dzahn) [07:52:37] !log increase minimum raid reconstruction speed on virt1005 and virt1009 [07:52:42] Logged the message, Master [07:54:04] godog, _joe_ getting 503 on phab [07:54:19] <_joe_> matanya: I'm not [07:54:40] Request: POST http://phabricator.wikimedia.org/auth/login/mediawiki:mediawiki/, from 10.64.0.172 via cp1044 cp1044 ([10.64.0.172]:80), Varnish XID 1598083421 [07:54:40] Forwarded for: MY IP, 10.64.0.172 [07:54:40] Error: 503, Service Unavailable at Wed, 17 Dec 2014 07:52:49 GMT [07:54:53] _joe_: while siging in with 0auth [07:55:04] *signing [07:55:18] it never gets back from mediawiki [07:55:20] <_joe_> ok so "I have an oauth issue with phabricator" [07:55:37] i think [07:56:03] which i can't report on phabricator :D [07:57:26] now i got in, but it comes and goes [07:57:45] <_joe_> matanya: I tried OAuth login 5 times in a row [07:57:51] <_joe_> and it worked all the times :) [07:57:58] <_joe_> so either I'm very lucky [07:58:05] <_joe_> or you're very unlucky [07:58:29] you might want to see in the logs, i might hit a diff path than you, idk [07:58:57] <_joe_> matanya: is the 503 from phabricator? [07:59:05] yes, see above [07:59:35] it might be a timeout in the way, e.g in varnish, or the cp server [08:01:36] <_joe_> matanya: will do, if it is blocking you I'll do it now, if not, I'm kinda in the middle of something right now :) [08:02:00] let it go _joe_ HAT is more important [08:02:06] thanks for your efforts [08:12:36] (03PS1) 10Giuseppe Lavagetto: hhvm: do not have logrotate fail on missing stacktraces [puppet] - 10https://gerrit.wikimedia.org/r/180414 [08:12:47] <_joe_> godog: ^^ [08:13:19] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: do not have logrotate fail on missing stacktraces [puppet] - 10https://gerrit.wikimedia.org/r/180414 (owner: 10Giuseppe Lavagetto) [08:13:40] (03CR) 10Filippo Giunchedi: [C: 031] hhvm: do not have logrotate fail on missing stacktraces [puppet] - 10https://gerrit.wikimedia.org/r/180414 (owner: 10Giuseppe Lavagetto) [08:13:45] _joe_: LGTM! [08:16:12] (03CR) 10Filippo Giunchedi: "understood, that makes sense!" [puppet] - 10https://gerrit.wikimedia.org/r/180210 (owner: 10Chad) [08:17:44] (03CR) 10Andrew Bogott: [C: 032] What the hell, partman? I mean, what the hell? [puppet] - 10https://gerrit.wikimedia.org/r/180412 (owner: 10Andrew Bogott) [08:21:52] (03PS1) 10Andrew Bogott: Customize the nova instance partition to support different server types [puppet] - 10https://gerrit.wikimedia.org/r/180415 [08:23:00] (03CR) 10Andrew Bogott: [C: 032] Customize the nova instance partition to support different server types [puppet] - 10https://gerrit.wikimedia.org/r/180415 (owner: 10Andrew Bogott) [08:27:15] (03CR) 10KartikMistry: "dh compat is set to 8 as no multiarch support is needed as of now. I plan to fix it along with few other things at upstream with next rele" [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/180405 (owner: 10KartikMistry) [08:31:18] PROBLEM - Host virt1010 is DOWN: PING CRITICAL - Packet loss = 100% [08:33:58] RECOVERY - Host virt1010 is UP: PING OK - Packet loss = 0%, RTA = 1.94 ms [08:39:01] PROBLEM - Host virt1010 is DOWN: PING CRITICAL - Packet loss = 100% [08:40:57] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:41:56] linux-image-server : Depends: linux-image-3.2.0-74-generic but it is not going to be installed [08:42:02] what the heck? This worked an hour ago [08:42:55] (03PS1) 10Giuseppe Lavagetto: jobrunner: provision an additional jobrunner on trusty [puppet] - 10https://gerrit.wikimedia.org/r/180416 [08:43:00] <_joe_> !log depooling mw1152, reimaging as an HAT jobrunner [08:43:08] Logged the message, Master [08:43:29] paravoid and/or _joe_, my attempt to image a new precise server is failing, has anything changed today? [08:44:13] https://dpaste.de/OvTK [08:44:15] <_joe_> andrewbogott: not now sorry [08:44:22] np [08:44:28] <_joe_> I do have some quite pressing work to do [08:45:16] RECOVERY - Host virt1010 is UP: PING OK - Packet loss = 0%, RTA = 7.11 ms [08:45:27] PROBLEM - DPKG on virt1010 is CRITICAL: Connection refused by host [08:45:46] PROBLEM - Disk space on virt1010 is CRITICAL: Connection refused by host [08:45:57] PROBLEM - NTP on virt1010 is CRITICAL: NTP CRITICAL: No response from NTP server [08:47:22] (03PS1) 10Andrew Bogott: another random partman change [puppet] - 10https://gerrit.wikimedia.org/r/180417 [08:47:33] andrewbogott: looks like it ran out of disk space? [08:47:34] Dec 17 08:39:16 in-target: failed in write on buffer copy for backend dpkg-deb during `./lib/modules/3.2.0-74-generic/kernel/net/bluetooth/bnep/bnep.ko': No space left on device [08:48:24] (03PS2) 10Giuseppe Lavagetto: jobrunner: provision an additional jobrunner on trusty [puppet] - 10https://gerrit.wikimedia.org/r/180416 [08:48:30] godog: yeah, looks like it. So, I guess I can blame partman yet again [08:48:51] (03CR) 10Andrew Bogott: [C: 032] another random partman change [puppet] - 10https://gerrit.wikimedia.org/r/180417 (owner: 10Andrew Bogott) [08:49:18] andrewbogott: yay partman sadface.gif [08:49:18] PROBLEM - DPKG on virt1011 is CRITICAL: Connection refused by host [08:49:28] PROBLEM - Disk space on virt1011 is CRITICAL: Connection refused by host [08:49:58] PROBLEM - RAID on virt1011 is CRITICAL: Connection refused by host [08:50:13] <_joe_> andrewbogott: "another random partman change" sounds like you don't know what you're doing and you're in trial-and-error mode [08:50:22] _joe_: yep! [08:50:23] goodmorning [08:50:41] hey paravoid [08:50:45] <_joe_> may I suggest you either change the commit messages or you take a break, read the docs, and try again after you're pretty sure it works? [08:51:15] I was wondering too what's the best way to quickly test partman changes/recipes [08:51:18] (03PS3) 10Giuseppe Lavagetto: jobrunner: provision an additional jobrunner on trusty [puppet] - 10https://gerrit.wikimedia.org/r/180416 [08:51:41] _joe_: I have a pretty good understanding of how partman is documented as acting, and how it has acted on other servers. On this one, not so much. [08:51:47] But, yeah, I'll fix the commit message. [08:52:54] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: provision an additional jobrunner on trusty [puppet] - 10https://gerrit.wikimedia.org/r/180416 (owner: 10Giuseppe Lavagetto) [08:53:25] <_joe_> andrewbogott: I had no doubts, I also understand the frustration [08:53:38] what is wrong with precise, andrewbogott? [08:53:47] PROBLEM - Host virt1011 is DOWN: PING CRITICAL - Packet loss = 100% [08:53:47] PROBLEM - Host virt1010 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:06] paravoid: Current theory is that partman was making me a ridiculously tiny partition and the precise installer failed because of lack of disk space. [08:54:44] paravoid: I've been trying to use either '-1' or 'super big number' to entice partman to create a partition that uses all available space. Now I've given up on that and am just trying to manually set a partition that's a bit smaller than available space. [08:54:49] We'll see if that works any better [08:55:06] I've used '-1' to good effect on other servers, but not today apparently :( [08:59:04] hm, nope, same problem again [08:59:57] RECOVERY - Host virt1011 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [09:00:38] RECOVERY - Host virt1010 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [09:00:59] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: puppet fail [09:01:40] paravoid: can you hazard a guess why this gets me a root partition that's only 461M? https://gerrit.wikimedia.org/r/#/c/180417/1/modules/install-server/files/autoinstall/partman/virt-hp.cfg [09:02:04] (Notably, the previous version of the file did the same) [09:02:41] godog, same question [09:03:53] PROBLEM - puppet last run on virt1011 is CRITICAL: Connection refused by host [09:04:12] hm, maybe 200000 is too big for the minimum. I'll try lowering that (although previously it was just picking the minimum regardless of the requested max…) [09:04:18] PROBLEM - RAID on virt1010 is CRITICAL: Connection refused by host [09:04:28] PROBLEM - salt-minion processes on virt1011 is CRITICAL: Connection refused by host [09:05:00] PROBLEM - configured eth on virt1011 is CRITICAL: Connection refused by host [09:05:48] ACKNOWLEDGEMENT - DPKG on virt1010 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:05:48] ACKNOWLEDGEMENT - Disk space on virt1010 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:05:48] ACKNOWLEDGEMENT - NTP on virt1010 is CRITICAL: NTP CRITICAL: No response from NTP server andrew bogott Im imaging this box over and over [09:05:49] ACKNOWLEDGEMENT - RAID on virt1010 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:05:49] ACKNOWLEDGEMENT - configured eth on virt1010 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:05:49] ACKNOWLEDGEMENT - dhclient process on virt1010 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:05:49] ACKNOWLEDGEMENT - puppet last run on virt1010 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:05:49] ACKNOWLEDGEMENT - salt-minion processes on virt1010 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:06:01] PROBLEM - dhclient process on virt1011 is CRITICAL: Connection refused by host [09:06:03] PROBLEM - NTP on virt1011 is CRITICAL: NTP CRITICAL: No response from NTP server [09:07:26] (03PS1) 10Andrew Bogott: Request a range of 100gb to 200gb for virt-hp root partition [puppet] - 10https://gerrit.wikimedia.org/r/180422 [09:07:36] ACKNOWLEDGEMENT - DPKG on virt1011 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:07:36] ACKNOWLEDGEMENT - Disk space on virt1011 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:07:36] ACKNOWLEDGEMENT - NTP on virt1011 is CRITICAL: NTP CRITICAL: No response from NTP server andrew bogott Im imaging this box over and over [09:07:36] ACKNOWLEDGEMENT - RAID on virt1011 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:07:36] ACKNOWLEDGEMENT - configured eth on virt1011 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:07:37] ACKNOWLEDGEMENT - dhclient process on virt1011 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:07:37] ACKNOWLEDGEMENT - puppet last run on virt1011 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:07:38] ACKNOWLEDGEMENT - salt-minion processes on virt1011 is CRITICAL: Connection refused by host andrew bogott Im imaging this box over and over [09:08:27] andrewbogott: mhh the other thing might be the relative priority? i.e. swap has more priority but I'd expect with only those two not to make a difference [09:08:45] godog: true, I'll try setting identical priority [09:09:07] or swap even lower, but still doesn't explain 500mb for / [09:09:21] godog: except, I want swap to have higher priority, right? Because otherwise / will take up the whole drive and not leave room for swap [09:09:23] (03CR) 10Alexandros Kosiaris: [C: 032] Add cxserver role for production [puppet] - 10https://gerrit.wikimedia.org/r/180125 (owner: 10KartikMistry) [09:10:08] andrewbogott: yeah that's what I meant by not making a difference, they have fixed sizes anyways [09:10:31] <_joe_> mh, 503 peak [09:10:41] <_joe_> or better, plateau [09:10:48] (03CR) 10Andrew Bogott: [C: 032] Request a range of 100gb to 200gb for virt-hp root partition [puppet] - 10https://gerrit.wikimedia.org/r/180422 (owner: 10Andrew Bogott) [09:11:36] (03Draft1) 10Filippo Giunchedi: swift-add-drive: discard preserved cache [software/swift-utils] - 10https://gerrit.wikimedia.org/r/180419 [09:11:40] <_joe_> the usual book issue, meh [09:11:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift-add-drive: discard preserved cache [software/swift-utils] - 10https://gerrit.wikimedia.org/r/180419 (owner: 10Filippo Giunchedi) [09:11:46] (03Draft1) 10Filippo Giunchedi: swift-add-drive: clarify when waiting for a disk [software/swift-utils] - 10https://gerrit.wikimedia.org/r/180420 [09:11:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift-add-drive: clarify when waiting for a disk [software/swift-utils] - 10https://gerrit.wikimedia.org/r/180420 (owner: 10Filippo Giunchedi) [09:16:14] PROBLEM - Host virt1011 is DOWN: CRITICAL - Plugin timed out after 15 seconds [09:16:26] PROBLEM - Host virt1010 is DOWN: CRITICAL - Plugin timed out after 15 seconds [09:17:16] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:17:25] (03PS1) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/180424 [09:18:26] andrewbogott: I take it virt1011 virt1010 is you? [09:18:38] andrewbogott: if I were you, I'd puppetd --disable on carbon and tweak the file manually [09:18:50] until I found the values that work [09:18:54] godog: yes. I just acknowledged, I don't know why icinga is still complaining. [09:19:16] paravoid: Yep, will do if this latest test still fails. Sorry for all the noise. [09:19:31] nah I'm just saying, easier for you [09:19:37] that's what I did for the Debian stuff too [09:19:45] and I used a VM as well so that I can reboot it quicker :P [09:20:20] RECOVERY - Host virt1010 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [09:20:21] andrewbogott: I think it might have to do with the wonky state a machine is when reinstalled with the puppet cache cleared and so on [09:20:24] ACKNOWLEDGEMENT - DPKG on virt1011 is CRITICAL: Timeout while attempting connection andrew bogott reimaging [09:20:24] ACKNOWLEDGEMENT - Disk space on virt1011 is CRITICAL: Timeout while attempting connection andrew bogott reimaging [09:20:24] ACKNOWLEDGEMENT - NTP on virt1011 is CRITICAL: NTP CRITICAL: No response from NTP server andrew bogott reimaging [09:20:24] ACKNOWLEDGEMENT - RAID on virt1011 is CRITICAL: Timeout while attempting connection andrew bogott reimaging [09:20:24] ACKNOWLEDGEMENT - SSH on virt1011 is CRITICAL: Connection timed out andrew bogott reimaging [09:20:24] ACKNOWLEDGEMENT - configured eth on virt1011 is CRITICAL: Timeout while attempting connection andrew bogott reimaging [09:20:24] ACKNOWLEDGEMENT - dhclient process on virt1011 is CRITICAL: Timeout while attempting connection andrew bogott reimaging [09:20:25] ACKNOWLEDGEMENT - puppet last run on virt1011 is CRITICAL: Timeout while attempting connection andrew bogott reimaging [09:20:25] ACKNOWLEDGEMENT - salt-minion processes on virt1011 is CRITICAL: Timeout while attempting connection andrew bogott reimaging [09:22:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comments about debhelper version, the rest looks fine" (032 comments) [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/180405 (owner: 10KartikMistry) [09:23:52] (03CR) 10Alexandros Kosiaris: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/180388 (owner: 10KartikMistry) [09:25:52] _joe_: buongiorno ! Any clue when the new HHVM package is going to land? :-] [09:26:20] RECOVERY - Host virt1011 is UP: PING OK - Packet loss = 0%, RTA = 4.89 ms [09:26:26] PROBLEM - Host virt1010 is DOWN: PING CRITICAL - Packet loss = 100% [09:26:29] <_joe_> hashar: I am working on that now [09:26:35] <_joe_> but it will take some time still [09:26:38] _joe_: awesome [09:27:31] _joe_: make sure to cherry pick https://github.com/facebook/hhvm/commit/324701c9fd31beb4f070f1b7ef78b115fbdfec34 which fix some Wddx formatter issue we had ( https://phabricator.wikimedia.org/T75531 ) [09:27:44] I will be more than happy to push the hhvm package on the CI slaves whenever it is ready [09:28:29] RECOVERY - Host virt1010 is UP: PING OK - Packet loss = 0%, RTA = 2.76 ms [09:28:37] <_joe_> hashar: yeah [09:29:07] <_joe_> I have some work to do, I'd like to build on 3.3.1 and rebase on what's in debian [09:32:15] _joe_: you should probably backport https://phabricator.wikimedia.org/T74556 too btw [09:32:18] to Debian too :) [09:32:47] <_joe_> paravoid: nod, will do [09:32:55] I just noticed it the other day [09:34:24] <_joe_> well, I didn't [09:34:31] <_joe_> I would've expected to be tbh :) [09:34:40] RECOVERY - DPKG on virt1010 is OK: All packages OK [09:37:15] (03PS1) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/180426 [09:40:15] RECOVERY - puppet last run on virt1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:41:02] RECOVERY - Disk space on virt1011 is OK: DISK OK [09:41:34] PROBLEM - Host virt1010 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:27] RECOVERY - DPKG on virt1011 is OK: All packages OK [09:44:15] RECOVERY - Host virt1010 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [09:46:20] (03PS4) 10Dzahn: rm module apachesync [puppet] - 10https://gerrit.wikimedia.org/r/177080 [09:47:51] (03PS1) 10Andrew Bogott: Remove the /srv mount point for /dev/sdb [puppet] - 10https://gerrit.wikimedia.org/r/180428 [09:49:03] (03CR) 10Andrew Bogott: [C: 032] Remove the /srv mount point for /dev/sdb [puppet] - 10https://gerrit.wikimedia.org/r/180428 (owner: 10Andrew Bogott) [09:50:56] PROBLEM - configured eth on virt1010 is CRITICAL: eth1 reporting no carrier. [09:51:47] mutante: how come your commit messages are linewrapped to 50 characters? [09:54:59] paravoid: i already amended the ones where you commented that, now this one as well [09:55:06] (03PS5) 10Dzahn: rm module apachesync [puppet] - 10https://gerrit.wikimedia.org/r/177080 [09:55:36] the diff is one line because i'm listing the scripts i'm touching [09:56:10] PROBLEM - Host virt1011 is DOWN: PING CRITICAL - Packet loss = 100% [09:57:22] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/559/change/177080/html/tin.eqiad.wmnet.html" [puppet] - 10https://gerrit.wikimedia.org/r/177080 (owner: 10Dzahn) [09:57:24] <_joe_> !log jobrunner started on mw1152 [09:57:29] Logged the message, Master [09:57:40] RECOVERY - Host virt1011 is UP: PING OK - Packet loss = 0%, RTA = 8.22 ms [09:58:21] andrewbogott: FWIW I think trying to downtime both hosts might work, at least until the puppet cache is cleared and the hosts get recreated in icinga [10:00:19] (03PS3) 10Dzahn: move mediawiki maintenance scripts to module [puppet] - 10https://gerrit.wikimedia.org/r/178873 [10:01:31] <_joe_> and with all the fixes and improvements we've done, the hhvm jobrunner is smoking hot and taking much more jobs than any non-HHVM one [10:01:34] <_joe_> \o/ [10:01:48] sweet [10:01:59] (03Abandoned) 10Dzahn: (WIP) certificates: move to module [puppet] - 10https://gerrit.wikimedia.org/r/171496 (owner: 10Dzahn) [10:02:12] <_joe_> it was noticeably slower before [10:07:22] (03PS1) 10Dereckson: Throttle rule for University of Haifa event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180429 [10:14:34] * _joe_ brb [10:16:35] (03PS1) 10Andrew Bogott: Isolate virt1010-1012 because they're not ready for prime time. [puppet] - 10https://gerrit.wikimedia.org/r/180432 [10:17:17] !change 177080 | canihaveareviewthen [10:20:11] (03CR) 10Andrew Bogott: [C: 032] Isolate virt1010-1012 because they're not ready for prime time. [puppet] - 10https://gerrit.wikimedia.org/r/180432 (owner: 10Andrew Bogott) [10:23:47] (03CR) 10Hashar: [C: 032] "Thanks for taking care of those additions!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180429 (owner: 10Dereckson) [10:24:12] (03Merged) 10jenkins-bot: Throttle rule for University of Haifa event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180429 (owner: 10Dereckson) [10:25:33] !log hashar Synchronized wmf-config/throttle.php: {{gerrit|180429}} - Throttle rule for University of Haifa event (duration: 00m 06s) [10:25:39] Logged the message, Master [10:26:08] !log hashar Synchronized wmf-config/throttle.php: {{gerrit|180429}} - Throttle rule for University of Haifa event (duration: 00m 06s) [10:26:18] pfff [10:26:57] !log mw1152 has a wrong host key in /etc/ssh/ssh_known_hosts:2480 causing scap to spurts a remote identification error. [10:27:02] Logged the message, Master [10:27:08] <_joe_> hashar: he, just reimaged [10:27:20] (03CR) 10Hashar: "Deployed!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180429 (owner: 10Dereckson) [10:27:55] <_joe_> hashar: ran sync-common - it's a jobrunner anyways [10:28:52] Dereckson: you are too fast to close bugs :D [10:28:57] _joe_: great, thanks [10:29:07] !log mw1152 is a jobrunner being rebuild [10:29:13] Logged the message, Master [10:30:05] !log virt1010 and 1011 are up but with puppet and nova-compute disabled pending firewall issues [10:30:11] Logged the message, Master [10:37:15] (03PS1) 10Springle: repool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180436 [10:39:10] (03CR) 10Springle: [C: 032] repool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180436 (owner: 10Springle) [10:40:12] !log springle Synchronized wmf-config/db-eqiad.php: repool db1064, warm up (duration: 00m 05s) [10:40:18] Logged the message, Master [10:50:49] http://korma.wmflabs.org/browser/repository.html?repository=gerrit.wikimedia.org_operations_puppet&ds=scr [10:53:24] +84% merged and +70% submitted (YoY, looking at october) [10:54:03] !log Jenkins deleting legacy 'mwext*testextension' jobs (now suffixed with '-zend') and restarting Jenkins. [10:54:09] Logged the message, Master [10:54:54] (03PS1) 10Faidon Liambotis: install-server::tftp-server cleanups [puppet] - 10https://gerrit.wikimedia.org/r/180441 [10:54:56] (03PS1) 10Faidon Liambotis: install-server: replace lighttpd with nginx [puppet] - 10https://gerrit.wikimedia.org/r/180442 [10:55:09] anyone up for reviewing those two? :) [10:55:28] akosiaris is the original author but I think he's not around atm [10:57:33] magic trick: git commit --amend --author "Alexandros Kosiaris " [11:01:32] paravoid: taking a look [11:04:36] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, simple enough to just remove the files manually" [puppet] - 10https://gerrit.wikimedia.org/r/180441 (owner: 10Faidon Liambotis) [11:09:18] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/180442 (owner: 10Faidon Liambotis) [11:10:54] (03PS1) 10Dzahn: phab metrics: switch cron from daily to monthly [puppet] - 10https://gerrit.wikimedia.org/r/180444 [11:13:32] <_joe_> I can take a look at the second one paravoid [11:13:43] <_joe_> but right now I'm pretty worried about the hhvm segfaults [11:19:42] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM. On a side note, atftpd is having problems with the tftp BLKSIZE protocol extension. I did some experiments while upgrading the R420'" [puppet] - 10https://gerrit.wikimedia.org/r/180441 (owner: 10Faidon Liambotis) [11:29:32] (03CR) 10Alexandros Kosiaris: "I cleaned up carbon's /srv/tftpboot/restricted and /tftpboot" [puppet] - 10https://gerrit.wikimedia.org/r/180441 (owner: 10Faidon Liambotis) [11:31:34] (03CR) 10Dzahn: [C: 032] phab metrics: switch cron from daily to monthly [puppet] - 10https://gerrit.wikimedia.org/r/180444 (owner: 10Dzahn) [11:46:22] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "I talked with Kartik on IRC. Debhelper 9 puts on some requirements on multiarch plus other minor stuff and we agreed to skip it for now. I" [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/180405 (owner: 10KartikMistry) [11:47:16] multiarch is good [11:51:38] RECOVERY - Host d-i-test is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [11:52:11] !log Don't sync extensions, undeployed unintentional reverts https://wikitech.wikimedia.org/?diff=138472&oldid=138399 [11:52:16] Logged the message, Master [11:52:18] paravoid: yeah, kart_ will do it on a later point in time though [11:54:36] Coren or YuviPanda|zzz: we should probably replace lighttpd with nginx in toollabs [11:55:22] so, this looks ok to you now? https://gerrit.wikimedia.org/r/#/c/177080/ i just moved apache-fast-test to be installed from role/deployment.pp , per comments from _joe_ [11:55:44] paravoid: might not be easy - lots of tools have custom lighttpd config now [11:55:53] Also why? [11:56:16] ugh [11:56:25] there is another place we use lighttpd, mailing list server [11:56:36] i'm aware [11:56:40] <_joe_> mutante: sorry no time to look now [11:56:41] and download as well [11:56:54] and dns recursors [11:56:56] (via webserver::static) [11:58:05] YuviPanda: it'd be nice to settle in 1-2 servers rather than have to maintain configs/manifests etc. for both [11:58:08] PROBLEM - check if dhclient is running on d-i-test is CRITICAL: Connection refused by host [11:58:08] PROBLEM - check configured eth on d-i-test is CRITICAL: Connection refused by host [11:58:08] PROBLEM - check if salt-minion is running on d-i-test is CRITICAL: Connection refused by host [11:58:15] and clearly nginx over lighttpd [11:58:25] Hmm static might be easy to.migrate if we want [11:58:35] Hmm true. [11:58:46] won* [11:58:53] yea, less webservers would be good [11:58:59] I was going to make uwsgi available on tools instead [11:59:19] For the python tools which now are fastcgi [11:59:27] download is easy; dns::recursor is also easy but needs a revamp anyway (it's used for a custom stats thing that we should migrate to graphite), mailman needs a revamp anyway [11:59:31] install-server I just did [11:59:55] <_joe_> !log removing some core dumps from appservers, so that we don't run out of space by tomorrow [12:00:00] Logged the message, Master [12:00:19] PROBLEM - DPKG on d-i-test is CRITICAL: Connection refused by host [12:00:20] PROBLEM - Disk space on d-i-test is CRITICAL: Connection refused by host [12:00:49] PROBLEM - RAID on d-i-test is CRITICAL: Connection refused by host [12:01:04] paravoid: I guess if we really want to migrate off lighttpd for tools we will have to first provide nginx in a similar fashion and then wait for people to migrate [12:01:47] Can auto migrate tools without custom config [12:02:05] Hmm will need php-fpm as well ugh [12:02:34] It is going to take a fair amount of time and effort [12:03:08] php-fpm? [12:03:34] why do you need php-fpm? [12:03:44] I mean why do you need it with nginx but not with lighttpd? [12:06:05] paravoid: I'm not sure how lighttpd supports php in our config - will take a look. I know that it 'just works' similar to apache modphp [12:06:20] there is no lighttpd modphp [12:06:22] * YuviPanda is IRCing from phone walking back home [12:06:34] There isn't I know. I'll take a look [12:06:46] manifests/node/web/lighttpd.pp: package { 'php5-cgi': [12:06:59] ewwww [12:07:03] Egad [12:07:05] Sigh [12:07:30] Ok maybe replacing that isn't a bad idea at all [12:07:40] :P [12:08:27] files/tool-lighttpd [12:08:42] 16 lines of a copyright banner for essentially a shell oneliner [12:10:24] paravoid: if only operations/puppet itself had a clear license... ;) [12:13:08] everything we imported from Puppet Labs appears to be Apache License [12:13:18] find . -name LICENSE [12:24:46] Reedy: do we have a page with instructions "how to close wiki" ? [12:25:16] like the proper steps to take to close wiki after a decision was made to close it [12:28:32] RECOVERY - DPKG on d-i-test is OK: All packages OK [12:28:35] RECOVERY - Disk space on d-i-test is OK: DISK OK [12:28:58] RECOVERY - RAID on d-i-test is OK: OK: no RAID installed [12:29:30] RECOVERY - check configured eth on d-i-test is OK: NRPE: Unable to read output [12:29:30] RECOVERY - check if salt-minion is running on d-i-test is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:29:31] RECOVERY - check if dhclient is running on d-i-test is OK: PROCS OK: 0 processes with command name dhclient [12:31:52] paravoid: hmm, are we moving to nginx+hhvm in prod anytimes soon? I guess a nice and smooth move would be to offer HHVM+nginx first, move people to it, then slowly move everything else with appropriate ways (uwsgi for python, fastcgi for everything else, etc). [12:31:57] plus it would make several tools faster. [12:32:01] and eugh, CGI [12:34:31] (03PS2) 10Alexandros Kosiaris: install-server: replace lighttpd with nginx [puppet] - 10https://gerrit.wikimedia.org/r/180442 (owner: 10Faidon Liambotis) [12:35:46] (03PS1) 10Dzahn: add angwikibooks and iewikibooks to closed.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180451 [12:38:12] PROBLEM - NTP on d-i-test is CRITICAL: NTP CRITICAL: Offset unknown [12:38:40] <_joe_> the only thing I hate more than crashes are crashes that stop as soon as you start looking at them [12:46:24] (03CR) 10Dzahn: "made T78782 to disable the Jenkins check for apache-config running on the mediawiki-config repo where it is doomed to always fail" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180451 (owner: 10Dzahn) [12:47:14] RECOVERY - NTP on d-i-test is OK: NTP OK: Offset 0.0002974271774 secs [12:47:29] (03CR) 10Alexandros Kosiaris: [C: 031] "Couldn't resist saving my RSpec test from oblivion. Anyway, a couple of other minor changes for RSpec to play nicely along and owner,group" [puppet] - 10https://gerrit.wikimedia.org/r/180442 (owner: 10Faidon Liambotis) [12:49:48] (03CR) 10Faidon Liambotis: [C: 032] install-server: replace lighttpd with nginx [puppet] - 10https://gerrit.wikimedia.org/r/180442 (owner: 10Faidon Liambotis) [12:56:52] PROBLEM - HTTP on carbon is CRITICAL: Connection refused [12:58:58] yea, i have been wondering about this myself in the distant past. why no content on http://apt.wikimedia.org/ but only http://apt.wikimedia.org/wikimedia [12:59:07] the empty index.html that is [13:00:03] RECOVERY - HTTP on carbon is OK: HTTP OK: HTTP/1.1 200 OK - 224 bytes in 0.001 second response time [13:00:04] <_joe_> is someone touching carbon? [13:00:10] yes [13:00:11] me [13:00:12] <_joe_> I'm kinda in the middle of something [13:00:17] <_joe_> oh ok :) [13:00:37] give me a second [13:01:00] <_joe_> if you're handling it, it's fine [13:04:41] (03PS1) 10Faidon Liambotis: install-server: fix nginx's listen directives [puppet] - 10https://gerrit.wikimedia.org/r/180455 [13:04:43] (03PS1) 10Faidon Liambotis: install-server: add hasstatus => false to atftpd [puppet] - 10https://gerrit.wikimedia.org/r/180456 [13:06:17] (03CR) 10Faidon Liambotis: [C: 032] install-server: fix nginx's listen directives [puppet] - 10https://gerrit.wikimedia.org/r/180455 (owner: 10Faidon Liambotis) [13:06:30] (03CR) 10Faidon Liambotis: [C: 032] install-server: add hasstatus => false to atftpd [puppet] - 10https://gerrit.wikimedia.org/r/180456 (owner: 10Faidon Liambotis) [13:12:46] (03PS1) 10Faidon Liambotis: base: puppet cleanups [puppet] - 10https://gerrit.wikimedia.org/r/180457 [13:22:05] (03PS1) 10Yuvipanda: tools: Add hhvm to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/180458 [13:24:38] (03CR) 10Yuvipanda: [C: 032] tools: Add hhvm to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/180458 (owner: 10Yuvipanda) [13:25:55] (03PS2) 10Faidon Liambotis: base: puppet cleanups [puppet] - 10https://gerrit.wikimedia.org/r/180457 [13:25:57] (03PS1) 10Faidon Liambotis: apt: remove class apt::puppet [puppet] - 10https://gerrit.wikimedia.org/r/180460 [13:33:12] * _joe_ bbl [13:33:45] <_joe_> YuviPanda: you may want to limit resource usage tightly for hhvm on toollabs [13:33:55] <_joe_> we can discuss that later maybe? [13:34:30] <_joe_> I need a pause, but I'll be back in ~ 1 hour [13:34:33] _joe_: yeah, would like to. Right now nobody is using it. Also wondering about the central repo and security considerations. [13:34:38] _joe_: cool. [13:40:28] (03CR) 10Alexandros Kosiaris: [C: 032] base: puppet cleanups [puppet] - 10https://gerrit.wikimedia.org/r/180457 (owner: 10Faidon Liambotis) [13:41:31] (03CR) 10Alexandros Kosiaris: [C: 032] apt: remove class apt::puppet [puppet] - 10https://gerrit.wikimedia.org/r/180460 (owner: 10Faidon Liambotis) [13:47:39] !log uploaded apertium-nob_0.1.0+svn~58076-1 and apertium-nno_0.1.0+svn~58076-1 to apt.wikimedia.org [13:47:47] Logged the message, Master [13:48:41] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Puppet has 2 failures [13:48:46] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: Puppet has 2 failures [13:48:47] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: Puppet has 2 failures [13:49:09] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: Puppet has 2 failures [13:49:09] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Puppet has 2 failures [13:49:10] PROBLEM - puppet last run on amslvs2 is CRITICAL: CRITICAL: Puppet has 2 failures [13:49:44] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 2 failures [13:49:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/180424 (owner: 10KartikMistry) [13:49:49] PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: Puppet has 2 failures [13:49:49] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Puppet has 2 failures [13:50:22] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Puppet has 2 failures [13:50:34] PROBLEM - puppet last run on caesium is CRITICAL: CRITICAL: Puppet has 2 failures [13:50:44] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 2 failures [13:50:52] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 2 failures [13:52:01] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Puppet has 2 failures [13:52:13] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 2 failures [13:52:22] RECOVERY - puppet last run on amslvs2 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [13:53:10] (03PS1) 10Dzahn: add hue.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/180471 [13:53:49] (03CR) 10Dzahn: [C: 031] "looks ok, needs DNS change though. https://gerrit.wikimedia.org/r/#/c/180471/" [puppet] - 10https://gerrit.wikimedia.org/r/180248 (owner: 10Ottomata) [13:53:52] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [13:56:15] (03CR) 10Dzahn: "allowing " /usr/sbin/service parsoid-rt-client restart" sounds perfectly reasonable, but what is "/home/parsoid-rt/update-code.sh" ?" [puppet] - 10https://gerrit.wikimedia.org/r/180221 (owner: 10Cscott) [13:57:04] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:00:39] (03CR) 10Dzahn: [C: 031] phabricator: strip Ubuntu 12.04 (precise) support [puppet] - 10https://gerrit.wikimedia.org/r/179882 (owner: 10Faidon Liambotis) [14:00:43] ignore the puppet problems, all are the result of merging https://gerrit.wikimedia.org/r/180457 and are all transient failures [14:01:09] (03PS1) 10Hashar: contint: +libmysqlclient-dev package on labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/180473 [14:02:37] RECOVERY - puppet last run on caesium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:03:54] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:05] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:05] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:17] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:27] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:27] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:39] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:04:48] (03CR) 10Dzahn: "i used salt to check the id of Apache on "-C 'G@cluster:appserver or G@cluster:api_appserver'" and they are ALL _gid_ 48 but uid 996" [puppet] - 10https://gerrit.wikimedia.org/r/178690 (owner: 10BryanDavis) [14:04:56] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:56] RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:04:56] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:05:50] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:07:05] (03CR) 10Dzahn: "this would change the UID of Apache on ALL appservers from 996 to 48. it's correct that uid/gid should be 48/48 per https://wikitech.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/178690 (owner: 10BryanDavis) [14:07:07] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on integration puppet master and confirmed to fix the issue." [puppet] - 10https://gerrit.wikimedia.org/r/180473 (owner: 10Hashar) [14:08:49] (03PS1) 10Faidon Liambotis: Add zlib1g-dev build-dep [debs/quickstack] - 10https://gerrit.wikimedia.org/r/180476 [14:09:25] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Add zlib1g-dev build-dep [debs/quickstack] - 10https://gerrit.wikimedia.org/r/180476 (owner: 10Faidon Liambotis) [14:11:02] hah @ "The site says: "Logstash (ssh deployment-bastion.eqiad.wmflabs sudo cat /root/secrets.txt)"" in a login banner [14:11:33] can't use LDAP for login? [14:11:46] Nemo_bis: You forgot the "requesting developer" indication in https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=138472&oldid=138399 [14:12:47] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: puppet fail [14:12:47] (03CR) 10Dzahn: [C: 031] udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [14:13:30] RECOVERY - puppet last run on d-i-test is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:13:44] anomie: no, I intentionally neglected it [14:14:13] Didn't know where else to place an anti-deployment warning/request [14:17:11] (03PS2) 10Dzahn: beta: Log !log messages from #wikimedia-qa [puppet] - 10https://gerrit.wikimedia.org/r/179507 (owner: 10BryanDavis) [14:18:01] nice work mutante [14:18:05] with the udp2log stuff [14:22:52] (03CR) 10Dzahn: [C: 032] beta: Log !log messages from #wikimedia-qa [puppet] - 10https://gerrit.wikimedia.org/r/179507 (owner: 10BryanDavis) [14:28:16] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:30:42] manybubbles, Reedy, or anyone else around: So gerrit change 180229 accidentally reverted a bunch of extensions to a really old version, but fortunately it doesn't seem to have been deployed. https://gerrit.wikimedia.org/r/#/c/180477/ reverts those accidental changes (except the Flow one, which was already overridden by a later patch). Care to at least +1 for me? [14:36:37] (03PS3) 10Dzahn: Redirect ve.wikimedia.org to wikimedia.org.ve [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [14:38:06] (03CR) 10Dzahn: "maybe first uranium can be switched from using "ganglia" to "ganglia_new"? i'm not sure" [puppet] - 10https://gerrit.wikimedia.org/r/172434 (owner: 10John F. Lewis) [14:38:37] ori: maybe for later if you got a minute? -> https://gerrit.wikimedia.org/r/#/c/177080/ (can also replace https://gerrit.wikimedia.org/r/#/c/164508/ ) [14:38:56] anomie: done - you want me to verify the actual revisions? [14:40:27] manybubbles: I just wanted to avoid a self-merge. It's an actual revert, except for the changes to extensions/Flow and skins/Vector from the original patch. [14:40:51] anomie: cool I can even merge it during SWAT if you like but i imagine you'll just merge it now? [14:40:58] Yeah, doing it now. [14:42:17] (03PS2) 10Alexandros Kosiaris: Added initial Debian packaging [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/180426 (owner: 10KartikMistry) [14:43:02] !log Merged and fetched [[gerrit:180477]], so undeployed bad extension changes from [[gerrit:180229]] are no longer a danger [14:43:11] Logged the message, Master [14:45:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Added the override_dh_auto_test part to set the locale there as well. Otherwise the test suite fails" [debs/contenttranslation/apertium-nno-nob] - 10https://gerrit.wikimedia.org/r/180426 (owner: 10KartikMistry) [14:52:43] !log uploaded apertium-nno-nob_1.0.0+svn~57977-1 to apt.wikimedia.org [14:52:49] Logged the message, Master [14:58:11] Reedy: greg-g twentyafterfour do you know if we are indeed cutting a new 1.25wmf13 branch today for test2 / test wikidata? [14:58:18] per https://wikitech.wikimedia.org/wiki/Deployments#Wednesday.2C.C2.A0December.C2.A017 [15:00:04] chasemp: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141217T1500). Please do the needful. [15:01:15] (03CR) 10Ottomata: "Thanks, I was going to get the other change merged first and then test some things. I think there will be SSL thingees." [dns] - 10https://gerrit.wikimedia.org/r/180471 (owner: 10Dzahn) [15:02:38] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [15:02:46] (03CR) 10Alexandros Kosiaris: [C: 032] "For I moment I was reminded of modules/toollabs/manifests/dev_environ.pp and the fight over there between libmysqlclient-dev/libmariadbcli" [puppet] - 10https://gerrit.wikimedia.org/r/180473 (owner: 10Hashar) [15:17:31] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:58] (03PS1) 10Giuseppe Lavagetto: admins: allow Sam Smith to access stat100[23] [puppet] - 10https://gerrit.wikimedia.org/r/180491 [15:31:25] (03CR) 10Giuseppe Lavagetto: [C: 032] admins: allow Sam Smith to access stat100[23] [puppet] - 10https://gerrit.wikimedia.org/r/180491 (owner: 10Giuseppe Lavagetto) [15:31:47] <_joe_> c'mon jenkins [15:32:00] <_joe_> you big pyle of java [15:32:30] <_joe_> insulting technology works, apparently [15:49:05] manybubbles: Are you going to SWAT today, since you have a change in? [15:49:21] anomie: I _should_ be I also have a meeting that starts at 11 [15:49:58] manybubbles: can be whenever [15:50:02] the swat [15:50:17] or the meeting can start late [15:55:26] (03PS1) 10Manybubbles: Have cirrus use the safer query in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180493 [15:56:08] I'd rather not do it today for the record, but I could if everyone else was, like, preventing the apocalypse [15:57:22] anomie: I'll do it but around 11:15. [15:58:56] fine with me [15:59:27] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141217T1600). Please do the needful. [16:02:14] <^d> I shall do it [16:02:29] ^d: ok then [16:02:35] aude: meeting? [16:02:46] manybubbles: don't think i am needed for it [16:02:56] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:02:57] <^d> I'll start both, provided you're both about to be pingable :) [16:03:03] ok [16:03:15] <^d> +2 on both patches. [16:05:23] pingable [16:08:07] * ^d sings the jenkins song [16:09:06] (03CR) 10BBlack: [C: 04-2] "This is basically on-hold for the present. HSTS will probably be the very last step we take in our general HTTPS plans, after most traffi" [puppet] - 10https://gerrit.wikimedia.org/r/178676 (owner: 10JanZerebecki) [16:10:50] !log demon Synchronized php-1.25wmf12/extensions/Wikidata/: (no message) (duration: 00m 12s) [16:10:53] <^d> aude: Ok, you're live ^ [16:10:58] Logged the message, Master [16:11:02] checking [16:11:20] looks ok afaik [16:11:25] <^d> Somebody reimaging apaches? [16:12:15] <^d> I got pubkey denied on mw1190 and host identification changed on mw1152 [16:14:25] !log demon Synchronized php-1.25wmf12/includes/specials/SpecialSearch.php: (no message) (duration: 00m 06s) [16:14:29] Logged the message, Master [16:14:37] <^d> manybubbles: ^ [16:14:46] <_joe_> ^d: no apaches are not reimaging, mw1152 has been reimaged as a jobrunner though [16:14:55] <^d> Ah, ok [16:15:03] ^d: works [16:15:07] <^d> Yay :) [16:20:23] <^d> !log mw1190: manually ran sync-common since it was yelling about my key earlier [16:20:30] Logged the message, Master [16:21:37] <^d> Ok, swat over. Thanks for easy stuff today guys :) [16:23:48] ^d: I think the deployments week thing is broken lua errors [16:23:54] thanks ^d [16:24:18] and maybe I broke it [16:24:23] the challenges of wikitext and templates [16:24:48] <^d> manybubbles: {ircnick|manybubbles|Nik}} [16:24:52] <^d> Missing an opening { [16:24:57] damn it [16:24:58] fixing [16:25:57] ^d: all better now. I suppose that means you didn't see my attempt to sneak in https://gerrit.wikimedia.org/r/#/c/180493/ [16:26:17] <^d> Oh we can do that [16:26:19] anomie: also, I've cherry picked the prefix search fix: https://gerrit.wikimedia.org/r/#/c/180495/ [16:26:25] (03CR) 10Chad: [C: 032] Have cirrus use the safer query in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180493 (owner: 10Manybubbles) [16:26:30] ^d: maybe we should do ^ too? [16:26:33] (03Merged) 10jenkins-bot: Have cirrus use the safer query in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180493 (owner: 10Manybubbles) [16:26:44] manybubbles: Good! [16:26:57] anomie: I'll merge it to the deployment branch and make the submodule update [16:27:45] !log demon Synchronized wmf-config/CirrusSearch-labs.php: for completeness (duration: 00m 05s) [16:27:50] Logged the message, Master [16:28:01] <^d> manybubbles: you're live when jenkins does the update to beta. [16:28:13] ^d: looks like I made mistake on it though! [16:28:31] <^d> whoops [16:28:43] ^d: oh wait, no its right [16:28:57] ^d: and its live [16:29:00] http://simple.wikipedia.beta.wmflabs.org/wiki/Special:Search?search=a+test&go=Search&cirrusDumpQuery=yes [16:30:03] go git pull go! [16:30:47] <^d> \o/ [16:34:41] ^d: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=138505&oldid=138501 [16:37:00] (03CR) 10BryanDavis: "Merging this without fixing the existing hosts first would be bad. It would change the uid of the running hhvm fcgi process and likely lea" [puppet] - 10https://gerrit.wikimedia.org/r/178690 (owner: 10BryanDavis) [16:39:22] ^d: thanks! [16:42:00] !log demon Synchronized php-1.25wmf12/extensions/TextExtracts/: (no message) (duration: 00m 05s) [16:42:07] Logged the message, Master [16:42:52] aude: we should be, Reedy should be around today for that (cc twentyafterfour ) [16:43:30] <^d> manybubbles, aude: Anything else? [16:44:24] ^d:wikipedias will get wmf12 this afternoon, right? [16:44:45] <^d> I think so [16:44:55] cool - then they'll just get the prefix search then [16:44:58] no more backporting required [16:45:08] (03PS1) 10Faidon Liambotis: sudo: move sudo-ldap Package from "ldap" to "sudo" [puppet] - 10https://gerrit.wikimedia.org/r/180502 [16:45:11] (03PS1) 10Faidon Liambotis: Move /etc/sudoers from module "admin" to "sudo" [puppet] - 10https://gerrit.wikimedia.org/r/180503 [16:45:12] (03PS1) 10Faidon Liambotis: sudo: fold sudo::labs_project into the role class [puppet] - 10https://gerrit.wikimedia.org/r/180504 [16:45:15] (03PS1) 10Faidon Liambotis: sudo: reduce delta between ::group & ::user [puppet] - 10https://gerrit.wikimedia.org/r/180505 [16:45:17] (03PS1) 10Faidon Liambotis: sudo: adjust sudoers for compat with newer sudo [puppet] - 10https://gerrit.wikimedia.org/r/180506 [16:45:21] (03PS1) 10Faidon Liambotis: admin::sudo: remove privs => [absent] support [puppet] - 10https://gerrit.wikimedia.org/r/180507 [16:45:22] (03PS1) 10Faidon Liambotis: admin::sudo: remove comment support [puppet] - 10https://gerrit.wikimedia.org/r/180508 [16:45:25] (03PS1) 10Faidon Liambotis: Replace four admin::sudo calls with sudo::user/group [puppet] - 10https://gerrit.wikimedia.org/r/180509 [16:45:27] (03PS1) 10Faidon Liambotis: admin: rename "privs" to "privileges" [puppet] - 10https://gerrit.wikimedia.org/r/180510 [16:45:28] (03PS1) 10Faidon Liambotis: sudo: port over linting & sudoers from admin::sudo [puppet] - 10https://gerrit.wikimedia.org/r/180511 [16:45:30] (03PS1) 10Faidon Liambotis: admin: remove ::sudo in favor of sudo::user/group [puppet] - 10https://gerrit.wikimedia.org/r/180512 [16:45:32] (03PS1) 10Faidon Liambotis: sudo: recursively manage /etc/sudoers.d [puppet] - 10https://gerrit.wikimedia.org/r/180513 [16:45:52] <^d> holy patchsets batman. [16:46:04] <_joe_> paravoid: oh man thanks [16:46:28] I just wanted to add two lines to sudoers [16:46:40] <_joe_> I had planned to clean that up since forever [16:47:14] <_joe_> I was tempted to do that over and over, and restrained myself while I was doing other things [16:47:19] I have two outstanding items I'd like to clean up and I'm not sure how [16:47:26] so, after this gets reviewed and merged, let's talk about it :) [16:48:49] (03CR) 10Giuseppe Lavagetto: [C: 032] sudo: move sudo-ldap Package from "ldap" to "sudo" [puppet] - 10https://gerrit.wikimedia.org/r/180502 (owner: 10Faidon Liambotis) [16:50:13] <_joe_> I'm reviewing the whole series :) [16:50:20] (03CR) 10Giuseppe Lavagetto: [C: 031] sudo: fold sudo::labs_project into the role class [puppet] - 10https://gerrit.wikimedia.org/r/180504 (owner: 10Faidon Liambotis) [16:52:14] (03CR) 10Giuseppe Lavagetto: [C: 031] sudo: reduce delta between ::group & ::user [puppet] - 10https://gerrit.wikimedia.org/r/180505 (owner: 10Faidon Liambotis) [16:52:52] jouncebot: next [16:52:53] In 2 hour(s) and 7 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141217T1900) [16:52:59] jouncebot: refresh [16:53:02] I refreshed my knowledge about deployments. [16:54:21] (03CR) 10Giuseppe Lavagetto: [C: 031] sudo: adjust sudoers for compat with newer sudo [puppet] - 10https://gerrit.wikimedia.org/r/180506 (owner: 10Faidon Liambotis) [16:57:20] (03CR) 10Giuseppe Lavagetto: "I don't see how having that makes consolidation harder, but I probably miss the context." [puppet] - 10https://gerrit.wikimedia.org/r/180507 (owner: 10Faidon Liambotis) [16:58:54] (03CR) 10Giuseppe Lavagetto: [C: 031] "I agree. Comments should be in the puppet code." [puppet] - 10https://gerrit.wikimedia.org/r/180508 (owner: 10Faidon Liambotis) [17:00:41] (03CR) 10Giuseppe Lavagetto: [C: 031] Replace four admin::sudo calls with sudo::user/group [puppet] - 10https://gerrit.wikimedia.org/r/180509 (owner: 10Faidon Liambotis) [17:01:21] from #mediawiki-core: [17:01:22] 11:58 < greg-g> 11:56 < andre__af> If Special:Recentchanges is 0 bytes but works when passing ?debug=true, what could be the reason? HHVM? (T78776) [17:01:25] 11:59 < bd808> debug=true would bypass varnish [17:01:28] 11:59 < bd808> X-CacheIcp1066 hit (2), cp4008 hit (2), cp4016 frontend miss (0) [17:01:31] 11:59 < bd808> for the empty page [17:01:36] bblack: ^ [17:01:38] oh [17:01:49] aha, aha! learned something. [17:02:00] * andre__ writes down on his debug page [17:02:08] varnish wouldn't generate an empty page, it would just cache one passed from the backend [17:02:18] (03PS1) 10Rush: phab testing out ops access request macro [puppet] - 10https://gerrit.wikimedia.org/r/180516 [17:02:40] (03CR) 10Giuseppe Lavagetto: [C: 031] admin: rename "privs" to "privileges" [puppet] - 10https://gerrit.wikimedia.org/r/180510 (owner: 10Faidon Liambotis) [17:03:03] I can't parse that commit message :) [17:03:47] paravoid: is that to me ? :) I can do better if so [17:03:59] yeah :) [17:04:28] imagine a colon after the first word and it's all crystal-clear! :P [17:05:32] should I make all spaces into dashes? [17:05:37] that would do it right? [17:05:42] maybe some 1337 [17:06:18] if gerrit wasn't slow that clever witticism would have been shortly followed by a better message but nope hang and dead joke [17:07:08] Well_I_can_perfectly_read_your_commit_messages [17:07:48] (03PS2) 10Rush: phab enable Security: Operations Access Request [puppet] - 10https://gerrit.wikimedia.org/r/180516 [17:08:28] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Small comment, but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/180511 (owner: 10Faidon Liambotis) [17:09:07] (03CR) 10Rush: [C: 032] phab enable Security: Operations Access Request [puppet] - 10https://gerrit.wikimedia.org/r/180516 (owner: 10Rush) [17:09:58] (03CR) 10Giuseppe Lavagetto: [C: 031] admin: remove ::sudo in favor of sudo::user/group [puppet] - 10https://gerrit.wikimedia.org/r/180512 (owner: 10Faidon Liambotis) [17:11:12] <_joe_> paravoid: I'm done for now :) I just have a doubt about managing sudoers.d recursively, but I'll think over it [17:11:16] * _joe_ off [17:14:48] (03PS1) 10Dereckson: Namespace configuration on el.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180518 [17:16:36] (03PS2) 10Dereckson: Namespace configuration on el.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180518 [17:17:32] (03CR) 10Dereckson: "PS2: author → Author" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180518 (owner: 10Dereckson) [17:18:46] (03PS10) 10Anomie: Configure Logstash and Elasticsearch for ApiFeatureUsage [puppet] - 10https://gerrit.wikimedia.org/r/173336 [17:22:51] (03PS1) 10Ottomata: Use stat box internel addresses (.eqiad.wmnet) for everything [puppet] - 10https://gerrit.wikimedia.org/r/180520 [17:53:09] (03PS1) 10Rush: Revert "phab enable Security: Operations Access Request" [puppet] - 10https://gerrit.wikimedia.org/r/180528 [17:53:23] (03CR) 10Rush: [C: 032] Revert "phab enable Security: Operations Access Request" [puppet] - 10https://gerrit.wikimedia.org/r/180528 (owner: 10Rush) [17:53:31] (03CR) 10Rush: [V: 032] Revert "phab enable Security: Operations Access Request" [puppet] - 10https://gerrit.wikimedia.org/r/180528 (owner: 10Rush) [18:10:41] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 467.549988 [18:11:57] (03PS1) 10Anomie: Enable ApiFeatureUsage on Beta Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180529 [18:17:35] hah [18:20:34] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 728.0 [18:30:26] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:41:24] (03Draft1) 10Dereckson: Add task references to wgImportSources block [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180543 [18:44:57] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 599.733337 [18:53:57] (03Draft1) 10Dereckson: Import sources configuration on el.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180546 [18:55:12] (03PS2) 10Dereckson: Import sources configuration on el.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180546 [18:55:39] (03CR) 10Dereckson: "PS2: added reference to T78795" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180546 (owner: 10Dereckson) [19:00:05] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141217T1900). Please do the needful. [19:00:11] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:00:56] (03CR) 10Ottomata: "This was brought up in Scrum of Scrums today as something that needs a little poking. I asked who in ops Citoid has previously been worki" [puppet] - 10https://gerrit.wikimedia.org/r/178419 (owner: 10Catrope) [19:03:35] (03Abandoned) 10BryanDavis: Rewrite robots.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177948 (owner: 10BryanDavis) [19:16:48] (03CR) 10Cscott: "@Dzahn: https://www.mediawiki.org/wiki/Parsoid/Round-trip_testing#Updating_the_code_to_test_.28and_being_run_by_the_clients.29" [puppet] - 10https://gerrit.wikimedia.org/r/180221 (owner: 10Cscott) [19:18:59] (03CR) 10Cscott: "(For the record, subbu is in 'cassandra-roots' as well as 'parsoid-admin', and cassandra-roots has `(ALL) NOPASSWD: ALL` sudo permissions." [puppet] - 10https://gerrit.wikimedia.org/r/180221 (owner: 10Cscott) [19:23:25] (03CR) 10Subramanya Sastry: "Scott, Arlo, and I should be able to (I currently can, because of being a member of cassandra-roots) run various other commands as well on" [puppet] - 10https://gerrit.wikimedia.org/r/180221 (owner: 10Cscott) [19:27:40] (03CR) 10Cscott: "gwicke, dzahn: any suggestions on how to structure these better?" [puppet] - 10https://gerrit.wikimedia.org/r/180221 (owner: 10Cscott) [19:35:04] (03Draft1) 10Dereckson: Removed 'OTRS-member' user group on commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180560 [19:42:10] PROBLEM - Host d-i-test is DOWN: PING CRITICAL - Packet loss = 100% [19:52:59] Who's on ops duty? [19:53:18] The topic says _joe_ but I only see a _joe|off [19:53:45] Ops guys is off, the world shall end :( [19:55:28] did you expect the ops duty person to be here 24/7? :) [19:56:52] paravoid: did you forget to put that in the contract again? :p [20:06:16] <_joe|off> Krenair: I'm actually here by chance [20:06:31] <_joe|off> do you need something right away? [20:06:40] Nothing urgent. [20:06:55] I tried to send something to access-requests, got denied, sent to ops-requests instead. [20:07:33] <_joe|off> oh, ok [20:08:48] Yeah, that's always the case [20:08:53] You just send it there and ops triage it [20:09:13] Who is actually allowed to create tickets in access-requests? ops? [20:09:29] Ops and managers iirc [20:09:48] How far 'managers' goes I don't know though [20:10:12] I'm assuming only WMF managers, no chapters? :) [20:10:38] But what counts as a WMF manager :p [20:11:39] goat and catherds [20:12:53] !log Jenkins some slaves are no more properly registered. Unpooling / Repooling them [20:12:58] Logged the message, Master [20:17:30] (03CR) 10GWicke: "@cscott: A great way would be to have root in a disposable container running the service ;) I know, I know.." [puppet] - 10https://gerrit.wikimedia.org/r/180221 (owner: 10Cscott) [20:25:35] (03PS1) 10Rush: rt redirect relevant queues to ops-private [puppet] - 10https://gerrit.wikimedia.org/r/180579 [20:30:07] (03PS1) 10Reedy: Add/update symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180582 [20:30:09] (03PS1) 10Reedy: testwiki to 1.25wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180583 [20:30:11] (03PS1) 10Reedy: wikipedias to 1.25wmf12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180584 [20:30:13] (03PS1) 10Reedy: group0 to 1.25wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180585 [20:31:02] Gerrit giving server unavailable [20:31:16] (03CR) 10Reedy: [C: 032] Add/update symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180582 (owner: 10Reedy) [20:32:48] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180583 (owner: 10Reedy) [20:35:46] (03PS1) 10Legoktm: Move composer.json into repository root [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180589 [20:36:22] !log reedy Started scap: testwiki to 1.25wmf13 and build l10n cache [20:36:23] (03Merged) 10jenkins-bot: Add/update symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180582 (owner: 10Reedy) [20:36:26] Logged the message, Master [20:36:31] (03Merged) 10jenkins-bot: testwiki to 1.25wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180583 (owner: 10Reedy) [20:36:32] Nice of you to be with us jenkins [20:36:33] !log Jenkins/Zuul had some deadlock. Disconnected/reconnected slaves but that did not fix it. Finally had to disconnect/reconnect thegearman client in Jenkins and it is processing again. [20:36:37] Logged the message, Master [20:38:55] (03CR) 10dschwen: [C: 031] Removed 'OTRS-member' user group on commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180560 (owner: 10Dereckson) [20:44:07] (03CR) 10Rush: [C: 032 V: 032] "jenkins where are you?" [puppet] - 10https://gerrit.wikimedia.org/r/180579 (owner: 10Rush) [20:50:02] (03CR) 10BryanDavis: [C: 031] "Should effectively be a no-op in all environments. It's just composer book keeping changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180589 (owner: 10Legoktm) [20:53:01] gerrit doesn't want to load for me now [21:00:05] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141217T2100). Please do the needful. [21:00:16] * Reedy is still scapping [21:19:01] (03PS1) 10Aaron Schulz: Use ProfilerXhprof for HHVM hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180637 [21:21:09] (03PS1) 10Yuvipanda: quarry: Remove sudo hack for quarry [puppet] - 10https://gerrit.wikimedia.org/r/180638 [21:22:55] (03CR) 10Yuvipanda: [C: 032] quarry: Remove sudo hack for quarry [puppet] - 10https://gerrit.wikimedia.org/r/180638 (owner: 10Yuvipanda) [21:23:34] (03CR) 10Mark Bergsma: "It adds them, but won't use them for @rt.wikimedia.org. That domain is not a system_domain, and shouldn't be. Currently there's no alias r" [puppet] - 10https://gerrit.wikimedia.org/r/168733 (owner: 10Dzahn) [21:24:06] !log @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ for mw1152 [21:24:11] Logged the message, Master [21:25:42] !log reedy Started scap: testwiki to 1.25wmf13 and build l10n cache [21:27:15] (03CR) 10Yuvipanda: [V: 032] "Jenkins is dead? DEAD?" [puppet] - 10https://gerrit.wikimedia.org/r/180638 (owner: 10Yuvipanda) [21:37:38] (03PS1) 10Mark Bergsma: Create a special alias router for RT [puppet] - 10https://gerrit.wikimedia.org/r/180641 [21:38:09] !log reedy Finished scap: testwiki to 1.25wmf13 and build l10n cache (duration: 12m 26s) [21:38:12] Logged the message, Master [21:39:02] (03CR) 10Hoo man: [C: 032] " Can someone merge those 2 for me in a few minutes?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180584 (owner: 10Reedy) [21:39:19] (03PS2) 10Mark Bergsma: Create a special alias router for RT [puppet] - 10https://gerrit.wikimedia.org/r/180641 [21:39:32] (03Merged) 10jenkins-bot: wikipedias to 1.25wmf12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180584 (owner: 10Reedy) [21:39:34] (03CR) 10Hoo man: [C: 032] " Can someone merge those 2 for me in a few minutes?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180585 (owner: 10Reedy) [21:40:24] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf12 [21:40:28] Logged the message, Master [21:41:09] (03Merged) 10jenkins-bot: group0 to 1.25wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180585 (owner: 10Reedy) [21:41:13] (03PS3) 10Mark Bergsma: Create a special alias router for RT [puppet] - 10https://gerrit.wikimedia.org/r/180641 [21:41:34] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf13 [21:41:38] Logged the message, Master [21:43:04] (03PS4) 10Mark Bergsma: Create a special alias router for RT [puppet] - 10https://gerrit.wikimedia.org/r/180641 [21:44:58] enough [21:45:00] killing jenkins [21:45:14] !log killing Jenkins [21:45:20] Logged the message, Master [21:45:21] (03PS5) 10Mark Bergsma: Create a special alias router for RT [puppet] - 10https://gerrit.wikimedia.org/r/180641 [21:48:14] (03CR) 10Mark Bergsma: [C: 032 V: 031] Create a special alias router for RT [puppet] - 10https://gerrit.wikimedia.org/r/180641 (owner: 10Mark Bergsma) [21:48:22] (03CR) 10Mark Bergsma: [V: 032] Create a special alias router for RT [puppet] - 10https://gerrit.wikimedia.org/r/180641 (owner: 10Mark Bergsma) [21:50:41] (03PS1) 10Mark Bergsma: Fix aliases file URL [puppet] - 10https://gerrit.wikimedia.org/r/180643 [21:50:47] grrrit-wm: /clear [21:51:17] (03CR) 10Mark Bergsma: [C: 032 V: 032] Fix aliases file URL [puppet] - 10https://gerrit.wikimedia.org/r/180643 (owner: 10Mark Bergsma) [21:53:29] (03PS2) 10Chad: Make `es-tool ban-node` handle both IP addressses and hostnames [puppet] - 10https://gerrit.wikimedia.org/r/180210 [21:53:37] (03CR) 10jenkins-bot: [V: 04-1] Make `es-tool ban-node` handle both IP addressses and hostnames [puppet] - 10https://gerrit.wikimedia.org/r/180210 (owner: 10Chad) [22:00:50] (03PS3) 10Chad: Make `es-tool ban-node` handle both IP addressses and hostnames [puppet] - 10https://gerrit.wikimedia.org/r/180210 [22:08:03] Coren: the admin tool definitely makes a good case for moving to hhvm. [22:08:26] Coren: I’m thinking of exprimenting nginx -> hhvm instead of lighttpd to hhvm, since paravoid has been trying to eliminate lighty from our config :) [22:08:35] or maybe best not to change two things at once... [22:10:15] Coren: do we have an ‘admin-test’ thing? [22:10:20] * YuviPanda checks [22:10:32] we doooo [22:12:53] meh, it doesn’t work [22:13:02] * YuviPanda gives up for now, goes to sleep instead. [22:16:28] mark: Got your test okay. [22:16:36] yeah good [22:46:38] !log restarting Jenkins [22:46:43] Logged the message, Master [22:53:55] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [22:56:35] hashar: you need any help with anything? [23:00:12] greg-g: I think I found the issue finally [23:00:26] good [23:00:28] greg-g: we have some plugin throttling browser test which apparently badly interact with Jenkins [23:00:36] greg-g: I have upgraded it, will see what happens. [23:00:38] oh, huh [23:00:54] greg-g: in short whenever a job using that plugin is running, it prevents anything else from running. Doh [23:01:14] greg-g: will test again tomorrow morning and report on whatever Task I filled about that issue [23:09:14] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:19:39] (03CR) 10Chad: [C: 032] Use ProfilerXhprof for HHVM hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180637 (owner: 10Aaron Schulz) [23:21:11] (03Merged) 10jenkins-bot: Use ProfilerXhprof for HHVM hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180637 (owner: 10Aaron Schulz) [23:22:06] !log demon Synchronized wmf-config/StartProfiler.php: xhprof on all hhvm hosts in eqiad (duration: 00m 05s) [23:22:10] Logged the message, Master [23:34:43] !log Restarted Jenkins and Zuul again to have a clean start while I am crashing to bed. [23:34:46] Logged the message, Master [23:35:29] :/ [23:35:31] g'night hashar