[00:00:30] mutante: That page is the perfect example of documentation rot [00:00:42] (03PS2) 10Ori.livneh: Update remaining references to /u/l/a/common-local [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159635 [00:01:50] bd808: yea, "type apt-get install && reboot" [00:07:34] (03CR) 10Ori.livneh: [C: 032] Update remaining references to /u/l/a/common-local [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159635 (owner: 10Ori.livneh) [00:07:56] !log ori updated /a/common to {{Gerrit|Id607bf36d}}: Update remaining references to /u/l/a/common-local [00:08:01] (03CR) 10Dzahn: [C: 032] delete "check_bad_apaches" monitoring [puppet] - 10https://gerrit.wikimedia.org/r/159619 (owner: 10Dzahn) [00:08:02] Logged the message, Master [00:08:16] * ori tries that config change on mw1017 [00:09:18] grrr, terbium wtf [00:10:23] Jeff_Green: what's up? [00:10:39] insane cronspam [00:11:06] apache : user NOT in sudoers [00:11:09] ooh? [00:11:15] that's the one yeah [00:11:19] fail [00:11:41] i bet that is [00:11:43] Reedy: i think that's your change [00:11:45] https://gerrit.wikimedia.org/r/#/c/157013/ [00:12:16] apache : user NOT in sudoers ; TTY=unknown ; PWD=/var/www ; USER=apache ; COMMAND=/usr/bin/php ... [00:12:22] yea, must be from deployment [00:12:47] Reedy: can you amend that to check that the user is not already apache? [00:12:50] lol [00:12:52] Poor apache can't even sudo to himself :( [00:12:52] one mail per wiki :) [00:12:53] ... [00:12:57] :D [00:13:21] Jeff_Green: sorry 'bout that [00:13:41] no worries [00:14:05] bye! [00:14:16] Is there a way of doing this without having to duplicate the whole line in an if statement? [00:14:38] put the whole line in an if statement, don't get fancy now [00:14:45] oh, just $SUDO at the start or something? [00:15:27] $sudo = ''; if ( user != apache ) { $sudo = 'sudo -u apache'; } $SUDO php -ddisplay_errors=On $MW_COMMON/multiversion/MWScript.php $CMD --wiki=$x " [00:15:27] ${@}" | sed -u "s/^/$x: /" [00:15:29] that should work [00:15:38] noting that isn't bash syntax, yada yada [00:16:07] (03CR) 10RobH: [C: 031] blog/techblog - TTL back to 1H [dns] - 10https://gerrit.wikimedia.org/r/158277 (owner: 10Dzahn) [00:16:34] sudo= [00:16:36] if groups | grep -Ewq 'sudo|wikidev|root'; then [00:16:42] sudo=sudo -u apache [00:16:44] fi [00:16:46] :P [00:17:02] or just $USER [00:17:23] mwscript is more fancy [00:17:27] and rpobably for a reason [00:17:53] eg. snapshots use dataset instead of apache for stuff [00:18:31] so a sole check on $USER might lead to subtle breakage [00:18:59] Reedy: [[ "$(id -u)" != "$(id -u apache)" ]] || [[ "$(groups)" == *wikidev* ]] [00:19:15] $waysToSkinACat++; [00:19:17] yes, it assumes there no groups with 'wikidev' in their name other than 'wikidev' [00:20:54] (03PS2) 10Dzahn: blog/techblog - TTL back to 1H [dns] - 10https://gerrit.wikimedia.org/r/158277 [00:21:05] (03CR) 10jenkins-bot: [V: 04-1] blog/techblog - TTL back to 1H [dns] - 10https://gerrit.wikimedia.org/r/158277 (owner: 10Dzahn) [00:21:23] lol [00:22:38] !log ori Synchronized docroot and w: Id607bf36d: Update remaining references to /u/l/a/common-local (duration: 00m 04s) [00:22:42] Logged the message, Master [00:23:57] (03PS3) 10Dzahn: blog/techblog - TTL back to 1H [dns] - 10https://gerrit.wikimedia.org/r/158277 [00:24:04] Reedy: what is this [2014-09-11 00:23:53] Fatal error: Base lambda function for closure not found at /usr/local/apache/common-local/php-1.24wmf20/extensions/Wikidata/extensions/Wikibase/lib/config/WikibaseLib.default.php on line 18 ? already reported/debugged? [00:24:10] https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/modules/mediawiki/templates/multiwiki/foreachwiki.erb#L4 [00:24:16] ori: It's an intermittant APC issue [00:24:20] ori: That's the APC f.ck up [00:24:25] Gracefulling the apache "fixes" it [00:24:35] (03CR) 10Dzahn: [C: 032] blog/techblog - TTL back to 1H [dns] - 10https://gerrit.wikimedia.org/r/158277 (owner: 10Dzahn) [00:24:40] didn't joe do a graceful-all a little bit ago? [00:24:49] Yeah, it started just after [00:24:58] mutante gracefulled anothr apache [00:25:03] just a single one [00:25:23] 10.64.32.41 [00:25:31] 10.64.32.57 [00:25:32] atm it seems [00:25:38] !log ori Synchronized multiversion: Id607bf36d: Update remaining references to /u/l/a/common-local (duration: 00m 04s) [00:25:42] !log ori Synchronized wmf-config: Id607bf36d: Update remaining references to /u/l/a/common-local (duration: 00m 03s) [00:25:43] Logged the message, Master [00:25:48] Logged the message, Master [00:26:34] yup, looks to be just those 2 [00:27:11] Inbox(416) [00:27:29] Reedy: they need graceful too? [00:27:42] please [00:28:12] done [00:28:24] !log graceful'ed Apaches on mw1171, mw1187 [00:28:29] Logged the message, Master [00:31:14] Reedy: i got the user check thing [00:35:57] (03PS1) 10Ori.livneh: foreachwikiindblist: check groups before attempting to sudo [puppet] - 10https://gerrit.wikimedia.org/r/159643 [00:36:11] w/in 37 [00:36:13] mutante: ^ [00:36:30] mutante: 37 what? [00:37:23] (03CR) 10Hoo man: [C: 031] "That's one way to do it :P" [puppet] - 10https://gerrit.wikimedia.org/r/159643 (owner: 10Ori.livneh) [00:37:24] 37 windows in my IRC client :o [00:37:33] oh, heh. [00:39:21] (03CR) 10Dzahn: [C: 031] "yes please, it's like mwscript. confirmed. and this should stop current mail spam to roots" [puppet] - 10https://gerrit.wikimedia.org/r/159643 (owner: 10Ori.livneh) [00:41:14] (03CR) 10Ori.livneh: [C: 032] foreachwikiindblist: check groups before attempting to sudo [puppet] - 10https://gerrit.wikimedia.org/r/159643 (owner: 10Ori.livneh) [00:43:16] (03PS1) 10Ori.livneh: Remove MW_DBLISTS config vars [puppet] - 10https://gerrit.wikimedia.org/r/159645 [00:43:17] mutante: merged and ran puppet on terbium [00:43:22] mutante: are the alerts better now? [00:45:13] ori: i think so, yea, the last one i got was now , curiously [00:45:20] Mail Delivery Subsystem [00:45:27] Delivery to the following recipient failed permanently:... [00:45:34] " Internal Message-ID collision" [00:51:08] yea, has stopped. thanks [01:03:39] (03PS1) 10Ori.livneh: Replace remaining references to /u/l/a/common [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159650 [01:06:25] (03PS1) 10Jforrester: Move Parsoid extension-list pointer now the code has moved [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159651 [01:06:27] (03PS1) 10Jforrester: Move Parsoid pointer in extension-list now the repo has been restructured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159652 [01:07:49] (03PS2) 10Jforrester: Move Parsoid pointer in Labs extension-list as the repo has been restructured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159651 [01:08:07] (03CR) 10MZMcBride: "Ottomata: People are sharing passwords? I don't follow." [puppet] - 10https://gerrit.wikimedia.org/r/155452 (owner: 10Dzahn) [01:39:30] (03PS1) 10Ori.livneh: Add grafana.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/159655 [01:39:39] brgh [01:39:47] (03Abandoned) 10Ori.livneh: Add grafana.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/159655 (owner: 10Ori.livneh) [01:40:28] (03PS2) 10Ori.livneh: Add grafana.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/133275 [01:40:30] (03PS4) 10Dzahn: webserver - use ssl_ciphersuite in generic_vhost [puppet] - 10https://gerrit.wikimedia.org/r/153971 [01:41:15] (03CR) 10Ori.livneh: "@mutante: definitely still desired! the dependent change () needs a friend, too." [dns] - 10https://gerrit.wikimedia.org/r/133275 (owner: 10Ori.livneh) [01:41:16] (03CR) 10Dzahn: "_joe_:" [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [01:49:46] (03CR) 10Dzahn: [C: 032] StrictTransportSecurity for lists.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/145500 (https://bugzilla.wikimedia.org/38516) (owner: 10Dzahn) [01:50:50] (03CR) 10Dzahn: [V: 032] StrictTransportSecurity for lists.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/145500 (https://bugzilla.wikimedia.org/38516) (owner: 10Dzahn) [01:54:18] (03CR) 10Dzahn: "curl -s -D- https://lists.wikimedia.org/mailman/listinfo | grep Strict" [puppet] - 10https://gerrit.wikimedia.org/r/145500 (https://bugzilla.wikimedia.org/38516) (owner: 10Dzahn) [02:05:34] (03CR) 10Catrope: [C: 032] Move Parsoid pointer in Labs extension-list as the repo has been restructured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159651 (owner: 10Jforrester) [02:05:39] (03Merged) 10jenkins-bot: Move Parsoid pointer in Labs extension-list as the repo has been restructured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159651 (owner: 10Jforrester) [02:05:41] (03PS2) 10Catrope: Move Parsoid pointer in extension-list now the repo has been restructured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159652 (owner: 10Jforrester) [02:06:02] (03PS1) 10Yurik: zerowiki: Removed zeroadmin group, rights adjustments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159656 [02:09:11] (03CR) 10Jforrester: "Must be merged and deployed alongside 83ff3f54c." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159652 (owner: 10Jforrester) [02:10:30] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3615 MB (3% inode=99%): [02:14:49] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /a/common/). [02:15:13] (03PS1) 10Dzahn: add stats table for sourceforge wikis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159661 (https://bugzilla.wikimedia.org/58396) [02:16:47] greg-g: OK for me to schedule a follow-on deploy after the train tomorrow for Roan to fix up Parsoid? The repo structure changed and we need to move the entry in extension-list and push the new Parsoid cherry-pick at the same time. [02:23:30] !log LocalisationUpdate completed (1.24wmf15) at 2014-09-11 02:23:29+00:00 [02:23:40] Logged the message, Master [02:28:55] greg-g: Never mind; Roan and I found that Reedy is wonderful, and all he needs to do is push a config change as part of the train instead. [02:29:20] (03CR) 10Jforrester: [C: 031] "Actually, this is good to go now; must be done before wmf21 train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159652 (owner: 10Jforrester) [02:32:43] (03PS1) 10Dzahn: add stats table for orain.org wikis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159663 (https://bugzilla.wikimedia.org/70309) [02:35:51] (03PS1) 10Yurik: Zero: fix X-CS tagging for the zero.wp and m.wp redirect pages [puppet] - 10https://gerrit.wikimedia.org/r/159664 [02:35:59] bblack, when you have a sec ^ [02:36:01] minor fix [02:36:02] (03PS1) 10Dzahn: wikistats-crons for updates of sourceforge, orain [puppet] - 10https://gerrit.wikimedia.org/r/159665 (https://bugzilla.wikimedia.org/70309) [02:36:37] !log LocalisationUpdate completed (1.24wmf19) at 2014-09-11 02:36:37+00:00 [02:36:42] Logged the message, Master [02:36:50] (03CR) 10Dzahn: [C: 032] add stats table for sourceforge wikis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159661 (https://bugzilla.wikimedia.org/58396) (owner: 10Dzahn) [02:37:06] (03CR) 10Dzahn: [C: 032] add stats table for orain.org wikis [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159663 (https://bugzilla.wikimedia.org/70309) (owner: 10Dzahn) [02:38:46] (03CR) 10BBlack: [C: 032] Zero: fix X-CS tagging for the zero.wp and m.wp redirect pages [puppet] - 10https://gerrit.wikimedia.org/r/159664 (owner: 10Yurik) [02:38:53] thx ) [02:39:09] np [02:45:33] (03PS1) 10Dzahn: bump version up to 2.8 [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159666 [02:46:11] (03CR) 10Dzahn: [C: 032] bump version up to 2.8 [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159666 (owner: 10Dzahn) [02:46:19] (03CR) 10Dzahn: [V: 032] bump version up to 2.8 [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159666 (owner: 10Dzahn) [02:46:51] (03CR) 10Dzahn: [C: 032] wikistats-crons for updates of sourceforge, orain [puppet] - 10https://gerrit.wikimedia.org/r/159665 (https://bugzilla.wikimedia.org/70309) (owner: 10Dzahn) [02:47:08] (03PS1) 10Jforrester: Follow-up I51abd7c: Enable Commons use for wikitech (labswiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159667 [02:49:17] (03CR) 10Jforrester: Mark out a bunch of code for wikitech. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [02:49:26] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-11 02:49:26+00:00 [02:49:31] Logged the message, Master [02:50:06] James_F: https://gerrit.wikimedia.org/r/#/c/158313/ :p [02:50:16] Bleurgh. [02:50:27] hehe [02:50:42] (03CR) 10Jforrester: [C: 04-1] "The convention is intentionally tabs, not spaces. This is definitely the wrong way to go…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [02:54:32] (03CR) 10Catrope: [C: 04-1] "We use tabs for all PHP code in the MediaWiki ecosystem, this repo shouldn't be different." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [03:00:19] RECOVERY - Disk space on virt0 is OK: DISK OK [03:25:22] (03PS1) 10Dzahn: orain actually has a /w/ style URL, the default [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159668 [03:25:57] (03CR) 10Dzahn: [C: 032] orain actually has a /w/ style URL, the default [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159668 (owner: 10Dzahn) [03:26:03] (03CR) 10Dzahn: [V: 032] orain actually has a /w/ style URL, the default [debs/wikistats] - 10https://gerrit.wikimedia.org/r/159668 (owner: 10Dzahn) [03:37:25] (03CR) 10Ori.livneh: "@mutante: yes, that should work." [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [03:41:06] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 11 03:41:03 UTC 2014 (duration 41m 2s) [03:41:13] Logged the message, Master [04:24:18] (03CR) 10BryanDavis: "The problem that led to introducing this flag was that virt1001 where labswiki/wikitech runs cannot communicate with the production databa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159667 (owner: 10Jforrester) [05:08:57] (03CR) 10Chmarkine: "Note that now the connection to http://ishmael.wikimedia.org no longer redirects to HTTPS." [puppet] - 10https://gerrit.wikimedia.org/r/154969 (owner: 10Dzahn) [06:15:53] (03PS1) 10Ori.livneh: mediawiki::sync: re-declare deployment paths [puppet] - 10https://gerrit.wikimedia.org/r/159674 [06:28:11] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:42] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:01] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:02] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:11] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] it's mod_passenger o'clock [06:39:35] <_joe_> yes [06:39:38] <_joe_> 8:30 here [06:39:55] <_joe_> so 23:30 PT I guess [06:40:24] <_joe_> or PDT, more probably [06:40:51] RECOVERY - Disk space on ms1004 is OK: DISK OK [06:43:51] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%): /var/lib/ureadahead/debugfs 0 MB (0% inode=94%): [06:45:22] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:52] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:01:59] _joe_: i merged tim's patch, so luasandbox should be good to go [07:02:09] <_joe_> ok [07:02:23] i have a good feeling about this [07:02:29] <_joe_> I'll package it and I'll deploy it on hhvm everywhere [07:02:48] sweet. yeah, it's safe [07:02:51] <_joe_> Do I have a way to confirm it works? [07:02:57] the behavior it's replacing is totally broken [07:02:59] sure, yeah, just a sec [07:03:01] <_joe_> I'd like to restart the JR in case [07:09:04] _joe_: ok so, osmium:/root/bug70177.py . evidence of bug: [07:09:13] (Usage: bug70177.py HOST NUM_REQS, btw) [07:09:23] <_joe_> ok [07:10:11] # python bug70177.py mw1019 1 [07:10:11] limitreport-cputime: 7.927 [07:10:11] limitreport-walltime: 7.972 [07:10:13] scribunto-limitreport-timeusage: 1.962 [07:10:15] --- [07:10:24] but with three requests: [07:10:58] https://dpaste.de/dJ8W/raw [07:12:10] <_joe_> on the JR, do you remember which errors woud appear in the logs? [07:12:17] <_joe_> it may be in the bug [07:12:20] note how scribunto-limitreport-timeusage balloons to >5 secs [07:12:40] but three reqs against a zend host: https://dpaste.de/qLBt/raw [07:12:58] <_joe_> heh [07:13:38] don't pay attention to cputime and walltime; they'll be busted on prod until https://gerrit.wikimedia.org/r/#/c/158550/ rolls out [07:14:51] _joe_: the errors weren't in the logs, because HHVM wouldn't fatal or throw an exception; as far as it was concerned it was doings its job enforcing the lua time limit on misbehaving scripts. [07:15:04] <_joe_> oh my [07:15:27] we saw it on the job runner because there's actually some parallelism there [07:15:30] <_joe_> it wouldn't even print a notice somewhere? [07:15:42] the script error is part of the page output [07:16:11] e.g.: https://test.wikipedia.org/wiki/BusyLoop [07:16:22] (if you click on 'Script Error' you'll get the details) [07:16:27] <_joe_> yes I've seen it [07:16:43] <_joe_> Ihoped something was in the logs as well [07:20:40] well, there's https://test.wikipedia.org/w/index.php?title=Category:Pages_with_script_errors [07:20:48] which is automatically populated/updated [07:21:00] does that count? :) [07:22:10] <_joe_> eheh kind of [07:22:12] <_joe_> :) [07:26:13] <_joe_> ok thanks - I think we can 'announce' the worst-kept secret of the world to engineering@ and in other places once I'm done with this [07:30:41] PROBLEM - DPKG on rubidium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:32:22] PROBLEM - DPKG on mexia is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:32:41] RECOVERY - DPKG on rubidium is OK: All packages OK [07:33:18] ^ that's me [07:33:22] RECOVERY - DPKG on mexia is OK: All packages OK [07:33:34] apparently long-running apt-get upgrade -> spam irc [07:33:43] new gdnsd? [07:33:51] that too, yeah [07:34:10] I built some one off 1.11.5~precise1 and ~trusty1 for our nsX and installed them [07:34:30] .5 being the pure fix? [07:34:41] yeah, and misc other little things [07:35:05] ok, I'll upload into Debian too then [07:35:32] PROBLEM - DPKG on eeden is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:35:42] thanks :) [07:36:26] today I finally figured out how to use git-pbuilder to do chroot builds for multiple distros [07:36:35] RECOVERY - DPKG on eeden is OK: All packages OK [07:36:36] it wasn't nearly as difficult as I thought it would be :) [07:36:59] <_joe_> cool [07:38:08] and it only took about 800M disk space on my labs instance to do both precise and trusty [07:38:38] the basic idea is to do "DIST=trusty ARCH=amd64 git-pbuilder create" to set up the environment [07:39:01] and then in your git checkout of debian branch, e.g. "git-buildpackage --git-pbuilder --git-dist=trusty --git-arch=amd64 -us -uc" [08:04:00] (03CR) 10Filippo Giunchedi: [C: 031] Remove MW_DBLISTS config vars [puppet] - 10https://gerrit.wikimedia.org/r/159645 (owner: 10Ori.livneh) [08:18:48] (03CR) 10Alexandros Kosiaris: [C: 032] "Ran it through compiler, noop" [puppet] - 10https://gerrit.wikimedia.org/r/159460 (owner: 10Matanya) [08:34:24] <_joe_> !log updating php-pear php5 php5-cli php5-common php5-curl php5-dev php5-intl php5-mysql php5-xmlrpc libapache2-mod-php5 on mw1018, see USN 2344-1 [08:34:29] Logged the message, Master [08:35:25] <_joe_> akosiaris: mmm we need to rebuild our packages [08:35:39] <_joe_> (I didn't remmember we make our own php packages) [08:36:00] yeah we do.. due to 2 patches we incorporated for zend memory cleanup [08:36:09] somehow our test suite crashes php [08:36:15] <_joe_> akosiaris: ok, what's the project? [08:36:31] ? [08:36:37] I am almost done building them [08:36:40] <_joe_> in gerrit, I mean [08:36:43] <_joe_> oh ok [08:36:46] <_joe_> d'oh [08:36:49] will put them in carbon in a few [08:36:54] <_joe_> ok [08:45:50] apergos: is there anyone I should poke for https://rt.wikimedia.org/Ticket/Display.html?id=8253 ? [09:00:34] !log upgrading php5 to 5.3.10-1ubuntu3.14+wmf1 on mw1212 [09:00:39] Logged the message, Master [09:01:01] _joe_: ^ let's see what happens [09:02:06] <_joe_> akosiaris: ok, I'm actually working on testwiki to see if the luasandbox update works [09:02:12] PROBLEM - DPKG on mw1212 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:04:03] <_joe_> \o/ [09:04:10] <_joe_> it _works_ [09:04:12] RECOVERY - DPKG on mw1212 is OK: All packages OK [09:21:16] <_joe_> !log upgrading hhvm and hhvm-luasandbox across the production cluster [09:21:21] Logged the message, Master [09:32:44] akosiaris: i looked at the admin module, it calls validate_ensure which comes from modules/wmflib/lib/puppet/parser/functions/validate_ensure.rb but i can't see how the that function is called i took it into a VM i have. and puppet fails for not finding the .rb although i added it. it is in line 56 of user.pp i get: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Unknown function validate_ensure at [09:32:44] /etc/puppet/modules/admin/manifests/user.pp:56 [09:33:09] can you please educate me ? :) [09:36:07] I need an op to graceful apache on mw1200 mw1196 and mw1186 [09:36:30] _joe_: ^ [09:36:51] <_joe_> hoo: why is it so? [09:37:00] _joe_: APC mess up again [09:37:15] no idea why we have that so frequently now, but we do [09:37:41] <_joe_> can I take a look at APC stats before gracefulling them [09:37:42] <_joe_> ? [09:37:50] sure [09:37:52] <_joe_> we may be able to set up some alert maybe' [09:38:01] but they're currently flooding the fatal logs [09:38:07] <_joe_> ok [09:38:14] <_joe_> so better make it quick [09:38:27] <_joe_> (or, we set a metric on the number of fatals) [09:38:51] matanya: not sure if I understand correctly but validate_ensure is defined in the wmflib module. That means that the puppetmaster needs to have the wmflib module in the puppet modules directory to be able to access it [09:38:55] <_joe_> !log gracefulling mw1200 mw1196 and mw1186 as they have APC issues [09:39:00] Logged the message, Master [09:39:17] matanya: after that is done, it is a "global" function. it can be used by any module [09:39:27] does this answer your question ? [09:39:34] akosiaris: my question is basically how puppet know to find it in that module ? [09:39:49] "that" being ? [09:39:56] wmflib [09:40:35] <_joe_> matanya: include wmflib [09:40:47] <_joe_> like you do with stdlib [09:41:11] ah, it parses modules//lib/puppet/parser/functions [09:41:27] _joe_: no, he speaks about the functions [09:41:33] <_joe_> hoo: done, fatal logs don't show anything more [09:41:43] <_joe_> akosiaris: include ;) [09:41:49] so if i put the function in modules/admin/lib it should work as well, and it doesn't [09:42:08] <_joe_> akosiaris: once you include the module, all its functions are available to the puppet master when compiling the manifest [09:42:15] _joe_: Any ideas about the root cause? [09:42:27] matanya: you should read carefully https://docs.puppetlabs.com/guides/custom_functions.html [09:42:28] <_joe_> hoo: no, let me look at the metrics [09:42:38] _joe_: include where ? [09:42:39] thank you akosiaris [09:42:57] <_joe_> akosiaris: if you want to use wmflib functions in another module [09:43:03] <_joe_> you have to include it [09:43:05] <_joe_> AFAIR [09:43:16] <_joe_> but this may have changed across puppet versions [09:43:55] I don't think so.. I 've never included a module for its functions [09:44:02] and they are always available to me [09:44:21] matanya: Put custom functions in the lib/puppet/parser/functions subdirectory of your module as it says in that doc [09:44:38] it is there :/ [09:44:56] <_joe_> matanya: what's the name of the function file, and what's the content? [09:44:57] matanya: be careful with custom functions... they have a tendency to not really help [09:46:00] <_joe_> hoo: I still have no idea about what caused the apc failure [09:46:16] <_joe_> but I suspect it happened more often in the last couple of days, right? [09:46:18] :S [09:46:23] Yep [09:46:36] <_joe_> that's because we gracefully restarted apache [09:46:46] <_joe_> thus cleaning the apc cache after a long time [09:46:53] we had that before, but never on such a wide scale [09:46:55] <_joe_> and well, we _fixed_ something [09:47:21] i;m checking something a sec [09:47:29] <_joe_> http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1410428733.953&target=servers.mw1200.mw_apcCollector.cache_frag_pcnt.value&from=-240hours [09:47:52] <_joe_> hoo: the apc cache was completely fragmented, thus very inefficient, for quite some time [09:48:24] ok, _joe_ akosiaris Thanks! the issue was file permission on that function [09:48:37] <_joe_> and apc was basically not doing GC in that condition [09:48:57] i copied it into the admin module from wmflib module without chowning [09:50:09] _joe_: mh... how can that be prevented? Auto graceful on high fragmention (that seems dirty) [09:50:18] <_joe_> hoo: eh [09:50:39] <_joe_> hoo: btw, GC seems to have had not-so-positive effects on cache miss ratios [09:52:58] <_joe_> I guess our solution is... hhvm [09:53:12] someone rang? [09:53:52] (03PS3) 10Alexandros Kosiaris: Purge the amanda-server packages/configurations [puppet] - 10https://gerrit.wikimedia.org/r/159283 [09:53:55] <_joe_> uh, no, not really [09:54:07] <_joe_> kill -SLEEP ori [09:54:11] heh [09:54:33] <_joe_> ori: if you're unable to sleep, I'm about to re-activate the jobrunner on mw1053 [09:54:36] _joe_: https://gerrit.wikimedia.org/r/159636 [09:54:51] _joe_: \o/ [09:57:22] the profiler timing data on labs is correct too, which means the php side of things works as well [09:58:08] <_joe_> ori: did you update beta? [09:58:25] not luasandbox [09:58:36] but it's not dependent on that; it's the getrusage thing [09:58:43] which we already upgraded beta for, iirc [09:58:47] i can upgrade it now [09:58:53] (03PS4) 10Alexandros Kosiaris: Purge the amanda-server packages/configurations [puppet] - 10https://gerrit.wikimedia.org/r/159283 [09:59:03] since prod is looking stable, we can move forward to beta [09:59:28] <_joe_> :P [09:59:34] (03CR) 10jenkins-bot: [V: 04-1] Purge the amanda-server packages/configurations [puppet] - 10https://gerrit.wikimedia.org/r/159283 (owner: 10Alexandros Kosiaris) [09:59:52] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet last ran 639454 seconds ago, expected 14400 [10:00:04] <_joe_> !log enabled puppet on mw1053 [10:00:09] Logged the message, Master [10:00:31] RECOVERY - Puppet freshness on mw1053 is OK: puppet ran at Thu Sep 11 10:00:27 UTC 2014 [10:01:54] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:02:15] <_joe_> ori: 2014-09-11T10:01:13+0000: Initialized loop 0 with 20 runner(s) [10:02:37] <_joe_> oh it's 9/11 I didn't even notice [10:02:57] <_joe_> which middle-east country will the US bomb this year for commemoration? [10:03:54] <_joe_> (joking aside, it's really sad that NATO decides to go to war exactly on the anniversary of a tragic terrorist attack) [10:09:05] i think it's the first time that there are no known issues with HHVM [10:09:13] i'm sure it'll be over in 90 seconds when new bugs come in [10:09:31] but i feel pretty happy about it right now [10:09:54] quick, shut down bugzilla! :P [10:10:41] <_joe_> done [10:11:03] * ori high-fives _joe_ [10:11:20] <_joe_> :) [10:12:52] <_joe_> hhvm _is_ impressive btw [10:13:48] well, you know, i wrote it myself [10:13:55] it wasn't easy! [10:16:20] * ori should really sleep [10:16:33] but not before making alexandros "i hate custom functions" kosiaris sad [10:16:47] (03PS1) 10Ori.livneh: wmflib: add to_milliseconds() / to_seconds() [puppet] - 10https://gerrit.wikimedia.org/r/159692 [10:17:00] nice and frivolous [10:17:43] <_joe_> lol [10:18:02] <_joe_> I hate custom functions as well if they're not needed [10:19:45] dismiss_objections() [10:19:58] bye! _joe_ thanks for the packages [10:23:06] (03PS5) 10Alexandros Kosiaris: Purge the amanda-server packages/configurations [puppet] - 10https://gerrit.wikimedia.org/r/159283 [10:25:11] (03PS1) 10Yuvipanda: labmon: Use a common prefix for betalabs icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/159693 [10:25:13] (03PS1) 10Yuvipanda: labmon: Add low space check for / on betalabs [puppet] - 10https://gerrit.wikimedia.org/r/159694 (https://bugzilla.wikimedia.org/70141) [10:26:24] ori: btw, I added some python code to check_graphite in https://gerrit.wikimedia.org/r/#/c/159473/ (since merged). Critique / CR welcome [10:26:29] (whenever you have the time) [10:27:01] you should really sleep as well, yea [10:32:24] (03CR) 10Alexandros Kosiaris: [C: 032] Purge the amanda-server packages/configurations [puppet] - 10https://gerrit.wikimedia.org/r/159283 (owner: 10Alexandros Kosiaris) [10:34:46] godog: I guess that's a mark approval thing (and he can give access too) [10:35:09] sorry, I was off in a 'clean up tridge' window... it's been real exciting. for bad vallues of exciting [10:37:24] apergos: amanda directory is about to go away. I saw you did some great job btw on /data on tridge [10:38:41] I'm still working on it :-D [10:39:12] jsut found an old 'here's the new X password, plese delete when you're done' notice to someone... from 6 years ago. I bet they deleted it after a few hours and [10:39:19] unfortunately, after the rsync :-D [10:39:36] apergos: ack, thanks! anyone else I could poke? I'm sure mark is busy already :| [10:40:20] godog, for approval, I'm not sure, I think akosiaris also may be able to give net device access [10:40:20] and surely para void [10:40:20] (who is formally not here but still) [10:40:35] context ? [10:40:58] I only know what's on the ticket https://rt.wikimedia.org/Ticket/Display.html?id=8253 which ain't much [10:41:12] akosiaris: network device access for me, https://rt.wikimedia.org/Ticket/Display.html?id=8253 [10:41:20] yeah I can [10:41:22] I 'll do it [10:41:33] it's definitely well past 3 days and yer ops so... [10:41:45] thanks, a kosiaris [10:41:51] thanks! showing seniority already akosiaris [10:42:07] lol [10:42:17] <_joe_> godog: I guess how many white hair he got with that [10:42:32] if that is a degree of seniority I am doomed... [10:42:55] I am not getting white hair.. I am actually getting no hair at all... [10:43:00] _joe_: worst case I can send some, my beard is turning white/grey [10:43:06] <_joe_> I just discovered my first completely white hair in my beard [10:43:48] akosiaris: ah the transparent hair! same boat [10:44:01] * YuviPanda probably will never get white hair [10:44:09] at the rate I'm losing hair, that is [10:44:38] i'm making a command decsion and tossing these stafford backups from 2011. no one has ever asked for them and why would they [10:44:41] (on tridge) [10:45:01] YuviPanda: same boat man... [10:45:06] apergos: please do [10:45:20] heh [10:45:35] * YuviPanda should cut it all off at some point, when he still has the choice [10:46:09] isidore from 2010 also going, we have new incarnations of everything that's in there [10:48:12] killing project2... 2011 [10:49:45] anybody been missing formey recently? because if not, that dir's going as well [10:50:04] ldap was the last thing to move off of there I think and I have ever heard a complaint about it [10:50:53] not ldap, more like ldap-client [10:51:08] used to modify the ldap due to its /etc/ldap/ldap.conf IIRC [10:51:31] but I have modified the modify-ldap* scripts and they no longer rely on it [10:51:32] yes, the client [10:51:40] oh that and having pam_ldap installed [10:51:47] yikes that was ugly [10:51:58] eww [10:51:59] I was happy to drop that dependency [10:52:01] (03PS1) 10Filippo Giunchedi: swift: fix swift-labs-ring unbound variable [puppet] - 10https://gerrit.wikimedia.org/r/159699 [10:52:18] well perhaps you want to do the honors of delting the vestiges on tridge (if not, I'll jut hit 'enter' :-D) [10:53:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: fix swift-labs-ring unbound variable [puppet] - 10https://gerrit.wikimedia.org/r/159699 (owner: 10Filippo Giunchedi) [10:54:16] 5... [10:54:17] 4... [10:54:44] 3..2..1 (impatient) done :-P [10:57:54] sud-thingie is going, 2008 copies of wp deployment dir plus cofs plus old mysql root paswords etc, not loving it [10:59:03] who's touched otrs in the last year? I need to ask them about old data I guess [10:59:07] (03PS2) 10Giuseppe Lavagetto: pybal: serve the virtualhost with pybal lb files with a dedicated vhost [puppet] - 10https://gerrit.wikimedia.org/r/159495 [11:04:41] apergos: JeffGreen I think [11:04:47] (I dunno if I got his name right) [11:05:04] ah sweet, I'll check with him later then, thanks! [11:06:02] :) [11:07:42] ragweed's been out of service for a couple years now, we don't need those bckups either [11:09:57] (03PS1) 10Yuvipanda: labmon: Add puppet freshness check for betalabs [puppet] - 10https://gerrit.wikimedia.org/r/159701 (https://bugzilla.wikimedia.org/70141) [11:17:11] RECOVERY - Disk space on dataset1001 is OK: DISK OK [11:19:51] (03PS1) 10Giuseppe Lavagetto: dns: add pybal-config CNAME [dns] - 10https://gerrit.wikimedia.org/r/159702 [11:22:17] folks were talking about wikitech earlier in this channel iirc [11:22:29] some sort of update? [11:23:24] apergos wikitech is now deployed with scap, etc, rather than by hand [11:24:05] great, which means I can toss the old copy from feb 2013 [11:31:11] (03PS1) 10Filippo Giunchedi: swift: remove ganglia stats via ganglia-logtailer [puppet] - 10https://gerrit.wikimedia.org/r/159705 [12:13:02] (03PS1) 10Yuvipanda: icinga: Simplify check_graphite series threshold messages [puppet] - 10https://gerrit.wikimedia.org/r/159708 [12:15:38] anyone around to merge a bunch of mostly trivial puppet patches? [12:22:09] * YuviPanda pokes Coren [12:22:14] wanna merge some monitoring patches? :) [12:22:36] nothing complex, just adding more monitors [12:22:39] (for betalabs) [12:24:09] (03PS36) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [12:29:31] (03Abandoned) 10Giuseppe Lavagetto: dns: add pybal-config CNAME [dns] - 10https://gerrit.wikimedia.org/r/159702 (owner: 10Giuseppe Lavagetto) [12:32:35] (03PS1) 10Yuvipanda: labmon: Add basic monitoring for toollabs [puppet] - 10https://gerrit.wikimedia.org/r/159709 [12:38:49] (03PS1) 10Yuvipanda: icinga: Minor formatting fix for check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159711 [12:41:02] (03PS1) 10Giuseppe Lavagetto: dns: add pybal-config CNAME [dns] - 10https://gerrit.wikimedia.org/r/159712 [12:41:54] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga: Minor formatting fix for check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159711 (owner: 10Yuvipanda) [12:42:31] _joe_: \o/ it has a ton of dependencies, tho. I should probably re-arrange that series [12:42:41] <_joe_> lol ok [12:42:51] _joe_: let me re-arrange the truly trivial ones [12:43:55] (03PS2) 10Yuvipanda: icinga: Simplify check_graphite series threshold messages [puppet] - 10https://gerrit.wikimedia.org/r/159708 [12:44:11] (03PS2) 10Yuvipanda: labmon: Add basic monitoring for toollabs [puppet] - 10https://gerrit.wikimedia.org/r/159709 [12:44:27] (03PS2) 10Yuvipanda: icinga: Minor formatting fix for check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159711 [12:44:57] _joe_: triival: https://gerrit.wikimedia.org/r/#/c/159693/ https://gerrit.wikimedia.org/r/#/c/159708/ and https://gerrit.wikimedia.org/r/#/c/159711/ [12:45:10] <_joe_> YuviPanda: gimme one min [12:45:14] _joe_: cool [12:48:13] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga: Minor formatting fix for check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159711 (owner: 10Yuvipanda) [12:49:05] (03PS3) 10Giuseppe Lavagetto: icinga: Simplify check_graphite series threshold messages [puppet] - 10https://gerrit.wikimedia.org/r/159708 (owner: 10Yuvipanda) [12:49:15] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga: Simplify check_graphite series threshold messages [puppet] - 10https://gerrit.wikimedia.org/r/159708 (owner: 10Yuvipanda) [12:49:43] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Epic puppet fail [12:55:49] (03PS1) 10Yuvipanda: icinga: Clarify boolean check for _total key [puppet] - 10https://gerrit.wikimedia.org/r/159714 [12:57:37] (03CR) 10Giuseppe Lavagetto: [C: 032] dns: add pybal-config CNAME [dns] - 10https://gerrit.wikimedia.org/r/159712 (owner: 10Giuseppe Lavagetto) [12:59:20] (03PS1) 10Yuvipanda: icinga: Consistently use single quotes in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159715 [13:02:40] (03PS1) 10Alexandros Kosiaris: Ensure cron absent for misc::nfs-server::home::backup [puppet] - 10https://gerrit.wikimedia.org/r/159716 [13:04:03] !log uploaded php5_5.3.10-1ubuntu3.14+wmf1 on apt.wikimedia.org [13:04:09] Logged the message, Master [13:10:04] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:14:08] (03PS3) 10Giuseppe Lavagetto: pybal: serve the virtualhost with pybal lb files with a dedicated vhost [puppet] - 10https://gerrit.wikimedia.org/r/159495 [13:14:46] (03CR) 10jenkins-bot: [V: 04-1] pybal: serve the virtualhost with pybal lb files with a dedicated vhost [puppet] - 10https://gerrit.wikimedia.org/r/159495 (owner: 10Giuseppe Lavagetto) [13:16:24] _joe_: just wondering, would it make sense to name that vhost a bit more generically, [13:16:31] so we could in theory put similar stuff on it as well? [13:17:46] <_joe_> mark: I wanted to avoid that [13:17:59] <_joe_> but also, I'm _terrible_ with naming things [13:18:46] why avoid it? [13:19:08] i don't really have a use case in mind right now, but there could be one [13:19:20] or for example, if we consolidate dsh/app server lists with pybal e.g. with json [13:19:26] then it may make sense to put that there as well [13:20:06] <_joe_> using the same cname, ok [13:20:31] importantstuffs.wikimedia.org [13:22:04] <_joe_> Reedy: bagofstuff [13:22:17] we already use bagofstuff in mediawiki ;) [13:22:25] <_joe_> mark: so, I agree, now you must find a name :P [13:22:45] <_joe_> Reedy: I know [13:22:45] https://en.wikipedia.org/wiki/Special:Random [13:22:46] mastah-copeh.wikimedia.org [13:23:05] <_joe_> (it's eqiad.wmnet) [13:23:15] why eqiad? [13:23:19] why wouldn't it move around? :P [13:24:19] hm, we could also do the per site aliases [13:24:23] like we do (did?) with puppet [13:24:39] <_joe_> yes [13:24:42] in theory it would be good to have a replica in each main dc [13:24:44] eqiad/codfw [13:24:47] sorta like we distribute dns [13:24:51] <_joe_> I was about to suggest the same [13:24:53] so if one goes down, you can also edit from the other [13:24:55] <_joe_> mark: etcd :) [13:25:03] yeah [13:25:13] <_joe_> ok so, you still owe me a name :P [13:25:26] i just gave you one ;-p [13:25:47] trying to establish i'm even worse with names I guess [13:26:41] <_joe_> alex already declared he's naming impaired as well [13:27:19] We should start an RfC on wikitech [13:27:20] That'll help [13:28:07] <_joe_> Reedy: I'm sure it would [13:28:41] Paint it green. [13:30:05] <_joe_> I'll go with config-master [13:30:15] ok [13:36:05] * YuviPanda starts an RfC to rename Reedy [13:36:32] (03CR) 10Andrew Bogott: [C: 032] labmon: Add basic monitoring for toollabs [puppet] - 10https://gerrit.wikimedia.org/r/159709 (owner: 10Yuvipanda) [13:36:59] andrewbogott_afk: w00t [13:37:58] andrewbogott: I added you to a bunch of other trivial patches [13:38:02] (03CR) 10Ottomata: "This is a username and password used by researchers to connect to MySQL research slaves. Historically, this username and password was not" [puppet] - 10https://gerrit.wikimedia.org/r/155452 (owner: 10Dzahn) [13:38:18] (03CR) 10Andrew Bogott: [C: 032] labmon: Use a common prefix for betalabs icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/159693 (owner: 10Yuvipanda) [13:39:45] (03CR) 10Andrew Bogott: [C: 032] icinga: Consistently use single quotes in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159715 (owner: 10Yuvipanda) [13:40:11] (03CR) 10Andrew Bogott: [C: 032] labmon: Add low space check for / on betalabs [puppet] - 10https://gerrit.wikimedia.org/r/159694 (https://bugzilla.wikimedia.org/70141) (owner: 10Yuvipanda) [13:40:28] heya, this hasn't moved in a while [13:40:28] https://rt.wikimedia.org/Ticket/Display.html?id=8283 [13:40:39] i was about to just do it, but I realized that ellery doesn't yet have an ldap account [13:40:50] how should I tell him to get one? sign up on wikitech? [13:41:55] uh, yup, just tested that myself :p [13:43:36] apergos: dont' forget about this access RT: https://rt.wikimedia.org/Ticket/Display.html?id=8283 :) [13:43:41] (03CR) 10Andrew Bogott: [C: 032] labmon: Add puppet freshness check for betalabs [puppet] - 10https://gerrit.wikimedia.org/r/159701 (https://bugzilla.wikimedia.org/70141) (owner: 10Yuvipanda) [13:44:42] (03PS2) 10Andrew Bogott: icinga: Clarify boolean check for _total key [puppet] - 10https://gerrit.wikimedia.org/r/159714 (owner: 10Yuvipanda) [13:45:26] right! [13:46:06] (03PS1) 10Giuseppe Lavagetto: dns: add config-master as an internal service [dns] - 10https://gerrit.wikimedia.org/r/159719 [13:48:01] (03CR) 10Andrew Bogott: [C: 032] icinga: Clarify boolean check for _total key [puppet] - 10https://gerrit.wikimedia.org/r/159714 (owner: 10Yuvipanda) [13:48:11] (03CR) 10Giuseppe Lavagetto: [C: 032] dns: add config-master as an internal service [dns] - 10https://gerrit.wikimedia.org/r/159719 (owner: 10Giuseppe Lavagetto) [13:48:17] (03PS1) 10Filippo Giunchedi: puppet: hook for easy pushing of branches from anywhere [puppet] - 10https://gerrit.wikimedia.org/r/159720 [13:48:22] (03PS1) 10BBlack: add ns1-mexia name to DNS for puppet transition [dns] - 10https://gerrit.wikimedia.org/r/159721 [13:48:27] (03PS2) 10Andrew Bogott: icinga: Consistently use single quotes in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159715 (owner: 10Yuvipanda) [13:49:40] (03CR) 10Andrew Bogott: [C: 032] icinga: Consistently use single quotes in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159715 (owner: 10Yuvipanda) [13:50:12] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/159721 (owner: 10BBlack) [13:55:15] bd808|BUFFER andrewbogott Coren context on https://gerrit.wikimedia.org/r/#/c/159720/ is "I wanted to push/test puppet master changes to labs from my laptop" :) [13:55:37] YuviPanda: is that everything? [13:55:52] (03PS1) 10BBlack: authdns-lint should use strict modes [puppet] - 10https://gerrit.wikimedia.org/r/159722 [13:55:54] (03PS1) 10BBlack: Set ns1 hostnames to ns1-$real_hostname [puppet] - 10https://gerrit.wikimedia.org/r/159723 [13:56:14] (03CR) 10BBlack: [C: 032] authdns-lint should use strict modes [puppet] - 10https://gerrit.wikimedia.org/r/159722 (owner: 10BBlack) [13:56:23] +yuvipanda too [13:56:24] (03CR) 10BBlack: [V: 032] authdns-lint should use strict modes [puppet] - 10https://gerrit.wikimedia.org/r/159722 (owner: 10BBlack) [13:59:05] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/159721 (owner: 10BBlack) [14:00:45] (03CR) 10BBlack: [C: 032] add ns1-mexia name to DNS for puppet transition [dns] - 10https://gerrit.wikimedia.org/r/159721 (owner: 10BBlack) [14:01:08] (03CR) 10coren: [C: 032] "It's a little bit hacky; but since it's a noop unless actively installed as a hook I see no issue." [puppet] - 10https://gerrit.wikimedia.org/r/159720 (owner: 10Filippo Giunchedi) [14:02:48] <_joe_> brb [14:05:51] (03PS2) 10BBlack: Set ns1 hostnames to ns1-$real_hostname [puppet] - 10https://gerrit.wikimedia.org/r/159723 [14:05:53] (03PS1) 10BBlack: set zones_strict_data for gdnsd runtime config [puppet] - 10https://gerrit.wikimedia.org/r/159724 [14:07:45] (03CR) 10BBlack: [C: 032] Set ns1 hostnames to ns1-$real_hostname [puppet] - 10https://gerrit.wikimedia.org/r/159723 (owner: 10BBlack) [14:08:19] (03CR) 10BBlack: [C: 032] set zones_strict_data for gdnsd runtime config [puppet] - 10https://gerrit.wikimedia.org/r/159724 (owner: 10BBlack) [14:10:46] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:11:06] andrewbogott: yeah, for now [14:11:11] andrewbogott: need to add more things [14:12:15] andrewbogott: would be nice to force a puppet run on neon tho [14:12:25] ok, one sec [14:12:28] andrewbogott: I've to add monitoring for the grid and then monitors for CPU usage, perhaps [14:15:19] andrewbogott: bah, I made a boo boo. [14:15:44] YuviPanda: ? [14:15:49] puppet is still chugging on neon [14:16:08] (03PS1) 10Yuvipanda: icinga: Fix syntax error in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159725 [14:16:09] andrewbogott: ^ [14:16:46] (03PS2) 10Andrew Bogott: icinga: Fix syntax error in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159725 (owner: 10Yuvipanda) [14:17:49] (03CR) 10Andrew Bogott: [C: 032] icinga: Fix syntax error in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159725 (owner: 10Yuvipanda) [14:18:23] (03PS1) 10BBlack: move physicalcorecount to base, use in gdnsd config for threads [puppet] - 10https://gerrit.wikimedia.org/r/159726 [14:18:34] (03PS1) 10Yuvipanda: icinga: Move production 5xx checks to the prod role [puppet] - 10https://gerrit.wikimedia.org/r/159727 [14:18:35] andrewbogott: also ^ [14:18:44] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:19:35] (03CR) 10BBlack: [C: 032] move physicalcorecount to base, use in gdnsd config for threads [puppet] - 10https://gerrit.wikimedia.org/r/159726 (owner: 10BBlack) [14:20:16] (03PS4) 10Giuseppe Lavagetto: pybal: serve the virtualhost with pybal lb files with a dedicated vhost [puppet] - 10https://gerrit.wikimedia.org/r/159495 [14:20:34] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 44 data above and 0 below the confidence bounds [14:20:46] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 44 data above and 0 below the confidence bounds [14:20:55] (03PS2) 10Andrew Bogott: icinga: Move production 5xx checks to the prod role [puppet] - 10https://gerrit.wikimedia.org/r/159727 (owner: 10Yuvipanda) [14:21:00] YuviPanda: you need a local rebase :) [14:21:06] andrewbogott: yeah, just did [14:22:37] (03PS1) 10Chmarkine: gerrit - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) [14:22:49] (03CR) 10Andrew Bogott: [C: 032] icinga: Move production 5xx checks to the prod role [puppet] - 10https://gerrit.wikimedia.org/r/159727 (owner: 10Yuvipanda) [14:23:07] <_joe_> !log upgrading php across the cluster: libapache2-mod-php5 php5-cli php-pear php5 php5-common php5-curl php5-dev php5-intl php5-mysql php5-xmlrpc [14:23:13] Logged the message, Master [14:24:25] (03PS1) 10Calak: Change autoconfirmed settings on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159730 (https://bugzilla.wikimedia.org/70128) [14:25:38] YuviPanda: neon looks happier now [14:25:59] andrewbogott: yea. Still hasn't picked up the new checks tho [14:36:36] (03PS1) 10Yuvipanda: icinga: Add scfe_de for toollabs notifications [puppet] - 10https://gerrit.wikimedia.org/r/159731 [14:39:03] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Puppet has 2 failures [14:39:28] (03PS1) 10Calak: Add 'autopatrol' and 'patrol' rights to "editor" group on plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159732 (https://bugzilla.wikimedia.org/70459) [14:39:41] (03CR) 10Andrew Bogott: [C: 032] icinga: Add scfe_de for toollabs notifications [puppet] - 10https://gerrit.wikimedia.org/r/159731 (owner: 10Yuvipanda) [14:40:35] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 5 failures [14:40:37] (03PS2) 10Calak: Add 'autopatrol' and 'patrol' rights to "editor" group on plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159732 (https://bugzilla.wikimedia.org/70459) [14:43:19] <_joe_> the puppet failure here ^^ is due to an apt conflict [14:44:27] <_joe_> !log php upgrade finished [14:44:32] Logged the message, Master [14:44:45] (03PS1) 10Yuvipanda: icinga: Spell scfc_de correctly [puppet] - 10https://gerrit.wikimedia.org/r/159733 [14:44:52] andrewbogott: ^ [14:45:22] (03CR) 10Andrew Bogott: [C: 032] icinga: Spell scfc_de correctly [puppet] - 10https://gerrit.wikimedia.org/r/159733 (owner: 10Yuvipanda) [14:47:03] bd808: whom to poke about puppet failures on betalabs? the alert has been critical for a while because of videoscaler-01 [14:47:45] (03PS5) 10Giuseppe Lavagetto: pybal: serve the virtualhost with pybal lb files with a dedicated vhost [puppet] - 10https://gerrit.wikimedia.org/r/159495 [14:48:47] YuviPanda: anything exciting? [14:48:59] Reedy: in terms of renaming you? [14:49:04] Roddy? [14:49:09] Well, that [14:49:13] But puppet failures too [14:49:15] ah [14:49:23] no, something about 'getaddrinfo - no host found' [14:49:24] for something [14:49:26] let me see again [14:51:29] Error: Could not get latest version: getaddrinfo: Name or service not known [14:51:29] Error: /Stage[main]/Mediawiki::Jobrunner/Package[jobrunner]/ensure: change from 5c927f9091f446452b9fd7bcb69614c7a7fe6eff to latest failed: Could not get latest version: getaddrinfo: Name or service not known [14:51:32] Reedy: ^ [14:52:52] YuviPanda: I can look. It may still be messed up from things jeremyb was working on yesterday. [14:52:58] bd808: cool [14:53:11] https://www.irccloud.com/pastebin/u4ZlQSCv [14:53:12] YuviPanda: Or you know that you can look too :P [14:53:16] Reedy: bd808 ^ [14:53:22] bd808: too much other stuff going on :) [14:53:32] (03PS6) 10Giuseppe Lavagetto: pybal: serve the virtualhost with pybal lb files with a dedicated vhost [puppet] - 10https://gerrit.wikimedia.org/r/159495 [14:53:34] bd808: I've gotten the alerts greg-g asked for, so doing alerts for toollabs now [14:54:03] bd808: plus it has the word 'scap' in the error :) [14:54:19] Oh I think I have seen that stupid error before and it's very misleading [14:54:23] ah [14:54:42] I think that's the error that trebuchet gives when the redis bindings are missing [14:54:55] oh?! [14:55:01] from... puppet?! [14:55:15] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: serve the virtualhost with pybal lb files with a dedicated vhost [puppet] - 10https://gerrit.wikimedia.org/r/159495 (owner: 10Giuseppe Lavagetto) [14:55:37] Yeah. The package{ provider=>trebuchet } that installs/updates scap [14:55:42] ah [14:55:44] I see [14:56:07] python-redis seems to be installed [14:56:22] hrm. maybe something different then [14:57:24] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:57:51] Completely misremembered, the python-redis missing error is "Function deploy.fetch is not [14:57:51] available" [14:57:55] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:58:24] ah [14:58:28] so this is something new and weirddd [14:58:37] <_joe_> fuck me. [14:58:49] I've seen it in the last couple of weeks though. Just can't remember where [14:59:08] * bd808 pokes more [14:59:21] * YuviPanda digs into OGE monitoring [14:59:45] wikitech interface says that host is rebooting ?! [15:00:05] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140911T1500). Please do the needful. [15:02:16] YuviPanda: All better now. MIsconfigured in wikitech interface [15:02:25] aaaah [15:02:25] (03PS1) 10Giuseppe Lavagetto: pybal-config: remove wrong comment [puppet] - 10https://gerrit.wikimedia.org/r/159734 [15:02:26] cool [15:02:39] ok, ^d [15:02:42] hiya, yt? [15:02:46] oh, nope you are not! [15:02:58] <^d> I am.. [15:03:01] oh you are! [15:03:01] bd808: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon shall be all green sooon [15:03:15] getting ready to do a perf test. [15:03:16] q [15:03:18] _joe_: Do I need to do anything to update the luasandbox stuff in beta? [15:03:27] can we turn off real searches to elastic1016 while we do this? [15:03:29] for a bit more control? [15:03:32] <_joe_> bd808: it would be a good idea [15:03:35] i don't want to disable 1016 from working [15:03:44] maybe just depool it? woudl that help? hm, i guess not? [15:03:46] <_joe_> it's tested in prod, so it can now move to beta :) [15:03:51] because other nodes would still route searches there? [15:03:55] <^d> Yeah. [15:04:14] how do we usually disable a node via es? [15:04:27] what will happen if we do that, and I specify _only_node? [15:04:47] _joe_: apt-get update && apt-get install hhvm-luasandbox ? [15:04:53] <^d> Well when we want a node depooled we'd blacklist it from shard allocation. [15:04:59] <^d> But that kind of defeats our test. [15:05:13] <_joe_> bd808: yep [15:05:29] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal-config: remove wrong comment [puppet] - 10https://gerrit.wikimedia.org/r/159734 (owner: 10Giuseppe Lavagetto) [15:06:20] _joe_: "hhvm-luasandbox is already the newest version." from deployment-mediawiki01. Do we have ensure=>latest in puppet still? [15:06:35] _joe_: Installed version is 2.0-6 [15:07:00] aye hm [15:07:03] <_joe_> bd808: in prod, not for sure [15:07:13] ^d, is there a way we can just keep it from routing searches there? [15:07:14] <_joe_> maybe or.i did that at 4 in the morning [15:07:33] <^d> ottomata: Offhand, I don't know of any way to. [15:07:43] hm ok, maybe we'll have to live with the noise then [15:07:43] _joe_: *shrug* I'll make sure they are all updated [15:07:54] could be weird for production though, if we start hammering one node [15:08:00] what's going to happen to live searches on that node? [15:08:23] <^d> If they fail, Cirrus will retry and hopefully end up at a different node :) [15:12:31] bd808: icinga all green now. \o/ ty [15:14:50] (03PS1) 10RobH: adding star.wmfusercontent.org.pem and GlobalSign.pem [puppet] - 10https://gerrit.wikimedia.org/r/159737 [15:14:53] (03PS1) 10Alexandros Kosiaris: WIP: module/role for url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/159738 [15:15:08] ^d, marktraceur: Is anyone SWATting already? [15:15:40] (03PS1) 10Giuseppe Lavagetto: pybal: change configuration host address [puppet] - 10https://gerrit.wikimedia.org/r/159739 [15:15:44] ori: Ready for SWAT? [15:15:50] Nope [15:16:08] I thought ottomata was pinging ^d to do it [15:17:19] (03PS1) 10RobH: adding in star.wmfusercontent.org.pem and its CA intermediate cert [puppet] - 10https://gerrit.wikimedia.org/r/159740 [15:18:43] (03CR) 10RobH: [C: 032] adding in star.wmfusercontent.org.pem and its CA intermediate cert [puppet] - 10https://gerrit.wikimedia.org/r/159740 (owner: 10RobH) [15:20:51] (03PS1) 10Rush: update labs phab to phT172 (footer needed) [puppet] - 10https://gerrit.wikimedia.org/r/159744 [15:22:04] (03PS2) 10Giuseppe Lavagetto: varnish: add comment to avoid future pitfalls [puppet] - 10https://gerrit.wikimedia.org/r/159294 [15:22:59] (03CR) 10Giuseppe Lavagetto: "I honestly don't see a value, and some risks, in doing this." [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [15:23:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] gerrit - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [15:24:18] !restarted icinga, manually removed some labsy things that were broken in config and temporarily disabled puppet :p [15:24:29] !log restarted icinga, manually removed some labsy things that were broken in config and temporarily disabled puppet :p [15:24:33] Logged the message, Master [15:25:41] Reedy: I think I saw evidence in backscroll that you'd been pinged about the layout change in the Parsoid extension, but just to be sure -- https://bugzilla.wikimedia.org/show_bug.cgi?id=70696 [15:26:17] I wonder if that'll break production scap for the new branch... [15:26:53] IIRC in the older branches the file is there, but it's a symlink [15:26:57] in newer, it changes to a physical file [15:27:36] bd808: I guess the point is production extension-list wants updating, not it adding to labs one [15:28:01] Reedy: Yeah. I just didn't want to do that in the middle of my night [15:28:06] Which... Looks like James_F|Away has done in https://gerrit.wikimedia.org/r/#/c/159652/ [15:28:34] (03CR) 10Reedy: [C: 032] Move Parsoid pointer in extension-list now the repo has been restructured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159652 (owner: 10Jforrester) [15:28:36] (03PS1) 10Yuvipanda: icinga: Fix naming of toollabs contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/159745 [15:28:43] (03Merged) 10jenkins-bot: Move Parsoid pointer in extension-list now the repo has been restructured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159652 (owner: 10Jforrester) [15:28:46] bblack: andrewbogott ^ [15:28:47] fix typo [15:28:50] well [15:28:52] naming mismatch [15:28:53] than typo [15:29:11] Reedy: There is a symlink at /a/common/php-1.24wmf20/extensions/Parsoid/Parsoid.php so I think thins should "just work" once the right file is updated [15:29:14] (03CR) 10Andrew Bogott: [C: 032] icinga: Fix naming of toollabs contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/159745 (owner: 10Yuvipanda) [15:29:21] andrewbogott: ty [15:29:23] Right [15:29:43] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [15:30:03] bblack: do those labsy things need to be put back in manually? [15:30:09] hmm, or just re-running puppet should be fine [15:30:12] Reedy: \o/ l10n cache is building now in beta [15:30:13] yeah [15:30:26] I'm not sure about the other one though, the dupe definition for labstore1003 [15:31:11] bblack: me neither, and I don't see a similarly named file in our pu[ppet repo [15:31:25] (03CR) 10Reedy: [C: 04-1] Follow-up I51abd7c: Enable Commons use for wikitech (labswiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159667 (owner: 10Jforrester) [15:32:30] will try puppet [15:32:35] bd808 or Reedy, can you explain to me about that? Is there a genuine problem that that patch is meant to fix? [15:32:47] <^d> ottomata: Ok back [15:33:04] andrewbogott: James_F|Away is trying to make Wikitech be able to use stuff from commons [15:33:28] ok… but local media upload is working fine, right? (seems to be, to me) [15:33:30] bd808: You know, with Sean having changed the passwords to the same as production... Can wikitech actually access the production db slaves now? [15:33:34] andrewbogott: Missing files that come from commons. One example in template on https://wikitech.wikimedia.org/wiki/Release_Engineering [15:33:37] I've not tried it [15:33:51] But just enabling that whole file will try and make wikitech use swift for local files, which we don't want [15:34:04] I dunno. I thought there was vlan issues too [15:34:09] Hm, ok -- how did it work before? [15:34:13] ^d, cool, I"m making this script al ittle more relevant [15:34:34] we really just want to load up the node a lot, so I'm removing sleeps, percent options, making sure the script waits for everything to finish before exiting [15:34:36] andrewbogott: Probably $wgUseInstantCommons (I think that's the flag) [15:34:40] will have a version for you to check out shortly [15:34:43] andrewbogott: Was there $wgUseInstantCommons = true; in the old config? [15:34:48] * andrewbogott looks [15:34:57] If so, maybe we just add it back to wikitech.php [15:35:19] * bd808 is amazed how much he has learned about all this in ~13 months [15:35:26] yep! $wgUseInstantCommons = true; [15:35:30] I'll make a patch [15:35:48] <^d> ottomata: Sounds good [15:37:08] (03PS1) 10RobH: removing a typo from netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/159747 [15:37:11] (03PS1) 10Andrew Bogott: $wgUseInstantCommons = true on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159748 [15:37:31] (03CR) 10Andrew Bogott: "Proposed alternative: https://gerrit.wikimedia.org/r/#/c/159748/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159667 (owner: 10Jforrester) [15:38:50] (03CR) 10RobH: [C: 032] removing a typo from netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/159747 (owner: 10RobH) [15:40:19] (03PS1) 10Andrew Bogott: Move labswiki to wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159749 [15:40:22] bd808, Reedy, are you still hopeful that we can move wikitech to wmf20 today? And, if so, are there preliminary steps besides ^ ? [15:40:33] mmmm [15:40:54] e.g. do we need to do a full scap to regenerate l10n now that that smw change is in? [15:40:55] I guess we need to backport/update submodules for SMW after bd808s patch [15:40:58] Then scap, yeah [15:41:09] backport/update? [15:41:50] Cherry-pick the change to the wmf20 branch, commit and update the core wmf20 branch submodule reference [15:42:37] andrewbogott: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1b:_extension_changes [15:42:45] There's no wmf20 branch for the extension ;) [15:42:52] it's just updating the submodule reference [15:43:02] oh because it's pinned? [15:43:15] yeah [15:43:29] even easier for andrewbogott then :) [15:44:25] Ja [15:44:51] (03PS1) 10Yuvipanda: labmon: Add monitoring for CPU usage on toollabs [puppet] - 10https://gerrit.wikimedia.org/r/159751 [15:44:56] Coren: scfc_de andrewbogott ^ [15:45:03] Reedy, bd808: thanks for tackling the parsoid extension restructure stuff! [15:45:04] !log icinga config is correct now, back to normal puppet updates [15:45:10] Logged the message, Master [15:45:44] bd808: Did you read the comment on your commit? lol [15:46:14] Reedy: Yeah. I decided to archive the email and get back to work :) [15:50:43] (03PS1) 10Rush: phab legal compliant footer [puppet] - 10https://gerrit.wikimedia.org/r/159754 [15:51:13] Reedy: if I remember my submodules properly, this should do it? https://gerrit.wikimedia.org/r/#/c/159753/ [15:51:16] (03PS2) 10Rush: update labs phab to phT172 (footer needed) [puppet] - 10https://gerrit.wikimedia.org/r/159744 [15:51:27] (03CR) 10Rush: [C: 032 V: 032] update labs phab to phT172 (footer needed) [puppet] - 10https://gerrit.wikimedia.org/r/159744 (owner: 10Rush) [15:51:42] (03PS2) 10Rush: phab legal compliant footer [puppet] - 10https://gerrit.wikimedia.org/r/159754 [15:52:17] andrewbogott: LGTM [15:52:39] Reedy: want to merge my two config changes before scapping? [15:52:46] I can [15:52:55] I'm about to go AFK for 50-60 minutes [15:52:56] (03CR) 10Andrew Bogott: [C: 032] labmon: Add monitoring for CPU usage on toollabs [puppet] - 10https://gerrit.wikimedia.org/r/159751 (owner: 10Yuvipanda) [15:52:58] (03CR) 10Rush: [C: 032 V: 032] phab legal compliant footer [puppet] - 10https://gerrit.wikimedia.org/r/159754 (owner: 10Rush) [15:53:04] So will do them after if that's ok [15:53:23] YuviPanda: I got your toollabs change in merge [15:53:25] sokay? [15:53:31] Reedy: my deploy window is in 10 [15:53:38] ya \o/ [15:53:58] andrewbogott: gah [15:54:37] (03CR) 10Reedy: [C: 031] $wgUseInstantCommons = true on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159748 (owner: 10Andrew Bogott) [15:55:03] (03CR) 10Reedy: "Needs https://gerrit.wikimedia.org/r/159753 deploying and scapping first..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159749 (owner: 10Andrew Bogott) [15:55:14] so hopefully bd808 is not also taking a coffee break [15:55:31] Mines not a coffee break [15:55:33] :P [15:55:44] * bd808 goes for high tea [15:55:44] (03CR) 10JanZerebecki: [C: 031] "The value is that it may happen often that someone does not use gerrit for 7 days while still using it regularly, so after this the header" [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [15:55:44] Got to go get a dressing changed on an ingrowing toenail I had cut out on monday [15:55:47] Ah, right, tea [15:55:57] Reedy: ouch! [15:56:08] They put poison into the nail bed and everything! [15:56:17] um… yay? [15:56:23] ok, ^d [15:56:24] https://gist.github.com/ottomata/fb03fd03267aa0eb1767 [15:56:28] mmmmm... great visuals [15:56:48] basically: read from lucene logs, send search with _only_node preference directly to the enwiki_content index [15:57:05] so we can do something like: [15:57:24] Regarding the move to wmf20… shouldn't scapping happen right /after/ that patch is merged? Rather than before? [15:57:27] time (head -n 1000 enwiki.lucene.log-20140910 | python elasticsearch_replay.py --jobs 24 ) [15:57:31] or whatever we think it best [15:58:32] andrewbogott: Which patch? The submodule update? [15:58:51] <^d> ottomata: lgtm. [15:59:00] bd808: the one that reedy said 'needs scapping first' https://gerrit.wikimedia.org/r/#/c/159749/ [15:59:36] andrewbogott: Ah. We usually keep wikiversions changes separate from scap [15:59:42] ottomata: that's nice! [16:00:05] andrewbogott: Respected human, time to deploy Wikitech (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140911T1600). Please do the needful. [16:00:06] bd808: ok…. is that true for https://gerrit.wikimedia.org/r/#/c/159748/ as well? [16:00:29] The scap will fix the l10n cache after the submodule update. Then switching the wikiversions is a simple change. [16:00:32] godog: (it is adapted from manybubble's own script that did a similar thing, but went through Cirrus. I cannot take credit :) ) [16:00:39] andrewbogott: That config update is safe to include in the scap [16:01:09] So I would pull the submodule and that config change, scap, pull the wikiversions change and sync-wikiversions [16:01:16] bd808: ok -- can you merge that and start the scap? I have a few other things I want to do in this window, I'll work on that while the scap does its work. [16:01:21] Ah, as you say :) sounds good! [16:01:30] Nik's was fancier too, as it attempted to more realistically simulate traffic by sleeping in between queries [16:01:37] this one just loads all the queries up and fires as fast as it can [16:01:54] ottomata: haha fair [16:02:11] andrewbogott: Do you want me to drive? I'm multitasking heavily at the moment. [16:02:11] bd808: oh, to clarify -- I can do all that. scapping is just a matter of running 'scap' on tin, yes? [16:02:34] (03CR) 10Andrew Bogott: [C: 032] $wgUseInstantCommons = true on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159748 (owner: 10Andrew Bogott) [16:02:42] andrewbogott: Awesome, Yeah just 'scap "some good description for SAL"' on tin [16:02:42] (03Merged) 10jenkins-bot: $wgUseInstantCommons = true on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159748 (owner: 10Andrew Bogott) [16:02:47] 'k [16:03:03] <^d> ottomata: Well Nik's was designed to simulate real load. We're looking for worst case here to measure io. [16:03:05] <^d> :) [16:03:08] yup [16:03:14] we want more than real load :) [16:05:48] ok, ^d, I'm going to try to replay 1 hour of enwiki traffic on 1016 node using 32 workers and time it, as our first test :) [16:06:17] <^d> Fire at will commander! [16:08:16] bd808: sorry, I can't figure out how to pull the updated SMW. I did a rebase origin/master in /a/common, then a 'submodule update' in /a/common/php-1.24wmf20 [16:08:21] which, unsettlingly, updated a different extension [16:08:36] ZeroBanner and ZeroPortal [16:11:06] great, its going [16:11:12] <^d> Heh, poolcounter.log has ~24k entries in the last 24h. Almost 23k of them are lsearchd. [16:13:50] andrewbogott: You need to fetch in /a/common/php-1.24wmf20 [16:14:00] bd808: yeah, just figured that out [16:14:09] seems right now [16:14:38] * andrewbogott crosses fingers and scaps [16:15:22] !log andrew Started scap: Preparing to move wikitech to 1.24wmf20 [16:15:27] Logged the message, Master [16:16:04] mw1120 is throwing the "Base lambda function for closure not found" error again for wikidata. apache-graceful fixed that on multiple hosts yesterday. [16:17:30] <^d> ottomata: Normal search traffic appears fine still. [16:19:05] !log Restarted logstash on logstash1001. Log empty and events not being stored in elasticsearch [16:19:11] Logged the message, Master [16:21:00] <^d> bd808: We should spend some time getting logstash ES more stable :\ [16:21:08] Was it _joe_ or godog that helped setup monitoring for the APC pool free size? We need to check on that and see if we can see what's up with all the wikidata fatals that seem to be fixed by restarting apache [16:21:30] bd808: I believe it was _joe_ [16:21:32] ^d: It's the ancient version of logstash and our crappy input pipeline more than ES [16:22:07] !log andrew Finished scap: Preparing to move wikitech to 1.24wmf20 (duration: 06m 45s) [16:22:13] Logged the message, Master [16:22:21] That was fast [16:22:29] * bd808 is suspicious [16:22:54] lots of this: 16:22:06 sudo -u mwdeploy /usr/bin/rsync -l tin.eqiad.wmnet::common/wikiversions*.{json,cdb} /srv/mediawiki on mw1022 returned [255]: Permission denied (publickey). [16:23:46] andrewbogott: same host or multiple hosts? [16:23:50] I forwarded my key and ran as myself... [16:23:53] ok ^d, what was one of the other nodes that had this same shard on it? [16:23:58] with node it...? [16:24:02] <^d> Lemme look again. [16:24:02] If I had to guess, I would say every host [16:24:15] oh, nm, i got it ^d [16:24:18] from convo yesterday [16:24:21] 12 or 05 [16:24:28] <^d> Yeah that sounds right [16:24:33] <^d> Just need ID for one of them. [16:24:35] File is owned correctly on mw1022 [16:25:05] andrewbogott: What is your shell username? [16:25:13] 'andrew' [16:25:23] I feel like I've scapped before and didn't have this issue [16:25:29] hmm.. you are in the wikidev group [16:25:30] got it [16:25:35] running against 1012 now [16:26:03] running out for lunch, back in a bit [16:26:18] "29 hosts had sync_wikiversions errors" [16:26:29] "229 hosts had sync_wikiversions errors" [16:26:48] all of the remote actions failed :( [16:26:52] I definitely cannot ssh mw1121 [16:26:53] for example [16:27:22] maybe I'm using the wrong key somehow [16:27:32] I can ssh in. Do you want me to run the scap or do you want to fix your key issue? [16:27:42] I'd vote for you fixing your key [16:28:13] I would too, except I don't want to run into the next window [16:28:20] Do you mind starting the scap? I'll sort out my keys in the meantime [16:28:49] andrewbogott: Yup. on it [16:29:07] thx [16:29:24] !log bd808 Started scap: Preparing to move wikitech to 1.24wmf20 (second try) [16:29:30] Logged the message, Master [16:30:17] (03PS1) 10Yuvipanda: graphite: Don't realm branch in graphite role [puppet] - 10https://gerrit.wikimedia.org/r/159759 [16:31:44] (from #wikimedia-tech) Hi folks! With whom I should talk about migrating a email (educacao@wikimedia.org) to the otrs? [16:31:58] :) [16:32:09] Can an root do a graceful restart of apache on mw1120 please? Lots and lots of wikidata log spam from there. [16:32:25] 542 fatals in the last 5 minutes [16:32:41] bd808|deploy: done [16:32:51] thx andrewbogott [16:33:30] <_joe_> bd808|deploy: fatals about what? [16:33:40] !log andrewbogott did apache graceful on mw1120 to stop wikidata APC logspam [16:33:45] Logged the message, Master [16:33:59] _joe_: "PHP Fatal error: Base lambda function for closure not found in /usr/local/apache/common-local/php-1.24wmf20/extensions/Wikidata/extensions/Wikibase/lib/config/WikibaseLib.default.php " [16:34:19] It started yesterday and restarting apache seems to make it go away. [16:34:19] <_joe_> ok so apc I guess [16:34:38] <_joe_> it started yesterday after we restarted apache [16:34:44] It seems to be apc getting full, but I'm not seeing the normal error messages from that [16:34:49] <_joe_> we didn't restart it in about one month [16:35:07] <_joe_> on almost any appserver I guess [16:35:40] We aren't getting any "Unable to allocate memory for pool" messages to logstash which is the usual sign of APC sadness [16:36:00] <_joe_> yeah I know too well [16:36:01] <_joe_> :/ [16:36:21] Krenair: if you haven't gotten an answer yet: send an email to OTRS I guess? (how meta) [16:36:42] haha [16:36:59] (03Abandoned) 10Jforrester: Follow-up I51abd7c: Enable Commons use for wikitech (labswiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159667 (owner: 10Jforrester) [16:36:59] <_joe_> ok, I'm out for today [16:37:02] greg-g, It's not the OTRS part I'm wondering about [16:37:40] bd808|deploy: I sorted out my key issues, just me being absent-minded [16:37:40] Krenair: there's a conversation that needs to happen between OTRS people and Ops (RT) [16:38:08] Krenair: how/when that conversation happens is the part I don't know, I assume OTRS people know how to have it /me shrugs [16:38:40] So, I need to request the queue and the otrs admins can handle the release of educacao@wiki...? [16:38:42] andrewbogott: sweet. scap is moving along. You can do the rest when it finishes [16:39:23] lestaty: not exactly, the OTRS admins will ask Ops to do the needed email config parts [16:39:40] hum [16:39:50] * bd808|deploy really wishes someone would fix "PHP Warning: Recursion detected in RequestContext::getLanguage" (bug 54193) [16:40:07] fine, thanks a lot greg-g and Krenair :) [16:40:32] lestaty: this is me guessing, see what the OTRS admins say :) (I've never seen/done this change before) [16:41:01] * bd808|deploy shakes other fist at fenari [16:41:19] Obviously ops need to be involved to get rid of the old setup for that email, but do normal queue creations need it? I thought that was handled entirely by the otrs admins [16:41:29] haha no problem,in any way, if otrs admins dont do that, I will back here :P [16:42:28] andrewbogott, _joe_: mw1039 has ~30 of the wikidata APC errors in the last 5 mins. The rest of the cluster seems to be happy. [16:43:35] andrewbogott: scap got to the build-l10n-cdbs stage and it's taking a bit which is a good sign for the wikitech migration to actually work this time. :) [16:43:42] <^d> bd808|deploy: It's hard to fix. [16:43:50] <^d> I think several of us have gone down that rabbit hole and given up. [16:44:04] ^d: Oh I know. I looked and it made my head hurt too [16:45:31] Recursive lazy objects being promoted to concrete and then ouch! [16:46:12] bd808|deploy: so, graceful on mw1039 also? [16:46:19] andrewbogott: Yes pelase [16:46:23] Hi [16:46:23] *please [16:46:32] * bd808|deploy waves to Reedy [16:46:41] scap is almost done for wikitech stuff [16:46:45] yay [16:46:53] !log apache graceful on mw1039 [16:46:58] Logged the message, Master [16:47:22] Reedy: We had 2 more apaches with the APC/wikidata problem but andrewbogott has restarted them [16:47:32] This is getting silly now :( [16:47:44] Yeah. Just whack-a-mole [16:48:16] I wonder if the answer is just more space assigned to apc and/or less strict cache eviction [16:48:20] I wonder if we missed something in the patch where we upped the APC limit [16:48:29] Though, for something used so much, why the hell would it be evicted [16:49:00] cap-rebuild-cdbs: 99% (ok: 228; fail: 0; left: 1) [16:49:17] That makes me happy every time I see it now (not 100% anymore) [16:49:28] :) [16:49:37] less ocd annoyance [16:50:08] That should be a primary concern for any dev facing UI [16:51:11] andrewbogott: You can sync-common on wikitech whenever you'd like [16:53:08] (03CR) 10BryanDavis: [C: 031] "Not tested but LGTM. I made that realm branching mess and am happy to see it die." [puppet] - 10https://gerrit.wikimedia.org/r/159759 (owner: 10Yuvipanda) [16:53:49] !log bd808 Finished scap: Preparing to move wikitech to 1.24wmf20 (second try) (duration: 24m 25s) [16:53:56] Logged the message, Master [16:54:24] "scap-rebuild-cdbs (duration: 12m 24s)" -- I really wish that bit was faster [16:54:44] ok andrewbogott it's all yours again [16:55:28] (03CR) 10Andrew Bogott: [C: 032] Move labswiki to wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159749 (owner: 10Andrew Bogott) [16:55:44] (03Merged) 10jenkins-bot: Move labswiki to wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159749 (owner: 10Andrew Bogott) [16:56:33] !log andrew rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [16:56:40] Logged the message, Master [16:57:00] !log sync-common on virt1000 -- with any luck this will upgrade us to wmf20 [16:57:05] Logged the message, Master [16:58:18] grrrrrrr [16:58:35] damn it [16:58:39] same problem? [16:58:51] exactly the same. [Thu Sep 11 16:58:40 2014] [error] [client 50.93.251.174] "" is not a valid magic word for "smwdoc" [16:59:29] (03PS1) 10Andrew Bogott: Revert "Move labswiki to wmf20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159763 [16:59:43] (03CR) 10Andrew Bogott: [C: 032] Revert "Move labswiki to wmf20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159763 (owner: 10Andrew Bogott) [16:59:47] (03Merged) 10jenkins-bot: Revert "Move labswiki to wmf20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159763 (owner: 10Andrew Bogott) [16:59:48] My patch isn't in 1.24wmf20 [17:00:02] ? [17:00:20] !log andrew rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [17:00:51] andrewbogott: The head there for SMW is a625bcf21fb1200c8d849c334ad48df356d2c3ef [17:00:51] bd808: didn't https://gerrit.wikimedia.org/r/#/c/159753/ do that? [17:01:09] My fault... I guess [17:01:28] I set jenkins to merge it and such, but wasn't around when it finished etc to pull onto tin [17:01:37] The current head there is a625bcf21fb1200c8d849c334ad48df356d2c3ef and not 8bc21b2a590c6b8742b9291619a8c7d9e80fbf43 [17:02:17] But andrewbogott did pull I thought [17:02:24] I thought I did too. [17:02:39] pull and submodule update? [17:02:48] I did, but... [17:02:52] I just did again and got the patch this time :( [17:03:04] dammit [17:03:09] Well, you have a train deploy now, right? [17:03:14] in an hour ish [17:03:18] I've not prepped for that, but yeah [17:03:21] Ah, so time for me to scap and try again? [17:03:22] I need to run scap for that [17:03:31] It's the same scap, right? [17:03:50] yup [17:04:30] The train deploy will fix everything I think, [17:04:45] Then andrewbogott will just need to sync-common to apply on wikitech [17:05:07] Because Reedy will be bumping everything to wmf20 [17:05:14] and then group0 to wmf21 [17:05:20] Oh, that will include labswiki? [17:05:34] yup, unless he remembers not to [17:05:38] * Reedy grins [17:06:26] ok, so… I'll stand back and just plan a new sync after a few hours. [17:06:37] Thanks, and also, dang. [17:06:50] is the submodule right on tin now? [17:06:57] looks right to me [17:07:00] * bd808 checks [17:07:17] Yeah head is 8bc21b2a590c6b8742b9291619a8c7d9e80fbf43 [17:07:31] "Register message files outside of enableSemantics()" [17:07:52] assuming of course that the patch does what we hope :P [17:08:44] (03PS2) 10Ori.livneh: Remove MW_DBLISTS config vars [puppet] - 10https://gerrit.wikimedia.org/r/159645 [17:08:50] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove MW_DBLISTS config vars [puppet] - 10https://gerrit.wikimedia.org/r/159645 (owner: 10Ori.livneh) [17:11:01] andrewbogott: You can pat yourself on the back that the wgUseInstantCommons flag worked. That's something. [17:11:18] * andrewbogott sighs, lunches [17:11:31] We can now wikilove using commons images [17:11:32] xD [17:12:07] Oh yeah. I have beer on my talk page again [17:16:35] (03PS2) 10Ori.livneh: Get rid of MULTIVER_CDB_DIR_{APACHE,HOME} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159637 [17:16:47] (03CR) 10Ori.livneh: [C: 032] Get rid of MULTIVER_CDB_DIR_{APACHE,HOME} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159637 (owner: 10Ori.livneh) [17:17:32] (03Merged) 10jenkins-bot: Get rid of MULTIVER_CDB_DIR_{APACHE,HOME} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159637 (owner: 10Ori.livneh) [17:17:34] (03PS2) 10Ori.livneh: Replace remaining references to /u/l/a/common [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159650 [17:17:45] Reedy: +1 for https://gerrit.wikimedia.org/r/#/c/159650/ ? [17:17:59] !log ori updated /a/common to {{Gerrit|I37b0a8338}}: Get rid of MULTIVER_CDB_DIR_{APACHE,HOME} [17:18:03] Logged the message, Master [17:18:39] (03PS1) 10Yuvipanda: androidsdk: Specify required libraries for trusty as well [puppet] - 10https://gerrit.wikimedia.org/r/159767 [17:18:47] scfc_de: ^ should fix most of the errors on tools-exec-12 [17:19:20] YuviPanda: every time you eliminate realm-branching from operations/puppet an angel gets its wings [17:19:25] (03CR) 10Reedy: [C: 031] Replace remaining references to /u/l/a/common [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159650 (owner: 10Ori.livneh) [17:19:26] ori: :D [17:19:41] ori: there's no realm branching now, but there are prod and labs roles [17:19:44] (03CR) 10Ori.livneh: [C: 032] Replace remaining references to /u/l/a/common [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159650 (owner: 10Ori.livneh) [17:19:49] (03Merged) 10jenkins-bot: Replace remaining references to /u/l/a/common [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159650 (owner: 10Ori.livneh) [17:19:51] YuviPanda: yeah, i think that's fine [17:19:56] yeah, me too [17:20:03] !log ori updated /a/common to {{Gerrit|I0bda3deab}}: Replace remaining references to /u/l/a/common [17:20:07] Logged the message, Master [17:20:15] ori: I also updated check_graphite to deal with multiple metrics properly :) [17:21:47] nice! [17:23:51] (03PS1) 10Gage: don't filter out stack traces, don't break hdfs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/159768 [17:26:04] (03CR) 10Tim Landscheidt: androidsdk: Specify required libraries for trusty as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/159767 (owner: 10Yuvipanda) [17:28:03] YuviPanda: do you still need a contact added in icinga? [17:32:43] (03CR) 10Yuvipanda: androidsdk: Specify required libraries for trusty as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/159767 (owner: 10Yuvipanda) [17:33:09] (03PS2) 10Yuvipanda: androidsdk: Specify required libraries for trusty as well [puppet] - 10https://gerrit.wikimedia.org/r/159767 [17:33:29] mutante: no, andrewbogott took care of it [17:33:37] mutante: but you can help by merging https://gerrit.wikimedia.org/r/#/c/159759/1 and https://gerrit.wikimedia.org/r/#/c/159767/ [17:35:50] (03PS1) 10Calak: Add "autopatrolled" and "patrol" user groups and enable RC patrol for lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159772 (https://bugzilla.wikimedia.org/70441) [17:37:08] (03PS1) 10Ori.livneh: Update Wikitech apache config for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159774 [17:37:14] ^ andrewbogott [17:38:06] (03CR) 10Andrew Bogott: [C: 032] Update Wikitech apache config for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159774 (owner: 10Ori.livneh) [17:38:14] thanks! [17:41:57] (03PS1) 10RobH: neglected to add dhcp entries for db2014 & db2015 [puppet] - 10https://gerrit.wikimedia.org/r/159776 [17:42:42] (03CR) 10Dzahn: "fwiw: the popular Qualys SSL Labs check at https://www.ssllabs.com/ssltest/ will down vote a little for having STS enabled but "too short"" [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [17:43:06] (03CR) 10RobH: [C: 032] neglected to add dhcp entries for db2014 & db2015 [puppet] - 10https://gerrit.wikimedia.org/r/159776 (owner: 10RobH) [17:43:47] andrewbogott: wanna merge https://gerrit.wikimedia.org/r/#/c/159759/1 and https://gerrit.wikimedia.org/r/#/c/159767/? [17:44:29] (03CR) 10Andrew Bogott: [C: 032] graphite: Don't realm branch in graphite role [puppet] - 10https://gerrit.wikimedia.org/r/159759 (owner: 10Yuvipanda) [17:44:39] (03CR) 10Andrew Bogott: [C: 032] androidsdk: Specify required libraries for trusty as well [puppet] - 10https://gerrit.wikimedia.org/r/159767 (owner: 10Yuvipanda) [17:46:48] (03CR) 10Dzahn: "the 31536000 value (1 year) originally comes from the example on owasp (https://www.owasp.org/index.php/HTTP_Strict_Transport_Security) an" [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [17:47:25] (03PS1) 10Aude: Bump cache epoch for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159778 [17:48:04] (03CR) 10Ottomata: [C: 031] don't filter out stack traces, don't break hdfs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/159768 (owner: 10Gage) [17:49:37] ^d [17:49:37] https://gist.githubusercontent.com/ottomata/fb03fd03267aa0eb1767/raw/f964923e992f04b05638ab15110092eb42fec535/results [17:49:38] (03PS1) 10RobH: more dhcp updates for db2014-2015 [puppet] - 10https://gerrit.wikimedia.org/r/159780 [17:49:50] not much difference...think I should do more? maybe 6 hours worth? [17:50:02] (03CR) 10RobH: [C: 032] more dhcp updates for db2014-2015 [puppet] - 10https://gerrit.wikimedia.org/r/159780 (owner: 10RobH) [17:52:13] (03PS1) 10Ori.livneh: mediawiki: update remaining path references for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159782 [17:53:36] <^d> ottomata: Hmm. Could do that. [17:53:39] (03CR) 10GWicke: "Thanks for fixing this up!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159652 (owner: 10Jforrester) [17:53:49] <^d> Again, we might not actually be seeing much difference because there's little difference for us. [17:54:00] <^d> (Although I think we've proven the new disks are no *worse*) [17:54:28] yeah, i expect so [17:54:33] but, still, 14minutes isn't really that long [17:54:36] so, um, am doing it [17:54:38] hitting 1012 right now [17:54:41] <^d> kk. [17:54:56] (03PS2) 10Gage: don't filter out stack traces, don't break hdfs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/159768 [18:00:05] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140911T1800). Please do the needful. [18:02:50] !log raised logging on Elasticsearch cluster temporarily to get more information about merging - a process super important to keeping the index up to date in "real time" [18:02:55] Logged the message, Master [18:03:43] (03CR) 10Ori.livneh: [C: 032] mediawiki: update remaining path references for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159782 (owner: 10Ori.livneh) [18:06:15] (03PS1) 10Reedy: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159786 [18:06:17] (03PS1) 10Reedy: testwiki to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159787 [18:06:19] (03PS1) 10Reedy: wikipedias to 1.24wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159788 [18:06:21] (03PS1) 10Reedy: group0 to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159789 [18:06:48] (03CR) 10Reedy: [C: 032] Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159786 (owner: 10Reedy) [18:06:53] (03Merged) 10jenkins-bot: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159786 (owner: 10Reedy) [18:06:59] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159787 (owner: 10Reedy) [18:07:04] (03Merged) 10jenkins-bot: testwiki to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159787 (owner: 10Reedy) [18:08:35] !log reedy Started scap: testwiki to 1.24wmf21 and build l10n cache [18:08:40] Logged the message, Master [18:11:58] (03PS1) 10Calak: Add import source for eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159790 (https://bugzilla.wikimedia.org/70414) [18:12:42] alright, test.wikidata will break ... [18:12:49] patch coming [18:13:09] lol [18:13:22] editing test.wikidata, that is [18:13:34] !log reedy scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="testwiki" --list-file="/a/common/wmf-config/extension-list" --output="/tmp/tmp.IH8przTNHs" ' returned non-zero exit status 1 (duration: 04m 59s) [18:13:40] Logged the message, Master [18:13:42] whee [18:13:51] soundsl ike what was happening on beta [18:14:09] need moar verbosity [18:14:14] Yeah that's exactly what beta said in the scap log. [18:14:33] I ran the mwscript command by had to see what really caused the problem [18:14:36] *by hand [18:14:50] (03PS1) 10Ori.livneh: Remove obsolete Icinga by_ssh_* checks [puppet] - 10https://gerrit.wikimedia.org/r/159793 [18:14:56] !log reedy Started scap: testwiki to 1.24wmf21 and build l10n cache [18:15:01] --verbose [18:15:17] won't help :( [18:15:37] really? :/ [18:15:48] I'm sure it did before with similar errors from extensions moving stuff [18:16:04] If it does, awesome [18:16:12] !log reedy scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="testwiki" --list-file="/a/common/wmf-config/extension-list" --output="/tmp/tmp.Nd45X2RONi" --verbose' returned non-zero exit status 1 (duration: 01m 18s) [18:16:20] lol, nope [18:16:28] Still looks like -- https://integration.wikimedia.org/ci/job/beta-scap-eqiad/21090/console [18:16:30] Ah, yeah [18:16:31] 18:16:12 Extension /a/common/php-1.24wmf21/extensions/WikimediaShopLink/WikimediaShopLink.php doesn't exist [18:16:31] 18:16:12 Some files are missing (see above). Giving up. [18:16:49] oh it did help [18:16:51] I stopped branching it [18:16:58] wasn't that removed? [18:17:02] (03PS1) 10Reedy: Remove WikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159794 [18:17:05] the extension [18:17:14] aude: Yeah, we were still branching it till earlier this week [18:17:26] ok [18:17:29] (03CR) 10Reedy: [C: 032] Remove WikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159794 (owner: 10Reedy) [18:17:33] (03Merged) 10jenkins-bot: Remove WikimediaShopLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159794 (owner: 10Reedy) [18:17:48] !log reedy Started scap: testwiki to 1.24wmf21 and build l10n cache take 3 [18:17:53] Logged the message, Master [18:18:18] now it's gonna rebuild all the l10n caches for all versions [18:18:20] lol [18:19:19] ouch [18:19:37] and for wmf15 which you are about to drop :( [18:19:59] 4x l10n will take.... a while [18:20:01] and wmf19 [18:20:10] (03PS1) 10Ori.livneh: Update docroot for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/159795 [18:20:37] * bd808 points to email about rethinking l10n cache [18:20:46] it'll still be quicker than l10nupdate ;) [18:20:53] heh [18:21:31] ottomata: I don't think you are using the disk really [18:22:46] Reedy: can you please https://gerrit.wikimedia.org/r/159364 ? [18:22:53] bd808: OH [18:22:56] It won't be so bad [18:23:05] The extension repo was empty/dummy in most versions [18:23:13] so only wmf21 to build [18:23:14] wmf15 is done [18:23:19] \o/ [18:23:34] (03PS1) 10Ori.livneh: lucene: update path reference for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159798 [18:25:01] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [18:27:01] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [18:27:12] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [18:27:22] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [18:27:23] bd808: ^ [18:27:37] bah [18:28:02] all 3 at once? [18:28:25] distributed systems are hard [18:28:31] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [18:28:36] fault-tolerance doubly so [18:29:13] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [18:29:29] <^d> ori: Best part is how ES fixes itself like that :p [18:29:29] network partition? [18:29:57] <^d> Also, we should use the new health check for logstash if we're not. [18:30:00] <^d> It's less bad. [18:31:28] `curl http://localhost:9200/_cluster/health` looks fine now. green and all nodes seen [18:31:32] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [18:31:44] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [18:31:46] blerg [18:32:01] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [18:32:28] icinga-wm: stop telling lies [18:33:48] logstash1003 is the current master. It shows green. [18:36:07] <^d> Do all the nodes agree on master? [18:37:33] PROBLEM - Apache HTTP on mw1137 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50349 bytes in 0.099 second response time [18:37:42] PROBLEM - Apache HTTP on mw1103 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50349 bytes in 0.011 second response time [18:38:14] ^d: good question... [18:39:41] ^d: Yup. logstash1003 response from all 3 [18:39:49] <^d> Hmm [18:40:02] PROBLEM - Apache HTTP on mw1047 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50428 bytes in 0.018 second response time [18:40:32] Some index is flapping though. Keeps going from green to yellow [18:40:35] PROBLEM - Apache HTTP on mw1151 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50349 bytes in 0.015 second response time [18:40:37] um [18:40:42] Reedy, revert [18:41:37] gah [18:41:39] well, let's see the error [18:41:51] PROBLEM - Apache HTTP on mw1076 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50349 bytes in 0.021 second response time [18:41:52] Fatal error: Unable to open MULTIVER_CDB_DIR_APACHE/wikiversions.cdb [18:42:08] removing that shop link caused that? [18:42:24] There was another patch that removed that var [18:42:31] PROBLEM - Apache HTTP on mw1064 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50428 bytes in 0.006 second response time [18:42:34] couldn't [18:42:42] PROBLEM - Apache HTTP on mw1143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50428 bytes in 0.032 second response time [18:42:46] /usr/local/apache/common-local/multiversion/CdbDBA.php uses it? [18:42:47] it's /usr/local/apache/common-local/php-1.24wmf19/includes/WebStart.php [18:43:01] Call to undefined function wfGetRusage() [18:43:04] that's my patch [18:43:41] PROBLEM - Apache HTTP on mw1146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50428 bytes in 0.020 second response time [18:43:56] anyone handling this? [18:44:03] (03PS1) 10Aude: Revert "Get rid of MULTIVER_CDB_DIR_{APACHE,HOME}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159803 [18:44:14] scap is still mid sync [18:44:19] gah [18:44:22] i'm looking [18:44:24] sync-common: 93% (ok: 215; fail: 0; left: 14) [18:44:40] Reedy: Wow. [18:45:10] sync-common: 98% (ok: 226; fail: 0; left: 3) [18:45:18] win 10 [18:45:21] !log ori Synchronized php-1.24wmf19/includes/profiler/Profiler.php: (no message) (duration: 00m 07s) [18:45:21] yeah hi [18:45:27] Logged the message, Master [18:45:40] apergos: see the note about vulnerability scanning? [18:45:53] that... should not have worked while scap was active [18:46:23] why not? [18:46:43] global sync lock in python code [18:46:46] !log ori Synchronized php-1.24wmf19/includes/WebStart.php: (no message) (duration: 00m 06s) [18:46:50] still getting Fatal error: Unable to open MULTIVER_CDB_DIR_APACHE/wikiversions.cdb [18:46:50] Logged the message, Master [18:47:22] Reedy: abort the scap [18:47:59] What's that gonna do when it's already synced everywhere? [18:48:02] 1 left [18:48:10] fenari :( [18:48:24] is it better now? [18:48:29] it's better on mw1103 [18:48:53] 1 failed and spammed a load of lines [18:49:01] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [18:49:24] ori: not yet [18:49:48] also see Call to undefined function wfGetRusage still [18:49:53] not as often as the other one [18:50:04] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.101 second response time [18:50:31] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [18:50:32] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.212 second response time [18:50:41] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.150 second response time [18:50:53] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.179 second response time [18:51:15] 18:51:06 2 apaches had sync errors [18:51:33] !log graceful'd apache on mw1047, mw1151, mw1137, mw1146 and mw1076 [18:51:38] Logged the message, Master [18:52:06] cajoel: now I did, good you poked me [18:52:33] (it's late in the day here, though I'm working I'm not checking my emails much at this point) [18:53:22] looks like the errors are going away now [18:53:22] because I wont be here to poke you if things go wrong in a while, we should make sure someone will be here in case [18:53:28] Reedy: They were mw1024 and mw1138 I think from the logs in fluorine [18:53:35] Recursion detected in RequestContext::getLanguage is at the top [18:53:41] actually, how long do you think the scan wil take, cajoel? if it's short enough I might indeed be here the whole time [18:53:56] i see mw1064 and mw1143 [18:54:14] * bd808 agrees with aude [18:54:24] !log graceful'd all apaches [18:54:28] Logged the message, Master [18:54:42] bd808: there's a few of these: 1 Warning: Division by zero in /var/www/monitoring/apc_stats.php on line 61 [18:54:45] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.206 second response time [18:54:48] bah, apc [18:54:48] for a handful of lines [18:54:57] yes, no actual bug [18:54:59] all apc issue [18:55:18] should be okay now, at least. [18:56:36] Reedy: caused by reastarts -- https://github.com/wikimedia/operations-puppet/blob/7debc10594b09bdf41c537de5506ee81d7607cc1/modules/mediawiki/files/monitoring/apc_stats.php#L61 [18:56:41] still see Fatal error: Unable to open MULTIVER_CDB_DIR_APACHE/wikiversions.cdb [18:56:52] aude: where? [18:57:08] mw1143 [18:57:18] 12 in last 5 min [18:57:25] per bd808 [18:57:33] Reedy, I just did a fetch/rebase in the config, and I still see "labswiki": "php-1.24wmf15" -- is that because you haven't gotten to that part yet? [18:57:39] none in last minute though [18:58:06] let me know if you see another one [18:59:06] andrewbogott: scap is still running... [18:59:12] 'k [18:59:15] andrewbogott: Yeah. versions not switched yet [18:59:16] so that's the initial staging [18:59:23] versions (bar testwiki) is next [19:00:27] scap-rebuild-cdbs: 99% (ok: 228; fail: 0; left: 1) [19:02:21] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [19:02:22] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [19:02:47] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 31 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 28, utimed_out: False, uactive_primary_shards: 36, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 72, uinitializing_shards: 3, unumber_of_data_nodes: 3} [19:02:51] !log Restarted elasticsearch on logstash1003 -- Java OOM error in logs and not recovering shards [19:02:57] Logged the message, Master [19:03:01] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 31 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 28, utimed_out: False, uactive_primary_shards: 36, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 72, uinitializing_shards: 3, unumber_of_data_nodes: 3} [19:03:26] (03PS1) 10Dzahn: install wmfusercontent SSL cert on phab [puppet] - 10https://gerrit.wikimedia.org/r/159809 [19:03:47] nearly done [19:04:03] (03Abandoned) 10Aude: Revert "Get rid of MULTIVER_CDB_DIR_{APACHE,HOME}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159803 (owner: 10Aude) [19:04:25] thanks [19:04:39] something up with bast1001? [19:04:48] one of my sessions just died and I can't reconnect [19:05:57] ori: Still some MULTIVER_CDB_APACHE errors trickling out of mw1143. 14 in last 5 mins [19:06:09] sync-common on it maybe? [19:06:32] it's probably some buffering in the logging pipeline. that's my guess. but i can sync-common, sure. [19:07:02] my sell has hung [19:07:04] *shell [19:07:16] there's no reference to MULTIVER_CDB_APACHE on mw1143 that i can see [19:07:24] so let's wait another minute [19:07:44] Fu-------- [19:07:51] The shell session I was scapping on just died [19:08:17] you don't use screen? [19:08:43] doesn't help when you need to forward an agent or similar [19:08:44] it should pick up from where it left off if you restart it, no? [19:10:44] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [19:10:49] Logged the message, Master [19:11:24] (03CR) 10Dzahn: [C: 04-2] "nope, instead we want it on nginx on the misc. varnish hosts, but so far we don't add more than one cert there" [puppet] - 10https://gerrit.wikimedia.org/r/159809 (owner: 10Dzahn) [19:13:58] 10.64.0.82 is still giving the lambda errors [19:14:05] 10.64.0.53 too [19:15:37] Sep 11 19:15:02 10.64.16.123 apache2[2799]: PHP Fatal error: Unable to open MULTIVER_CDB_DIR_APACHE/wikiversions.cdb.#012 in /usr/local/apache/common-local/multiversion/MWMultiVersion.php on line 358 [19:15:54] why lambda still? [19:16:07] No idea, but they're still there [19:16:13] !log running sync-common on mw1143 [19:16:19] Logged the message, Master [19:18:02] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [19:18:18] Sep 11 19:17:59 10.64.16.123 apache2[14031]: PHP Fatal error: Base lambda function for closure not found in /usr/local/apache/common-local/php-1.24wmf20/extensions/Wikidata/extensions/Wikibase/lib/config/WikibaseLib.default.php on line 18 [19:18:28] I presume 1143 is ok now [19:19:14] Can someone please graceful mw1143? [19:19:18] uh, no [19:19:32] or, yes, mw1143 [19:19:48] 10.64.16.123 == mw1143 [19:20:12] rage [19:20:25] !log graceful'ed apache on mw1143 [19:20:27] Reedy: done [19:20:29] thanks [19:20:30] Logged the message, Master [19:21:07] Look to have stopped at Sep 11 19:20:16 now [19:21:49] They're disappearing from the apache syslogs now [19:21:55] gah, wikibase is breaking beta [19:21:55] No sigh of the cdb issue either [19:22:20] * aude prepares in case it's a problem on test.wikidata / test2 [19:22:47] * bd808__ is having bouncer issues [19:23:49] When these lambda errors filter out I think we're roughly back to normal [19:24:50] gerrit sucking for anyone else too? [19:25:00] always :P [19:25:43] Getting Server Unavailable errors [19:26:14] not that bad for me [19:26:29] <^d> Oddly enough I'm on office wifi and gerrit's not sucking. [19:26:34] <^d> Which is unusual. [19:26:36] Though, I wonder if I'm having ipv6 transit issues [19:26:43] Reeeeeeedy [19:26:44] hence no bast1001 [19:27:55] we have a wmf21 branch? am i confused [19:28:07] wmf21 is todays new branch [19:28:11] oh [19:28:16] Currently active MediaWiki versions: 1.24wmf15 1.24wmf19 1.24wmf20 1.24wmf21 [19:28:17] lol [19:28:36] our branch is misnamed then [19:28:58] suppose ok for now though [19:29:13] (03PS1) 10MaxSem: Revert "Custom rights on zerowiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159812 [19:30:11] greg-g, ^^^ is a fix for https://bugzilla.wikimedia.org/show_bug.cgi?id=70712 - hwen it can be deployed [19:31:54] * Reedy mumbles something about not having eaten [19:32:29] * marktraceur gives Reedy a cookie [19:35:47] (03PS1) 10Yuvipanda: tools: Install libvips* only on precise [puppet] - 10https://gerrit.wikimedia.org/r/159814 [19:35:49] andrewbogott: ^ [19:35:50] merge? [19:35:55] cleans out puppet failures [19:36:37] (03CR) 10Andrew Bogott: [C: 032] tools: Install libvips* only on precise [puppet] - 10https://gerrit.wikimedia.org/r/159814 (owner: 10Yuvipanda) [19:36:41] Reedy: how's the deploy? [19:36:43] andrewbogott: ty [19:36:53] andrewbogott: should make icinga all green for toollabs :) [19:36:55] (03PS2) 10Andrew Bogott: tools: Install libvips* only on precise [puppet] - 10https://gerrit.wikimedia.org/r/159814 (owner: 10Yuvipanda) [19:37:57] "have you tried turning it off an on again"? [19:39:02] andrewbogott: we've to figure out how to get labsdebrepo to work in mixed environments with both precise and trusty [19:41:13] (03PS2) 10Reedy: labswiki to 1.24wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158291 [19:42:30] (03PS2) 10Yurik: ZERO: fixed meta security, adjusted zerowiki rights adjustments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159656 [19:42:37] (03PS2) 10Reedy: wikipedias to 1.24wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159788 [19:42:40] MaxSem, ^^ [19:42:42] (03CR) 10Reedy: [C: 032] wikipedias to 1.24wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159788 (owner: 10Reedy) [19:42:47] (03Merged) 10jenkins-bot: wikipedias to 1.24wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159788 (owner: 10Reedy) [19:43:01] Reedy, deploying? [19:44:26] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.24wmf20 [19:44:32] Logged the message, Master [19:44:47] (03CR) 10Chad: [C: 04-1] ZERO: fixed meta security, adjusted zerowiki rights adjustments (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159656 (owner: 10Yurik) [19:44:54] (03PS2) 10Reedy: group0 to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159789 [19:45:01] (03CR) 10Reedy: [C: 032] group0 to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159789 (owner: 10Reedy) [19:45:05] (03Merged) 10jenkins-bot: group0 to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159789 (owner: 10Reedy) [19:45:42] (03CR) 10Yurik: ZERO: fixed meta security, adjusted zerowiki rights adjustments (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159656 (owner: 10Yurik) [19:46:14] (03PS3) 10Yurik: ZERO: fixed meta security, adjusted zerowiki rights adjustments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159656 [19:46:50] Reedy, greg-g, could you push this out asap - meta is broken a bit ^ [19:46:56] or i could do it [19:47:05] https://bugzilla.wikimedia.org/show_bug.cgi?id=70712 [19:47:36] <^d> yurikR: Thanks for fixing the \t's :) [19:47:52] (03PS3) 10Reedy: labswiki to 1.24wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158291 [19:48:13] ^d, np ) [19:48:25] (03CR) 10Reedy: [C: 032] labswiki to 1.24wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158291 (owner: 10Reedy) [19:48:29] (03Merged) 10jenkins-bot: labswiki to 1.24wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158291 (owner: 10Reedy) [19:48:43] Reedy: how's the deploy going? cc yurikR [19:48:53] yurikR: we're still mid-train deploy [19:49:03] greg-g: ipv6 sucks [19:49:07] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf21 [19:49:12] Logged the message, Master [19:50:25] (03PS4) 10Reedy: ZERO: fixed meta security, adjusted zerowiki rights adjustments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159656 (owner: 10Yurik) [19:50:29] (03CR) 10Reedy: [C: 032] ZERO: fixed meta security, adjusted zerowiki rights adjustments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159656 (owner: 10Yurik) [19:50:33] there ya go yurikR ;) [19:50:34] (03Merged) 10jenkins-bot: ZERO: fixed meta security, adjusted zerowiki rights adjustments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159656 (owner: 10Yurik) [19:50:41] thx :)) [19:50:53] MaxSem, ^ [19:51:01] but if it blows up, i wasn't here... [19:51:09] * yurikR hides [19:51:14] yurikR: best bet in these situations is to see what is scheduled/going on, and ask those people if you can sync or not (in this case Reedy ) [19:51:45] greg-g, hehe, that's exactly whom i pinged :) [19:51:58] !log reedy Synchronized wmf-config/: Fix Zero settings (duration: 00m 15s) [19:51:59] oh, missed the one to ree-dy, heh [19:52:04] Logged the message, Master [19:52:05] yurikR: thank you sir! ;) [19:52:09] !log Running manual sync-common on mw1138 [19:52:14] Logged the message, Master [19:52:21] hehe :) [19:52:24] Sep 11 19:50:32 10.64.16.118 apache2[6428]: PHP Fatal error: require() [function.require]: Failed opening required '/srv/mediawiki/php-1.24wmf21/api.php' (include_path='.:/usr/share/php:/usr/local/apache/common/php') in /usr/local/apache/common-local/w/api.php on line 3 [19:53:01] (03PS4) 10Reedy: Set password default to PBKDF2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158024 (https://bugzilla.wikimedia.org/68766) (owner: 10Parent5446) [19:53:08] (03CR) 10Reedy: [C: 032] Set password default to PBKDF2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158024 (https://bugzilla.wikimedia.org/68766) (owner: 10Parent5446) [19:53:12] (03Merged) 10jenkins-bot: Set password default to PBKDF2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158024 (https://bugzilla.wikimedia.org/68766) (owner: 10Parent5446) [19:54:03] (03PS2) 10Reedy: Bump cache epoch for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159778 (owner: 10Aude) [19:54:09] (03CR) 10Reedy: [C: 032] Bump cache epoch for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159778 (owner: 10Aude) [19:54:14] (03Merged) 10jenkins-bot: Bump cache epoch for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159778 (owner: 10Aude) [19:55:32] (03PS2) 10Reedy: Alphasort extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157831 [19:55:38] (03CR) 10Reedy: [C: 032] Alphasort extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157831 (owner: 10Reedy) [19:55:43] (03Merged) 10jenkins-bot: Alphasort extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157831 (owner: 10Reedy) [19:56:02] (03PS2) 10Reedy: (bug 70616) Change rights for user groups in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159364 (owner: 10Matanya) [19:56:05] (03CR) 10Reedy: [C: 032] (bug 70616) Change rights for user groups in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159364 (owner: 10Matanya) [19:56:11] (03Merged) 10jenkins-bot: (bug 70616) Change rights for user groups in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159364 (owner: 10Matanya) [19:56:27] thank you [19:56:49] yurikR: Warning: API call failed trying to get remote JsonConfig: error={"code":"readapidenied","info":"You need read permission to use this module"}, query={"action": [19:56:49] "query","titles":"Zero:310-260","prop":"revisions","rvprop":"content"} [19:57:18] (03Abandoned) 10MaxSem: Revert "Custom rights on zerowiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159812 (owner: 10MaxSem) [19:57:59] !log Running sync-common on mw1024 [19:58:04] Logged the message, Master [19:58:18] (03PS2) 10Reedy: Add import source for eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159790 (https://bugzilla.wikimedia.org/70414) (owner: 10Calak) [19:58:21] (03CR) 10Reedy: [C: 032] Add import source for eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159790 (https://bugzilla.wikimedia.org/70414) (owner: 10Calak) [19:58:26] (03Merged) 10jenkins-bot: Add import source for eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159790 (https://bugzilla.wikimedia.org/70414) (owner: 10Calak) [19:59:19] (03PS2) 10Reedy: Add "autopatrolled" and "patrol" user groups and enable RC patrol for lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159772 (https://bugzilla.wikimedia.org/70441) (owner: 10Calak) [19:59:21] twentyafterfour: that sring of merges from Reedy up there ^ this is what is known as "Reedy spam" aka "when Reedy goes through the mediawiki-config backlog and MERGES ALL THE THINGS! [19:59:23] (03CR) 10Reedy: [C: 032] Add "autopatrolled" and "patrol" user groups and enable RC patrol for lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159772 (https://bugzilla.wikimedia.org/70441) (owner: 10Calak) [19:59:27] string* [19:59:27] (03Merged) 10jenkins-bot: Add "autopatrolled" and "patrol" user groups and enable RC patrol for lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159772 (https://bugzilla.wikimedia.org/70441) (owner: 10Calak) [20:00:05] Reedy, did i cause that? [20:00:11] (03PS3) 10Reedy: Add 'autopatrol' and 'patrol' rights to "editor" group on plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159732 (https://bugzilla.wikimedia.org/70459) (owner: 10Calak) [20:00:21] yurikR: I've no idea. It's your code presumably [20:00:27] (03CR) 10Reedy: [C: 032] Add 'autopatrol' and 'patrol' rights to "editor" group on plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159732 (https://bugzilla.wikimedia.org/70459) (owner: 10Calak) [20:00:31] (03Merged) 10jenkins-bot: Add 'autopatrol' and 'patrol' rights to "editor" group on plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159732 (https://bugzilla.wikimedia.org/70459) (owner: 10Calak) [20:00:45] * Reedy migrates to the kitchen to find food [20:01:02] (03PS2) 10Reedy: Change autoconfirmed settings on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159730 (https://bugzilla.wikimedia.org/70128) (owner: 10Calak) [20:01:06] (03CR) 10Reedy: [C: 032] Change autoconfirmed settings on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159730 (https://bugzilla.wikimedia.org/70128) (owner: 10Calak) [20:01:11] oh, Reedy, yes, i know about that issue, working on it - you shouldn't have seen it more than a few times (me testing) [20:01:26] btw, where did you see it? [20:01:37] i couldn't see it in logstash [20:01:37] apache syslogs [20:01:49] do you know why its not in logstash? [20:02:43] Reedy: https://gerrit.wikimedia.org/r/#/c/159816/ [20:03:02] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 7 unmerged changes in mediawiki_config (dir /a/common/). [20:03:30] thanks [20:04:39] (03PS1) 10Rush: WIP [puppet] - 10https://gerrit.wikimedia.org/r/159820 [20:05:17] (03CR) 10jenkins-bot: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/159820 (owner: 10Rush) [20:05:36] * Reedy waits for jenkins [20:05:39] ok [20:06:03] for merging mediawiki-config stuff ;) [20:06:13] andrewbogott: should be good to sync-common for wikitech [20:06:52] i am puzzled why wikibase is broken on beta and not test.wikidata [20:07:01] both have hhvm now [20:07:16] or might have an idea [20:08:17] !log reedy Synchronized php-1.24wmf21/extensions/Wikidata/: (no message) (duration: 00m 17s) [20:08:21] Logged the message, Master [20:09:01] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:09:37] wth is jenkins doing [20:09:44] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Small problem, otherwise LGTM. I guess it could be less complicated than this, but I'll have to think about this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/159820 (owner: 10Rush) [20:09:46] (03Merged) 10jenkins-bot: Change autoconfirmed settings on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159730 (https://bugzilla.wikimedia.org/70128) (owner: 10Calak) [20:09:58] oh there we go [20:10:32] (03PS2) 10JanZerebecki: Puppetize icinga log file permission fix. [puppet] - 10https://gerrit.wikimedia.org/r/158633 [20:11:29] (03CR) 10JanZerebecki: Puppetize icinga log file permission fix. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/158633 (owner: 10JanZerebecki) [20:11:44] (03PS2) 10Reedy: Further throttle Cirrus template update jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159485 (owner: 10Manybubbles) [20:11:50] (03CR) 10Reedy: [C: 032] Further throttle Cirrus template update jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159485 (owner: 10Manybubbles) [20:11:57] (03Merged) 10jenkins-bot: Further throttle Cirrus template update jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159485 (owner: 10Manybubbles) [20:12:06] thanks reedy [20:12:22] (03PS2) 10Reedy: Add comma between active MW versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158043 [20:12:28] (03CR) 10Reedy: [C: 032] Add comma between active MW versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158043 (owner: 10Reedy) [20:12:34] (03Merged) 10jenkins-bot: Add comma between active MW versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158043 (owner: 10Reedy) [20:13:16] (03PS2) 10Reedy: Clean up and re-organise VisualEditor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157845 (owner: 10Jforrester) [20:13:20] (03CR) 10Reedy: [C: 032] Clean up and re-organise VisualEditor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157845 (owner: 10Jforrester) [20:13:25] (03Merged) 10jenkins-bot: Clean up and re-organise VisualEditor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157845 (owner: 10Jforrester) [20:14:10] (03CR) 10Reedy: [C: 04-1] "Needs rebasing. Or redundant?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157485 (owner: 10Ori.livneh) [20:14:13] ori: ^^ [20:14:42] (03Abandoned) 10Reedy: Disable RelatedSites Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/150301 (https://bugzilla.wikimedia.org/68815) (owner: 10Reedy) [20:14:57] !syncing virt1000, again in hopes of moving to wmf20 [20:15:10] um... [20:15:15] !log syncing virt1000, again in hopes of moving to wmf20 [20:15:22] Logged the message, Master [20:15:52] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 13s) [20:16:00] Logged the message, Master [20:16:54] (03PS1) 10Ottomata: Grant Fabian Kaelin Analytics Cluster access [puppet] - 10https://gerrit.wikimedia.org/r/159825 [20:17:17] (03CR) 10Ottomata: [C: 04-1] "Waiting for approval from legal" [puppet] - 10https://gerrit.wikimedia.org/r/159825 (owner: 10Ottomata) [20:19:51] (03PS1) 10Yurik: Typpo in the word 'false' :) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159826 [20:19:59] bd808: "" is not a valid magic word for "smwdoc", referer: https://wikitech.wikimedia.org/wiki/Main_Page [20:19:59] (03CR) 10Ori.livneh: WIP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/159820 (owner: 10Rush) [20:20:25] Reedy, sorry to bug you again, a tiny typpo https://gerrit.wikimedia.org/r/159826 [20:20:43] andrewbogott: WFM? [20:20:56] andrewbogott: I get a different error about the echo db tables needing to be updated [20:21:07] hm, so do I, now [20:21:10] (the different error) [20:21:18] Do I need to run update.php, or did sync-common do that already? [20:21:19] run update.php then ;) [20:21:20] mwscript update.php --quick [20:21:55] 1.24wmf20! https://wikitech.wikimedia.org/wiki/Special:Version [20:21:58] I take it sync-common /doesn't/ do that? [20:22:14] andrewbogott: Nope. db updates are out of scope for syncing [20:22:30] in prod s.pringle would do the safer equivalent [20:22:42] bd808: how does the deployment train work, then, when there are db changes? [20:22:44] Hm [20:22:50] ok, well… seems to be working now! [20:23:33] Patches that change the db are flagged to s.pringle in bugzilla and he updates things before the patch is merged (or before it is deployed at least) [20:23:52] At least that's how I know to get them done [20:24:03] ok... [20:24:10] I guess I'll be doing these by hand for a while yet then. [20:24:18] * bd808 nods [20:24:32] bd808: Can we write an addon module to scap that runs update.php only on wikitech? [20:24:45] In theory yes [20:25:06] (03PS1) 10Dzahn: beta - IRC notifications for the QA team [puppet] - 10https://gerrit.wikimedia.org/r/159827 [20:25:29] It would be a bit of work but possible. scap allows for a per-machine config files and that could be used to describe additional things to do [20:25:59] I guess "sync-common; mwscript update.php --quick" is easier [20:26:03] Or just write a shell script that says "sync-common && mwscript update.php --quick" [20:26:11] jinx [20:26:29] yep, that's simple enough [20:27:07] unrelatedly… is there a way for me to ask the jobqueue how full it is (or better yet how old the oldest job is) without actually running everything? [20:27:12] yup [20:27:17] mwscript showJobs.php [20:27:33] mwscript showJobs.php --wiki=labswiki [20:28:00] so that's 'how full' but not 'how old' right? [20:28:21] (03CR) 10BryanDavis: [C: 031] beta - IRC notifications for the QA team [puppet] - 10https://gerrit.wikimedia.org/r/159827 (owner: 10Dzahn) [20:28:27] I think there's a parameter for that [20:28:45] hmm, nope [20:29:00] (03CR) 10Ottomata: [C: 032 V: 032] "Confirmed by legal. :)" [puppet] - 10https://gerrit.wikimedia.org/r/159825 (owner: 10Ottomata) [20:29:09] andrewbogott: --list will just list them [20:29:14] and you have to sort through them yourself [20:29:20] yeah, with timestamps, that'll work. [20:29:27] Every time I run mwscript I get PHP Notice: Undefined index: REQUEST_URI in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 2487 [20:29:31] there are some other parameters which do other things but nothing that groups them well [20:29:38] well, like you want that is [20:29:39] <^d> --list --group [20:29:50] <^d> --group is useful. [20:29:54] andrewbogott: That's happening in beta now too. I haven't looked for the culprit [20:30:10] ok, great. As long as I'm not alone [20:31:12] yurikR: your spamming the logs [20:31:12] a lot [20:31:20] (03CR) 10Rush: WIP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/159820 (owner: 10Rush) [20:31:34] Reedy, i already fixed the typo [20:31:37] see above [20:31:45] http://p.defau.lt/?lSZNp_Z8En1D_Fi8rbitlA [20:32:04] (03PS2) 10Rush: WIP [puppet] - 10https://gerrit.wikimedia.org/r/159820 [20:32:05] Reedy, https://gerrit.wikimedia.org/r/#/c/159826/ [20:32:08] ... [20:32:11] Waaait [20:32:12] andrewbogott: Looks like it is caused by this patch from Ori -- https://gerrit.wikimedia.org/r/#/c/158948/ [20:32:26] we were able to sync that? [20:32:38] (03CR) 10Reedy: [C: 032] Typpo in the word 'false' :) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159826 (owner: 10Yurik) [20:32:48] Coren: Have a moment? I'm still thwarted by the jobqueue on virt1000. If you look at apache's crontab you can see the job that should be running, and when I run it by hand it works fine... [20:32:51] possible something with cwd [20:33:20] Reedy: It's not a php syntax error. Php would treat that as an undefined constant [20:33:23] andrewbogott: I can take a look in a bit? I'm in the middle of coding. [20:33:26] bd808: :9 [20:33:28] *:( [20:33:34] Coren: sure, anytime. thx [20:33:41] bd808: what was caused by that patch? [20:33:43] RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [20:33:57] ori: Every time I run mwscript I get PHP Notice: Undefined index: REQUEST_URI in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 2487 [20:34:39] let's just revert it, i wanted feedback from aaron / tim on that anyhow [20:34:51] (03PS1) 10Ori.livneh: Revert "Scribunto: double the Lua CPU limit on the job runners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159830 [20:35:04] (03PS2) 10Ori.livneh: Revert "Scribunto: double the Lua CPU limit on the job runners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159830 [20:35:16] bd808: ^ +1 ? [20:35:19] (03CR) 10Rush: "note:" [puppet] - 10https://gerrit.wikimedia.org/r/159820 (owner: 10Rush) [20:35:43] (03CR) 10BryanDavis: [C: 031] "better for cli usage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159830 (owner: 10Ori.livneh) [20:35:52] (03PS2) 10Dzahn: beta - IRC notifications for the QA team [puppet] - 10https://gerrit.wikimedia.org/r/159827 [20:35:55] (03CR) 10Ori.livneh: [C: 032] Revert "Scribunto: double the Lua CPU limit on the job runners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159830 (owner: 10Ori.livneh) [20:37:17] (03PS1) 10Andrew Bogott: /fully/qualify/path/to/mwscript [puppet] - 10https://gerrit.wikimedia.org/r/159831 [20:37:48] hey chasemp, what's the proper bastion group to add a new user to [20:37:56] new user wants to access a .eqiad.wmnet host [20:38:03] do they need to be added to the bastiononly group too? [20:38:24] so that gets them onto bastions yes [20:38:27] and then whatever else they need [20:38:41] ah ok [20:39:23] (03PS1) 10Ottomata: Fabian also needs access to bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/159832 [20:39:40] considered having all groups dump into bastions or something [20:39:44] but seemed too magic and prone to fail [20:40:20] * Reedy kicks jenkins [20:40:29] (03CR) 10Ottomata: [C: 032 V: 032] Fabian also needs access to bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/159832 (owner: 10Ottomata) [20:40:30] (03CR) 10Dzahn: "no jenkins..." [puppet] - 10https://gerrit.wikimedia.org/r/159827 (owner: 10Dzahn) [20:42:01] (03CR) 10Reedy: [V: 032] Typpo in the word 'false' :) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159826 (owner: 10Yurik) [20:42:57] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [20:43:02] Logged the message, Master [20:44:46] (03PS3) 10Rush: WIP [puppet] - 10https://gerrit.wikimedia.org/r/159820 [20:45:59] chasemp: you need to update the $full_site_name = '' in cache.pp to be $full_site_name = undef, in that case [20:46:28] Krinkle: could you take a look at jenkins being ok? [20:47:01] done and thanks, too many things at once [20:47:08] (03PS4) 10Rush: WIP [puppet] - 10https://gerrit.wikimedia.org/r/159820 [20:47:10] i know the feeling :) [20:47:52] (03CR) 10Ori.livneh: [C: 031] "Haven't tested, but looks right." [puppet] - 10https://gerrit.wikimedia.org/r/159820 (owner: 10Rush) [20:48:23] (03PS3) 10Ori.livneh: Revert "Scribunto: double the Lua CPU limit on the job runners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159830 [20:48:28] (03CR) 10Ori.livneh: [V: 032] Revert "Scribunto: double the Lua CPU limit on the job runners" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159830 (owner: 10Ori.livneh) [20:48:45] (03PS1) 10Dzahn: toollabs - IRC notifications for labs [puppet] - 10https://gerrit.wikimedia.org/r/159834 [20:48:46] !log ori updated /a/common to {{Gerrit|I1f3234746}}: Revert "Scribunto: double the Lua CPU limit on the job runners" [20:48:51] Logged the message, Master [20:49:08] !log ori Synchronized wmf-config/CommonSettings.php: I1f3234746: Revert Scribunto: double the Lua CPU limit on the job runners (duration: 00m 05s) [20:49:13] Logged the message, Master [20:49:44] andrewbogott: i have a few mins if you want me to look at anything [20:49:58] ori: Thanks. I may have a solution, but… waiting on jenkins [20:50:05] * ori nods [20:50:22] andrewbogott: I think jenkins might have finished for the weekend [20:50:32] yeah :( [20:51:16] confirmed. no jenkins, already pinged about it [20:51:52] (03CR) 10Andrew Bogott: [C: 031] "I have some concern that this will drown out newbies in the channel, but we can give it a chance." [puppet] - 10https://gerrit.wikimedia.org/r/159834 (owner: 10Dzahn) [20:52:39] can i has +1 for https://gerrit.wikimedia.org/r/#/c/159793/ and https://gerrit.wikimedia.org/r/#/c/159795/ ? [20:52:41] both straightforward [20:53:26] apergos: ping [20:53:27] (03CR) 10Dzahn: "depends how many service we add this too, let me stress out this is NOT going to output all the ops stuff, only whichever service this con" [puppet] - 10https://gerrit.wikimedia.org/r/159834 (owner: 10Dzahn) [20:53:51] cajoel: yes? (pretty much gone now, it's midnight) [20:53:54] Is someone working on jenkins? Or are we waiting for ^d to return from lunch? [20:54:02] why don't we just restart it? [20:54:06] apergos: I need someone from Ops to ACK that we're staring a scan [20:54:10] I thought reedy tried that already? [20:54:13] and you're RT master at the moment [20:54:28] RT 8339 [20:54:43] is there a better way to mass spam all Ops with an FYI? [20:54:49] I m but I'm also about to be not around.. I tried to ack you a couple hours earlier but I guess you were gone [20:55:01] sorry -- missed that [20:55:07] I haven't touched jenkins [20:55:16] in ticket would be nice [20:55:21] hmmm, Reedy, seems something else is amiss, figuring it out... [20:55:21] oh, when you said 'kick jenkins' you were just… recommending [20:55:24] well let's just see if someone can keep an eye out from the sf timezone now [20:55:28] ori, will you restart if you know how? [20:55:34] Otherwise I'll look it up :) [20:55:44] jgage: can you be the scaning wingman [20:55:55] andrewbogott: dunno how, sorry [20:56:01] 'k [20:56:08] it's on wikitech! [20:56:44] !log graceful'd apache on mw1053, missed it earlier [20:56:48] Logged the message, Master [20:57:40] mutante: around? [20:58:56] !log restarted jenkins, maybe [20:59:00] Logged the message, Master [20:59:34] Reedy, fixed [21:00:05] superm401, rmoen: Respected human, time to deploy Growth (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140911T2100). Please do the needful. [21:00:12] apparently jenkins takes a looooong time to start up [21:00:18] yup [21:00:30] thanks andrewbogott Jenkins does take while to restore itself [21:03:10] <^d> andrewbogott: I've been back from lunch for >1h. Was I looking at Jenkins? [21:04:42] bblack, around? [21:04:54] yeah [21:05:03] ^d: No worries, turns out everyone was kvetching about Jenkins but no one thought to tell you, or do anything (including me until just a minute ago) [21:05:08] I'm preparing to break all the things :) [21:05:12] I restarted, as yet unclear if it helped. [21:05:15] bblack, hi, i changed a few security settings, not sure if Netmapper is still downloading stuff ok. can you check? [21:05:26] bblack, that is a noble task! [21:05:46] ottomata: do you mean to still be hammering elastic1016? [21:05:48] ^d: signs point to 'did not help' [21:05:52] i fully support you in that endeavour [21:06:23] ^d: elastic1016? [21:06:28] <^d> Not I. [21:06:29] yurikR: are these security changes related to IP addresses, or does checking just one host do? [21:06:30] <^d> andrewbogott: Well I was complaining about it too. Maybe I shouldn't lick the cookie :p [21:06:49] bblack, sec checking as far as what user can do what things [21:07:12] netmapper user should be able to pull ips, but i need to double check that it still can [21:07:44] yurikR: I meant, are you trying to restrict what networks can make requests by-address? [21:08:08] bblack, no, just reorganizing the security groups for MW [21:08:18] and setting the rights [21:08:27] yup [21:08:29] {"servedby": "mw1200", "error": {"info": "You need read permission to use this module", "code": "readapidenied"}} [21:08:31] manybubbles: [21:08:34] ^ been getting that for like an hour [21:08:40] i was running 6 hours of enwiki searches on it [21:08:41] its still going [21:08:45] (03CR) 10JanZerebecki: "You can read the NDA at https://wikitech.wikimedia.org/wiki/File:Volunteer_Non-disclosure_Agreement_Template.pdf . (IANAL, but:) It is pre" [puppet] - 10https://gerrit.wikimedia.org/r/159419 (owner: 10Dzahn) [21:08:45] i did 1012 earlier [21:08:49] oh [21:08:49] no [21:08:54] just checked now manybubbles [21:08:55] it is done [21:08:57] !log restarting zuul on gallium [21:09:02] Logged the message, Master [21:09:06] ottomata: are you sure? elastic1016 is hammered [21:09:11] <^d> manybubbles: Looks like it started about an hour and a half ago. [21:09:37] ja, i would have started about that long ago [21:09:40] and it looks like it jsut finished [21:09:40] ^d: I'm having some connection issues - can you check if we're actually hurting anyone right now? [21:09:45] real 85m38.684s [21:09:59] you know you didn't cause any io [21:10:12] yeah, godog was pointing that out earlier [21:10:43] ottomata: just finished, yeah [21:10:47] (03PS4) 10BBlack: Remove LVS/SSL defs for unused project-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/157978 [21:10:52] can you email me what you did? [21:10:54] makes sense, not really sure how we would cause it then thoug...i guess restart es like we did last week and THEN hammer them? [21:10:54] ja [21:10:55] (03PS4) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/158297 [21:10:57] so I can review why it didn't make io? [21:11:01] manybubbles: [21:11:02] https://gist.github.com/ottomata/fb03fd03267aa0eb1767 [21:11:03] <^d> manybubbles: I'm not seeing any problems with search anywhere [21:11:10] <^d> (user facing, at least) [21:11:42] yurikR: {"servedby": "mw1200", "error": {"info": "You need read permission to use this module", "code": "readapidenied"}} (forgot to tag your name earlier) [21:11:56] ^d: welp, now zuul is waiting..........................................................................................................................................................etc [21:12:01] mutante: Hi [21:12:05] bblack, sec, adjusting [21:12:23] bblack, try it now [21:12:35] ^d: cool [21:12:38] <^d> andrewbogott: Did you try kicking zuul? [21:12:51] ottomata: I don't _think_ your rebuilding cirrus's searches though - your just running some queries [21:12:51] bblack, btw, was that error just now or about 15+ min ago? [21:13:01] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:01] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:01] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:02] if you rebuild what cirrus does you'll hammer it [21:13:03] ^d: That's what I mean. https://dpaste.de/k0g0 [21:13:14] now I don't know if I should ctrl-c or wait or what [21:13:20] apergos: ellery has a wikitech account and a UID [21:13:21] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:21] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:22] PROBLEM - check if dhclient is running on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:22] i updated the ticket [21:13:32] PROBLEM - Disk space on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:33] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:40] andrewbogott: restart doesn't do what you think it does [21:13:41] manybubbles: very possible......... [21:13:41] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:57] haha, that's why I asked ^d to review and if q=search was enough! oh well :p [21:14:00] i gotta run though [21:14:01] Krinkle: ok, I'm just following (trying to follow) https://www.mediawiki.org/wiki/Continuous_integration/Zuul [21:14:02] !log Stopping/starting zuul [21:14:06] but, I will accept all offers! [21:14:07] Logged the message, Master [21:14:11] q=search will just search a single field [21:14:25] yurikR: it was all the automated attempts from about 1h20m ago through about 15m ago, when it apparently stopped erroring [21:14:28] manybubbles: if you get a chance to let me know what request I should send, I will try again tomorrow [21:14:44] yurikR, I see a ton of "No content is available, caching empty '480:470-01' for 10 seconds [Called from JsonConfig\JCCache::memcSet in /srv/mediawiki/php-1.24wmf20/extensions/JsonConfig/includes/JCCache.php at line 100] " in fatalmonitor [21:14:46] bblack, oh, so maybe i shouldnat' have change it [21:14:47] yurikR: currently it looks ok, although I'd be more-positive if there were an actual data change to see happen [21:15:05] currently it's seeming to succeed, but nothing has changed in hours [21:15:12] MaxSem, i fixed it - had to add an account to the right sec group [21:15:12] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/159831 (owner: 10Andrew Bogott) [21:15:25] Krinkle: can you fill me in? [21:15:26] bblack, i have deployed some setting changes [21:15:31] SHOULD BE PUPPETIZED :P [21:15:45] ok, laters all! [21:15:51] yurikR: last change was to carriers.json a little over 3 hours ago [21:16:14] well close to 4 hours [21:16:29] bblack, i haven't changed the data, only the security settings [21:16:37] right [21:16:41] but i must have misconfigured them, hence the issues [21:16:57] now i'm adding the user accts back to the right groups [21:16:57] ottomata: use this as a template: https://en.wikipedia.org/wiki/Special:Search/test%20some%20stuff?srbackend=CirrusSearch&cirrusDumpQuery=yes [21:17:11] yurikR: as best I can tell, it's currently fetching correctly, but the correct fetch induces no change in the data :) But I'm no longer seeing error reports [21:17:14] bblack, can you manually run it on one of the server? [21:17:22] if you can make some trivial change to a test carrier, it would be more-obvious that it's definitely working [21:17:25] I did [21:17:43] i might have given too many rights to your acct, want to revese it a bit [21:17:50] ok [21:17:52] would you be able to test it again? [21:18:02] (03CR) 10Andrew Bogott: [C: 032] /fully/qualify/path/to/mwscript [puppet] - 10https://gerrit.wikimedia.org/r/159831 (owner: 10Andrew Bogott) [21:18:15] andrewbogott: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart [21:18:21] bblack, ok, removed sysop, can you check again? [21:18:44] spike again [21:18:51] Krinkle: yes, that was exactly what I did. As you can see from my previous message that included this link: https://dpaste.de/k0g0 [21:18:53] What did /you/ do? [21:19:11] andrewbogott: refresh the link [21:19:15] Oh, did you just rewrite? [21:19:29] Ah! [21:19:32] OK, thanks for documenting :) [21:19:49] yurikR, https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [21:19:53] yurikR: it's failing again now [21:20:08] woo merges [21:21:38] bblack, ok, guess you need sysop for now :( Will see what i can do later. Testing it outside of production is painful ) [21:21:58] it's working again now [21:22:16] xthx [21:22:17] thx [21:22:24] also, I found a bug in the zerofetcher error-handling (which just replaces the intended exception with a different one, but still) [21:22:27] MaxSem, i have no idea what causes dberror [21:22:41] andrewbogott: So why was it restarted / what's going on? [21:22:46] bblack, hehe, something positive :) [21:23:12] Krinkle: Nothing was happening. I restarted jenkins, and it came up and was waiting for jobs but not getting any. [21:23:16] So I blamed zuul... [21:23:20] which seems to've been right [21:23:29] k [21:23:44] MaxSem, and why adding account to the sysop group solves it... I wonder when we will start seeing proper stacktraces in logstash... [21:24:01] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [21:24:02] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [21:24:02] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [21:24:21] yurikR: sometime after this patch lands https://gerrit.wikimedia.org/r/#/c/119940/ [21:24:22] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [21:24:22] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:24:23] RECOVERY - check if dhclient is running on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [21:24:42] RECOVERY - Disk space on mw1053 is OK: DISK OK [21:24:42] RECOVERY - DPKG on mw1053 is OK: All packages OK [21:24:47] (03PS1) 10BBlack: zerofetch.py: Fix str+int error in exception output [puppet] - 10https://gerrit.wikimedia.org/r/159843 [21:25:28] bd808, looks complicated, good luck :) [21:25:55] (03CR) 10BBlack: [C: 032] zerofetch.py: Fix str+int error in exception output [puppet] - 10https://gerrit.wikimedia.org/r/159843 (owner: 10BBlack) [21:26:50] yurikR: Just waiting on some Jenkins/zuul work by hashar at this point I think :) I only started on all that in January. :P [21:27:29] oh, that's nothing [21:28:09] (03PS5) 10BBlack: Remove LVS/SSL defs for unused project-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/157978 [21:28:16] (03CR) 10Dzahn: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/159827 (owner: 10Dzahn) [21:28:24] (03CR) 10Dzahn: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/159834 (owner: 10Dzahn) [21:29:41] ori: ok, now i have a different, related question… if you log into virt1000, showJobs reports 990 jobs left to run, but runJobs does nothing [21:29:48] or (more likely) i'm misinterpreting showJobs [21:30:02] (03Abandoned) 10Dzahn: install wmfusercontent SSL cert on phab [puppet] - 10https://gerrit.wikimedia.org/r/159809 (owner: 10Dzahn) [21:31:43] m i allowed to do debugging from a particular host, perhaps osmium? Specifically for beta labs i put together a small environment where i can boot up an hhvm instance with debugging enabled and then use a small client i wrote to send fastcgi requests directly to that hhvm instance. This allows me to use a GDB like debugger to walk through php thats breaking on beta labs [21:31:51] would it be ok to do the same at osmium, or somewhere else? [21:32:01] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:33:22] ebernhardson: use osmium [21:33:40] ebernhardson: it's good to ask, since it's shared by multiple people testing HHVM stuff, but you can have it for yourself right now [21:34:08] ori: ok excellent, thanks [21:35:18] (03PS1) 10ArielGlenn: new ellery user, add to analytics hadoop and stat1003, rt # 8283 [puppet] - 10https://gerrit.wikimedia.org/r/159846 [21:35:58] not sure that access is right, I'm adding them to 'restricted' to get them cluster access [21:36:12] ottomata is gone unfortuntely [21:36:59] (03PS4) 10Mattflaschen: Enable the Task Recommendations experiment v1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156282 (owner: 10Phuedx) [21:37:54] (03PS5) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/158297 [21:38:23] Sorry, we're running a little behind. [21:41:00] (03PS2) 10ArielGlenn: new ellery user, add to analytics hadoop and stat1003, rt # 8283 [puppet] - 10https://gerrit.wikimedia.org/r/159846 [21:41:13] we'll try it this way [21:42:43] AaronS: _joe_ noted that load on the job runners has had a few spikes since tues: [21:42:43] (03PS6) 10BBlack: Remove LVS/SSL defs for unused project-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/157978 [21:42:45] <_joe_> AaronS: the hhvm jr is running fine but - something is hammering the JRs, and that made the HHVM JR OOM [21:43:08] <_joe_> so, we need to find a way to limit the number of threads it runs [21:43:45] (03PS6) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/158297 [21:43:56] Reedy: ParsoidCacheUpdateJobOnDependencyChange: 990 queued; 0 claimed (0 active, 0 abandoned); 0 delayed <- those never seem to run [21:44:26] <_joe_> see https://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&m=cpu_report&r=4hr&s=by%20name&hc=4&mc=2 [21:44:50] (03CR) 10Mattflaschen: [C: 032] Enable the Task Recommendations experiment v1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156282 (owner: 10Phuedx) [21:44:57] (03Merged) 10jenkins-bot: Enable the Task Recommendations experiment v1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156282 (owner: 10Phuedx) [21:45:13] andrewbogott: Probably should disable the parsoid extension if it's enabled [21:46:28] (03PS1) 10Andrew Bogott: Disable parsoid on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159850 [21:46:29] Reedy: ^ ? [21:46:47] (03PS2) 10Reedy: Disable parsoid on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159850 (owner: 10Andrew Bogott) [21:46:53] (03CR) 10Reedy: [C: 031] Disable parsoid on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159850 (owner: 10Andrew Bogott) [21:47:02] andrewbogott: You can then just delete the jobs from the database table [21:47:40] (03CR) 10Andrew Bogott: [C: 032] Disable parsoid on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159850 (owner: 10Andrew Bogott) [21:47:44] (03Merged) 10jenkins-bot: Disable parsoid on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159850 (owner: 10Andrew Bogott) [21:49:40] Reedy, ok, seems all better. Thanks! [21:49:43] !log mattflaschen Started scap: Deploy new GettingStarted recommendations A/B test [21:49:48] Logged the message, Master [21:52:37] _joe_: is it fine? It bloated to 6.3Gb [21:52:59] I don't think that happens immediately [21:54:23] andrewbogott: Is it working OK now? [21:54:33] Krinkle: seems so! [21:54:59] there were 50 or so threads, I wonder what the limit is [21:55:35] (03PS3) 10Rush: new ellery user, add to analytics hadoop and stat1003, rt # 8283 [puppet] - 10https://gerrit.wikimedia.org/r/159846 (owner: 10ArielGlenn) [21:56:09] (03CR) 10Rush: [C: 031] "from a technical stand point this is good to me, from an analytics stand point, I think it's the right stuff?" [puppet] - 10https://gerrit.wikimedia.org/r/159846 (owner: 10ArielGlenn) [21:56:49] (03PS1) 10Dzahn: install wmfusercontent.org cert on protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/159867 [21:57:27] https://github.com/facebook/hhvm/wiki/runtime-options [21:57:28] (03CR) 10jenkins-bot: [V: 04-1] install wmfusercontent.org cert on protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/159867 (owner: 10Dzahn) [21:57:34] that matches the defaults, makes sense [21:57:49] (03CR) 10ArielGlenn: [C: 032] new ellery user, add to analytics hadoop and stat1003, rt # 8283 [puppet] - 10https://gerrit.wikimedia.org/r/159846 (owner: 10ArielGlenn) [22:00:19] (03PS2) 10Dzahn: install wmfusercontent.org cert on protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/159867 [22:00:44] Got a couple errors from sync-common: [22:00:46] 21:57:57 ['sync-common', '--no-update-l10n', 'mw1010.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on tmh1001 returned [255]: Permission denied (publickey). [22:00:55] (03CR) 10jenkins-bot: [V: 04-1] install wmfusercontent.org cert on protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/159867 (owner: 10Dzahn) [22:01:05] (03PS7) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/158297 [22:03:10] (03PS3) 10Dzahn: install wmfusercontent.org cert on protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/159867 [22:06:34] (03PS7) 10BBlack: Remove LVS/SSL defs for unused project-lb IPs, and cleanup several older service IPs that have not been published in DNS in a very long time. [puppet] - 10https://gerrit.wikimedia.org/r/157978 [22:06:39] (03CR) 10Dzahn: [C: 032] beta: add deployment-mediawiki03 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/159520 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [22:07:33] (03PS8) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/157978 [22:08:10] wow, ok, bblack is cleaning it all [22:09:42] (03Abandoned) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/158297 (owner: 10BBlack) [22:10:26] mutante: I told you I was preparing to break all the things :) [22:11:20] bblack: that's a good thing :) [22:17:03] apergos, I think scap may have hung. [22:17:16] superm401: 1 left? [22:17:22] No, 72 [22:17:26] It's on sync-common and it's been over 4 minutes (probably longer), and no servers have changed in that time: [22:17:33] sync-common: 68% (ok: 155; fail: 2; left: 72) [22:17:37] hmmm... [22:18:22] It looks like it finished mw-update-l10n normally. Do you think I should cancel and do sync-dirs? [22:20:22] "mw1053 INFO - Finished rsync common (duration: 22m 14s)" [22:20:25] ouch [22:20:43] superm401: progressing again now? [22:20:55] (03PS1) 10Jackmcbarn: Re-enable the Lua profiler on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159903 [22:21:11] bd808, yeah, where did you get that status info? [22:21:24] I'm tailing logs on fluorine [22:21:36] $ tail -1000f /a/mw-log/scap.log | python ~bd808/scaplog.py [22:21:38] Cool, thanks. [22:21:41] (03PS9) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/157978 [22:23:23] (03CR) 10Jackmcbarn: "Why was this done?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159830 (owner: 10Ori.livneh) [22:23:47] Sync times like that make me think that a single rsync server caught a huge pile of concurrent requests [22:24:11] That happened once last week as I recall (or maybe the week before) [22:24:48] "mw1028 INFO - Finished rsync common (duration: 25m 45s)" [22:26:25] mw1053 is HHVM and special in that somebody just seems to have fixed it earlier [22:26:31] re: long sync time on just that one [22:26:40] maybe it had just more to sync [22:30:46] superm401: ! failure due to rsync timeout: mw1010.eqiad.wmnet copying to itself :/ [22:31:03] <_joe_> oh bd808 at what time did you perform scap? [22:31:04] Yeah, that's weird. [22:31:10] * _joe_ checks the sal [22:31:17] Is it intended that the proxies are also regular web servers? [22:31:59] _joe_: It's running now [22:32:00] https://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&h=mw1010.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [22:32:03] _joe_, I started 21:50 UTC [22:32:13] <_joe_> ok thanks [22:32:27] <_joe_> the jobrunners are running at 100% niced cpu [22:32:31] i think mw1053 might just be broken [22:32:33] <_joe_> so they will take time to sync [22:32:38] like it was in "7408 Hardware repair and reinstall mw1053.eqiad.wmnet" [22:32:48] <_joe_> mutante: uh? [22:32:52] <_joe_> I don't think so [22:32:57] <_joe_> what makes you think that? [22:33:11] all the icinga checks on it recovered earlier [22:33:27] it keeps popping up in icinga and when people sync [22:33:47] as being extra slow [22:34:32] <_joe_> mutante: the icinga popup was due to hhvm OOM'ing [22:34:53] ah, i just saw you re-enabled puppet on it earlier. ok [22:38:43] (03PS1) 10RobH: db2031 has disk shelf, shouldn't use stock db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/159914 [22:39:31] (03CR) 10RobH: [C: 032] db2031 has disk shelf, shouldn't use stock db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/159914 (owner: 10RobH) [22:40:57] Reedy, solved the bug with the imagemagick - it was the double ## in the color code [22:41:03] atglenn:ping [22:41:14] without it, works fine [22:41:23] apergos:ping [22:43:20] Dibs SWAT. [22:43:24] We have a patch in anyway [22:43:52] Still crawling along [22:43:53] sync-common: 99% (ok: 225; fail: 3; left: 1) [22:43:57] Oh, hm, RoanKattouw claimed it. [22:44:16] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/159827 (owner: 10Dzahn) [22:44:28] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/159834 (owner: 10Dzahn) [22:45:00] Yeah I claimed it for Ed [22:45:03] So he can practice [22:45:48] Neato. [22:45:56] RoanKattouw: New SWATter? :D [22:46:24] Well, maybe eventually [22:46:29] *nod* [22:46:32] But right now he has deploy rights and has never used them before [22:46:49] Time to learn how to break the site [22:46:56] And we figured our team should have more people than just me that know how to do this stuff [22:47:02] ok good, csap on his first day depoying :) [22:47:05] s/csap/scap/ [22:47:11] * ebernhardson cant type for anything some days ... [22:47:38] (03CR) 10Dzahn: [C: 032] beta - IRC notifications for the QA team [puppet] - 10https://gerrit.wikimedia.org/r/159827 (owner: 10Dzahn) [22:48:12] (03CR) 10Dzahn: [C: 032] toollabs - IRC notifications for labs [puppet] - 10https://gerrit.wikimedia.org/r/159834 (owner: 10Dzahn) [22:48:48] bd808, still one left... [22:48:59] * bd808 bets on fenari [22:49:22] won't bet against that [22:50:59] (03PS6) 10Dduvall: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) [22:54:33] (03PS7) 10Dduvall: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) [22:59:58] RoanKattouw: edsanders: I'll wait till after your scap stuff again [23:00:05] RoanKattouw, edsanders: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140911T2300). [23:00:21] superm401: omg. still one left? [23:00:32] This is fubar [23:00:35] bd808, I know, really. [23:00:43] !log restarting icinga-wm for config change [23:00:48] Logged the message, Master [23:04:35] (03CR) 10Dduvall: "Compilation looks ok against prod and labs." [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [23:05:29] (03CR) 10Dzahn: "16:02 -!- icinga-wm [~icinga-wm@neon.wikimedia.org] has joined #wikimedia-labs" [puppet] - 10https://gerrit.wikimedia.org/r/159834 (owner: 10Dzahn) [23:06:17] (03CR) 10Dzahn: "16:02 -!- icinga-wm [~icinga-wm@neon.wikimedia.org] has joined #wikimedia-qa" [puppet] - 10https://gerrit.wikimedia.org/r/159827 (owner: 10Dzahn) [23:08:50] superm401: is it just hanging on fernari? [23:09:16] greg-g, I don't know, which number is that? [23:09:31] Actually, I don't think I can tell anyway. [23:09:38] it doesnt have a number, it's just called 'fenari' [23:09:38] It only tells you which it just finished, not what it's currently doing. [23:10:03] James_F: notice icinga-wm in -qa now [23:15:10] !log esanders scap failed: LockFailedError Failed to lock /var/lock/scap: [Errno 11] Resource temporarily unavailable (duration: 00m 00s) [23:15:15] Logged the message, Master [23:15:34] RoanKattouw, edsanders, first lesson, check here before you try to deploy [23:15:57] Yeah, it's been running for over an hour. Sorry. [23:16:18] greg-g, bd808, should I just cancel it? [23:16:22] They're going to have to start a new one anyway. [23:16:30] And there's only one left. [23:17:12] superm401: Let me find out which one [23:17:13] superm401: Up to you. Did you need l10n updates? If so they aren't built on any hosts yet. [23:17:27] bd808, oh, I thought mw-update-l10n did that. Yeah, we do. [23:17:29] Thanks, RoanKattouw [23:17:31] We do too [23:18:10] (03CR) 10Dzahn: [C: 032] "yes, thanks, these are unused. checked on neon" [puppet] - 10https://gerrit.wikimedia.org/r/159793 (owner: 10Ori.livneh) [23:18:11] https://doc.wikimedia.org/mw-tools-scap/api.html#scap -- you are in step 7 [23:18:42] mw1010 is the holdout [23:18:58] look at that, scap docs [23:18:58] So if you kill now and they follow with a full scap it will fix all the things [23:19:03] (03CR) 10Dzahn: "also, "nagios-fedora-plugins" fedora *g*" [puppet] - 10https://gerrit.wikimedia.org/r/159793 (owner: 10Ori.livneh) [23:19:11] generated from code even :) [23:19:19] It's a job runner that is running at basically 100% CPU [23:19:24] eek [23:19:42] which is probably the hhvm jr issue [23:19:50] _joe_, AaronS, ori: ^^ [23:20:05] Which HHVM issue is this? [23:20:20] bd808: Would you like me to kill the mw1010 process so scap will move ahead with one failure? [23:20:25] RoanKattouw, bd808, there was already a failure from that mw1010 earlier. It must be retrying or something: [23:20:27] superm401: ! failure due to rsync timeout: mw1010.eqiad.wmnet copying to itself :/ [23:20:41] and mw1010 is a rsync slave [23:20:45] so extra load [23:20:50] No, sorry [23:20:52] I wa swrong [23:20:54] mw1063 [23:20:58] Which is slaving from mw1010 [23:21:19] It's not doing anything weird [23:21:46] RoanKattouw: The HHVM issue is that the hhvm jobrunners are taking more work than they can handle at times. [23:21:56] RoanKattouw: What else is going on on mw1010? [23:22:19] I don't know, I can't find the process on mw1063 that's supposed to be trying to pull from 1010 [23:22:19] load average: 48.09, 42.97, 45.35 [23:22:37] There's no sync-common process [23:23:02] Even though this is in ps: [23:23:04] 2821 4769 4769 4769 ? -1 S The jobrunner on mw1010 is eating all the cpu though and starving the rsync process [23:23:23] Hmm, what are the other server names there? Are those all the proxies? [23:23:30] yes [23:23:37] note that hhvm is not running there [23:23:56] bd808: those procs are heavily niced [23:24:05] what is the rsync nice? [23:24:22] bd808: OK so should I just kill the ssh mw1063 process on tin so the scap will move on with 1 failure? [23:24:23] AaronS: So this may be an issue in general? Or we need to not use job runners as rsync slaves anymore? [23:24:30] RoanKattouw: Sure [23:24:33] OK [23:25:16] Done [23:25:34] As long as load on mw1010 is that high the next scap will probably be horrible too [23:25:44] Yeah now it's sshing to lots of stuff again, it seems to have moved on to the next step [23:25:52] superm401: Your scap should have come back to life now [23:25:55] RoanKattouw: I got {{ec}}'d when trying to add my patch, you may or may not have seen it [23:26:02] Yep, it's showing scap-rebuild-cbs. Thanks. [23:26:07] marktraceur: Didn't see it [23:26:14] Don't worry, I have to add a patch of my own still [23:26:19] 23:25:11 4 apaches had sync errors [23:26:20] 23:25:11 Finished sync-apaches (duration: 87m 15s) [23:26:45] sudo -u mwdeploy -n -- scap-rebuild-cdbs on tmh1002 returned [255]: Permission denied (publickey). [23:26:55] bd808: it should perhaps just be another apache doing proxying, though this sounds like an old rsync/ssh nice issue [23:26:57] And on more publickey error on a different server but otherwise the same. [23:26:57] RoanKattouw: It may be worth hiding mw1010 from scap by commenting it out in /etc/dsh/group/scap-proxies [23:27:27] AaronS: It has worked well until the last 7-10 days [23:27:35] but meh [23:27:49] we need rsync servers that are mostly bored [23:28:06] the rsync process is costly while it runs [23:28:33] I have to leave for a dinner date. [23:28:56] scap-rebuild-cdbs: 99% (ok: 226; fail: 2; left: 1) [23:29:02] Alright, thanks for your help, bd808. [23:29:15] superm401: MOstly moral support, but yw [23:29:17] !log mattflaschen Finished scap: Deploy new GettingStarted recommendations A/B test (duration: 99m 34s) [23:29:23] Logged the message, Master [23:29:24] :) [23:30:57] andrewbogott: were you concerned about Parsoid jobs using CPU & memory? [23:31:11] (03CR) 10Dzahn: [C: 031] lucene: update path reference for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159798 (owner: 10Ori.livneh) [23:31:33] !log esanders Started scap: SWAT deploy [23:31:38] Logged the message, Master [23:32:25] (03CR) 10BryanDavis: "The patch led to cli errors from $_SERVER['REQUEST_URI'] being undefined in a non-web context. Also Ori said it needed more review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159830 (owner: 10Ori.livneh) [23:33:18] (03PS2) 10Dzahn: remove pmtpa subnets from install-server [puppet] - 10https://gerrit.wikimedia.org/r/159438 [23:34:18] (03CR) 10Dzahn: [C: 032] "we really should not be installing new things in pmtpa anymore" [puppet] - 10https://gerrit.wikimedia.org/r/159438 (owner: 10Dzahn) [23:34:50] mutante: could i ask you to check out the dependent change as well, ? it's also very straightforward [23:35:59] ori: i'm not sure if it really is, that contains pybal [23:36:17] which _joe_ was going to move [23:36:30] doesnt it [23:36:41] (03PS1) 10Ori.livneh: Wikitech: update paths for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159925 [23:37:00] mutante: sure, but it's just referring to the same file by another name [23:37:07] it's not actually relocating them [23:37:36] also, mutante: where do you see pybal? [23:37:48] (03PS1) 10Yurik: Updated login-logout whitelisted pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159926 [23:37:53] @fenari:/srv/mediawiki/docroot/noc/pybal [23:38:16] mutante: that's a symlink to /home/wikipedia/conf/pybal [23:38:22] that doesn't change at all [23:38:48] andrewbogott: https://gerrit.wikimedia.org/r/#/c/159925/ [23:39:05] Some of our messages are missing (show up on MW, but not in actual messages). [23:39:13] May have gotten cached as missing during the long scap. [23:39:35] I mean they're on the MediaWiki: namespace pages, but RL is apparently serving them as missing. [23:39:56] superm401: hm. touch a js file and re-scap? [23:40:56] edsanders, your scap is running, right? [23:41:02] yup [23:41:20] edsanders, alright, please let me know when it's done. I'll retest and may have to do another i18n-only scap/deploy. [23:44:41] (03CR) 10Dzahn: [C: 031] "0 lrwxrwxrwx 1 root root 30 May 29 21:35 common -> /usr/local/apache/common-local" [puppet] - 10https://gerrit.wikimedia.org/r/159795 (owner: 10Ori.livneh) [23:44:56] thank you :) [23:45:05] (03PS2) 10Ori.livneh: Update docroot for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/159795 [23:45:11] I'm putting off my LVS-related changes. I didn't think the other stuff in this window would get complicated or take long (well it was empty earlier!), and it's the final deploy of the week and late here, etc. [23:45:12] (03CR) 10Ori.livneh: [C: 032 V: 032] Update docroot for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/159795 (owner: 10Ori.livneh) [23:45:18] (03PS2) 10Ori.livneh: lucene: update path reference for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159798 [23:45:21] I'll try again on Monday [23:45:23] (03CR) 10Ori.livneh: [C: 032 V: 032] lucene: update path reference for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159798 (owner: 10Ori.livneh) [23:51:47] test2.wikipedia.org is down. (Apologies if this is a known issue.) [23:51:53] hrmmmm, no [23:52:21] edsanders: http://test2.wikipedia.org/wiki/Main_Page [23:52:31] oh, it's wikibase [23:52:38] aude: awake? :) [23:54:00] (03CR) 10Andrew Bogott: [C: 032] Wikitech: update paths for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159925 (owner: 10Ori.livneh) [23:54:23] greg-g: i can look [23:54:25] andrewbogott: thanks! [23:54:43] greg-g: that test2 error looks sort of like the error beta labs was having earlier today [23:55:59] greg-g: No, beta labs was https://integration.wikimedia.org/ci/view/BrowserTests/job/browsertests-Flow-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/192/artifact/log/Actions_menu_Permalink%3A_Topic_Actions_menu_Permalink.png [23:56:03] sorta, that's https://bugzilla.wikimedia.org/show_bug.cgi?id=70740 [23:56:10] yeah [23:56:14] just reported https://bugzilla.wikimedia.org/show_bug.cgi?id=70747 [23:56:39] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 1 failures [23:56:53] it's only with hhvm=true, i think [23:57:11] greg-g: they're both from LanguageLinkBadgeDisplay [23:57:12] oh no, never mind [23:57:39] spagewmf: oh? [23:58:13] edsanders, RoanKattouw - stuck in the weeds? [23:58:29] spagewmf: ugh, yeah [23:59:11] scap is still running [23:59:18] test2 has mysteriously exploded in a ball of Wikibase fire [23:59:26] ori: it's probably this: https://gerrit.wikimedia.org/r/#/c/159818/ [23:59:29] i think this is the key: testwikidatawiki/WBL-1.24wmf21:WikiPageEntityRevisionLookup:Q33 [23:59:34] i'm going to delete it [23:59:39] ori: as in, that's what aude did to fix it in beta, but... [23:59:46] ok [23:59:53] RoanKattouw, edsanders : does the scap running ncorporate either Flow bump? [23:59:58] ori: to be clear: I'm not 100% both are the same issue