[00:00:16] !log ori Synchronized php-1.24wmf18/extensions/WikimediaEvents: Update WikimediaEvents for cherry-picks (duration: 00m 03s) [00:00:22] Logged the message, Master [00:00:38] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 0B: /srv/deployment/ocg/output 5.5GB (= 5.0GB critical): /srv/deployment/ocg/postmortem 990482B: ocg_job_status 9897 msg: ocg_render_job_queue 0 msg [00:01:11] (03PS1) 10coren: Tool Labs: add php5-xdebug to dev environment [puppet] - 10https://gerrit.wikimedia.org/r/158032 [00:01:14] aww, OCG ? [00:05:18] disks are cheap, why not give it a terabyte /tmp? :P [00:06:21] they are not _that_ cheap if you get them for actual servers in data centers [00:06:30] as opposed to home computers [00:07:51] but i guess.. if it needs it ,it needs it [00:08:28] icinga-wm: that looks resolved already? what? [00:08:50] the_nobodies: yet when checking on ocg1001 it looks already resolved.. and .. well .. "tmp" [00:09:30] (03PS2) 10coren: Tool Labs: add php5-xdebug to dev environment [puppet] - 10https://gerrit.wikimedia.org/r/158032 (https://bugzilla.wikimedia.org/70313) [00:09:34] oh, or it's the output size [00:09:40] 5.6G /srv/deployment/ocg/output/ [00:10:52] dat oits. [00:11:13] (03PS1) 10Legoktm: Install "php5-xdebug" on tool labs [puppet] - 10https://gerrit.wikimedia.org/r/158033 (https://bugzilla.wikimedia.org/70313) [00:11:28] :OO [00:11:31] * legoktm hugs Coren [00:12:16] (03PS9) 10Ori.livneh: Clean up salt::minion [puppet] - 10https://gerrit.wikimedia.org/r/153727 [00:12:25] Coren: does installing it in dev also make it available on the exec nodes? [00:12:51] legoktm: No; it's the other way 'round. [00:13:03] ok, well I need it on the exec nodes [00:13:11] ... I'm not sure I like that idea. [00:13:26] https://phpunit.de/manual/3.7/en/code-coverage-analysis.html requires it [00:13:43] Ah. Ew. [00:14:01] I don't actually need it for debugging, just as a dependency :P [00:14:56] (03PS3) 10coren: Tool Labs: add php5-xdebug to exec environment [puppet] - 10https://gerrit.wikimedia.org/r/158032 (https://bugzilla.wikimedia.org/70313) [00:15:17] (03Abandoned) 10Legoktm: Install "php5-xdebug" on tool labs [puppet] - 10https://gerrit.wikimedia.org/r/158033 (https://bugzilla.wikimedia.org/70313) (owner: 10Legoktm) [00:15:38] legoktm: Dude, I already got a patch. :-) [00:15:49] I wasn't paying attention :P [00:15:58] Clearly. :-P [00:16:02] (03CR) 10Legoktm: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/158032 (https://bugzilla.wikimedia.org/70313) (owner: 10coren) [00:16:13] ori: any luck with virt1000? [00:16:37] RECOVERY - OCG health on ocg1001 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 3711649432B: /srv/deployment/ocg/postmortem 990482B: ocg_job_status 9897 msg: ocg_render_job_queue 0 msg [00:16:52] Reedy: i didn't get around to looking; will do now. just a sec [00:17:00] np, thanks :) [00:17:50] Reedy: it's the wrong server [00:17:58] eh? [00:18:01] you have it both listening on 11212 and connecting to 11212 [00:18:06] !log deleted PDF files older than 3d and a huge 1G one on ocg1001 in reaction to monitoring complaints [00:18:13] Logged the message, Master [00:18:21] why do we monitor that being over 5G? [00:18:26] since we have > 450G of space [00:18:40] i guess first there was monitoring, then new disk? [00:18:43] ori: eh? :/ [00:18:53] listen => '127.0.0.1:11212', [00:18:57] '127.0.0.1:11211:1', [00:19:04] 'servers' => array( '127.0.0.1:11212' ), [00:19:32] (03PS1) 10Ori.livneh: wikitech: specify correct port for memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/158034 [00:19:37] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [00:19:37] ^ Reedy [00:20:26] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/mediawiki.pp#L13 [00:20:27] and Coren, if you can +1 [00:20:34] It's 11211 on the others [00:20:49] well, andrewbogott_afk doesn't want to change it [00:21:19] which is fine [00:21:37] the following will be true for all app servers, wikitech included: [00:21:39] andrewbogott_afk: is the clock on your laptop/pc out by a few days? :/ [00:21:53] * mediawiki connects to memcached on port 11212 [00:22:05] * nutcracker runs locally and listens on port 11212 [00:22:08] (03CR) 10Reedy: [C: 031] "Per https://gerrit.wikimedia.org/r/#/c/158001/1/wmf-config/wikitech.php,unified" [puppet] - 10https://gerrit.wikimedia.org/r/158034 (owner: 10Ori.livneh) [00:22:34] on wikitech, nutcracker will proxy to localhost:11000 (the local memcached instance) [00:22:35] (03CR) 10coren: [C: 032] "The number, she is better that way." [puppet] - 10https://gerrit.wikimedia.org/r/158034 (owner: 10Ori.livneh) [00:22:42] right [00:22:43] yadda yadda yadda, thanks :) [00:23:00] (03CR) 10coren: [C: 032] Tool Labs: add php5-xdebug to exec environment [puppet] - 10https://gerrit.wikimedia.org/r/158032 (https://bugzilla.wikimedia.org/70313) (owner: 10coren) [00:23:07] note puppet is disabled on virt1000 atm [00:23:11] so might want to fix it manually ;) [00:23:16] sure, easy to do [00:23:54] Reedy: try now [00:24:29] A database query error has occurred. This may indicate a bug in the software. [00:24:30] Function: AccountAudit::updateLastLogin [00:24:30] Error: 1146 Table 'labswiki.accountaudit_login' doesn't exist (208.80.154.18) [00:24:30] hahaha [00:24:38] more fscking extensions [00:25:09] disable the extension, or run update.php? [00:25:40] it's on the same version as wikitech, so at worst, update.php is only going to add any needed extension tables [00:25:53] update.php IMHO [00:25:59] mind running it please? [00:26:10] but yeah, login works now. CC andrewbogott_afk :D [00:26:12] thanks [00:26:41] (03PS2) 10Reedy: Merge virt1000 apache config back into wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/157853 [00:26:44] Reedy: how should i run it? the mwscript wrapper isn't present afaict [00:27:00] php multiversion/MWMultiVersion.php update.php --wiki=labswiki [00:27:01] I think [00:27:15] oh, no [00:27:18] MWScript.php [00:27:31] php multiversion/MWScript.php update.php --wiki=labswiki [00:27:43] obviously from /usr/local/apache/common [00:29:20] are you taking responsibility for this? i'm not sure who needs to sign off or whatever [00:29:46] let's get another opinion from Coren maybe? [00:29:53] Hmm? [00:30:18] Error: 1146 Table 'labswiki.accountaudit_login' doesn't exist (208.80.154.18) [00:30:21] disable the extension, or run update.php? [00:30:29] it's on the same version as wikitech, so at worst, update.php is only going to add any needed extension tables [00:30:43] (i agree, fwiw) [00:30:58] Should be, though tha table doesn't ring a bell. [00:31:10] it records the last login time [00:31:22] it was installed a while back as part of the SUL switch over [00:31:34] so that very inactive accounts could be identified [00:31:58] so we probably don't need it on wikitechwiki, do we? [00:32:17] * jamesofur shrugs [00:32:22] possibly not [00:32:22] update.php sounds safer, io [00:32:23] imo [00:32:38] though I've got a feeling we might end up in circle disabling these extensions that cause no harm being there [00:32:41] it's fairly light weight, it likely doesn't hurt much [00:33:05] (03PS1) 10Reedy: Merge virt1000.wikimedia.org back into wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158035 [00:33:33] ok i'll run it in 3 mins if nobody speaks up against the idea [00:34:05] * ori is surprised that update.php doesn't have a --dry-run option [00:34:43] heh [00:34:47] might be worth BZ-ing that one [00:35:00] if you BZ, i'll run :P [00:36:10] ah, dupe [00:36:11] https://bugzilla.wikimedia.org/show_bug.cgi?id=59857 [00:37:23] ori: ah [00:37:33] (03PS1) 10Dzahn: OCG-raise monitor threshold for output file size [puppet] - 10https://gerrit.wikimedia.org/r/158036 [00:37:34] --schema [00:37:42] "There is a --schema option, to this file is only written, when there actually changes, so having a empty file means no changes." [00:37:43] Reedy: not the same, really [00:37:56] gives you an idea at least [00:38:08] oh, that's --quiet [00:38:10] well, you really want to know what *would* change when there are changes to make [00:38:11] sort of [00:39:02] SAL doesnt have CSS anymore? [00:39:19] wikitech works but no skin? i mean [00:39:36] Reedy: known issue related to the ongoing thing? [00:40:56] ooh [00:41:04] that was working fine not so long ago... [00:41:40] I see the css/js returning but it's not applied? [00:42:39] Resource interpreted as Document but transferred with MIME type text/javascript: "https://wikitech.wikimedia.org/w/load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector&*". [00:43:46] (03CR) 10Krinkle: "fixme: Tabs instead of spaces." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157993 (owner: 10Reedy) [00:44:55] (03CR) 10Dzahn: [C: 032] "00:18 mutante: deleted PDF files older than 3d and a huge 1G one on ocg1001 in reaction to monitoring complaints" [puppet] - 10https://gerrit.wikimedia.org/r/158036 (owner: 10Dzahn) [00:45:41] (03CR) 10Dzahn: "/dev/md2 456G 3.8G 452G 1% /srv" [puppet] - 10https://gerrit.wikimedia.org/r/158036 (owner: 10Dzahn) [00:45:52] bd808: That MIME is false positive. Only happens when the chrome dev tools are open. Harmless either way (the js request is fine, css is missing) [00:45:54] been in chrome for years. [00:45:56] https://wikitech.wikimedia.org/w/load.php?debug=false&lang=en&modules=site&only=styles&skin=vector&* [00:46:00] that looks garbled [00:46:02] mysql chartype? [00:46:52] Related to running update.php maybe? [00:47:11] I think it might be memcached [00:47:21] if virt1000 and wikitech are using the same keys [00:47:27] but different access libraries [00:47:30] No [00:47:34] it's not just css [00:47:45] oh nvm, it is just css [00:47:55] I thought it was Vector missing, but our base styles just lok better these days [00:48:18] Reedy: One with compression and one without or something? [00:48:27] https://wikitech.wikimedia.org/w/load.php?debug=false&lang=en&modules=ext.echo.badge%7Cext.gadget.enwp-boxes%7Cext.visualEditor.viewPageTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface%7Cmediawiki.ui.button%7Cskins.vector.styles&only=styles&skin=vector&* [00:48:31] That's the skin css request [00:48:34] linked from html [00:48:38] also contains garbled characters [00:48:47] ori: Does that ring any bells? mixed memcached client libs causing corruption? [00:49:01] so it's not just for modules with contents from wiki pages (mysql), but indeed memcached, php or apache in general [00:49:24] k, leaving for real this time [00:50:09] https://wikitech.wikimedia.org/wiki/Main_Page?debug=true looks right [00:50:27] without debug not so much [00:51:02] https://noc.wikimedia.org/conf/highlight.php?file=mc.php [00:51:18] versus the stuff using our php memcached client etc [00:54:26] yes, i've seen that before [00:54:27] we could switch the normal wikitech over to using the newer memcached config via nutcracker etc [00:54:37] why is it not using the same memcached driver? [00:54:37] How do we get the right things back into cache? Touch all the resource files in the "real" file tree for wikitech? [00:54:37] but that's more fiddling around [00:55:20] (03CR) 10Dzahn: [C: 031] "thanks, this came up because i recently wanted to add the same thing. then ended up wondering why we _don't_ set those. misc hostgroups in" [puppet] - 10https://gerrit.wikimedia.org/r/157658 (owner: 10Alexandros Kosiaris) [00:55:28] ori: wikitech is using the old one... to save even more config variation on virt1000 (ie going forward), I updated it to using the newer config etc [00:58:14] if we switch back to CACHE_MEMCACHED on the various $wg, and then set $wgMemCachedServers = array( '127.0.0.1:11000' ); ... [01:01:00] (03PS1) 10Reedy: Use old style memcached access on wikitech to stop cache pollution [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158038 [01:01:03] ala that [01:04:30] why not newer style on old wikitech? [01:07:13] i suggested that originally ;) [01:07:23] just I can't do it, so someone else would have to [01:16:21] Error: 1146 Table 'labswiki.betafeatures_user_counts' doesn't exist (208.80.154.18) [01:16:23] another one ;) [01:19:11] andrewbogott_afk: Coren load on labmon seems ok [01:19:26] I see 3 - 5% waiting on IO now with everything sending them stats [01:19:29] which is fine [01:19:42] Reedy: ^d: http://www.bullshit.wiki/api.php :) [01:20:19] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00332225913621 [01:21:07] (03PS1) 10Rush: iridium exim roled for phab [puppet] - 10https://gerrit.wikimedia.org/r/158039 [01:25:17] (03PS1) 10Rush: phab::mail renamed to phab::mailrelay [puppet] - 10https://gerrit.wikimedia.org/r/158040 [01:25:23] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [01:25:46] (03CR) 10Rush: [C: 032] "already there, was removed erroneously previously" [puppet] - 10https://gerrit.wikimedia.org/r/158039 (owner: 10Rush) [01:26:13] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [01:26:30] (03CR) 10Rush: [C: 032] "mail is a terrible name as this just sets up a relay for ticket interaction not actual mail" [puppet] - 10https://gerrit.wikimedia.org/r/158040 (owner: 10Rush) [01:32:41] I'm done for today [01:32:53] (03CR) 10Dzahn: "why "certainly not via templates"? it's a config file, don't we usually use puppet templates for config files?" [puppet] - 10https://gerrit.wikimedia.org/r/157294 (https://bugzilla.wikimedia.org/69979) (owner: 10Dzahn) [01:38:18] (03CR) 10Dzahn: "i see manifests/init.pp: core_dump_report_directory => '/var/log/hhvm'," [puppet] - 10https://gerrit.wikimedia.org/r/157294 (https://bugzilla.wikimedia.org/69979) (owner: 10Dzahn) [01:38:23] (03Abandoned) 10Dzahn: hhvm - make debug path configurable [puppet] - 10https://gerrit.wikimedia.org/r/157294 (https://bugzilla.wikimedia.org/69979) (owner: 10Dzahn) [01:39:53] (03CR) 10Dzahn: "don't know if there is another labs solution, bug doesnt say such thing.." [puppet] - 10https://gerrit.wikimedia.org/r/157294 (https://bugzilla.wikimedia.org/69979) (owner: 10Dzahn) [01:40:32] (03PS1) 10Rush: phabricator email relay to task creation [puppet] - 10https://gerrit.wikimedia.org/r/158042 [01:41:33] (03CR) 10Rush: [C: 032] phabricator email relay to task creation [puppet] - 10https://gerrit.wikimedia.org/r/158042 (owner: 10Rush) [01:50:05] (03PS1) 10Reedy: Add comma between active MW versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158043 [02:11:13] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3614 MB (3% inode=99%): [02:16:39] (03CR) 10BBlack: [C: 04-1] "In both cases this stuff should be conditional as labs-only, I think." [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [02:20:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [02:29:53] RECOVERY - swift eqiad-prod object availability on tungsten is OK: OK: Less than 1.00% under the threshold [95.0] [02:39:56] Is there a bug tracking the wikitech CSS thing? [02:40:25] I think Reedy was working on something to do with that earlier. [02:40:54] But he's probably asleep now. Sorry scfc_de [02:42:39] Yep, that's why I'm interested in something I can subscribe to so that I can sleep as well :-). [02:43:27] !log LocalisationUpdate completed (1.24wmf15) at 2014-09-03 02:42:24+00:00 [02:43:33] Logged the message, Master [03:00:13] RECOVERY - Disk space on virt0 is OK: DISK OK [03:07:53] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: Puppet has 1 failures [03:07:53] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [03:09:03] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Puppet has 1 failures [03:16:14] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 1 failures [03:17:40] !log LocalisationUpdate completed (1.24wmf18) at 2014-09-03 03:16:37+00:00 [03:17:47] Logged the message, Master [03:24:53] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [03:25:53] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [03:27:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [03:27:03] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [03:33:59] Krenair: Which CSS thing? [03:34:14] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [03:34:26] Oh, hmm. [03:34:28] Carmela, read up the chat log [03:34:39] Less exciting than I'd hoped. [03:36:03] PROBLEM - puppet last run on mw1091 is CRITICAL: CRITICAL: Puppet has 1 failures [03:37:13] we could throw in some helvetica, make it interesting [03:37:37] Uh, you mean Helvetica Neue. [03:37:46] helvetica neue light [03:38:14] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet has 1 failures [03:45:03] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Puppet has 1 failures [03:46:23] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [03:46:53] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: Puppet has 1 failures [03:51:20] !log LocalisationUpdate completed (1.24wmf19) at 2014-09-03 03:50:17+00:00 [03:51:26] Logged the message, Master [03:51:43] PROBLEM - puppet last run on mw1079 is CRITICAL: CRITICAL: Puppet has 1 failures [03:55:03] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:56:14] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:03:03] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [04:03:53] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [04:04:23] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:09:43] RECOVERY - puppet last run on mw1079 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [04:21:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [04:54:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Sep 3 04:53:50 UTC 2014 (duration 53m 49s) [04:55:02] Logged the message, Master [04:59:13] (03PS2) 10Giuseppe Lavagetto: Add pcre overflow patch that should prevent beta from crashing regularly [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157790 [05:01:55] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "Made the patch apply cleanly so that the debian build system won't argue about it" [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157790 (owner: 10Giuseppe Lavagetto) [05:21:30] (03PS1) 10Springle: ignore oai.% for s3 sanitarium replication [puppet] - 10https://gerrit.wikimedia.org/r/158053 [05:22:34] (03CR) 10Springle: [C: 032] ignore oai.% for s3 sanitarium replication [puppet] - 10https://gerrit.wikimedia.org/r/158053 (owner: 10Springle) [05:28:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [05:32:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:22:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [06:28:24] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:24] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Epic puppet fail [06:28:44] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:23] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:23] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:23] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:03] PROBLEM - Disk space on elastic1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [06:42:24] PROBLEM - puppet last run on db1007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:43:34] (03CR) 10Alexandros Kosiaris: [C: 032] "After some discussion in IRC with Giuseppe we concluded that previous behavior was old and the new one seems better. So merging this." [puppet] - 10https://gerrit.wikimedia.org/r/157658 (owner: 10Alexandros Kosiaris) [06:45:33] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:44] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:03] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [06:46:03] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:23] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:56] damn you passenger... [06:47:19] it's punctual [06:47:34] yes... on apache logrotate [06:47:43] despire graceful [06:47:47] despite* [06:48:17] why don't we use http://cronolog.org/ [06:48:25] godog mentioned it recently so i read up about it [06:48:35] Wikitech looks a bit silly at the moment. [06:49:01] Carmela: indeed [06:49:12] it's cache pollution from the migration work [06:49:28] both old (current) and new wikis are using the same memcached infrastructure [06:49:57] And you can't just flush memcached? [06:50:58] we probably could, but if it goes south i don't want to be on the hook [06:51:16] i guess i can tcpdump memcached keys and reset the resourceloader ones [06:52:24] * ori keeps this one-liner handy for such occasions [06:52:25] tcpdump -i eth0 -s 65535 -A -ttt port 11211 | cut -c 9- | grep -i '^get' | cut -d' ' -f2 [06:55:35] it's better now [06:56:07] Ah, nice. Thanks. [06:56:57] !log restarted memcached on virt1000 due to cache pollution from migration (different memc drivers w/different encoding) [06:57:03] Logged the message, Master [06:57:58] thanks ori [06:58:41] np. [06:58:43] * ori sleeps [07:00:24] RECOVERY - puppet last run on db1007 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:00:29] <_joe_> night ori [07:04:11] and the new cluster is named [07:04:19] sca - from service cluster a [07:04:47] mathoid probably going into production early next week [07:11:05] yep cronolog might be worth a try [07:11:44] kind of hesitant tbh. I have seen piped logs in apache misbehaving... [07:12:12] ending up with a couple or more lines of garbled logs [07:12:26] which is not much [07:12:49] but still. Plus we would do that to avoid sending a signal to passenger ? [07:13:09] well the entire apache + mod_passenger to be more precise [07:13:29] ah! I've never seen apache mispiping but it might have been luck [07:14:28] yeah that'd avoid apache doing the rotation, well anything that fixes the puppet o'clock alert storm would do :) [07:14:46] possibly trusty's version of mod_passenger is better? [07:14:57] could be. [07:15:04] it is worth a shot anyway [07:15:13] and reading up on passenger docs as well [07:15:29] which I intend to do as soon as I get some time [07:16:27] (03PS2) 10Giuseppe Lavagetto: Fix startup of hhvm [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157798 [07:16:59] <_joe_> or, we could upgrade to trusty, ditch apache on the puppetmasters and use nginx [07:17:09] indeed, I mentioned mod_passenger because when it crashed a while ago it seemed a lot of people were seeing the same crashes [07:17:24] <_joe_> (never had a problem with passenger+nginx managing a node app in prod) [07:17:48] <_joe_> godog: ^^ thanks for the CR on that btw [07:17:51] indeed that would work too [07:18:09] I actually dislike mod_passenger [07:18:31] <_joe_> passenger in its most recent incarnation is actually nice [07:18:41] <_joe_> we were managing nodejs with it [07:18:56] I meant the apache module [07:19:12] <_joe_> well, I dislike apache in general :) [07:19:17] ahahahaha [07:19:29] well somebody has to :P [07:19:31] <_joe_> but you know that from my interview :P [07:19:42] :-) [07:20:02] <_joe_> well, it's unfair to say I dislike it. [07:20:17] <_joe_> I prefer nginx for 99.9% of interesting things you can do [07:21:14] (03CR) 10Alexandros Kosiaris: "@Daniel. Seems like they are in use since puppet-master is complaining about non @-prepended variables contained in these files in logs." [puppet] - 10https://gerrit.wikimedia.org/r/157685 (owner: 10Matanya) [07:21:14] <_joe_> also, once you get used to the nginx paradigm, you'll end up thinking "how could I live with all those messy rewrites all this time?" [07:21:38] (03CR) 10Filippo Giunchedi: "we went from 4*1G to 4*10G networking on the frontends, to 10x increase (in theory, in practice I don't think swift could do 10G before ru" [puppet] - 10https://gerrit.wikimedia.org/r/157678 (owner: 10Filippo Giunchedi) [07:21:45] (03PS3) 10Alexandros Kosiaris: deployment: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/157685 (owner: 10Matanya) [07:22:13] (03CR) 10Giuseppe Lavagetto: Fix startup of hhvm (033 comments) [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157798 (owner: 10Giuseppe Lavagetto) [07:22:47] (03PS3) 10Giuseppe Lavagetto: Fix startup of hhvm [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157798 [07:22:56] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix startup of hhvm [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157798 (owner: 10Giuseppe Lavagetto) [07:23:19] (03CR) 10Alexandros Kosiaris: [C: 032] "Ran through catalogcompiler, noop" [puppet] - 10https://gerrit.wikimedia.org/r/157685 (owner: 10Matanya) [07:23:49] <_joe_> akosiaris: someone fixed the compiler? [07:23:52] <_joe_> nice to know [07:23:59] <_joe_> I need to add a cleanup job [07:24:07] eeeh no [07:24:21] I fell back to the old one [07:24:34] what is there needed to be done about that one ? [07:25:20] <_joe_> cleaning of old jobs leftovers [07:29:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [07:31:12] <_joe_> done [07:34:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A couple of missing stuff, otherwise looks pretty good" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/157856 (owner: 10Matanya) [07:39:02] (03CR) 10Filippo Giunchedi: mediawiki::monitoring::errors: report to statsd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/158008 (owner: 10Ori.livneh) [07:40:35] (03PS5) 10Filippo Giunchedi: elasticsearch: handle request timeout and increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/157805 [07:40:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] elasticsearch: handle request timeout and increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/157805 (owner: 10Filippo Giunchedi) [07:42:30] (03PS2) 10Matanya: protoproxy: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/157856 [07:49:32] (03PS1) 10Giuseppe Lavagetto: Version bump for repackaging. [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158069 [07:50:43] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Version bump for repackaging. [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158069 (owner: 10Giuseppe Lavagetto) [07:55:26] <_joe_> !log re-enabling mw1192, what we were seeing was probably load and not anything else [07:55:33] Logged the message, Master [08:05:20] gerrit doesn't seem to have anchors to single comments I could link to? sadness [08:08:10] (03PS1) 10Matanya: gerrit: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/158071 [08:13:53] <_joe_> is anyone around here proficient with cmake? [08:14:13] PROBLEM - puppet last run on amslvs3 is CRITICAL: CRITICAL: Puppet has 1 failures [08:23:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [08:23:03] PROBLEM - Puppet freshness on elastic1004 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 06:22:45 UTC [08:31:13] RECOVERY - puppet last run on amslvs3 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:42:22] (03PS1) 10Giuseppe Lavagetto: fix install path [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158072 [08:45:00] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] fix install path [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158072 (owner: 10Giuseppe Lavagetto) [09:05:08] (03PS2) 10Filippo Giunchedi: swift: check high load average on backend machines [puppet] - 10https://gerrit.wikimedia.org/r/157672 [09:05:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: check high load average on backend machines [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [09:07:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] releases: do not include hostname in sudoers.d [puppet] - 10https://gerrit.wikimedia.org/r/157797 (owner: 10Filippo Giunchedi) [09:30:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [09:32:13] (03PS1) 10Giuseppe Lavagetto: Use ini-style comments [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158075 [09:33:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Use ini-style comments [debs/hhvm] - 10https://gerrit.wikimedia.org/r/158075 (owner: 10Giuseppe Lavagetto) [09:36:22] sigh, I'm taking a look at cronspam on tin [09:43:00] (03PS1) 10Filippo Giunchedi: releases: fully qualify sudo command [puppet] - 10https://gerrit.wikimedia.org/r/158076 [09:43:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] releases: fully qualify sudo command [puppet] - 10https://gerrit.wikimedia.org/r/158076 (owner: 10Filippo Giunchedi) [09:48:37] #thereifixedit [09:53:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [09:53:39] icinga-wm: shush! [09:54:03] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:23:43] PROBLEM - puppet last run on elastic1004 is CRITICAL: CRITICAL: Puppet last ran 14445 seconds ago, expected 14400 [10:24:03] PROBLEM - Puppet freshness on elastic1004 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 06:22:45 UTC [10:24:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [10:45:47] (03PS4) 10Giuseppe Lavagetto: beta: use HHVM for all requests [puppet] - 10https://gerrit.wikimedia.org/r/157823 [10:46:28] (03PS1) 10JanZerebecki: icinga check wikidata: escape arguments [puppet] - 10https://gerrit.wikimedia.org/r/158081 [10:49:23] anyone up for merging a 1 line patch that hopefully corrects an icinga check so it goes green?^^ [10:49:45] <_joe_> jzerebecki: let me see [10:50:56] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga check wikidata: escape arguments [puppet] - 10https://gerrit.wikimedia.org/r/158081 (owner: 10JanZerebecki) [10:51:05] <_joe_> simple enough [10:52:27] <_joe_> jzerebecki: and thanks once again :) [10:54:06] thank you :) [11:00:38] _joe_: can I trouble you please ? [11:06:13] <_joe_> matanya: yes [11:06:16] <_joe_> :) [11:06:26] can you please update https://etherpad.wikimedia.org/p/puppet3 [11:06:35] with up to date data at some point ? [11:06:59] <_joe_> yes! sorry I left that back [11:07:39] no rush, i still have some modules i didn't finish [11:07:50] was very busy lately [11:17:19] !log run gmond on elastic1001 manually to debug ES collector issues [11:17:24] Logged the message, Master [11:21:31] <_joe_> matanya: done btw [11:21:38] thank you! [11:22:00] <_joe_> and I'll be back later :) [11:22:25] <_joe_> we're down to less than 500 warnings, we had more than 700 [11:22:48] <_joe_> I'll work on this sooner than later [11:23:23] akosiaris: Hi, are there any updates regarding the Mathoid deployment? [11:23:48] !log run gmond on elastic1002 manually to debug ES collector issues [11:23:54] Logged the message, Master [11:24:14] physikerwelt: yes, machines are almost assigned. Probable names sca1001, sca1002 [11:24:58] physikerwelt: almost == baring any unexpected changes [11:27:39] akosiaris: Thank you very much for supporting Mathoid. Just ping me if it makes sense to update https://gerrit.wikimedia.org/r/#/c/156576/ [11:29:32] physikerwelt: ok, will do [11:31:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [11:54:49] (03PS1) 10Matanya: pybal: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/158086 [12:25:03] PROBLEM - Puppet freshness on elastic1004 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 06:22:45 UTC [12:25:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [12:41:07] <_joe_> !log mw1120: remove from pybal, schedule downtime, reimage to HAT [12:41:13] Logged the message, Master [12:42:03] <_joe_> !log typo: mw1020, not mw1120 [12:42:10] Logged the message, Master [12:46:36] anyone any idea why my icinga check gives me: 'The command defined for service ... does not exist' https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wikidata&service=check+if+wikidata.org+dispatch+lag+is+higher+than+2+minutes [12:53:03] <_joe_> jzerebecki: sorry, can't help you now [12:54:50] Did wikitech CSS re-break? [12:55:06] Reedy: Ori restarted memcache, but it seems re-broken. [12:55:16] It's quite likely [12:55:22] If people hit virt1000 and pollute the cache [12:57:29] (03PS1) 10Giuseppe Lavagetto: reimage: fix unbound variable [puppet] - 10https://gerrit.wikimedia.org/r/158089 [12:59:14] Hopefully we can get it finished up today [13:00:05] K4-713: Respected human, time to deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140903T1300). Please do the needful. [13:13:11] (03CR) 10Giuseppe Lavagetto: [C: 032] reimage: fix unbound variable [puppet] - 10https://gerrit.wikimedia.org/r/158089 (owner: 10Giuseppe Lavagetto) [13:30:20] nostalgia skin on wikitech wiki? [13:30:35] aude: borked css [13:30:40] :( [13:30:48] virt1000 is seemingly polluting memcached due to the different driver used etc [13:31:04] though, bar missing tables for extensions, I think virt1000 is ready to go [13:31:12] <_joe_> I actually prefer it this way, btw [13:31:49] heh [13:32:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [13:33:07] Didn't we have a skin for this before? ;) [13:36:27] (03PS4) 10BBlack: Remove unused foo-lb.(eqiad|esams).wm.o A/AAAA recs [dns] - 10https://gerrit.wikimedia.org/r/157979 [13:38:13] (03CR) 10Mark Bergsma: [C: 031] Remove unused foo-lb.(eqiad|esams).wm.o A/AAAA recs [dns] - 10https://gerrit.wikimedia.org/r/157979 (owner: 10BBlack) [13:38:56] (03CR) 10BBlack: [C: 031] Remove unused foo-lb.(eqiad|esams).wm.o A/AAAA recs [dns] - 10https://gerrit.wikimedia.org/r/157979 (owner: 10BBlack) [13:39:03] bblack: <3 [13:39:15] hi :) [13:39:21] many many thanks for dealing with this :) [13:39:36] and for doing it properly [13:39:51] my patchsets were a bit sloppy I guess :) [13:40:14] fortunately you've got some discipline slammed into you now [13:40:19] haha [13:40:19] ;) [13:40:22] heh [13:42:21] (03CR) 10BBlack: [C: 032] Remove unused foo-lb.(eqiad|esams).wm.o A/AAAA recs [dns] - 10https://gerrit.wikimedia.org/r/157979 (owner: 10BBlack) [13:42:38] andrewbogott_afk: ping me when you're around [13:42:38] :) [13:43:58] wow, gerrit is slow [13:44:58] (03PS3) 10BBlack: Remove revdns for unused project-lb.site hostnames [dns] - 10https://gerrit.wikimedia.org/r/157980 [13:45:06] (03PS3) 10BBlack: Remove references to deprecated $project-lb.wm.o names [dns] - 10https://gerrit.wikimedia.org/r/157981 [13:45:11] (03PS3) 10BBlack: Remove actual $project-lb.wm.o domainnames [dns] - 10https://gerrit.wikimedia.org/r/157982 [13:45:51] ^ I have some puppet bits and more looking at real traffic to do before those, don't worry! [13:54:49] Reedy, ori, bd808|BUFFER, can someone catch me up on what's happening with wikitech? [13:55:30] andrewbogott: So AFAIK we only have 2 outstanding issues [13:55:46] ori sorted the login/memcached issue [13:55:50] I'd used the wrong port in the config [13:56:04] Are wikitech and virt1000 now using the same memcache? [13:56:09] yup [13:56:14] Because I thought that previously we had twemcache for virt1000? [13:56:19] (Or does it use both?) [13:56:24] twemcache is a memcached proxy [13:56:27] ok [13:56:36] which is causing the problems you may have seen on wikitech (no css/js etc) [13:56:38] So, is cache collision presumed to be the reason why wikitech is busted now? [13:56:41] that's not a big deal [13:56:42] Right [13:56:52] Hm, I would've thought that the memcache keys included a unique key... [13:56:59] it uses the dbname [13:57:26] oh, of course. [13:57:38] * Reedy facepalms [13:57:41] I didn't think of this last night [13:57:44] OK, anyway -- you think we can get virt1000 working fast enough that we don't need to worry about fixing wikitech? [13:57:58] We could just set $wgCachePrefix for virt1000 and restarted memcached to clear the bad stuff [13:58:14] ok, let's do that then :) [13:58:33] I think we can fix it quickly enough though... The only problem on virt1000 (that I've seen) currently is from the extra extensions that need database tables [13:58:39] which requires running update.php to create them [13:58:47] ok, shall I do that right now? [13:59:08] please. ori tried last night but seemingly ran into some username/password issues [13:59:24] (03CR) 10Jgreen: "The 5GB threshold was set based on intended design, not disk capacity. The OCG server should keep the output dir pruned. Matt assured me t" [puppet] - 10https://gerrit.wikimedia.org/r/158036 (owner: 10Dzahn) [13:59:29] but I think that might be due to virt1000 using a mix of production and its on private settings [13:59:45] I need to use mwscript, right? [14:00:15] yeah [14:00:34] sudo -u apache php multiversion/MWScript.php php-1.24wmf15/maintenance/update.php <- that one? [14:00:54] sudo -u apache php multiversion/MWScript.php update.php --wiki=labswiki [14:01:15] can you explain the distinction in path/to/update.php there? [14:01:28] mwscript by default looks in the maintenance folder [14:01:40] because of the wiki parameter, it knows what version/folder of mediawiki to look in [14:01:49] ok, so it's the same command ultimately? [14:02:11] yup, I think yours should work fine, but the --wiki=labswiki would've still been required [14:02:44] apparently prod expects the username to be 'wikiadmin' [14:02:50] right [14:02:54] whereas on wikitech the user is 'wikiuser' [14:02:59] so, I think the easy fix to that... [14:03:04] Can you talk me through the mysql commands to duplicate that account? [14:04:25] bleugh, where is the WikitechPrivateSettings.php actually included? [14:04:39] at the bottom of PrivateSettings.php [14:04:48] but isn't it easy enough to just add a new admin user to the db? [14:05:02] (Thus less divergence...) [14:05:10] probably, yeah [14:05:28] .oO( which db are we talking about? ) [14:05:49] springle, wikitech's wiki db uses a non-standard admin name, we're trying to standardize things. [14:05:56] springle: virt1000s install of mysql/mariadb [14:06:10] springle: if you have a moment to log in to virt1000 and create the new user that'd be great, since otherwise I'll probably mess it up :) [14:06:12] ewww, it's still mysql ;) [14:06:28] andrewbogott: can do [14:06:32] thanks springle [14:06:41] springle: we need to preserve the old admin name as well, though, if possible. [14:06:44] Rather than just rename [14:06:56] andrewbogott: while that's going on... I made both the recombined mediawiki-config commit and one for the wikitech apache config [14:07:14] Yep, I saw but haven't read them yet. thanks [14:08:30] (03CR) 10Andrew Bogott: [C: 031] "Looks good, but of course should not be merged yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158035 (owner: 10Reedy) [14:16:37] Reedy: ok, things look pretty good now... [14:16:44] OpenStackManager can't talk to ldap. [14:17:01] But I can log in and view most everything, no problems. [14:17:11] Dynamic sidebar is also not working. any ideas there? [14:17:17] (I'll investigate the ldap issue) [14:19:49] What should I see if DynamicSidebar is working? [14:20:05] Reedy, I just set $wgCachePrefix for wikitech and restarted memcached; still no css [14:20:19] The sidebar would have disclosure widgets [14:20:22] rather than just a list of everything [14:20:43] It looks normal to me [14:20:44] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [14:20:47] Labs Users [14:20:47] Manage Service Groups [14:20:47] Labs Projectadmins [14:20:47] Manage Projects [14:20:47] Manage Instances [14:21:13] Do you know what I mean by 'disclosure widget'? [14:21:27] Anyway, maybe we should fix wikitech, then I can show you what I mean. [14:21:29] Nope [14:22:08] What did you set the cache prefix to? [14:22:22] Carmela: You said when ori restarted memcached last night that fixed the css, right? [14:22:44] $wgCachePrefix = 'wikitechtemp'; [14:23:14] andrewbogott: wikitech LGTM in incognito mode [14:23:25] not in normal chrome, but chrome is a bitch at over caching [14:23:35] Ah, F5 fixed it [14:23:50] You're right, it's better. [14:23:54] lemme see if I can log in and such [14:24:36] Yeah, seems better. [14:24:49] And, no disclosures there either. So I guess we'll ignore that one for now :) [14:25:01] I'll see if I can sort out the ldap issue [14:25:19] heh [14:25:21] Neaaarly there :) [14:26:03] PROBLEM - Puppet freshness on elastic1004 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 06:22:45 UTC [14:26:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [14:37:13] (03CR) 10Manybubbles: "Does this kind of thing need an RT ticket?" [puppet] - 10https://gerrit.wikimedia.org/r/152724 (owner: 10Hoo man) [14:38:40] manybubbles: Thanks for poking at that [14:38:44] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:38:59] hoo: no problem! [14:40:07] Hey SWATters, anyone already claim today? (should we set up some kind of rotation?) [14:40:16] do i have access to datasets? [14:40:31] marktraceur: I was just about to ask manybubbles if he wanted to take it, since 2 of the 3 patches are his [14:40:39] That sounds sane [14:40:40] Reedy, I need to change venues, back later. [14:40:40] * aude doesn't need sudo but would be helpful to see logs in case something fails [14:40:47] anomie: I'll do today! [14:41:03] we pretty much just claim day of [14:41:10] but a rotation would make some degree of sense [14:41:56] marktraceur: Usually it's like what I did yesterday: ping all SWATters who're online when I think of it around 14:30 UTC or so [14:42:34] marktraceur: Or else if one of us has patches, the usual assumption is that that person will do it and I just double-check that. [14:42:57] RECOVERY - Host mw1178 is UP: PING OK - Packet loss = 0%, RTA = 4.02 ms [14:43:25] (03PS2) 10Giuseppe Lavagetto: Ensure that dependent packages for Trebuchet are installed [puppet] - 10https://gerrit.wikimedia.org/r/157299 (owner: 10BryanDavis) [14:43:33] Sure [14:43:44] RECOVERY - SSH on mw1178 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [14:43:44] RECOVERY - RAID on mw1178 is OK: OK: no RAID installed [14:43:44] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.176 second response time [14:43:53] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Ensure that dependent packages for Trebuchet are installed [puppet] - 10https://gerrit.wikimedia.org/r/157299 (owner: 10BryanDavis) [14:43:58] * marktraceur creates daily calendar events [14:45:04] greg-g: Speaking of calendar events, should you update the guest lists for the existing SWAT Deploy calendar events to reflect the current SWATters? [14:45:17] (03PS1) 10Filippo Giunchedi: elasticsearch: fetch ganglia stats asynchronously [puppet] - 10https://gerrit.wikimedia.org/r/158104 [14:45:44] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:45:48] (03PS1) 10Cmjohnson: Revert "Remove mw1178 from mediawiki-installation, deaded" [puppet] - 10https://gerrit.wikimedia.org/r/158105 [14:48:18] (03PS2) 10Cmjohnson: Revert "Remove mw1178 from mediawiki-installation, deaded" [puppet] - 10https://gerrit.wikimedia.org/r/158105 [14:48:43] PROBLEM - RAID on mw1020 is CRITICAL: Connection refused by host [14:48:43] PROBLEM - SSH on mw1020 is CRITICAL: Connection refused [14:48:44] PROBLEM - check if dhclient is running on mw1020 is CRITICAL: Connection refused by host [14:48:44] PROBLEM - nutcracker port on mw1020 is CRITICAL: Connection refused by host [14:48:44] (03CR) 10Manybubbles: [C: 031] elasticsearch: fetch ganglia stats asynchronously [puppet] - 10https://gerrit.wikimedia.org/r/158104 (owner: 10Filippo Giunchedi) [14:48:53] PROBLEM - puppet last run on mw1020 is CRITICAL: Connection refused by host [14:49:06] (03CR) 10Manybubbles: "Looks sane enough and if it resolves the problem then I'm happy with it." [puppet] - 10https://gerrit.wikimedia.org/r/158104 (owner: 10Filippo Giunchedi) [14:49:13] PROBLEM - Apache HTTP on mw1020 is CRITICAL: Connection refused [14:49:13] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Epic puppet fail [14:49:13] PROBLEM - DPKG on mw1020 is CRITICAL: Connection refused by host [14:49:23] PROBLEM - nutcracker process on mw1020 is CRITICAL: Connection refused by host [14:49:23] PROBLEM - Disk space on mw1020 is CRITICAL: Connection refused by host [14:49:33] PROBLEM - check configured eth on mw1020 is CRITICAL: Connection refused by host [14:50:34] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Puppet has 1 failures [14:50:44] PROBLEM - puppet last run on nfs1 is CRITICAL: CRITICAL: Puppet has 1 failures [14:51:08] <_joe_> ok mw1020 is being re-imaged again [14:51:13] <_joe_> so, we will wait [14:51:17] (03CR) 10Ottomata: "Looks good. Can you add some comment documentation as to why you are doing this? It wouldn't be clear to someone reading the code as is " [puppet] - 10https://gerrit.wikimedia.org/r/158104 (owner: 10Filippo Giunchedi) [14:51:37] (03CR) 10Chad: [C: 031] "Let's give it a try. Can't be worse :)" [puppet] - 10https://gerrit.wikimedia.org/r/158104 (owner: 10Filippo Giunchedi) [14:51:43] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet has 1 failures [14:51:45] (03CR) 10Cmjohnson: [C: 032] Revert "Remove mw1178 from mediawiki-installation, deaded" [puppet] - 10https://gerrit.wikimedia.org/r/158105 (owner: 10Cmjohnson) [14:51:52] ok, manybubbles hiya [14:52:01] ottomata: hi! [14:52:05] so, elastic1016 has been running for a few days now with the newer SSDs [14:52:34] load isn't noticeably different [14:52:40] !log adding mw1178 back to pybal [14:52:43] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Epic puppet fail [14:52:46] Logged the message, Master [14:52:54] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Epic puppet fail [14:53:05] !log running sync-common on mw1178 [14:53:11] Logged the message, Master [14:54:07] io wait is kinda about the same too [14:54:18] manybubbles: should we try to force it to have a heavily used shard? [14:54:43] RECOVERY - SSH on mw1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [14:54:51] ottomata: oh, do you want me to try to push some traffic to that server? [14:55:03] PROBLEM - puppet last run on sanger is CRITICAL: CRITICAL: Puppet has 1 failures [14:55:19] yes? heh, we are trying to do a production perf test on these ssds, so, ideally we could compare it to a heavily and similarly loaded server [14:55:40] i *think* we just want to confirm that these SSDs are not worse, right? so we can feel confident about ordering them [14:55:43] godog, thoughts? [14:57:14] ottomata manybubbles yep definitely at least not worse, we could try some traffic on 1016 and another machine and see if we can spot differences [14:58:13] PROBLEM - puppet last run on linne is CRITICAL: CRITICAL: Puppet has 1 failures [14:58:40] ottomata: btw in https://gerrit.wikimedia.org/r/#/c/158104/ the explanation is in the commit message, I can add some comments too [14:58:43] PROBLEM - NTP on mw1020 is CRITICAL: NTP CRITICAL: No response from NTP server [14:59:17] swat time [14:59:35] give me a few minutes to provide submodule patch [15:00:05] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140903T1500). Please do the needful. [15:00:26] aude: k [15:00:34] gerrit is horribly extra slow for me [15:00:43] PROBLEM - puppet last run on nickel is CRITICAL: CRITICAL: Puppet has 1 failures [15:00:46] ottomata: give me a few minutes to do swat and then I'll generate some traffic [15:01:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:01:14] PROBLEM - puppet last run on es4 is CRITICAL: CRITICAL: Puppet has 1 failures [15:01:23] (03CR) 10Manybubbles: [C: 032] Cirrus: Switch group1 wikis to all fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157861 (owner: 10Manybubbles) [15:01:29] (03CR) 10Manybubbles: [C: 032] Turn on job throttling for Cirrus template jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157215 (owner: 10Manybubbles) [15:01:33] (03Merged) 10jenkins-bot: Cirrus: Switch group1 wikis to all fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157861 (owner: 10Manybubbles) [15:01:35] <_joe_> aude: mw1020 is being reinstalled now [15:01:35] (03Merged) 10jenkins-bot: Turn on job throttling for Cirrus template jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157215 (owner: 10Manybubbles) [15:01:40] ok [15:01:44] <_joe_> just FYI [15:01:55] <_joe_> it's depooled from production for the moment [15:02:04] <_joe_> (but not from scap) [15:02:05] godog, yeah i know its in the commit, but think about future folks having to use this code [15:02:19] they're not going to see that commit message when they are trying to understand motivations for doing things [15:02:37] ok, thanks manybubbles [15:02:59] ottomata: yeah that's true, I'll add some comments [15:03:10] thanks [15:03:10] !log manybubbles Synchronized wmf-config/: SWAT deploy cirrus config changes (duration: 00m 06s) [15:03:16] Logged the message, Master [15:03:54] PROBLEM - puppet last run on ms1001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:04:03] so slooooooow [15:04:03] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:04:24] <^d> ottomata, godog, manybubbles: Maybe we could restart the node like during a rolling restart. [15:04:36] <^d> See how it behaves when shards are initializing. [15:04:41] manybubbles: https://gerrit.wikimedia.org/r/#/c/158107/ [15:04:50] hm, sure, but we should do that for a different node as well [15:04:51] to compare [15:05:08] put patch on the wiki [15:05:11] <^d> We could, sure. [15:05:24] !log https://gerrit.wikimedia.org/r/#/c/157861/ didn't work as expected - dropped everything out of using the all field...... [15:05:30] Logged the message, Master [15:05:38] ^d: could you have a look at https://gerrit.wikimedia.org/r/#/c/157861/ ? its not working properly [15:06:13] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.002 second response time [15:06:24] PROBLEM - puppet last run on ms1004 is CRITICAL: CRITICAL: Puppet has 1 failures [15:06:26] (03PS2) 10Filippo Giunchedi: elasticsearch: fetch ganglia stats asynchronously [puppet] - 10https://gerrit.wikimedia.org/r/158104 [15:06:34] ^d: yep anything that generates sustained disk i/o would do [15:06:38] <^d> manybubbles: Looking. [15:06:41] ^d: simplest thing is for me to replay enwiki trafic and see if that hits it [15:06:43] PROBLEM - puppet last run on tridge is CRITICAL: CRITICAL: Puppet has 1 failures [15:07:05] (03CR) 10Ottomata: [C: 032] elasticsearch: fetch ganglia stats asynchronously [puppet] - 10https://gerrit.wikimedia.org/r/158104 (owner: 10Filippo Giunchedi) [15:07:07] !log manybubbles Synchronized wmf-config/: SWAT deploy cirrus config changes - make sure to get mw1020 (duration: 00m 04s) [15:07:13] Logged the message, Master [15:07:30] !log mw1020 gets WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! during sync-dir call [15:07:36] Logged the message, Master [15:11:09] aude: syncing you [15:11:11] !log manybubbles Synchronized php-1.24wmf19/extensions/Wikidata/: (no message) (duration: 00m 07s) [15:11:13] ok [15:11:18] Logged the message, Master [15:12:05] <^d> manybubbles: no clue why it's not working yet :\ [15:12:45] i wonder if more things need touching [15:12:48] !log deployed throttling for Cirrus job named cirrusSearchLinksUpdate - it handles updating the index when a transcluded page changes - we'll have to check on the backlog over the next few hours/days to see if it stabilizes [15:12:55] Logged the message, Master [15:12:57] * aude hates resource loader [15:12:58] aude: probably not :) [15:13:14] I _think_ it has to do with how difficult it is to call out just the wikipedias [15:13:20] with the resource loader [15:16:18] manybubbles: when i cleared local storage, the style is fixed [15:16:24] now i can't reproduce the bug [15:17:48] aude: so good? [15:17:49] works for lydia [15:17:52] cool [15:17:59] good enough for me [15:18:15] ottomata: so I tried replaying some traffic - and that worked to cause some load but not in the places we wanted [15:19:30] (03PS3) 10Reedy: Re-enable all Math modes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) [15:19:58] (03PS1) 10Manybubbles: Switch all wikis to use Cirrus' all fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158109 [15:20:23] RECOVERY - Disk space on mw1020 is OK: DISK OK [15:20:33] RECOVERY - check configured eth on mw1020 is OK: NRPE: Unable to read output [15:20:43] RECOVERY - RAID on mw1020 is OK: OK: no RAID installed [15:20:44] RECOVERY - check if dhclient is running on mw1020 is OK: PROCS OK: 0 processes with command name dhclient [15:21:13] RECOVERY - DPKG on mw1020 is OK: All packages OK [15:21:33] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:38] (03CR) 10Physikerwelt: [C: 04-1] "Blocked by Idb610901b17bf8a4698dff5769623591fcad01b7" [puppet] - 10https://gerrit.wikimedia.org/r/156576 (https://bugzilla.wikimedia.org/69990) (owner: 10Physikerwelt) [15:21:53] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 2 failures [15:22:44] RECOVERY - nutcracker port on mw1020 is OK: TCP OK - 0.000 second response time on port 11212 [15:23:02] (03CR) 10Manybubbles: [C: 04-1] Switch all wikis to use Cirrus' all fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158109 (owner: 10Manybubbles) [15:23:23] RECOVERY - nutcracker process on mw1020 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [15:23:51] so, manybubbles, can we do a restart like ^d suggested? [15:23:57] i'm not expecting much change really [15:24:11] restart the than, watch io, let it recover shareds [15:24:15] thang* [15:24:18] then do the same with another node [15:24:29] <^d> Either that or throw traffic at it like manybubbles suggested. [15:24:46] ^d: already tried throwing traffic at it - but that just failed on another node first [15:24:56] I really need to think that through too.... [15:25:07] ottomata: k - restart is fine with me [15:25:07] <^d> :\ [15:25:59] (03PS2) 10Manybubbles: Switch non-wikipedia to Cirrus weighted all field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158109 [15:26:08] ^d: can you review https://gerrit.wikimedia.org/r/#/c/158109 ? [15:26:17] might do it durring this swat if it makes sense to you [15:26:39] <^d> Heh, it should work [15:26:55] (03CR) 10Chad: [C: 032] "Mmm, sledgehammers are the best solutions :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158109 (owner: 10Manybubbles) [15:26:59] (03Merged) 10jenkins-bot: Switch non-wikipedia to Cirrus weighted all field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158109 (owner: 10Manybubbles) [15:27:15] manybubbles: shoudl we wait til you guys are not actively messing with the cluster for other stuff? [15:27:55] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT - Update another cirrus config - this time maybe it will work (duration: 00m 05s) [15:28:01] Logged the message, Master [15:29:48] ^d: that did the trick [15:29:49] (03CR) 10GWicke: "The clean-up is definitely happening. There were very few files older than 3 days, I'd guess the clean-up period is 4 days? Individual PDF" [puppet] - 10https://gerrit.wikimedia.org/r/158036 (owner: 10Dzahn) [15:30:51] ottomata: oh! we can use the "slow" way to push shards on to and off of the box and see how that feels. though, the best test would be to slam it with search traffic. but we can't really do that because its all routed around [15:31:33] RECOVERY - NTP on mw1020 is OK: NTP OK: Offset -0.006935477257 secs [15:32:12] (03PS3) 10Filippo Giunchedi: elasticsearch: fetch ganglia stats asynchronously [puppet] - 10https://gerrit.wikimedia.org/r/158104 [15:32:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] elasticsearch: fetch ganglia stats asynchronously [puppet] - 10https://gerrit.wikimedia.org/r/158104 (owner: 10Filippo Giunchedi) [15:33:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [15:34:04] (03CR) 10Dzahn: "so the password is shared between researchers yet we don't know who the researchers are. i'd recommend to change the password once we actu" [puppet] - 10https://gerrit.wikimedia.org/r/155452 (owner: 10Dzahn) [15:38:36] ottomata: feel free to do what you want with the machine so long as you only bring that one machine down. I'm going to step out for a few minutes if that is ok [15:38:53] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:39:20] (03CR) 10Dzahn: "nope, there are ' around the entire command and in the middle of it. won't work. escaping is really annoying here, also see other people c" [puppet] - 10https://gerrit.wikimedia.org/r/158081 (owner: 10JanZerebecki) [15:39:23] go ahead manybubbles, i'm looking at some kafka atm, maybe godog and I can sync up and make a plan in a bit... [15:39:34] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [15:41:39] ottomata: sounds good, I had in mind basically to slam the disk for long enough and see how it does [15:41:58] <_joe_> !log mw1020 correctly reimaged, putting it in the hhvm pool [15:42:05] Logged the message, Master [15:48:15] (03PS5) 10Giuseppe Lavagetto: beta: use HHVM for all requests [puppet] - 10https://gerrit.wikimedia.org/r/157823 [15:48:16] ottomata: anyways I'm almost out of the door and will be off thurs/fri, will check email every now and then though [15:49:02] (03CR) 10jenkins-bot: [V: 04-1] beta: use HHVM for all requests [puppet] - 10https://gerrit.wikimedia.org/r/157823 (owner: 10Giuseppe Lavagetto) [15:50:16] ooof, ok godog, [15:54:03] someone review my dns changgggeeee [15:54:03] https://gerrit.wikimedia.org/r/#/c/158025/ [15:54:12] win 3 [15:54:38] * robh is intentionally not solo-reviewing things but its not easy as it lacks instant gratification [15:54:55] robh: I love how you +1'd your own patch :D [15:55:23] it looks great to me, made sense [15:55:26] heh [15:56:55] (03CR) 10John F. Lewis: [C: 031] "Looks sane is all I'll say." [dns] - 10https://gerrit.wikimedia.org/r/158025 (owner: 10RobH) [15:57:14] JohnFLewis: thanks =] [15:57:18] robh: well it looks sane to me :p [15:57:23] its simple change, but a missing . can break things [15:57:27] so always good to have another set of eyes [15:59:11] (03CR) 10RobH: [C: 032] assign new mgmt ips to codfw cisco systems [dns] - 10https://gerrit.wikimedia.org/r/158025 (owner: 10RobH) [15:59:56] and that change was blocking me setting up dns entries for all the other servers, so success [16:00:47] robh: great :p feel free to add me to any other changes for review if you want [16:01:13] (03PS1) 10Yuvipanda: labmon: Setup icinga alerts for betalabs puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/158111 [16:01:27] mutante: can you help adding people to the contactgroups in the private repo? [16:01:44] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Epic puppet fail [16:02:56] (03PS2) 10Yuvipanda: labmon: Setup icinga alerts for betalabs puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/158111 [16:02:58] greg-g: ^ first pass at puppet failure monitoring for betalabs. alerts set only to you :) [16:03:03] should probably add more people [16:03:15] YuviPanda: yes, as long as you dont need paging [16:03:29] mutante: yeah, emails + flailing on IRC should be good enough [16:04:22] mutante: can you add yuvipanda (yuvipanda@wikimedia.org), greg-g (greg@wikimedia.org) and chrismcmahon ) [16:05:42] I like emails and flailing on IRC [16:06:39] (03PS1) 10Aude: Put wikibase cache settings together [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158112 [16:06:41] (03PS1) 10Aude: Add Wikibase properties to suggester blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158113 (https://bugzilla.wikimedia.org/70346) [16:07:08] greg-g: who else should I add? [16:08:52] greg-g: and long walks on the beach [16:08:55] YuviPanda: antoine [16:09:02] chrismcmahon: :) [16:10:34] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [16:10:44] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [16:12:20] greg-g: chrismcmahon cool, I'll wait for mutante to add them to the private repo and give me names [16:12:26] http://labmon.wmflabs.org/render/?width=586&height=308&_salt=1409760009.501&target=deployment-prep.*.puppetagent.failed_events.value looks clean currently tho [16:12:33] not many failures! [16:14:23] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Epic puppet fail [16:14:39] (03PS1) 10Andrew Bogott: Added some latent wikitech debug settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158114 [16:15:29] (03CR) 10Andrew Bogott: [C: 032] Added some latent wikitech debug settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158114 (owner: 10Andrew Bogott) [16:15:35] (03Merged) 10jenkins-bot: Added some latent wikitech debug settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158114 (owner: 10Andrew Bogott) [16:16:34] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 05s) [16:16:35] (03CR) 10Manybubbles: "This seems to have worked." [puppet] - 10https://gerrit.wikimedia.org/r/158104 (owner: 10Filippo Giunchedi) [16:16:43] Logged the message, Master [16:16:46] (03PS3) 10Yuvipanda: labmon: Setup icinga alerts for betalabs puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/158111 [16:16:49] YuviPanda: ok, committed in private repo. i set all 3 to the defaults [16:17:05] mutante: \o/ cool. wanna review ^? :) [16:17:19] YuviPanda: defaults means .. 24/7 notification period, notified on c,r,f ...notify by email... [16:17:31] c,r,f = crit, recover, flap [16:17:32] cool, since it's just by email that should be fine [16:18:41] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 04s) [16:18:48] Logged the message, Master [16:19:00] (03PS4) 10Yuvipanda: labmon: Setup icinga alerts for betalabs puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/158111 [16:19:08] mutante: ^ actually applies to role to labmon [16:19:35] (03CR) 10Dzahn: labmon: Setup icinga alerts for betalabs puppet failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/158111 (owner: 10Yuvipanda) [16:19:41] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 03s) [16:19:46] * andrewbogott has to make the same mistake at least three times [16:19:47] (03CR) 10John F. Lewis: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/153987 (owner: 10Dzahn) [16:19:48] Logged the message, Master [16:20:08] YuviPanda: gregg ( i wasn't sure about the - char), yuvipanda (i see you even use the short version on LinkedIn) :) [16:20:34] (03PS5) 10Yuvipanda: labmon: Setup icinga alerts for betalabs puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/158111 [16:20:46] mutante: :) I use it everywhere. HR sends me email as Yuvi as well [16:21:39] mutante: updated with new names [16:21:43] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:21:58] (03CR) 10Arlolra: [C: 031] Update Parsoid extension require path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157177 (owner: 10GWicke) [16:23:16] Reedy: ok, the ldap auth problem has to do with order of loading. If I move the private password settings to wikitech.php everything works. [16:23:43] My guess is that one of the values in OpenStackManager has a default which clobbers the value that we set in PrivateSettings [16:23:53] …and that the extension is loaded after PrivateSettings, which seems likely. [16:24:05] Is there a proper way to set defaults in an extension so that doesn't happen? [16:24:23] YuviPanda: you will likely hate me for this, but can you align the =>'s please :) [16:24:33] …otherwise I can just include the private settings from wikitech.php [16:24:36] nooooooo [16:24:38] fine :P [16:24:52] YuviPanda: it's a warning for each line in compiler output :) thx [16:25:00] bd808: same question :) [16:25:19] It might be easier just doing that [16:25:19] * bd808 reads backscroll [16:25:20] mutante: oh, didn't know [16:25:22] (03PS6) 10Yuvipanda: labmon: Setup icinga alerts for betalabs puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/158111 [16:25:23] mutante: done [16:25:33] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:25:43] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:25:46] greg-g: now we need to make puppet fail to see if it is working [16:26:41] andrewbogott: dependency resolution and ordering is scary in MW. If you can make it work by importing your overrides later in the initialization order that works for me. [16:26:54] YuviPanda: cool! just fyi, here they are https://integration.wikimedia.org/ci/job/operations-puppet-puppetlint-strict/7014/console -> "indentation" [16:27:06] PROBLEM - Puppet freshness on elastic1004 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 06:22:45 UTC [16:27:06] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [16:27:18] aaah, nice [16:27:53] lol @ ERROR mobile::vumi::iptables-purges [16:28:03] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [16:28:45] (03CR) 10Dzahn: [C: 032] labmon: Setup icinga alerts for betalabs puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/158111 (owner: 10Yuvipanda) [16:28:53] \o/ [16:29:08] mutante: can you force a puppet run on the icinga host? [16:29:30] YuviPanda: yep, on it [16:29:39] bah [16:29:41] puppet failure [16:30:15] (03PS1) 10Yuvipanda: labmon: Pass 'warning' threshold to betalabs puppet check [puppet] - 10https://gerrit.wikimedia.org/r/158117 [16:30:17] mutante: ^ [16:32:03] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [16:32:03] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [16:32:19] YuviPanda: yes, puppet fail, but not caused by your change , heh [16:32:22] Error: Could not find any hostgroup matching 'misc_esams' [16:32:23] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:32:24] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Epic puppet fail [16:32:26] akosiaris: ^ :p [16:32:31] mutante: oh, puppet fail on labmon is from me :) [16:32:41] YuviPanda: well, more than one fail then, i meant on neon :) [16:32:46] aaaah [16:32:47] heh [16:35:00] (03PS1) 10Andrew Bogott: Split private wikitech settings files into two. [puppet] - 10https://gerrit.wikimedia.org/r/158118 [16:35:47] bd808: ^ for starters... [16:35:47] (03PS1) 10JanZerebecki: icinga check wikidata: avoid double quotes [puppet] - 10https://gerrit.wikimedia.org/r/158119 [16:35:57] scfc_de: see https://gerrit.wikimedia.org/r/#/c/158111/, we can have toollabs alerts the same way [16:36:10] (03PS1) 10Jforrester: Switch SpecialCite out for CiteThisPage on phase0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158120 [16:36:12] (03PS1) 10Jforrester: Switch from SpecialCite to CiteThisPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 [16:36:28] (03CR) 10Jforrester: [C: 04-1] "Not just yet. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (owner: 10Jforrester) [16:36:32] Krinkle: btw, cvn is now reporting data to labmon.wmflabs.org (will soon be graphite.wmflabs.org) [16:36:35] (as are all projects [16:36:37] ) [16:36:47] what's cvn? [16:37:35] greg-g: countervandalism, one of Krinkle's projects [16:37:43] (03CR) 10BryanDavis: Split private wikitech settings files into two. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/158118 (owner: 10Andrew Bogott) [16:37:46] (03PS1) 10Andrew Bogott: Include private ldap settings in wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158122 [16:37:54] bd808: and, step two ^ [16:38:09] andrewbogott: I think you missed adding a file in step 1 [16:38:16] hm, seems so [16:38:38] (03CR) 10Jforrester: [C: 04-1] "Tentatively scheduled for mid-September." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158120 (owner: 10Jforrester) [16:39:05] I saw one in Paris the other day in a 2nd hand shop [16:39:07] (03PS1) 10Dzahn: add missing misc_esams hostgroup to icinga [puppet] - 10https://gerrit.wikimedia.org/r/158123 [16:39:33] (03PS2) 10Andrew Bogott: Split private wikitech settings files into two. [puppet] - 10https://gerrit.wikimedia.org/r/158118 [16:39:49] http://i.imgur.com/lkHAlTl.jpg [16:39:59] It looks distorted but it isn't. It was that way [16:40:10] I didn't know there was a non-widescreen variant [16:40:12] Krinkle: The 4:3 screen, you mean? [16:40:18] yeah [16:40:27] but... wrong channel [16:41:39] YuviPanda: What was it again with those graphs? There was a template you mentioned to read this data. [16:41:50] Krinkle: yeah, haven't gotten to that part yet [16:41:53] I see a dozen different values in CPU alone, not sure which is the one "I want". [16:41:57] probably next week [16:41:59] right now setting up alerts for betalabs [16:42:12] e.g. whatever ganglia displays looks useful [16:42:27] Krinkle: user and system, I think [16:42:41] http://labmon.wmflabs.org/render/?target=cvn.cvn-app5.cpu.total.system.value&target=cvn.cvn-app5.cpu.total.user.value&target=cvn.cvn-app5.cpu.total.idle.value [16:43:04] http://labmon.wmflabs.org/render/?target=cvn.cvn-app5.cpu.total.*.value&width=800&height=500 [16:43:27] bd808: better? [16:43:28] Not sure which are included in others, or are negative (e.g. idle is negative) [16:43:34] anyway, don't tell me. [16:43:43] Krinkle: :) ok [16:43:49] shall poke you again when the other dashboard is done [16:43:53] akosiaris: so you added the servers to all the groups again, right. we just were missing a group.. [16:44:22] mutante: which one ? [16:44:36] (03CR) 10BryanDavis: [C: 031] Split private wikitech settings files into two. [puppet] - 10https://gerrit.wikimedia.org/r/158118 (owner: 10Andrew Bogott) [16:44:37] mutante: misc_esams, I just say [16:44:38] mutante: https://gerrit.wikimedia.org/r/#/c/158111/ fixes the other puppet fail [16:44:41] akosiaris: misc_esams https://gerrit.wikimedia.org/r/#/c/158123/ [16:44:42] s/say/saw [16:45:16] (03PS2) 10Dzahn: add missing misc_esams hostgroup to icinga [puppet] - 10https://gerrit.wikimedia.org/r/158123 [16:45:20] (03CR) 10BryanDavis: [C: 031] Include private ldap settings in wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158122 (owner: 10Andrew Bogott) [16:45:23] mutante: sigh.. I should have noticed... we must create that icinga has not reloaded check... [16:45:45] YuviPanda: that's already merged [16:45:50] oh [16:46:05] mutante: no it isn't [16:46:18] (03CR) 10Dzahn: [C: 032] add missing misc_esams hostgroup to icinga [puppet] - 10https://gerrit.wikimedia.org/r/158123 (owner: 10Dzahn) [16:46:49] ack, ori, just noticed that all varnishkafka rsyslog.d files are gone :/ [16:46:51] YuviPanda: no it is [16:46:56] been that way for a while i guess, eh? [16:47:02] YuviPanda: And notifications/monitoring is also on the agenda? [16:47:10] mutante: says 'status: Review in Progress' [16:47:12] fine, i'll find it by user name in gerrit [16:47:33] mutante: bah, wrong link [16:47:34] mutante: https://gerrit.wikimedia.org/r/#/c/158117/ [16:47:39] we were looking at two different patchsets [16:47:47] (03CR) 10Andrew Bogott: [C: 032] Split private wikitech settings files into two. [puppet] - 10https://gerrit.wikimedia.org/r/158118 (owner: 10Andrew Bogott) [16:48:07] (03CR) 10Ori.livneh: mediawiki::monitoring::errors: report to statsd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/158008 (owner: 10Ori.livneh) [16:48:11] Krinkle: yes, kinda. Only doing betalabs and toollabs for now, but yes, you can add checks for cvn if you want. I'd like cvn itself to be puppetized for that, tho [16:48:12] YuviPanda: you had the wrong number, but i found it. are you sure you want warn to be the same value ? [16:48:30] mutante: a puppet fail is a puppet fail, no? even 1 is critical, and I can't set 0 to warn [16:48:52] YuviPanda: What's the rough sketch for monitoring? E.g. what does it, and where does it get data from (graphite? or from its source directly) [16:48:54] (03CR) 10Andrew Bogott: [C: 032] Include private ldap settings in wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158122 (owner: 10Andrew Bogott) [16:48:59] (03Merged) 10jenkins-bot: Include private ldap settings in wikitech.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158122 (owner: 10Andrew Bogott) [16:49:12] Krinkle: right now, it just checks a particular graphite value for a threshold and warns. [16:49:36] ack, ori, this is supposed to say varnishkafka, right? [16:49:37] https://github.com/wikimedia/operations-puppet/commit/d0ae0114#diff-341c32b25102f10e9df234286ed480e6R478 [16:49:46] YuviPanda: it? [16:50:08] Krinkle: 'it' as in 'plan for icinga checks for things in labs' [16:50:08] ottomata: say that where? [16:50:19] YuviPanda: Ah, so it'll be in icinga [16:50:19] cool [16:50:27] ori, i think copy/paste error happened [16:50:34] file { '/etc/rsyslog.d/75-kafkatee.conf': [16:50:40] you meant to put 70-varnishkafka.conf [16:50:40] i think [16:50:43] in cache.pp [16:51:00] (03CR) 10Dzahn: [C: 032] labmon: Pass 'warning' threshold to betalabs puppet check [puppet] - 10https://gerrit.wikimedia.org/r/158117 (owner: 10Yuvipanda) [16:51:23] ori: https://gerrit.wikimedia.org/r/#/c/135447/12/manifests/role/cache.pp [16:51:27] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 03s) [16:51:40] (03PS2) 10Ori.livneh: mediawiki::monitoring::errors: report to statsd [puppet] - 10https://gerrit.wikimedia.org/r/158008 [16:51:45] ottomata: could well be, let me see the package [16:51:51] !log andrew Synchronized private/WikitechPrivateSettings.php: (no message) (duration: 00m 05s) [16:51:56] Logged the message, Master [16:51:59] !log andrew Synchronized private/WikitechPrivateLdapSettings.php: (no message) (duration: 00m 03s) [16:51:59] kafkatee has nothing to do with varnishkafka [16:52:05] it isn't on cache servers at all [16:52:06] Logged the message, Master [16:52:17] i think it was copy/paste error from here: [16:52:18] https://gerrit.wikimedia.org/r/#/c/135447/12/manifests/role/analytics/kafkatee.pp [16:52:29] ottomata: yes, sounds right [16:52:46] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 03s) [16:52:51] ok, so, since the file is gone now...should I just put it back in place from the package on all the servers, and then declare teh file in the same way [16:52:56] or should I commit the file to puppet and render it? [16:53:59] ottomata: if you could [16:54:04] which, the latter? [16:54:06] render it? [16:54:18] doesn't matter much either way [16:54:31] (03PS1) 10Yuvipanda: icinga: Set default value for from in graphite threshold checks [puppet] - 10https://gerrit.wikimedia.org/r/158125 [16:54:36] mutante: ^ doc fail this time, fixed at source [16:54:56] hm, i wouldn't really be sure where to put it, i guess in varnishkafka module...but that's annoying because then we'd have to make sure to always sync package changes with puppet changes to get the right file [16:55:00] ok i will try to put it in place [16:55:12] manally [16:55:30] YuviPanda: missing another misc group this time :) also fixing puppet fail #2 [16:55:44] heh [16:55:53] ottomata: give me five minutes to properly wake up and i'll help you figure out a way to fix that easily [16:56:08] (03PS1) 10Ottomata: Ensure varnishkafka rsyslog file does not get removed by rsyslog module [puppet] - 10https://gerrit.wikimedia.org/r/158126 [16:56:11] ha, its ok, i think I know, i just need to do what was meant to be done, and then manually copy the correct file back in place [16:56:14] like that, right? ^ [16:56:26] (03PS2) 10Ottomata: Ensure varnishkafka rsyslog file does not get removed by rsyslog module [puppet] - 10https://gerrit.wikimedia.org/r/158126 [16:56:32] ottomata: but if the file isn't there, you'll get failures [16:56:36] let's do it another way [16:56:55] eh? I will put the file there, the package puts it there [16:57:13] okay, if you can time that right (or don't mind a couple of ephemeral puppet failures), that's fine [16:57:25] (03PS1) 10Dzahn: add missing misc_ulsfo hostgroup to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/158127 [16:57:33] (03CR) 10jenkins-bot: [V: 04-1] add missing misc_ulsfo hostgroup to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/158127 (owner: 10Dzahn) [16:57:39] i don't mind a couple of failures, i'll force puppet run on all varnish nodes once it is ready [16:59:10] (03PS2) 10Dzahn: add missing misc_ulsfo hostgroup to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/158127 [16:59:43] (03CR) 10Dzahn: [C: 032] add missing misc_ulsfo hostgroup to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/158127 (owner: 10Dzahn) [17:00:59] (03PS1) 10Ori.livneh: nutcracker: make configs notify service again [puppet] - 10https://gerrit.wikimedia.org/r/158129 [17:01:54] YuviPanda: Re Icinga & Tools, make it so! :-) [17:02:57] ottomata: quick question; what is the status regarding https://gerrit.wikimedia.org/r/#/c/155452/? Looking at the RT ticket just waiting for people to say they need access yet I don't see an email from you to the list about it :) [17:03:02] mutante: ^ [17:03:32] bd808: andrewbogott how's progress? [17:03:37] mutante: also merge https://gerrit.wikimedia.org/r/#/c/158125/ [17:03:37] ? [17:03:54] Reedy: pretty good! Doing some testing now, but close to ready to switch this live [17:04:03] sweet :D [17:04:29] Reedy: I had to split the private settings into two sections, which feels a bit dirty. https://gerrit.wikimedia.org/r/#/c/158118/ [17:05:55] andrewbogott: Unrelated (sort of) but https://virt1000.wikimedia.org/wiki/Nova_Resource:I-000005aa.eqiad.wmflabs hardcodes wikitech in links [17:06:09] in the "Actions" panel [17:06:23] That's probably hardcoding onwiki [17:06:38] yeah [17:06:44] as they're external links in an infobox [17:06:47] JohnFLewis: pretty sure I sent an email... checking [17:07:00] yeah, I those are created by a bot which probably hard-codes things [17:07:04] Yup, https://virt1000.wikimedia.org/w/index.php?title=Template:InstanceStatus&action=edit [17:07:20] ottomata: I might have missed it or looking in the wrong place :) [17:07:41] JohnFLewis: I think I sent it to an internal mailing list [17:07:57] ottomata: That's why then :) [17:08:18] will poke it [17:08:25] Thanks :) [17:08:25] (03PS2) 10Dzahn: icinga: Set default value for from in graphite threshold checks [puppet] - 10https://gerrit.wikimedia.org/r/158125 (owner: 10Yuvipanda) [17:10:44] (03CR) 10Ottomata: [C: 032 V: 032] Ensure varnishkafka rsyslog file does not get removed by rsyslog module [puppet] - 10https://gerrit.wikimedia.org/r/158126 (owner: 10Ottomata) [17:10:46] (03PS1) 10RobH: setting up dns for acamar/achernar, codfw recursive dns servers [dns] - 10https://gerrit.wikimedia.org/r/158130 [17:11:56] (03CR) 10RobH: [C: 031] setting up dns for acamar/achernar, codfw recursive dns servers [dns] - 10https://gerrit.wikimedia.org/r/158130 (owner: 10RobH) [17:12:27] JohnFLewis: Want to review another dns change? https://gerrit.wikimedia.org/r/#/c/158130/ ;] [17:12:31] mutante: did you force a puppet run on the icinga host? [17:12:35] (neon, I think?) [17:12:51] (anyone can review my dns change) [17:13:00] YuviPanda: neon is icinga yetp [17:13:04] yep [17:13:21] ah cool, thanks [17:13:51] * robh is totally going to self review and merge now cuz mehhhhhhh [17:13:51] (03PS2) 10Ori.livneh: nutcracker: make configs notify service again [puppet] - 10https://gerrit.wikimedia.org/r/158129 [17:14:05] (03CR) 10Ori.livneh: [C: 032] nutcracker: make configs notify service again [puppet] - 10https://gerrit.wikimedia.org/r/158129 (owner: 10Ori.livneh) [17:14:12] (03CR) 10Ori.livneh: [V: 032] nutcracker: make configs notify service again [puppet] - 10https://gerrit.wikimedia.org/r/158129 (owner: 10Ori.livneh) [17:14:14] (03CR) 10RobH: [C: 032] "this isnt the solo review you are looking for" [dns] - 10https://gerrit.wikimedia.org/r/158130 (owner: 10RobH) [17:14:16] (03CR) 10John F. Lewis: "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/158130 (owner: 10RobH) [17:14:21] robh: : done. [17:14:21] heh [17:14:27] im so impatient! ;D [17:14:55] robh: I should have said 'there is a problem but I'm not saying where' :p [17:15:06] ah i was about to hit +1! too fast! [17:15:07] :p [17:15:21] The proxy server could not handle the request GET [17:15:22] godog: can i has https://gerrit.wikimedia.org/r/#/c/158008/ re-review? [17:15:45] mutante: uh, did it fail with that? [17:16:19] YuviPanda: i'm not sure yet, i dont think so [17:16:39] mutante: it will eventually fail anyway since https://gerrit.wikimedia.org/r/#/c/158125/ will cause a puppet fail [17:16:46] it probably won't fail on the icinga machine but it won't pick up the new things [17:17:23] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Puppet has 1 failures [17:17:43] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: Puppet has 1 failures [17:17:44] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:18:24] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [17:18:55] YuviPanda: the error was temp. on the puppetmaster, it finished ok [17:19:13] mutante: cool. still failing on labmon tho [17:19:24] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:19:43] YuviPanda: just setting a value for a parameter = puppet fail ?? odd [17:20:01] mutante: not setting a value, apparently. I just added a default [17:20:19] eh, yea, well "default value" :) [17:20:39] true [17:20:50] mutante: the docs said '-10m is default' but yet it fails [17:21:04] fails how [17:21:20] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Must pass from to Monitor_graphite_threshold[betalabs-puppet-fail] at /etc/puppet/manifests/role/beta.pp:158 on node labmon1001.eqiad.wmnet [17:21:46] mutante: because $from is a required variable in the class definition [17:22:43] it starts with "beta" being a role :) [17:22:50] hmm? [17:23:01] YuviPanda: yep, well, let's set it in the role then? [17:23:14] mutante: the docs for the class say '-10m' (default) [17:23:21] so I made the code match the docs :) [17:23:22] beta being a role is like if "production" would be a role [17:23:28] but that's another thing [17:25:10] (03PS1) 10Dzahn: set required variable "from" for labs graphite [puppet] - 10https://gerrit.wikimedia.org/r/158132 [17:25:40] (03PS2) 10Dzahn: set required variable "from" for labs graphite [puppet] - 10https://gerrit.wikimedia.org/r/158132 [17:25:50] (03PS3) 10Dzahn: set required variable "from" for labs graphite [puppet] - 10https://gerrit.wikimedia.org/r/158132 [17:26:42] (03CR) 10Dzahn: [C: 032] set required variable "from" for labs graphite [puppet] - 10https://gerrit.wikimedia.org/r/158132 (owner: 10Dzahn) [17:27:33] mutante: heh, that's ok too [17:29:24] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:29:43] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:29:45] mutante: cool, puppet fixed! I think neon needs another puppet run to pick up the changes? [17:30:31] (03PS1) 10Ori.livneh: rsyslog: ignore 50-default.conf instead of provisioning it [puppet] - 10https://gerrit.wikimedia.org/r/158135 [17:30:39] (03PS2) 10Dzahn: icinga check wikidata: avoid double quotes [puppet] - 10https://gerrit.wikimedia.org/r/158119 (owner: 10JanZerebecki) [17:32:15] (03CR) 10Dzahn: [C: 032] "trying, on neon for Yuvi's change anyways:)" [puppet] - 10https://gerrit.wikimedia.org/r/158119 (owner: 10JanZerebecki) [17:33:06] woot [17:34:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [17:34:09] (03CR) 10Aaron Schulz: [C: 031] image scalers: bump workers limits [puppet] - 10https://gerrit.wikimedia.org/r/157678 (owner: 10Filippo Giunchedi) [17:34:44] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:35:24] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:35:43] (03PS1) 10RobH: setting mgmt ip for baham - future auth dns server for codfw [dns] - 10https://gerrit.wikimedia.org/r/158136 [17:37:01] wtf gerrit why so slow [17:37:56] robh: ha :P want another review? [17:38:01] (03CR) 10RobH: [C: 032] "just mgmt dns changes" [dns] - 10https://gerrit.wikimedia.org/r/158136 (owner: 10RobH) [17:38:24] robh: impatient :( [17:38:24] nah just mgmt on that [17:38:30] YuviPanda: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon [17:38:34] im not as paranoid when it isnt touching wikimedia.org =] [17:38:35] ah, alright then [17:38:39] or a ton of servers [17:38:53] one server's mgmt on a system that i requested myself and if it doesn't work meh? =D [17:38:55] YuviPanda: maybe we should change the name of the check to be more obvious? [17:39:02] but i appreciate the offer =] [17:39:08] Break your stuff - not others :D [17:39:19] mutante: the check doesn't show up there [17:39:35] it should say something beta [17:39:41] betalabs-puppet-fail [17:40:08] do we need to nudge icinga? [17:40:27] (03PS1) 10Andrew Bogott: Mark out a bunch of code for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 [17:40:32] YuviPanda: we just wait.. neon is slow [17:40:32] * andrewbogott stabs about randomly ^ [17:40:40] bd808: ^ [17:41:11] * bd808 reels from stab wounds [17:41:49] mutante: aaah, heh [17:41:55] Does just marking out those bits work, or do I have to actually specify values? I imagine there are sensible defaults in most cases... [17:41:58] mutante: I've go to now, brb in about 15 min [17:46:43] PROBLEM - DPKG on mw1019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:47:03] no, it's not. it's just upgrading [17:47:28] (03CR) 10BryanDavis: "Some ideas about a slightly different approach inline." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [17:47:43] RECOVERY - DPKG on mw1019 is OK: All packages OK [17:49:37] andrewbogott: I assume you're aware about search issues on Wikitech? :) [17:49:49] JohnFLewis: nope! [17:50:08] Well; now you are :p [17:50:15] * Reedy kicks JohnFLewis [17:50:33] Reedy: sorry :( [17:51:26] JohnFLewis: What's the issue? [17:51:37] (which was the reason for the kick :P) [17:51:42] (03PS2) 10Andrew Bogott: Mark out a bunch of code for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 [17:52:15] Reedy: Oh :p Searching anything on Wikitech gives 'An error has occurred while searching: We could not complete your search due to a temporary problem.' Best I have really error wise [17:53:18] andrewbogott: :) [17:53:25] (03PS3) 10Andrew Bogott: Mark out a bunch of code for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 [17:53:40] JohnFLewis: I'm breaking a bunch of things right now, so setting that one aside for the moment [17:54:20] andrewbogott: alright. Just wanted to make sure you were aware [17:54:25] yep, thanks [17:54:44] bd808: one last chance ^ [17:55:25] (03CR) 10Reedy: Mark out a bunch of code for wikitech. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [17:56:10] andrewbogott: I think that'll remove all the db config from wikitech too... [17:56:32] Reedy: which bit? [17:56:33] With your wgDefaultExternalStore = false, the db.php should be ok again [17:56:41] only conditionally including db.php [17:58:37] (03PS4) 10Andrew Bogott: Mark out a bunch of code for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 [17:59:27] Reedy: sorry, when you say 'remove all the db config' I can't tell if that's a good or a bad thing [17:59:55] Anyway, y'all have to go, I'll leave that patch for post-meeting review. [17:59:58] And maybe go have some lunch [18:00:05] yurik: Dear anthropoid, the time has come. Please deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140903T1800). [18:01:46] (03CR) 10BryanDavis: Mark out a bunch of code for wikitech. (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [18:02:47] (03PS1) 10Ori.livneh: mediawiki::sync: Don't ensure /a/common [puppet] - 10https://gerrit.wikimedia.org/r/158141 [18:03:38] (03CR) 10Reedy: Mark out a bunch of code for wikitech. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [18:04:00] mutante|away: back, still nothing in neon [18:04:02] and awww, he's away [18:04:54] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:09:13] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Epic puppet fail [18:09:27] (03PS5) 10Andrew Bogott: Mark out a bunch of code for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 [18:10:44] (03PS2) 10Ori.livneh: mediawiki: move definition of /a/common to misc::deployment [puppet] - 10https://gerrit.wikimedia.org/r/158141 [18:10:47] So many new globals [18:14:31] (03CR) 10Ori.livneh: [C: 032] mediawiki: move definition of /a/common to misc::deployment [puppet] - 10https://gerrit.wikimedia.org/r/158141 (owner: 10Ori.livneh) [18:19:35] YuviPanda: there seems to be nothing in your recent change that actually adds a service on icinga though.. i'll be back later [18:19:54] oh? [18:20:06] I added a monitor_graphite_threshold [18:20:21] that does setup a monitor_service [18:20:39] but what adds it to neon [18:21:09] uh, resource collection? [18:22:40] it's in role::beta, and that isnt used in production ? [18:22:44] srry, need to run, i'll look later [18:23:18] mutante|away: cool, thanks [18:23:23] it's included in labmon, tho [18:23:27] so resource collection should pick it up [18:23:31] I'll go in about 10min as well [18:23:53] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:53] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [18:27:14] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 7682.26433569 [18:28:03] PROBLEM - Puppet freshness on elastic1004 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 06:22:45 UTC [18:28:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [18:28:13] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:29:03] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [18:30:03] RECOVERY - Disk space on elastic1004 is OK: DISK OK [18:30:24] RECOVERY - Puppet freshness on elastic1004 is OK: puppet ran at Wed Sep 3 18:30:18 UTC 2014 [18:30:43] RECOVERY - puppet last run on elastic1004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:33:03] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [18:33:03] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [18:35:24] hm, search appears to be broken on wikitech. known issue? [18:35:26] An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later. [18:35:53] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 16.0 [18:35:59] <^d> jgage: No good, let's find out. [18:36:03] jgage: I believe andrewbogott is aware but not sure if he's figured out what may be breaking it. [18:36:12] I'm watching the kafka stuff, btw [18:36:16] trying to troubleshoot something [18:36:18] <^d> I'll have a look. [18:36:18] I'm aware but I haven't investigated. [18:36:20] ok ottomata :) [18:36:31] I suspect it's unrelated to what I've been doing and a result of recent search deploy. But that's just a guess. [18:36:47] It's probably not very useful to troubleshoot since we're going to change everything soon (today, I hope) [18:36:54] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 16.0 ottomata Attempting to troubleshoot loss during leader elections. [18:36:54] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 ottomata Attempting to troubleshoot loss during leader elections. [18:37:02] andrewbogott, ok [18:37:03] <^d> andrewbogott: Well this is why we're happy about moving it more in-cluster. Makes it easier for our search deploys to not break things :) [18:37:43] (03CR) 10Jgreen: [V: 04-1] Added the bouncehandler router to catch in all bounce emails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [18:37:47] <^d> Hmm, mapping and so forth are ok. [18:39:17] <^d> demon@terbium:~$ mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki=labswiki [18:39:17] <^d> DB connection error: Can't connect to MySQL server on '208.80.154.18' (4) (208.80.154.18) [18:39:23] <^d> That might be part of it ^ [18:44:09] (03PS1) 10Yurik: Enabled Graph ext on zerowiki & collabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158149 [18:44:40] (03PS23) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [18:44:45] (03PS1) 10Chad: Disable LuceneSearch always on wikis that it doesn't work on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158150 [18:44:53] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [18:45:05] (03PS24) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [18:45:14] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [18:45:28] (03Abandoned) 10Chad: Disable LuceneSearch always on wikis that it doesn't work on [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158150 (owner: 10Chad) [18:46:41] <^d> andrewbogott: I'm guessing error logs still all go to virt1000? [18:46:58] ^d: for your purposes nothing has changed. [18:47:11] we have a test wiki running at a different address… wikitech.wikimedia.org should be just the same as always [18:47:17] !log yurik Synchronized php-1.24wmf19/extensions/Graph/: (no message) (duration: 01m 05s) [18:47:21] <^d> Hmmm. [18:47:22] Logged the message, Master [18:47:49] Confusingly, the test wiki also uses the 'labswiki' database. Possible there's a collision happening there... [18:48:13] (03CR) 10Yurik: [C: 032] Enabled Graph ext on zerowiki & collabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158149 (owner: 10Yurik) [18:48:17] (03Merged) 10jenkins-bot: Enabled Graph ext on zerowiki & collabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158149 (owner: 10Yurik) [18:48:18] actually… ^d, the test wiki is at virt1000.wikimedia.org [18:48:22] and search works just fine there :) [18:48:27] So maybe we ate your search somehow [18:48:37] <^d> Indeed it does :) [18:48:46] <^d> Probably getting confused with the duplicated dbnames. [18:48:48] !log yurik Synchronized php-1.24wmf18/extensions/Graph/: (no message) (duration: 01m 09s) [18:48:54] Logged the message, Master [18:49:03] In case this matters… it's not a duplicate db [18:49:12] it's the same db. Two different mw installs pointed at the same db [18:49:47] <^d> MediaWiki is probably confused as hell by that :p [18:50:41] hey, mw1163 seems FUBAR [18:51:15] ^d: Probably! [18:51:30] ^d: I'm about to go to lunch, need anything before I vanish? [18:51:58] !log Running sync-common on mw1163 [18:51:59] <^d> Na. I'm about to dip out and grab an early lunch myself. [18:52:05] Logged the message, Master [18:52:08] 'k [18:52:31] mark: when you got time - https://gerrit.wikimedia.org/r/#/c/155753/ [18:52:42] !log yurik Synchronized wmf-config: enabling graph ext on zerowiki & collabwiki (duration: 01m 06s) [18:52:49] Logged the message, Master [18:54:14] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2686.38448163 [18:57:04] hey guys, mw1163 was severely broken due to being outta sync - please either add it back to dsh or depool if it needs more fixage [19:07:04] (03PS1) 10BBlack: Fix ipv6 service subnet comments in esams/ulsfo/codfw [dns] - 10https://gerrit.wikimedia.org/r/158151 [19:08:00] (03CR) 10BBlack: [C: 032] Fix ipv6 service subnet comments in esams/ulsfo/codfw [dns] - 10https://gerrit.wikimedia.org/r/158151 (owner: 10BBlack) [19:08:21] (03PS1) 10Ottomata: Buffer up to 10 seconds on varnishkafka to account for possible long leader election times [puppet] - 10https://gerrit.wikimedia.org/r/158152 [19:09:19] (03PS2) 10Ottomata: Buffer up to 10 seconds on varnishkafka to account for possible long leader election times [puppet] - 10https://gerrit.wikimedia.org/r/158152 [19:09:32] (03CR) 10Ottomata: [C: 032 V: 032] Buffer up to 10 seconds on varnishkafka to account for possible long leader election times [puppet] - 10https://gerrit.wikimedia.org/r/158152 (owner: 10Ottomata) [19:27:42] <^d> ottomata: So, when do we want to play with 1016? [19:35:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [19:39:27] (03PS1) 10RobH: temp assigning papaul a mgmt ip so he can test drac [dns] - 10https://gerrit.wikimedia.org/r/158158 [19:40:52] (03CR) 10RobH: [C: 032] "this is a temp measure for papaul to test mgmt network config changes. once we have onsite networking fully deployed, this will go away." [dns] - 10https://gerrit.wikimedia.org/r/158158 (owner: 10RobH) [19:40:56] ^d, hmmmMmm [19:40:57] now! [19:41:47] <^d> ottomata: Sounds good to me :) How we gonna bench this? [19:41:55] good q, i don't really know... [19:42:01] <^d> I'm assuming something a little more scientific than "look at ganglia" :p [19:42:30] we can capture some iostat output during [19:42:31] and compare [19:44:27] <^d> ottomata: Sounds like a plan. [19:45:45] <^d> Did we decide if we're going to do move shards off (wait), move shards on? Or the faster way? [19:49:01] ^d, which generates more io :) ? [19:49:29] <^d> Probably about the same, but the former takes wayyyy longer. [19:49:38] <^d> Latter we can do inside an hour. [19:49:44] let's do latter then [19:50:39] ^d: tail -f /tmp/iostat.out on elastic1016 [19:51:30] <^d> Ok, going through the paces on 1016. [19:51:46] <^d> Non primary allocation disabled. [19:52:18] <^d> elasticsearch restarted, waiting for it to rejoin. [19:52:33] ok, ^d, if we wait longer to start elasticsearch, we should get more io, right? [19:52:40] like, wait a whole minute, or 5 [19:52:41] ? [19:53:06] <^d> Probably yeah. We can hold off on re-enabling allocation. [19:53:41] instead of stopping elasticsearch altogether? [19:54:37] <^d> Stopped entirely. [19:55:03] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [19:55:18] <^d> Ok, it's hovering at > 99% idle now [19:56:31] ok, let's get the timing about right [19:56:38] what minute did you stop it? :54? [19:57:03] PROBLEM - ElasticSearch health check for shards on elastic1016 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.13:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [19:57:14] <^d> ottomata: [2014-09-03 19:54:32,686][INFO ][node ] [elastic1016] closed [19:57:53] great ok [19:58:00] so, how long should we leave it off... 5 mins? [19:58:24] <^d> No more than that, icgina's going to keep complaining :p [19:58:47] we can ack it [19:58:49] let's do 5. [19:59:03] <^d> Well 5 minutes is only another 30 seconds or so anyway :) [19:59:04] RECOVERY - ElasticSearch health check for shards on elastic1016 is OK: OK - elasticsearch status production-search-eqiad: {ustatus: uyellow, unumber_of_nodes: 18, uunassigned_shards: 363, utimed_out: False, uactive_primary_shards: 2021, ucluster_name: uproduction-search-eqiad, urelocating_shards: 2, uactive_shards: 5695, uinitializing_shards: 0, unumber_of_data_nodes: 18} [19:59:16] op [19:59:19] puppet do that? [19:59:40] <^d> Must've :p [20:00:05] gwicke, subbu, cscott: Respected human, time to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140903T2000). Please do the needful. [20:00:06] ooook, well 4 minutes then... :p [20:00:10] <^d> Yep, syslog confirmed it. [20:01:38] <^d> ottomata: Ok, want to reenable allocation and put it back under load? [20:02:01] ok, 2 shards are moving in, that is what we are curious about, right? [20:02:38] <^d> Somewhat that, but that's also throttled at the network level. [20:02:46] hm [20:02:50] <^d> I'm more interested in the 18 or so it'll try to restore from disk. [20:02:57] <^d> *18 at a time [20:03:16] i think I don't know how this works well [20:03:33] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2024: active_shards: 5698: relocating_shards: 2: initializing_shards: 3: unassigned_shards: 372 [20:03:33] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2024: active_shards: 5698: relocating_shards: 2: initializing_shards: 3: unassigned_shards: 372 [20:03:33] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2024: active_shards: 5698: relocating_shards: 2: initializing_shards: 3: unassigned_shards: 372 [20:03:43] we turn it off for a few minutes so that it's shards can get stale, right? [20:03:49] <^d> Ok ok, it went red. Time to turn it back. [20:03:51] and then we turn it back on, and it has to sync its replcas [20:03:53] hah, ok do it [20:04:46] <^d> There we go. [20:05:27] (03PS1) 10RobH: setting mgmt/production ip addresses for sca1001-1002 [dns] - 10https://gerrit.wikimedia.org/r/158165 [20:05:35] why'd it go red? [20:06:12] <^d> It's yellow again, icinga hasn't caught up yet. [20:06:37] ok so, yeah, ^d, I think I'm confused here...i thought we were shutting this down so it would have to sync a bunch over recent changes [20:06:47] and that was how we were going to generate io [20:07:31] <^d> That doesn't generate nearly as much io as initializing the shards from disk. [20:07:42] <^d> Because we throttle it at the network level to *keep* from thrashing the disks. [20:08:24] hm, ok, so what we want it searches to go to it, so they have to pull all the existing shards there into memory? [20:09:35] <^d> I'm not sure how the routing works there, if another replica has the data it probably wouldn't initialize a shard to do that. [20:10:02] <^d> We could force routing, but that doesn't simulate our real traffic patterns at all. [20:11:37] that's ok, ^d, if we can force routing, maybe we should. we just want to do the same thing to this node and another node, and compare [20:14:36] <^d> ottomata: We can do the same thing to a second host. Disable non-primary allocation, kick it for about 4-5 minutes, bring it up. Wait another 3-4 minutes and re-enable all allocation. [20:14:49] ok [20:15:08] so, 1016 has queries on it now, right? [20:15:34] <^d> Yes, plus enwiki and a bunch of other shards initializing on it. [20:16:34] how can you see initializing? [20:16:38] !log deployed Parsoid version 78e55c6b (deploy repo sha c0761179) [20:16:44] Logged the message, Master [20:16:45] oh i see nik's fucntion [20:16:58] <^d> Yeah, that crazy thing. [20:16:59] <^d> https://wikitech.wikimedia.org/wiki/Search#Other_useful_functions [20:17:24] <^d> I could write a whole page on wikitech about shard allocation. Maybe I should. [20:17:27] ok, so it is pulling that fro disk now? [20:17:34] maybe you should! [20:17:48] <^d> Yeah, from=elastic1016 on enwiki_content [20:19:32] <^d> Getting over 100k wKB/s at peak on md2. Not bad. [20:19:48] bd808, Reedy, if back from your meeting can confirm that https://gerrit.wikimedia.org/r/#/c/158138/5 is (at worst) harmless? [20:19:59] * bd808 looks again [20:21:39] (03CR) 10Chad: "Looks mostly harmless, minus one inline nit. Maybe we should annotate these with some comments about how the ideal world has us getting ri" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [20:23:14] (03CR) 10Andrew Bogott: "I disagree about the ideal world -- I want wikitech to still work if one of the cluster services collapses. Self-sufficiency seems like a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [20:23:19] so ^d, should we wait until this en shard is done initializing? [20:23:34] oh, there's 2 enwiki shards initing ther enow, but one is _general? [20:23:44] <^d> Yeah, _content and _general. [20:23:47] <^d> We should wait until they're all done before we do a different host. [20:23:49] ok [20:23:51] <^d> So the conditions are the same. [20:24:36] (03CR) 10BryanDavis: "Couple of small issues but really close I think." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [20:25:43] andrewbogott: Yeah, I think per bd808 it's just about there [20:26:02] andrewbogott: You're so greedy. :) You want our code updates but you want to be able to not crash too? Crazy. [20:28:21] (03CR) 10RobH: [C: 032] setting mgmt/production ip addresses for sca1001-1002 [dns] - 10https://gerrit.wikimedia.org/r/158165 (owner: 10RobH) [20:29:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [20:29:40] (03PS6) 10Andrew Bogott: Mark out a bunch of code for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 [20:30:03] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [20:31:06] (03CR) 10Reedy: Mark out a bunch of code for wikitech. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [20:31:17] (03CR) 10Reedy: [C: 04-1] Mark out a bunch of code for wikitech. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [20:32:39] (03PS7) 10Andrew Bogott: Mark out a bunch of code for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 [20:33:42] (03CR) 10Reedy: [C: 031] "LGTM now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [20:34:03] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [20:34:03] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [20:36:42] (03CR) 10Andrew Bogott: [C: 032] Mark out a bunch of code for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [20:36:48] (03Merged) 10jenkins-bot: Mark out a bunch of code for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158138 (owner: 10Andrew Bogott) [20:37:49] !log andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 04s) [20:38:02] !log andrew Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 04s) [20:38:08] Logged the message, Master [20:38:13] !log andrew Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 05s) [20:38:19] Logged the message, Master [20:39:12] (03PS1) 10RobH: setting sca1001-1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/158232 [20:40:40] Reedy, bd808, I'm back to getting a perpetual cookie failure on login [20:41:08] (03PS3) 10Dzahn: add missing misc_ulsfo hostgroup to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/158127 [20:41:55] (03CR) 10Dzahn: [C: 032] "thought it was merged earlier" [puppet] - 10https://gerrit.wikimedia.org/r/158127 (owner: 10Dzahn) [20:42:03] andrewbogott: oh, bah [20:42:06] Y'all aware that images are broken, like, everywhere? [20:42:13] marktraceur: behave [20:42:17] (03CR) 10RobH: [C: 032] "self review of install server params for new servers (and thus non-service impacting) is one of the few acceptable solo-reviews." [puppet] - 10https://gerrit.wikimedia.org/r/158232 (owner: 10RobH) [20:42:23] * marktraceur stands in corner [20:42:42] robh: merge conflict, i'll take them :) [20:42:54] did it already without asking =P [20:43:00] andrewbogott: we want to use cluster mc config (because the default config is to point at nutcracker on the local host), but not cluster session config [20:43:08] 'if he didnt want it live he wouldnt have merged it on gerrit' ;D [20:43:19] heh [20:43:21] ok, was sitting at the yes/no [20:43:26] sorry about that [20:43:29] "No backend defined with the name `global-multiwrite`." [20:43:48] np, i won't try to say yes or no, i'll ctrl +c :) [20:43:52] Reported from tons of mw servers [20:43:57] Reedy: ^ [20:44:08] That would explain it [20:44:13] ...probably [20:44:27] wmgUseClusterFileBackend is true in the config [20:44:27] local-multiwrite, global-multwrite [20:44:45] Sync order of files? [20:44:53] * bd808 looks at SAL [20:44:53] > var_dump( $wmgUseClusterFileBackend ); [20:44:54] NULL [20:45:19] !log reedy Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 15s) [20:45:24] Logged the message, Master [20:45:40] Only greps in ComomnSettings [20:46:01] Case mismatch! [20:46:10] boom [20:46:16] andrewbogott: also, is your pc clock completely wrong? [20:46:17] Date: Sat Aug 30 07:38:49 2014 -0500 [20:46:45] (03PS1) 10Andrew Bogott: Fix var case mismatch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158235 [20:47:04] (03CR) 10BryanDavis: [C: 031] Fix var case mismatch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158235 (owner: 10Andrew Bogott) [20:47:06] (03PS2) 10Reedy: Fix var case mismatch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158235 (owner: 10Andrew Bogott) [20:47:10] (03CR) 10Reedy: [C: 032] Fix var case mismatch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158235 (owner: 10Andrew Bogott) [20:47:14] (03Merged) 10jenkins-bot: Fix var case mismatch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158235 (owner: 10Andrew Bogott) [20:47:18] I'm working on a local VM, sometimes the clock drifts while my laptop is shut [20:47:35] OK What broke? [20:47:37] syncing [20:47:40] qcoder00: MediaWiki [20:47:41] duh [20:47:48] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 15s) [20:47:55] Logged the message, Master [20:47:55] As in specfically what went wrong? [20:48:05] Typo in a config file change [20:48:10] reedy@tin:/a/common$ mwscript eval.php enwiki [20:48:10] > var_dump( $wmgUseClusterFileBackend ); [20:48:10] bool(true) [20:48:12] That's better [20:48:16] The sort of thing that should not happen [20:48:21] andrewbogott: i was able to solve that clock-in-vm problem by installing open-vm-dkms [20:48:21] * qcoder00 sighs [20:48:24] (03CR) 10Dzahn: "wmf-config/CommonSettings.php:if ( $wmgUseClusterFileBackend ) {" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158235 (owner: 10Andrew Bogott) [20:48:33] qcoder00: Shit happens, unfortunately [20:48:47] Yes, but it makes Wikimedia projects look unprofessional [20:48:53] We're professional? [20:49:04] qcoder00: worry about the post mortem after the issue is fixed. [20:49:05] Reedy: behave [20:49:11] WMF has 200 paid staff [20:49:27] So yes i would expect some degree of reasonable comptence [20:49:46] and not letting changes go live until they were KNOWN to be OK [20:49:48] and 10+ years of legacy tech debt and human employees and ... [20:49:48] if you think you can do better, you should apply! ;) [20:49:56] defects happen even in competent environments [20:50:01] qcoder00: And how are we supposed to KNOW they're ok without putting them on the cluster? [20:50:08] The fact is that it get's fixed, and done quickly [20:50:15] Reedy: There are various techniques [20:50:25] qcoder00: good, you're going to tell us how to test changes. Do carry on. [20:50:41] It was syntactically fine [20:50:44] NotASpy: You have 2 systems [20:50:45] qcoder00: the WMF might have 200 paid staff; but far away from 200 paid operations staff. [20:50:50] Did that mismatch cause visible symptoms or just alarms? (I didn't see any noticeable change in en but I'm not much of a test case) [20:50:52] The fact it's was FileBackend vs Filebackend [20:50:54] * bd808 thinks everyone should chill [20:50:54] One is the test systems, the other is production [20:51:04] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:51:13] qcoder00: this is an operational channel, please move this discussion to #wikimedia-tech [20:51:17] And you don't put things on "production" until it's obvious they are OK [20:51:27] Nemo-bis: Will do so [20:51:28] bd808: The community like moaning, that's life :D [20:51:33] qcoder00: thanks [20:51:37] andrewbogott: Images not being served from cache would've been broken. Uploads too [20:51:39] qcoder00: it's called beta, it's pretty complex, it's way more than "just 2 machines", we are trying to catch all sorts of errors and typos with jenkins, but it can only get better [20:51:50] hm, ok. [20:52:01] I'm good with that but us vs them and strawman arguments from all sides in not very productive [20:52:54] * bd808 orders andrewbogott's t-shirt [20:53:09] We missed it too but better now [20:53:11] \o/ [20:54:22] [front] I broke Wikipedia... :( [back] But I fixed it! :) [20:54:35] haha [20:54:42] (03PS1) 10Jforrester: Enable Flow on [[mw:Talk:MediaWiki 1.25]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158237 [20:55:12] (03CR) 10Bartosz Dziewoński: "Eww." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158237 (owner: 10Jforrester) [20:55:47] MatmaRex: Yeah, manually specifying per-page is a real pain. [20:56:00] James_F: I like the idea :p [20:56:00] that too [20:56:09] JohnFLewis: Dogfooding is a good plan. [20:56:12] i hear we are 6 weeks into a ~10 week db migration involving billions of rows :P [20:56:16] once thats done, it can be better [20:56:17] So, to recap, php is case-sensitive when it comes to variables, but insensitive when it comes to functions? [20:56:33] (03PS1) 10Reedy: We always want to use the cluster memcached config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158238 [20:56:36] Moving forward with Flow at the same time moving forward with MW. Perfect idea James_F :D [20:56:37] but, i have a major breaking change pending for 1.25 and now not only am i going to have to respond to people bitchin', i am also going to have to do it in flow? :P [20:56:41] * andrewbogott knows this but is still perpetually surprised [20:56:59] MatmaRex: Welcome to the future. [20:57:15] bd808: Order andrewbogott a php sucks shirt while you're at it [20:57:16] andrewbogott: yes. classes are insensitive too! :D [20:57:21] MatmaRex: Hey - you can tell them to go with the Flow now :D (/me goes to door himself) [20:58:01] JohnFLewis: https://www.mediawiki.org/wiki/User:MZMcBride/Flow [20:58:04] (03CR) 10Andrew Bogott: [C: 031] We always want to use the cluster memcached config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158238 (owner: 10Reedy) [20:58:49] (03CR) 10Reedy: [C: 032] We always want to use the cluster memcached config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158238 (owner: 10Reedy) [20:58:52] MatmaRex: added :p [20:58:53] (03Merged) 10jenkins-bot: We always want to use the cluster memcached config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158238 (owner: 10Reedy) [20:59:06] andrewbogott: that should fix login again [20:59:13] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: No changes to merge. [20:59:15] well at least facebook was down as well today :) [20:59:34] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 15s) [20:59:40] Logged the message, Master [20:59:43] compared to that we were way less down :) [20:59:49] thedjNotWMF: One day; people will go 'remember the day Facebook and Wikimedia had issues' [21:01:51] (03PS4) 10BBlack: Remove revdns for unused project-lb.site hostnames [dns] - 10https://gerrit.wikimedia.org/r/157980 [21:03:06] (03CR) 10BBlack: [C: 032] Remove revdns for unused project-lb.site hostnames [dns] - 10https://gerrit.wikimedia.org/r/157980 (owner: 10BBlack) [21:03:10] Reedy: I have logins again. Editing a page throws an error still. [21:03:19] what error? [21:03:46] 'Sorry! This site is experiencing technical difficulties.' [21:03:56] Actually the edit took effect despite the error [21:04:25] (Cannot contact the database server: Access denied for user 'wikiuser'@'208.80.154.18' (using password: YES) (10.64.16.29)) [21:04:53] db1040, s4 master [21:05:18] no images on my user page though [21:05:38] I couldn't figure out where that was being pulled in earlier (s4) [21:06:43] I presumed it was our "instant commons" via ForeignDBViaLBRepo before [21:06:58] (03PS1) 10BBlack: Fix ipv6 service subnet comments/origin in eqiad [dns] - 10https://gerrit.wikimedia.org/r/158242 [21:07:25] hm, globalusage? [21:07:45] Yeah, let's disable that [21:07:57] (03CR) 10BBlack: [C: 032] Fix ipv6 service subnet comments/origin in eqiad [dns] - 10https://gerrit.wikimedia.org/r/158242 (owner: 10BBlack) [21:08:13] (03CR) 10Bartosz Dziewoński: [C: 04-1] "And now for a more constructive comment, I think we shouldn't flowify any more pages anywhere until that can be done on-wiki rather than i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158237 (owner: 10Jforrester) [21:08:25] (03PS1) 10Reedy: Disable GlobalUsage on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158243 [21:08:34] $wgGlobalUsageDatabase = 'commonswiki'; [21:08:35] $wgGlobalUsageSharedRepoWiki = 'commonswiki'; [21:08:35] etc [21:08:56] (03CR) 10Reedy: [C: 032] Disable GlobalUsage on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158243 (owner: 10Reedy) [21:09:00] (03Merged) 10jenkins-bot: Disable GlobalUsage on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158243 (owner: 10Reedy) [21:09:04] what is 'global usage'? [21:09:27] It tracks where images are used [21:09:32] Our extension that reports back to commons when a wiki uses an image from it [21:09:35] !log reedy Synchronized wmf-config/InitialiseSettings.php: Disable GlobalUsage on labswiki (duration: 00m 15s) [21:09:41] Logged the message, Master [21:09:46] MatmaRex: Blocking improvements on DB re-orgs seems a bit unhelpful. :-( [21:09:54] ok, I just got a clean edit of my user page. So, much better! [21:10:36] James_F: i could list a number of improvements blocked on database re-orgs ;) [21:10:40] back in ten! [21:10:50] MatmaRex: Sure, but this doesn't need to be one of them. :-) [21:10:51] James_F: besides, the content handler columns are (almost) done already, aren't they? [21:11:03] MatmaRex: ebernhardson is the expert. [21:16:03] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:16:33] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:16:33] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:16:33] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:17:11] oh, ^d, 1016 looks cool again [21:17:17] i'm going to stop the iostat output [21:17:20] <^d> Okie dokie [21:17:24] shall we do the same thing to another node tomorrow? [21:17:46] <^d> Sounds good. [21:30:38] * YuviPanda pokes mutante [21:30:41] Around? [21:31:51] Here's one everyone will jump on: Fatal error: Cannot access protected property EditPage::$mTokenOk in /usr/local/apache/common-local/php-1.24wmf15/extensions/SemanticForms/includes/SF_AutoeditAPI.php on line 459 [21:32:33] wmf15? [21:32:33] You need a newer semantic forms I think [21:33:03] Pretty sure it's all the same as on wikitech... [21:33:03] * andrewbogott checks [21:33:13] That variable was made protected recently and I remember seeing a thread on wikitech-l about this breaking sf [21:33:42] andrewbogott: LOL, we fixed that in master recently [21:33:49] hm, actually it's too new [21:34:00] well, too new and/or too old :) [21:34:08] andrewbogott: do you have https://gerrit.wikimedia.org/r/#/c/157178/ ? [21:34:30] bd808, looks like that branch isn't tracked correctly. wikitech uses 2.7 but virt1000 3.0 alpha [21:34:56] legoktm: Since I'm trying to replicate wikitech I'll probably address this by moving backwards for now [21:35:19] well that change is not in any deployment branch yet [21:35:24] andrewbogott: That particular bug has a pending gerrit change I think. [21:35:26] maybe we should cherry-pick it? [21:35:36] bd808: yes, but... [21:35:41] legoktm is way ahead of me [21:36:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [21:36:05] andrewbogott: Yeah I get your point too [21:36:12] but ori is right that wmf15 is a bit old... [21:36:30] Since we're committed to an older SMW it'd be nice for the semanticforms to be pinned as well. [21:36:36] Otherwise it will surely drift into incompatibility [21:37:46] * andrewbogott presumes that wikitech is the only thing using semanticforms [21:38:00] andrewbogott: Semantic forms isn't pinned. -- https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/default.conf#L174 [21:38:33] Would it be harmful to pin it? Is anyone else using it? [21:38:37] So if we need to pin it we should change that config (cc Reedy) [21:38:57] I'd guess not -- https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/default.conf#L159-L160 [21:39:00] nothing else on the cluster uses SemanticForms [21:41:54] :/ [21:42:27] Making a patch to change the commit that is used in wmf15 seems safe. Getting wikitech running master seems a good goal for the mid-term [21:42:48] mid-term being ~1 month in my brain [21:43:08] I have a patch, just wrestling with gerrit [21:43:23] (03PS1) 10Ori.livneh: snapshot: change $apachedir to /usr/local/apache/common [puppet] - 10https://gerrit.wikimedia.org/r/158250 [21:44:58] (03PS4) 10BBlack: Remove references to deprecated $project-lb.wm.o names [dns] - 10https://gerrit.wikimedia.org/r/157981 [21:47:05] Reedy: https://gerrit.wikimedia.org/r/#/c/158253/ [21:47:54] (03PS5) 10BBlack: Remove references to deprecated $project-lb.wm.o names [dns] - 10https://gerrit.wikimedia.org/r/157981 [21:48:08] (03PS1) 10Ori.livneh: mediawiki: Create /srv/mediawiki symbolic link [puppet] - 10https://gerrit.wikimedia.org/r/158254 [21:49:47] (03CR) 10BBlack: [C: 031] "This is good to go and should cause zero functional change, but I'd like some outside review since it's rather large." [dns] - 10https://gerrit.wikimedia.org/r/157981 (owner: 10BBlack) [21:52:58] Reedy, bd808, my understanding is that that patch will only pin for future wmf branches, right? So we also need a commit on the wmf15 branch? [21:54:43] andrewbogott: Yes. You need a patch to change the .gitmodules to the version you need on the 1.24wmf15 branch of core [22:06:32] (03PS1) 10BBlack: remove dead comment [dns] - 10https://gerrit.wikimedia.org/r/158256 [22:06:45] (03CR) 10BBlack: [C: 032] remove dead comment [dns] - 10https://gerrit.wikimedia.org/r/158256 (owner: 10BBlack) [22:07:00] (03CR) 10Dzahn: [C: 031] "http://www.pathname.com/fhs/pub/fhs-2.3.html#SRVDATAFORSERVICESPROVIDEDBYSYSTEM" [puppet] - 10https://gerrit.wikimedia.org/r/158254 (owner: 10Ori.livneh) [22:08:21] I'd never really thought about it before [22:08:27] ie why it wasn't but the rest were [22:09:28] Now I'm confused… https://wikitech.wikimedia.org/wiki/Special:Version says 2.7 [22:09:41] but when I look at the source it says Version 3.0-alpha [22:09:59] oh, hm... [22:10:57] Yeah, they report different versions in Special:Version but the same date [22:11:11] mutante: thanks [22:12:14] um... [22:12:35] reedy, how do you explain this? https://wikitech.wikimedia.org/wiki/Special:UserRights/Meshr https://virt1000.wikimedia.org/wiki/Special:UserRights/Meshr [22:12:52] shell is checked for one and not the other... [22:13:05] also, many more rights available on virt1000 (which I thought I eliminated previously) [22:13:24] I don't have the permissions to view the page [22:14:57] Reedy: how about now? [22:15:11] I was just looking at it on w/index.php?title=Special%3AListUsers&username=Meshr&group=&limit=50 and can see it [22:15:17] The log entry is there [22:15:19] (03CR) 10Dzahn: [C: 04-1] "root@snapshot1001:~# file /apache/common" [puppet] - 10https://gerrit.wikimedia.org/r/158250 (owner: 10Ori.livneh) [22:15:31] ori: /apache/common: symbolic link to `/usr/local/apache/common-local' [22:15:41] but in that change you have just "common" [22:16:10] /usr/local/apache/common exists too and is a symlink to common-local, but regardless i think you're right, it's better to just directly link to the proper target [22:16:12] andrewbogott: I think that might be cache seperation [22:16:13] eh. wait..or i'm confused [22:16:13] i'll amend [22:16:21] Reedy: ah, that's possible. [22:16:26] i was about to say, they exist both.. ok ori :) [22:16:31] andrewbogott: without logging out/in, I have the rights on wikitech but not virt1000 [22:16:35] let me login cycle [22:16:50] yeah, I gave you the rights on wikitech but not virt1000 [22:16:59] thing is, it should be the same field in the same db [22:17:16] cache and user_touched etc [22:18:38] (03PS2) 10Ori.livneh: snapshot: change $apachedir to /usr/local/apache/common [puppet] - 10https://gerrit.wikimedia.org/r/158250 [22:19:17] (03PS3) 10Ori.livneh: snapshot: change $apachedir to /usr/local/apache/common-local [puppet] - 10https://gerrit.wikimedia.org/r/158250 [22:19:21] mutante: ^ [22:20:29] (03PS2) 10Ori.livneh: rsyslog: ignore 50-default.conf instead of provisioning it [puppet] - 10https://gerrit.wikimedia.org/r/158135 [22:20:37] (03CR) 10Ori.livneh: [C: 032 V: 032] rsyslog: ignore 50-default.conf instead of provisioning it [puppet] - 10https://gerrit.wikimedia.org/r/158135 (owner: 10Ori.livneh) [22:21:47] andrewbogott: noting that this double entry point to the same wiki/db, with different caches isn't a tried, tested nor recommended solution! ;) [22:22:18] Reedy: yes, agreed! [22:22:27] So, back to my original puzzle over SemanticForms versioning... [22:22:38] (03CR) 10Dzahn: [C: 031] snapshot: change $apachedir to /usr/local/apache/common-local [puppet] - 10https://gerrit.wikimedia.org/r/158250 (owner: 10Ori.livneh) [22:22:44] mutante: <3 thanks! [22:22:46] via 'git log' it looks like the same version is in both places. But that's clearly not really the case. [22:22:56] (03PS4) 10Ori.livneh: snapshot: change $apachedir to /usr/local/apache/common-local [puppet] - 10https://gerrit.wikimedia.org/r/158250 [22:23:03] andrewbogott: IMHO, we get it so that it's nearly working in pretty much all cases, switch over, then fix issues as they come up [22:23:06] (03CR) 10Ori.livneh: [C: 032 V: 032] snapshot: change $apachedir to /usr/local/apache/common-local [puppet] - 10https://gerrit.wikimedia.org/r/158250 (owner: 10Ori.livneh) [22:23:15] (03PS2) 10Ori.livneh: mediawiki: Create /srv/mediawiki symbolic link [puppet] - 10https://gerrit.wikimedia.org/r/158254 [22:23:18] andrewbogott: I bet I know the answer to that. You can "blame" bd808 [22:23:22] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki: Create /srv/mediawiki symbolic link [puppet] - 10https://gerrit.wikimedia.org/r/158254 (owner: 10Ori.livneh) [22:23:26] how so? [22:23:28] We have a gitinfo cache for hashes and alike [22:23:52] so if the cache hasn't been updated, it'll get the git info from cache, but the version number from php [22:24:04] * bd808 nods [22:24:18] ok... [22:24:22] better than no gitinfo at all but sometimes confusing [22:24:25] heh [22:24:32] does it need scap to fix it? I don't recall [22:24:36] that explains the wrong version number but not why it works in one place but not the other [22:24:39] i'm not sure that it's better than not gitinfo at all [22:24:42] yeah scap rebuilds the cache [22:24:54] better to have no information than be actively misled [22:25:15] andrewbogott: wikitech won't use the gitinfo cache, virt1000 will from production [22:25:38] $wgGitInfoCacheDirectory [22:25:53] We only get "bad" data from non-scap syncs. And thats on my backlog of things to fix (recompute with sync-*) [22:25:54] wikitech will also have the correct .git files in place [22:25:55] how can I find out what is the latest patch running on virt1000? [22:26:22] look at tin? [22:26:35] * andrewbogott looks at tin [22:27:15] it's the same [22:27:32] so: SemanticForms version is not the explanation for the issue I'm seeing [22:27:46] Even though everyone agrees that it's a known bug in this version of SemanticForms :( [22:29:01] That bug is due to a core change, IIRC [22:29:02] It's actually a known regression in core [22:29:32] Somebody got a little wild with converting var $foo to protected $foo [22:29:46] FWIW, pinning forms going forward is still a good idea [22:30:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [22:31:03] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [22:31:37] OK, so… the latest patch on wikitech (on "wmf/1.24wmf15" ) is Date: Thu Jul 24 21:07:09 2014 +0100 [22:31:56] Whereas the latest patch on tin in php-1.24wmf15 is Fri Jun 27 00:15:03 2014 +0000 [22:31:58] James_F, do you want that flow change to be swatted? [22:32:04] So, the fix to that bug is probably in there. But why the difference? [22:32:09] Cherry-picks on wikitech? [22:32:15] * andrewbogott does not recall doing any cherry-picking [22:32:16] MaxSem: No, spagewmf will have to do it manually, I think. [22:32:18] MaxSem: But thanks. [22:33:28] andrewbogott: does git suggest any outstanding working copy changes etc? [22:34:08] That latest patch on wikitech is 'Update Translate to 1.24wmf15 HEAD' [22:34:22] No local changes that I can see [22:35:03] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [22:35:03] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [22:35:13] my local checkout of wmf15 core has yet more changes on it. [22:35:23] So it seems like the wmf15 checkout on tin is just out of date. [22:35:47] oh, it's possible [22:35:50] or rather: php-1.24wmf15 does not have the latest checkout of wmf15 [22:35:53] MaxSem accidentally merged something [22:36:14] * Reedy runs git fetch --all in php-1.24wmf15 [22:36:25] thanks [22:37:31] reedy@tin:/a/common/php-1.24wmf15$ git status [22:37:31] # On branch wmf/1.24wmf15 [22:37:31] # Your branch and 'origin/wmf/1.24wmf15' have diverged, [22:37:31] # and have 3 and 1 different commit each, respectively. [22:37:47] mmm, the 3 different are security fixes [22:37:56] Which wikitech might not have had (yay, bonuses) [22:40:24] php-1.24wmf15 is still pretty different from what I see when I do a local checkout of that branch [22:40:28] andrewbogott: So yeah, the security live hacks are part of the difference [22:41:01] it should be good to update virt1000 now [22:41:21] What am I doing wrong such that php-1.24wmf15 differs from my local copy? [22:41:33] not sure [22:41:40] git fetch --all [22:41:42] then git status [22:41:43] I see 4f172bb29487ceb79f7b11f3fecf17adbc0c9385 as the head of origin/wmf/1.24wmf15 [22:41:45] do you not? [22:41:46] what's it say? [22:42:12] I updated my branch 30 minutes ago [22:42:23] I just made a revert to MaxSems accidental commit [22:42:41] 4f172bb29487ceb79f7b11f3fecf17adbc0c9385 is that commit [22:42:47] My revert is d8e5f941ad17634f946dcdf1e7a945a8fc52efe3 [22:43:02] There's then 3 more commits on tins php-1.24wmf15 [22:43:15] https://gerrit.wikimedia.org/r/#/q/I2f46e623c1f541dbbafb6e8333e0929055098b15,n,z [22:43:28] oh, nevermind, the dates are just out of order. [22:43:31] https://gerrit.wikimedia.org/r/#/q/I17d2720fb94bb383a92059e5adbf6c16ee3e9ef4,n,z [22:43:33] They roughly agree. I'll sync. [22:43:37] There shouldn't be any security patches on the cluster atm... [22:43:40] https://gerrit.wikimedia.org/r/#/q/I19e2bf3af017a37c35cbadce9a70194aac693f33,n,z [22:43:45] csteipp: 1.24wmf15 :) [22:44:08] wikitech is retro [22:44:15] (03PS1) 10Ori.livneh: Consolidate python-redis package declarations in ::redis::client::python [puppet] - 10https://gerrit.wikimedia.org/r/158262 [22:44:20] bd808: ^ [22:44:23] (03PS1) 10ArielGlenn: reduce number of dumps we keep, need moar space [puppet] - 10https://gerrit.wikimedia.org/r/158263 [22:44:26] fixes puppet on silver, which was broken by your change [22:44:38] bah, still errors out when I edit with form :( [22:44:46] what now? :/ [22:44:53] exactly the same [22:45:06] semanticforms error? [22:45:13] did you need to sync on tin before I did my sync? [22:45:36] (03CR) 10ArielGlenn: [C: 032] reduce number of dumps we keep, need moar space [puppet] - 10https://gerrit.wikimedia.org/r/158263 (owner: 10ArielGlenn) [22:45:43] nope, I'm done [22:46:30] The fix for that issue in core is a contenious issue [22:46:40] (03CR) 10BryanDavis: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/158262 (owner: 10Ori.livneh) [22:46:56] bd808: thanks [22:47:02] andrewbogott: I note that it's only appearing on virt1000 because we have the debugging stuff turned on [22:47:14] Coren: could you +1 possibly? (pinging because RT because puppet breakage) [22:47:30] Reedy: but it's also crashing... [22:47:49] but it works on wikitech? :/ [22:47:52] ori: Just a happy accident that that didn't cause a problem before [22:47:55] yeah [22:47:59] that's... [22:48:15] bd808: yes, not really your fault at all [22:48:25] bd808: that's why i tried fixing it everywhere rather than just the one conflict [22:49:01] puppet really needs something internally for resolving package conflicts [22:49:13] Reedy: ok, if this is a known upstream bug anyway… maybe I should step away and check for other issues. [22:49:18] I know there at add-ons for it but seriously [22:49:21] If this is the worst of it then we're probably ready to forge ahead [22:49:53] (03PS1) 10Yurik: enabled wgRawHtml on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158265 [22:50:01] bd808: "on the one hand, this manifest wants me to install python-redis. on the other hand, this other manifest wants me to install python-redis, too. WHAT TO DO?!" [22:50:18] exactly [22:50:30] Stupid puppet [22:50:55] It seems like merging identical nodes in the graph would be simple enough [22:50:57] * YuviPanda floats ensure_package being used more often [22:51:11] apt-get install cowsay cowsay [22:51:12] works fine [22:51:47] bd808: the answer that you'd probably get is that puppet doesn't want to mask possible errors for you ("In the face of ambiguity, refuse the temptation to guess.") [22:51:48] gwicke, are you done with depl? greg-g i would like to push https://gerrit.wikimedia.org/r/#/c/158265/ (rawhtml on zerowiki) [22:51:51] arguably there isn't any ambiguity [22:52:04] but puppet would presumably claim that the ambuiguity is about whether the duplication is intentional or not [22:52:14] (03PS1) 10Rush: phabricator better email relay settings [puppet] - 10https://gerrit.wikimedia.org/r/158266 [22:52:45] Next time I'm in Portland I really want to go out for a beer with some Puppet folks and then leave them with the bill :) [22:53:06] yurikR1: subbu is done [22:53:13] They owe me (and the rest of us who use their product) [22:53:17] the window is from 1-2 normally [22:53:17] (03CR) 10jenkins-bot: [V: 04-1] phabricator better email relay settings [puppet] - 10https://gerrit.wikimedia.org/r/158266 (owner: 10Rush) [22:53:31] (03PS2) 10Rush: phabricator better email relay settings [puppet] - 10https://gerrit.wikimedia.org/r/158266 [22:53:46] bd808: just one beer? [22:54:14] There are some at Rogue that cost enough that I'd be happy [22:54:15] (03CR) 10jenkins-bot: [V: 04-1] phabricator better email relay settings [puppet] - 10https://gerrit.wikimedia.org/r/158266 (owner: 10Rush) [22:54:46] bd808, Reedy, I have one more day to work on this, and then I'm out for F,S,S,M,T. The question is -- do we turn this thing live tonight and hope that it's adequate by the time I'm gone, or turn off virt1000 until I get back and then make it live? [22:54:58] So far apart from edit-with-form things seem to be working ok. [22:55:06] There are a few weird issues which are probably dueling-cache but it's hard to know [22:55:53] andrewbogott: Your call for sure, but if you can rope a few other roots into being Reedy and my eyes and hands on virt1000 it can probably be made to work. [22:56:24] Or give up and let mortals on the host ;) [22:56:25] 'few other roots' probably means Coren and mutante. [22:59:50] Coren, mutante, interested in helping with this during my absence? (There may be hundreds of issues, or just the one. Unclear.) [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140903T2300). [23:00:06] I'll do the SWAT, it's my stuff anyway [23:00:26] andrewbogott: Keeping in mind that I'm on RT duty so very interruptible (and interrupted) [23:01:05] Sweet thanks Max [23:01:28] Coren, you mean this week, or next? [23:01:43] andrewbogott: This week. Did you mean next? [23:01:55] (03CR) 10MaxSem: [C: 032] enabled wgRawHtml on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158265 (owner: 10Yurik) [23:01:56] I predict (and bd808 may correct) that your mail role will be to a) notice bugs, and b) run 'sync-common' on virt1000 when they ask you to. [23:02:00] (03Merged) 10jenkins-bot: enabled wgRawHtml on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158265 (owner: 10Yurik) [23:02:08] Coren: "I have one more day to work on this, and then I'm out for F,S,S,M,T. The question is -- do we turn this thing live tonight and hope that it's adequate by the time I'm gone, or turn off virt1000 until I get back and then make it live?" [23:02:12] andrewbogott: That seems managable. [23:02:21] (03CR) 10MaxSem: [C: 032] Disable mobile site notices, too intrusive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157855 (owner: 10MaxSem) [23:02:26] (03Merged) 10jenkins-bot: Disable mobile site notices, too intrusive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157855 (owner: 10MaxSem) [23:02:37] andrewbogott: It seems usable enough to me, even if there are kinks left. [23:02:48] thx MaxSem [23:03:15] Is it possible to /merge/ RT tickets? [23:03:24] Reedy, willing to work for another hour or so while we switch this live? [23:03:44] andrewbogott: I'm just eating now (a bit late ;)), but yeah [23:03:56] cool [23:04:14] Coren: not as far as I know (on merging) [23:04:49] chasemp: Actually yes, I just found it: it's under "links" [23:04:54] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/157855/ https://gerrit.wikimedia.org/r/#/c/158265/ (duration: 00m 04s) [23:05:00] Logged the message, Master [23:05:10] * bd808 checks clock and sends Reedy some antacids [23:05:18] Coren: nice! good deal [23:06:51] (03PS2) 10Reedy: Merge virt1000.wikimedia.org back into wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158035 [23:06:55] woo, it rebased [23:09:24] So… where does 'mwscript' come from? It seems like we're going to want one of those on wikitech [23:09:54] it's in puppet. Let me find the class for it [23:10:21] misc::deployment::common_scripts [23:10:22] manifests/misc/deployment.pp [23:10:29] think it's safe to dump that whole class on wikitech? [23:10:48] (03PS1) 10Dzahn: add ca.wm and ca.m.wm for Canada chapter wiki [dns] - 10https://gerrit.wikimedia.org/r/158270 [23:11:52] there's quite a few scripts it doesn't need, but they shouldn't cause any harm [23:11:58] couple of extra packages [23:12:00] I don't see anything scary in there. Other than maybe the /a and /a/common fiels [23:12:03] *files [23:12:15] /a/common being a symlink [23:12:23] (03PS1) 10Andrew Bogott: Add misc::deployment::common_scripts to nova controller. [puppet] - 10https://gerrit.wikimedia.org/r/158272 [23:12:34] You said virt1000 had a /a right? Puppet managed or home brew? [23:13:02] Eeew. /a [23:13:09] mutante: did you mean to add the other m subdomains? [23:13:29] Reedy: no :/ [23:13:43] (03CR) 10Reedy: [C: 04-1] add ca.wm and ca.m.wm for Canada chapter wiki (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/158270 (owner: 10Dzahn) [23:14:23] If I run mwscript on an extension maintenance script... [23:14:32] Would that look like mwscript extensions/Echo/maintenance/processEchoEmailBatch.php ? [23:14:37] yup, exactly [23:14:43] with --wiki=foowiki [23:15:06] (03PS2) 10Dzahn: add ca.wm and ca.m.wm for Canada chapter wiki [dns] - 10https://gerrit.wikimedia.org/r/158270 [23:15:46] (03PS1) 10Andrew Bogott: Change our mediawiki crons to use the new mw install in /usr/local [puppet] - 10https://gerrit.wikimedia.org/r/158274 [23:16:55] Reedy: do you need a break to eat, or are you good to go? [23:17:26] (03CR) 10Reedy: [C: 04-1] Change our mediawiki crons to use the new mw install in /usr/local (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/158274 (owner: 10Andrew Bogott) [23:17:29] I've eaten [23:17:35] (03PS2) 10Andrew Bogott: Add misc::deployment::common_scripts to nova controller. [puppet] - 10https://gerrit.wikimedia.org/r/158272 [23:18:22] (03CR) 10Andrew Bogott: Change our mediawiki crons to use the new mw install in /usr/local (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/158274 (owner: 10Andrew Bogott) [23:18:29] (03PS2) 10Andrew Bogott: Change our mediawiki crons to use the new mw install in /usr/local [puppet] - 10https://gerrit.wikimedia.org/r/158274 [23:19:07] (03PS1) 10Dzahn: align the misc wiki section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 [23:20:27] (03PS3) 10Andrew Bogott: Add misc::deployment::common_scripts to nova controller. [puppet] - 10https://gerrit.wikimedia.org/r/158272 [23:20:29] (03PS3) 10Andrew Bogott: Change our mediawiki crons to use the new mw install in /usr/local [puppet] - 10https://gerrit.wikimedia.org/r/158274 [23:20:44] (03CR) 10Andrew Bogott: [C: 032] Add misc::deployment::common_scripts to nova controller. [puppet] - 10https://gerrit.wikimedia.org/r/158272 (owner: 10Andrew Bogott) [23:22:13] (03CR) 10Reedy: Change our mediawiki crons to use the new mw install in /usr/local (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/158274 (owner: 10Andrew Bogott) [23:23:11] (03PS1) 10Dzahn: align the misc. services section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158276 [23:23:43] $mutante->ocd++; [23:23:50] (03CR) 10Andrew Bogott: Change our mediawiki crons to use the new mw install in /usr/local (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/158274 (owner: 10Andrew Bogott) [23:24:29] (03PS1) 10Dzahn: blog/techblog - TTL back to 1H [dns] - 10https://gerrit.wikimedia.org/r/158277 [23:24:31] (03CR) 10Reedy: [C: 031] Change our mediawiki crons to use the new mw install in /usr/local [puppet] - 10https://gerrit.wikimedia.org/r/158274 (owner: 10Andrew Bogott) [23:24:37] Reedy: ok, next we merge your config change, and sync… (scap needed?) [23:24:53] shouldn't need scap, no [23:24:58] And then merge your puppet change to change the apache conf, and then I re-enable puppet [23:25:00] and everything breaks! [23:25:03] ready? [23:25:33] (03PS3) 10Reedy: Merge virt1000 apache config back into wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/157853 [23:25:43] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [23:25:49] yup, ready I think [23:26:04] "new mw install in /usr/local"? [23:26:08] (03PS1) 10Ori.livneh: Relativize all absolute symlinks that point to /usr/local/apache/* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158278 [23:26:19] (03CR) 10jenkins-bot: [V: 04-1] Relativize all absolute symlinks that point to /usr/local/apache/* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158278 (owner: 10Ori.livneh) [23:26:19] ^ bd808 :) [23:26:22] ori was about to move it to /srv [23:26:23] fucking jenkins [23:27:39] (03CR) 10Reedy: [C: 032] Merge virt1000.wikimedia.org back into wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158035 (owner: 10Reedy) [23:27:43] (03Merged) 10jenkins-bot: Merge virt1000.wikimedia.org back into wikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158035 (owner: 10Reedy) [23:28:08] blergh, it's right, i think [23:28:11] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [23:28:17] Logged the message, Master [23:28:36] (03CR) 10Reedy: [C: 031] add ca.wm and ca.m.wm for Canada chapter wiki [dns] - 10https://gerrit.wikimedia.org/r/158270 (owner: 10Dzahn) [23:28:59] (03CR) 10Reedy: [C: 031] align the misc wiki section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 (owner: 10Dzahn) [23:29:07] Reedy: do we not need $wgCookieDomain = "wikitech.wikimedia.org"; ? [23:29:14] (03CR) 10Reedy: [C: 031] align the misc. services section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158276 (owner: 10Dzahn) [23:29:17] ori: docroot is in common. I think your depth is wrong for the directory traversal [23:30:21] andrewbogott: Uh.. I presumed it defaulted to the domain, apparently not.. I'll re-add it [23:31:46] (03PS1) 10Reedy: Re-add cookie domain for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158280 [23:31:51] (03PS3) 10Dzahn: add ca.wm and ca.m.wm for Canada chapter wiki [dns] - 10https://gerrit.wikimedia.org/r/158270 [23:32:08] (03CR) 10Reedy: [C: 032] Re-add cookie domain for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158280 (owner: 10Reedy) [23:32:13] (03Merged) 10jenkins-bot: Re-add cookie domain for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158280 (owner: 10Reedy) [23:32:20] Reedy: wikitech [23:32:34] lmfao [23:33:07] (03PS1) 10Reedy: virt1000 -> wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158281 [23:33:44] (03CR) 10Reedy: [C: 032] virt1000 -> wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158281 (owner: 10Reedy) [23:33:49] (03Merged) 10jenkins-bot: virt1000 -> wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158281 (owner: 10Reedy) [23:34:20] (03CR) 10CSteipp: "At a cost of 200k, it's taking 2 seconds to generate the hash on my laptop, which isn't much slower than any of our servers. Type :B: has" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158024 (https://bugzilla.wikimedia.org/68766) (owner: 10Parent5446) [23:34:30] ok… goodbye sweet virt1000.wikimedia.org, we hardly knew you [23:36:54] (03PS2) 10Ori.livneh: Relativize all absolute symlinks that point to /usr/local/apache/* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158278 [23:37:03] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 23:24:49 UTC [23:37:13] (03CR) 10Dzahn: [C: 032] add ca.wm and ca.m.wm for Canada chapter wiki [dns] - 10https://gerrit.wikimedia.org/r/158270 (owner: 10Dzahn) [23:37:26] (03CR) 10Andrew Bogott: [C: 032] Merge virt1000 apache config back into wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/157853 (owner: 10Reedy) [23:37:39] (03PS3) 10Ori.livneh: Relativize all absolute symlinks that point to /usr/local/apache/* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158278 [23:37:47] (03CR) 10Ori.livneh: [C: 032] Relativize all absolute symlinks that point to /usr/local/apache/* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158278 (owner: 10Ori.livneh) [23:37:53] (03Merged) 10jenkins-bot: Relativize all absolute symlinks that point to /usr/local/apache/* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158278 (owner: 10Ori.livneh) [23:38:14] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 11 failures [23:38:22] 11? :/ [23:38:33] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 1 failures [23:38:45] andrewbogott: we should probably restart memcached on virt1000 when things have settled to give things a clean slate [23:38:53] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Wed Sep 3 23:38:46 UTC 2014 [23:39:00] yay [23:40:01] well... [23:40:04] wikitech is still working? [23:40:13] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [23:41:29] Reedy: I think that… worked? [23:41:44] (03PS1) 10BBlack: s/ed1a::0/ed1a::1/ in root-level AAAA recs intended for text-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/158283 [23:42:06] andrewbogott: I wonder how we tell? :) [23:42:23] The versions changed in special::version [23:42:32] I tink [23:42:33] think [23:43:10] yeah [23:43:13] (03PS2) 10BBlack: s/ed1a::0/ed1a::1/ in root-level AAAA recs intended for text-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/158283 [23:43:16] 1.24wmf15 (85dbd05) [23:43:25] that hash is a security fix :) [23:43:29] (03CR) 10BBlack: [C: 032 V: 032] s/ed1a::0/ed1a::1/ in root-level AAAA recs intended for text-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/158283 (owner: 10BBlack) [23:43:31] https://git.wikimedia.org/commit/mediawiki%2Fcore.git/85dbd05aa1717de3e5ac93948c7f065790df36c3 [23:43:34] That fact that it seems to be working seems like proof that we didn't change anything though [23:43:43] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [23:43:55] !log ori Synchronized docroot and w: (no message) (duration: 00m 05s) [23:44:13] There's also more extensions enabled than before [23:44:35] * Reedy high fives andrewbogott and bd808 [23:44:53] well, huh. [23:44:55] Nice work Reedy! [23:45:07] kewl [23:45:19] * andrewbogott tries to edit with form [23:45:36] Yeah, still fails as virt1000 did. [23:45:39] So, that's good news? [23:45:40] sort of [23:45:52] I guess we should put some sort of "fix" in place for hat [23:46:02] even if it's just a live hack to revert the protection of that variable [23:46:26] The patch is merged I think. Just needs a cherry-pick [23:46:28] (03PS1) 10Dzahn: add cawikimedia to dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158284 [23:46:45] https://gerrit.wikimedia.org/r/#/c/151370/ ? [23:47:01] (03PS1) 10Ori.livneh: mediawiki::hhvm: set docroot to /srv/mediawiki/docroot [puppet] - 10https://gerrit.wikimedia.org/r/158285 [23:47:03] https://gerrit.wikimedia.org/r/#/c/157178/ [23:47:28] aha [23:47:36] The other one is scary. Better to revert the whole original patch than that [23:47:47] mutante: easy-peasy: https://gerrit.wikimedia.org/r/#/c/158285/ [23:48:03] cherry picked, just waiting for jenkins now [23:49:44] (03CR) 10Dzahn: [C: 031] "/srv/mediawiki -> /usr/local/apache/common-local" [puppet] - 10https://gerrit.wikimedia.org/r/158285 (owner: 10Ori.livneh) [23:49:52] mutante: <3 <3 <3 [23:50:03] (03CR) 10Ori.livneh: [C: 032] mediawiki::hhvm: set docroot to /srv/mediawiki/docroot [puppet] - 10https://gerrit.wikimedia.org/r/158285 (owner: 10Ori.livneh) [23:50:51] (03PS4) 10Andrew Bogott: Change our mediawiki crons to use the new mw install in /usr/local [puppet] - 10https://gerrit.wikimedia.org/r/158274 [23:52:08] !log reedy Synchronized php-1.24wmf15/includes/EditPage.php: (no message) (duration: 00m 14s) [23:52:09] andrewbogott: ^^ [23:53:06] (03CR) 10Andrew Bogott: [C: 032] Change our mediawiki crons to use the new mw install in /usr/local [puppet] - 10https://gerrit.wikimedia.org/r/158274 (owner: 10Andrew Bogott) [23:53:33] Reedy: lemme fix my crons, then I'll do a sync [23:55:33] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [23:56:08] * YuviPanda vaguely pokes mutante again [23:56:24] (03PS1) 10Reedy: labswiki to 1.24wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158291 [23:56:50] Reedy, bd808, edit with form works now [23:56:55] sweet [23:58:03] PROBLEM - puppet last run on amssq50 is CRITICAL: CRITICAL: Epic puppet fail [23:59:12] https://dpaste.de/0fx1 <- suggests that the jobqueue config is different now [23:59:21] (03PS2) 10Ori.livneh: Consolidate python-redis package declarations in ::redis::client::python [puppet] - 10https://gerrit.wikimedia.org/r/158262 [23:59:29] bd808, Does it run jobs immediately by default?