[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160223T0000). Please do the needful. [00:00:04] RoanKattouw bd808 James_F bblack: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:16] * James_F waves. [00:01:10] RoanKattouw: James_F: we're running over a bit with our ad-hoc reserved perf patches deploy window. [00:01:15] We need another 20min or so [00:01:27] K-o. [00:01:33] I'm here [00:01:44] OK WFM [00:01:44] My meeting is going over anyway [00:03:15] (03PS8) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/271797 (https://phabricator.wikimedia.org/T124680) [00:05:40] (03PS1) 10Volans: mariadb: Moved error logs to syslog [puppet] - 10https://gerrit.wikimedia.org/r/272639 (https://phabricator.wikimedia.org/T127636) [00:08:54] (03CR) 10Krinkle: [C: 031] Bugfix: wrong value format for wgReferrerPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272517 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [00:10:08] 6Operations, 10ops-eqiad: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2053650 (10RobH) IRC Update: We should also test using a 6Ghps external SAS controller in an HP DL360. Once the new restbase systems arrive and restbase1001-1006 goes to spare, we can u... [00:10:43] !log Depooling mw1099 for debugging [00:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:53] ok I'm out for now. please merge https://gerrit.wikimedia.org/r/272517 when the time comes, someone :P [00:20:47] Krinkle: Please ping me whne you're done [00:21:07] what time is that, next SWAT window? [00:21:19] * apergos wonders vaguely if they'll still be in here awake then [00:21:27] RoanKattouw: Yeah, currently waiting to resolve a cherry-pick conflict [00:21:34] apergos: It's meant to be now, but Krinkle is running over. [00:21:34] apergos: I'll superintend, don't worry. [00:21:35] related to sessionmanager perf regressions [00:22:16] awesome James_F... I'm happy to be a pair of hands but I'm really not safe around the cluster at this state of tired [00:22:28] apergos: No worries at all. :-) [00:22:37] :-) [00:23:02] fyi my patch in SWAT is beta cluster only if that changes anything :) [00:33:42] ori: You know after what happened earlier, we should have some (at least rudimentary) test that ensures those links all point to the right place. [00:34:06] (or kill more of them :P :P) [00:34:23] or have a test so we can be more confident about killing more of them! :) [00:34:36] Hehe [00:42:05] kill kill [00:44:23] RoanKattouw: lock.release() [00:44:36] RoanKattouw: thanks for your patience [00:44:52] lock.acquire() [00:44:56] :) [00:45:00] Thanks for the pings guys [00:45:07] It was more convenient for me to postpone anyway [00:47:21] (03CR) 10Catrope: [C: 032] Freeze LQT on fiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271942 (https://phabricator.wikimedia.org/T127576) (owner: 10Catrope) [00:48:08] (03Merged) 10jenkins-bot: Freeze LQT on fiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271942 (https://phabricator.wikimedia.org/T127576) (owner: 10Catrope) [00:51:48] ori: Didn't you say you had an idea to make sync-masters not be super slow? [00:51:58] It used to be that sync-file consistently took ~50s and now it consistently takes ~2 mins [00:52:11] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Freeze LQT on fiwikimedia (duration: 01m 39s) [00:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:52:36] sync-masters alone is 1m11s out of the 1m39s sync-file takes [00:52:51] RoanKattouw: there's a pending patch that may fix things [00:54:09] RoanKattouw: we found an mtime bug in python caused by numeric truncation -- https://phabricator.wikimedia.org/D132 [00:54:42] right now mira is seeing all l10n caches as changed on each sync and rebuilding all the CDB files [00:55:08] aha [00:55:13] (03CR) 10Catrope: [C: 032] Add Tool namespace to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268616 (https://phabricator.wikimedia.org/T122865) (owner: 10BryanDavis) [00:55:24] but updating scap is more complicated these days because it's packaged as a deb now [00:55:36] (03PS5) 10Bmansurov: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271264 (https://phabricator.wikimedia.org/T123980) [00:55:53] (03Merged) 10jenkins-bot: Add Tool namespace to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268616 (https://phabricator.wikimedia.org/T122865) (owner: 10BryanDavis) [00:55:57] * bd808 rubs hands greedily for Tool namespace [00:56:16] ohhhh [00:56:18] nice change [00:56:46] next I will have to convince people to actually document their tools... [00:56:52] details details [00:59:19] * RoanKattouw almost typed "Add Toll namespace on wikitech" [00:59:28] Means something very different depending on whether you read that in English or German, as well [01:00:35] RoanKattouw, Troll! [01:00:44] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add Tool namespace on wikitech (duration: 01m 37s) [01:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:56] Ahm, wat [01:02:08] Do maintenance scripts against wikitech wikis have to be run on silver or something? [01:02:15] $ mwscript namespaceDupes.php --wiki=labswiki [01:02:18] Warning: require_once(/etc/mediawiki/WikitechPrivateSettings.php): failed to open stream: No such file or directory in /srv/mediawiki-staging/wmf-config/wikitech.php on line 171 [01:02:30] oh they probably do RoanKattouw [01:03:32] And guess who doesn't have access to that box [01:03:40] bd808: Mind running namespaceDupes.php yourself? [01:04:15] (03CR) 10Catrope: [C: 032] Bugfix: wrong value format for wgReferrerPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272517 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [01:04:26] I'll do it [01:04:54] (03Merged) 10jenkins-bot: Bugfix: wrong value format for wgReferrerPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272517 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [01:04:55] !log ran `mwscript namespaceDupes.php --wiki=labswiki` on silver [01:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:10] ori: Don't forget labtestwik [01:05:11] i [01:05:25] do I need --fix btw? [01:05:46] Not likely there's anything there :) but it's also in the wikitech group, so for completenes [01:05:46] Oh, ahm yes you do [01:06:25] ok, done, 1 links to fix, 1 were resolvable. (and repeating the invocation shows nothing to fix) [01:06:40] but labtestwiki isn't on silver [01:07:00] labtestwiki did fail on tin [01:07:02] It probably doesn't matter [01:07:06] Who cares about test wikis, right :P [01:07:20] is labtestwiki on labs? [01:07:28] ori: OK yeah that sounds like it's fixed things, thanks [01:07:45] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Fix wgReferrerPolicy (duration: 01m 36s) [01:07:47] Oh! Maybe [01:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:08:02] * RoanKattouw is confused as to why https://wikitech.wikimedia.org/wiki/Special:AllPages?from=&to=&namespace=116 is empty [01:08:09] RoanKattouw: yeah. I'll run it [01:08:38] Ori says he ran it [01:08:39] With --fix, and that repeating it showed nothing [01:08:44] I suppose it might have cleaned up things in a different namespace though [01:08:55] ori: Do you still have the output of that script handy? [01:09:02] I think so, hang on [01:09:31] I don't know that there were any Tool:foo pages in the main namespace [01:10:07] here you go: https://dpaste.de/ZSSk/raw [01:10:12] RoanKattouw: ori i need to head off soon - is there any chance i could get https://gerrit.wikimedia.org/r/#/q/271322,n,z fast-tracked? [01:10:41] Sure [01:10:42] I'm not doing the SWAT; RoanKattouw is [01:10:48] (03CR) 10Catrope: [C: 032] Revert "Strip references for experimentation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271322 (https://phabricator.wikimedia.org/T126390) (owner: 10Jdlrobson) [01:11:01] Sorry for the delay there jdlrobson , I got distracted with the wikitech confusion there [01:11:16] thx RoanKattouw (ori merely mentioned you since you and RoanKattouw seemed involved in important stuffz) [01:11:35] Might take 15ish mins for that change to make its way over to labs though because that deploy runs on a timer largely outside of human control [01:11:37] I'll check on its status [01:12:28] Yeah maybe 5-10, I'll update you [01:12:49] ori: hmm so your run showed a page named Tool:Glamtools that was supposedly fixed? [01:12:50] I should really do labs changes first to mitigate that problem, because merging production config patches makes it worse [01:13:24] (03Merged) 10jenkins-bot: Revert "Strip references for experimentation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271322 (https://phabricator.wikimedia.org/T126390) (owner: 10Jdlrobson) [01:14:22] bd808: yeah, I copy-pasted the output verbatim. [01:15:18] ori: Oh! That page was a pagelink ref from a prior move [01:15:25] https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Tools/Tools/glamtools&action=history [01:15:28] all good [01:15:48] (Tim Landscheidt moved page Tool:glamtools to Nova Resource:Tools/Tools/glamtools without leaving a redirect) [01:16:31] jdlrobson: ETA 5 mins till your change is on beta [01:16:41] (Not very scientific) [01:16:50] RoanKattouw: yup thanks am watching https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [01:17:17] Sadly the job that's currently running doesn't have your change [01:17:26] The job that's going to pull in your change is first waiting for the current job to finish [01:17:32] Jenkins Logic (TM) [01:17:53] So then another beta-scap-eqiad will run soon after and that one will have your change [01:22:50] jdlrobson: OK once https://integration.wikimedia.org/ci/job/beta-scap-eqiad/90729/console finishes yours should be done [01:23:07] That should take ~2 minutes [01:25:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [01:25:55] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:27:34] thanks RoanKattouw all good! [01:27:50] Excellent [01:28:44] !log catrope@tin Synchronized php-1.27.0-wmf.14/extensions/VisualEditor: SWAT (duration: 01m 34s) [01:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:29:43] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [01:31:05] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [01:31:15] is swat on now? [01:31:27] bblack: It just finished [01:31:52] The referrer patch went out at :07 [01:32:33] ok thanks, back to dinner for me! [01:32:48] James_F: Your VE patch should be live too [01:32:55] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [01:33:28] (03PS1) 10Chad: Remove skel-1.5 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272649 [01:35:34] (03CR) 10Catrope: [C: 04-1] "Per Krinkle's comments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [01:36:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:36:45] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:38:15] (03PS5) 10Catrope: Add Echo site icons for all of the remaining families. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [01:39:44] (03PS6) 10Catrope: Add Echo site icons for all of the remaining families. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [01:49:49] (03CR) 10Gergő Tisza: [C: 04-2] "Scheduling this for next Tuesday so that current Gather users can be notified first. See task for details." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271932 (https://phabricator.wikimedia.org/T127509) (owner: 10MarcoAurelio) [01:55:01] RoanKattouw: Ta. [02:05:33] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2053929 (10Anomie) >>! In T126700#2052198, @Krinkle wrote: > 1/4th of the backend time (or 170ms) is being spent in WebRequest::getSession() and unde... [02:18:04] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [02:19:27] Hm.. mwgrep isn't working as expected. [02:19:28] https://pl.wikipedia.org/w/index.php?search=UsabilityInitiative&profile=advanced&ns8=1 [02:19:42] That one is not found by "$ mwgrep UsabilityInitiative" [02:21:47] Krinkle: are you sure? It works for me from tin [02:21:59] it's the first result [02:23:36] ori: Yeah, it shows up now [02:23:40] I purged/null edited the page [02:23:47] but it wasn't in there at first [02:23:52] ran it several times even [02:24:05] a gap in CirrusSearch? dunno [02:24:07] I did that before I mentioned it here [02:24:11] unlikely to be a bug in mwgrep [02:24:13] So I thought the purge didn't work [02:24:18] but probably delayed [02:24:23] yeah, it seems like the index was having a gap [02:24:27] yeah, it goes through the job queue [02:24:31] iirc [02:24:45] I think there is many gaps [02:25:00] it's because you live in london and take the tube [02:25:06] something is producing a fair number of requests to /w/extensions/UsabilityInitiative/css/combined.min.css?3 [02:25:11] ori: hehe [02:25:17] and they have all on-wiki referals [02:25:21] so likely a gadget or site script [02:25:24] but can't find it [02:25:41] it's not zhwiki MediaWiki:MainPageScript.js ? [02:26:45] anyhow I'm off, bye! [02:26:51] (03PS1) 10CSteipp: Password policies for advanced permission groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) [02:27:27] !log switching mw1017 to wmf.12 for perf tests [02:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:44] ori: nope, that one is unused. It matches because of a comment I left in that file just now [02:29:39] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.13) (duration: 13m 34s) [02:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:44:16] ori: Filed as https://phabricator.wikimedia.org/T127788 - got one live one remaining [02:45:44] (03PS2) 10CSteipp: [WIP] Password policies for advanced permission groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) [02:46:19] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Password policies for advanced permission groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [02:53:53] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.14) (duration: 11m 44s) [02:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:33] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2053999 (10Tgr) {F3402594} {F3402595} {F3402596} Tests valid vs. valid with expired backend vs. no session on a few request types. action=raw with a... [02:56:07] Who's on ops clinic duty right now? The topic isn't showing anyone. [03:03:06] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Feb 23 03:03:06 UTC 2016 (duration 9m 13s) [03:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:38] (03CR) 10Alex Monk: "Perhaps a topic for another day, but: since those global groups can be renamed on-wiki, I wonder if we should be setting up policies as pa" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [03:46:18] (03PS2) 10BBlack: parsoidcache: remove from LVS [puppet] - 10https://gerrit.wikimedia.org/r/272322 (https://phabricator.wikimedia.org/T110472) [03:50:28] (03PS2) 10BBlack: cache_parsoid: remove from DNS [dns] - 10https://gerrit.wikimedia.org/r/272484 (https://phabricator.wikimedia.org/T110474) [03:51:00] (03CR) 10BBlack: [C: 032] parsoidcache: remove from LVS [puppet] - 10https://gerrit.wikimedia.org/r/272322 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack) [03:57:44] RECOVERY - PyBal backends health check on lvs1008 is OK: PYBAL OK - All pools are healthy [04:03:15] PROBLEM - PyBal backends health check on lvs1008 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [04:03:39] it's been in that state for a while, it just cycled through ok back to unhealthy due to pybal restart [04:04:17] PROBLEM - Host parsoid-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [04:04:22] PROBLEM - Host parsoid-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [04:04:41] ^ expected [04:05:03] !log parsoid-lb.eqiad.wikimedia.org turned off [04:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:06:56] (03CR) 10BBlack: [C: 032] cache_parsoid: remove from DNS [dns] - 10https://gerrit.wikimedia.org/r/272484 (https://phabricator.wikimedia.org/T110474) (owner: 10BBlack) [04:08:03] (03PS1) 10Dzahn: fix whitespace-related lint issues [puppet] - 10https://gerrit.wikimedia.org/r/272666 [04:09:23] (03CR) 10Dzahn: [C: 031] "git review -d 272666 ; git show -w <-- will show how this is no diff" [puppet] - 10https://gerrit.wikimedia.org/r/272666 (owner: 10Dzahn) [04:10:44] 6Operations, 6Services, 10Traffic, 13Patch-For-Review: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#2054102 (10BBlack) [04:10:47] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2054101 (10BBlack) 5stalled>3Resolved [04:12:42] (03CR) 10Tim Landscheidt: "http://tools.wmflabs.org/watroles/project/ores: ores-staging-01, ores-web-01, ores-worker-01, ores-worker-02, ores-lb-02, ores-worker-03, " [puppet] - 10https://gerrit.wikimedia.org/r/270102 (owner: 10Tim Landscheidt) [04:14:49] (03CR) 10Dzahn: "TIL that watroles also has "project" not just "roles" :)" [puppet] - 10https://gerrit.wikimedia.org/r/270102 (owner: 10Tim Landscheidt) [04:23:20] (03PS2) 10BBlack: decom cache_parsoid [puppet] - 10https://gerrit.wikimedia.org/r/272323 (https://phabricator.wikimedia.org/T110472) [04:23:59] (03PS1) 10Dzahn: phabricator: move roles to module/role/ [puppet] - 10https://gerrit.wikimedia.org/r/272667 [04:31:10] (03PS1) 10Dzahn: backup: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/272668 [04:34:04] (03PS3) 10BBlack: upload/misc VCL: remove t2-be bypass trick [puppet] - 10https://gerrit.wikimedia.org/r/271966 (https://phabricator.wikimedia.org/T127481) [04:34:06] (03PS2) 10BBlack: vcl_(hit|miss|pass|deliver): single definition [puppet] - 10https://gerrit.wikimedia.org/r/271991 (https://phabricator.wikimedia.org/T127481) [04:34:08] (03PS2) 10BBlack: vcl_recv: single definition in wikimedia.vcl [puppet] - 10https://gerrit.wikimedia.org/r/271990 (https://phabricator.wikimedia.org/T127481) [04:34:10] (03PS2) 10BBlack: cache_misc: call recv_purge like others [puppet] - 10https://gerrit.wikimedia.org/r/271987 (https://phabricator.wikimedia.org/T127481) [04:34:12] (03PS2) 10BBlack: Remove sub vcl_foo from sub-includes [puppet] - 10https://gerrit.wikimedia.org/r/271986 (https://phabricator.wikimedia.org/T127481) [04:34:24] (03Abandoned) 10BBlack: cache_parsoid: explicit default vcl_recv behavior [puppet] - 10https://gerrit.wikimedia.org/r/271989 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:34:29] (03Abandoned) 10BBlack: cache_parsoid: use recv_purge [puppet] - 10https://gerrit.wikimedia.org/r/271988 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:37:37] (03CR) 10BBlack: [C: 031] gdash: decom [puppet] - 10https://gerrit.wikimedia.org/r/272427 (https://phabricator.wikimedia.org/T104365) (owner: 10Ori.livneh) [04:48:50] (03PS3) 10Ori.livneh: gdash: decom [puppet] - 10https://gerrit.wikimedia.org/r/272427 (https://phabricator.wikimedia.org/T104365) [04:49:13] (03CR) 10Ori.livneh: [C: 032 V: 032] gdash: decom [puppet] - 10https://gerrit.wikimedia.org/r/272427 (https://phabricator.wikimedia.org/T104365) (owner: 10Ori.livneh) [05:06:45] !log Deleted gdash docroot on graphite2001 and krypton [05:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:16:03] (03PS1) 10Ori.livneh: remove gdash.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/272670 (https://phabricator.wikimedia.org/T104365) [05:16:56] (03PS2) 10Ori.livneh: xhgui: profile 1:10,000 requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272616 [05:17:04] (03CR) 10Ori.livneh: [C: 032] xhgui: profile 1:10,000 requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272616 (owner: 10Ori.livneh) [05:17:32] (03Merged) 10jenkins-bot: xhgui: profile 1:10,000 requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272616 (owner: 10Ori.livneh) [05:20:51] !log ori@tin Synchronized wmf-config/StartProfiler.php: (no message) (duration: 01m 46s) [05:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:25:03] !log StartProfiler.php sync was of Ic952fab90f: xhgui: profile 1:10,000 requests [05:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:45:42] (03PS1) 10Ori.livneh: Add a comment about manually-created MongoDB indexes [puppet] - 10https://gerrit.wikimedia.org/r/272673 [05:46:41] (03PS2) 10Ori.livneh: Add a comment about manually-created MongoDB indexes [puppet] - 10https://gerrit.wikimedia.org/r/272673 [05:47:02] (03CR) 10Ori.livneh: [C: 032 V: 032] Add a comment about manually-created MongoDB indexes [puppet] - 10https://gerrit.wikimedia.org/r/272673 (owner: 10Ori.livneh) [06:04:44] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:29:47] ori, who said MONGO? :P [06:31:40] Mango, I said mango! [06:31:43] Mangos are delicious. [06:31:54] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:59] Mungos eat snakes [06:32:01] (03PS1) 10Dzahn: ferm: fix "not documented" warnings [puppet] - 10https://gerrit.wikimedia.org/r/272674 [06:32:30] yeah mongooses are pretty cool [06:33:13] ah:) it's Mungo in German [06:33:47] heh, that's nicer [06:34:54] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:45] (03PS1) 10Dzahn: logstash: fix top-scope var w/o namespace [puppet] - 10https://gerrit.wikimedia.org/r/272675 [06:55:28] 6Operations, 7Puppet: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2054254 (10Dzahn) [06:56:25] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:37] 6Operations, 7Puppet: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2054254 (10Dzahn) p:5Triage>3Low [07:00:00] <_joe_> mutante: can we mark such tasks as needing volunteers in some way? [07:00:22] <_joe_> it's the typical relatively easy thing to do and we should start trying to involve volunteers more [07:01:24] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:01:53] 6Operations, 7Puppet, 7Need-volunteer: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2054281 (10Dzahn) [07:01:58] _joe_: yes, i like that.done :) [07:02:14] _joe_: i'm gonna look for volunteers at wikimania specifically [07:02:23] it will be good to have such a list [07:03:00] used existing tag [07:03:39] <_joe_> nice [07:03:47] we also need an ops section on [07:03:50] https://www.mediawiki.org/wiki/Annoying_little_bugs [07:03:53] or something like that [07:04:03] <_joe_> or we create our own on wikitech [07:04:14] <_joe_> or we just create a custom search on phab [07:04:25] see the links from that page, they are already deep links into phabricator [07:04:33] to tickets that are marked "easy" somehow [07:06:49] <_joe_> if we want that to work, it should be known to all ops and the onduty person [07:07:09] right, yes [07:09:44] gotta say good night for now, tbc [07:11:35] etherpad is disconnecting me almost constantly, can someone poke it? [07:29:49] (03CR) 10Giuseppe Lavagetto: "@Faidon I concur with you, actually it seems like a good idea to rename the function sooner than later :)" [puppet] - 10https://gerrit.wikimedia.org/r/271259 (owner: 10Giuseppe Lavagetto) [07:53:41] mutante _joe_ There is also a documentation project [07:53:48] that you can tag [07:53:57] (Superceeds T2001) [07:59:22] (03PS2) 10Muehlenhoff: Remove outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/272495 [07:59:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/272495 (owner: 10Muehlenhoff) [08:01:43] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:03:43] (03PS2) 10Jcrespo: Depooled es2010, controller issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272628 (https://phabricator.wikimedia.org/T127769) (owner: 10Volans) [08:04:07] ^let's apply this [08:04:23] (I will do it if you are not around) [08:07:41] (03CR) 10Jcrespo: [C: 032] Depooled es2010, controller issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272628 (https://phabricator.wikimedia.org/T127769) (owner: 10Volans) [08:11:44] does this look like a controller issue, or could it just be the RAID being broken because too many disks failed? https://phabricator.wikimedia.org/T127769 [08:14:38] (03PS1) 10Giuseppe Lavagetto: [WiP] Add ipvs-related FSM [debs/pybal] - 10https://gerrit.wikimedia.org/r/272679 [08:15:56] (03CR) 10jenkins-bot: [V: 04-1] [WiP] Add ipvs-related FSM [debs/pybal] - 10https://gerrit.wikimedia.org/r/272679 (owner: 10Giuseppe Lavagetto) [08:15:58] (03CR) 10Jcrespo: "Before my +1, can you test this extensively? Even if I suggested it, this is the kind of change that could have bad consequences like:" [puppet] - 10https://gerrit.wikimedia.org/r/272639 (https://phabricator.wikimedia.org/T127636) (owner: 10Volans) [08:24:36] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [08:24:43] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [08:26:03] jynus: ^^^ it's you? [08:26:12] yes [08:27:23] do you want to merge yourself? [08:27:35] as the patch was yours [08:27:59] ok I can take care [08:29:01] I have already started [08:29:34] yeah, saw your sync file on tin :) [08:30:04] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [08:30:04] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [08:30:23] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2010 (duration: 01m 34s) [08:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:32:23] PROBLEM - Disk space on ms-be2003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error [08:33:05] PROBLEM - RAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [08:42:00] (03PS1) 10Giuseppe Lavagetto: wmflib: fix failing test [puppet] - 10https://gerrit.wikimedia.org/r/272682 [08:44:05] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: fix failing test [puppet] - 10https://gerrit.wikimedia.org/r/272682 (owner: 10Giuseppe Lavagetto) [08:47:43] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures [08:55:38] (03CR) 10Hoo man: [C: 031] Non critical DBA pages should not send an sms to the DBA group [puppet] - 10https://gerrit.wikimedia.org/r/272478 (owner: 10Jcrespo) [09:02:44] RECOVERY - Disk space on ms-be2003 is OK: DISK OK [09:13:10] (03PS4) 10Giuseppe Lavagetto: ipresolve: add PTR resolution, tests [puppet] - 10https://gerrit.wikimedia.org/r/271259 [09:14:38] (03PS1) 10Muehlenhoff: Backport upstream fix 062c189fee20c18fae5ac3716a7379143d64150e which deals with changes in OpenSSL's SSL_shutdown() function during SSL handshakes introduced in 1.0.2f (causing false positive critical errors) Bug: T126616 [software/nginx] - 10https://gerrit.wikimedia.org/r/272685 (https://phabricator.wikimedia.org/T126616) [09:16:37] (03PS5) 10Giuseppe Lavagetto: ipresolve: add PTR resolution, tests [puppet] - 10https://gerrit.wikimedia.org/r/271259 [09:18:15] (03CR) 10Hashar: [C: 04-1] Move ORES settings to beta features part (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272526 (owner: 10Ladsgroup) [09:20:21] 7Blocked-on-Operations, 6Operations, 10Continuous-Integration-Infrastructure, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2054518 (10ema) p:5Triage>3Normal [09:21:55] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [09:23:44] 7Blocked-on-Operations, 6Operations, 10Continuous-Integration-Infrastructure, 7HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020104 (10Joe) @hashar no we definitely don't want the debian package to deal with this... [09:27:28] (03CR) 10Giuseppe Lavagetto: [C: 032] ipresolve: add PTR resolution, tests [puppet] - 10https://gerrit.wikimedia.org/r/271259 (owner: 10Giuseppe Lavagetto) [09:28:23] (03CR) 10Jcrespo: "Thoughts for non-dbas?" [puppet] - 10https://gerrit.wikimedia.org/r/272478 (owner: 10Jcrespo) [09:29:28] 6Operations, 10Traffic, 13Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#2054535 (10ema) p:5Triage>3Normal [09:30:05] 6Operations, 10Traffic, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2054536 (10ema) p:5Triage>3Normal [09:30:46] (03CR) 10Giuseppe Lavagetto: [C: 031] "Seems reasonable; whoever wants pages from databases should just add her/himself to the dba contact group." [puppet] - 10https://gerrit.wikimedia.org/r/272478 (owner: 10Jcrespo) [09:35:51] (03PS3) 10Jcrespo: Non critical DBA pages should not send an sms to the DBA group [puppet] - 10https://gerrit.wikimedia.org/r/272478 [09:36:03] (03CR) 10Giuseppe Lavagetto: role::memcached: add cross-dc Ipsec for the various shards. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [09:36:38] (03CR) 10Jcrespo: [C: 032] "Merging this now, we can later setup/fix the email issue as a further iteration." [puppet] - 10https://gerrit.wikimedia.org/r/272478 (owner: 10Jcrespo) [09:37:20] (03PS2) 10Filippo Giunchedi: disable package-installed initscript [puppet] - 10https://gerrit.wikimedia.org/r/272612 (https://phabricator.wikimedia.org/T127365) (owner: 10Eevans) [09:39:01] (03PS2) 10Filippo Giunchedi: Enable async secondary swift writes for non-"big" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272611 (https://phabricator.wikimedia.org/T91869) (owner: 10Aaron Schulz) [09:39:27] (03CR) 10Filippo Giunchedi: [C: 031] Enable async secondary swift writes for non-"big" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272611 (https://phabricator.wikimedia.org/T91869) (owner: 10Aaron Schulz) [09:40:32] hoo, volans after merging 272478, I will want to do some page tests, so that we can check that it is effectively working (no more spam, but critical checks still work) [09:41:10] sure, go ahead [09:41:53] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: puppet fail [09:42:02] ops [09:42:03] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, will merge in coordination with services" [puppet] - 10https://gerrit.wikimedia.org/r/272536 (https://phabricator.wikimedia.org/T127747) (owner: 10Eevans) [09:42:37] jynus: Sounds good... breaking pages might not get noticed in time... [09:43:04] I will log it when I do it (I am checking first I have not broken puppet) [09:43:09] 6Operations, 10hardware-requests, 13Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2054563 (10fgiunchedi) @Cmjohnson we'd still need to upgrade ram/cpu on restbase1008 / restbase1009, let's coordinate that for today [09:43:30] moritzm: I've heard that you are planning an OS upgrade on logstash for tomorrow [09:43:44] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:44:01] seems like a good time to upgrade elastic search to 1.7.5 at the same time (I talked about it with bd808) [09:44:02] no, db2042 unrelated [09:44:13] moritzm: how do we coordinate this? [09:44:28] (03PS3) 10Filippo Giunchedi: alert only on external (parity w/ dashboards) [puppet] - 10https://gerrit.wikimedia.org/r/272498 (owner: 10Eevans) [09:44:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] alert only on external (parity w/ dashboards) [puppet] - 10https://gerrit.wikimedia.org/r/272498 (owner: 10Eevans) [09:47:48] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T127808#2054571 (10Nikerabbit) [09:47:48] gehel: not a full OS upgrade, but I'll upgrade kernels. we can certainly bundle this if 1.7.5 is sufficiently tested in labs [09:48:30] do you have a ticket for the general ES upgrade? [09:48:36] moritzm: I'm still trying to understand what "sufficiently tested" means in this context. [09:49:25] icinga failed to restart, but I think the config is ok [09:49:27] moritzm: T122697, but its initial context was more about upgrading ES for the Discovery cluster [09:49:27] i'd say "tested in labs to the extent that you feel comfortable to move this to the production logstash ES" [09:49:33] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:49:40] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T127808#2054590 (10Arrbee) This request is approved. Thanks. [09:51:36] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2054593 (10Gehel) v 1.7.5 has been applied on Discovery's elasticsearch cluster in labs. With the help of @bd808 the version 1.7.5 has been... [09:51:43] (03PS4) 10Giuseppe Lavagetto: role::memcached: add cross-dc IPsec for the various shards [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) [09:51:57] moritzm: I'll check with bd808 this evening if he can think of additional tests. [09:52:42] but since logstash* is currently at 1.7.1 and this is supposed to be a compatible 1.7.x series and it's running in labs let's just bundle it in, we can start with one of the 100[1-3] nodes (which are not master-eligible) [09:54:01] the 1.7.5 .deb is not yet in reprepro. If I understood correctly, we prefer to do a first deployment "manually" and update reprepro once all nodes are up to date. [09:54:08] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#1911562 (10MoritzMuehlenhoff) We can bundle tomorrow's logstash* kernel upgrade to also make the upgrade, logstash* is currently at 1.7.1 and... [09:56:05] gehel: you can just as well already add it there (since it has gotten some testing now, we mostly avoid adding completely untested packages to the repo). we don't use any automated upgrades for such packages and if it turns out to be broken, we can easily drop it again [09:56:32] moritzm: ok, I'll see if I can manage to add it ... [09:57:28] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2054645 (10Reedy) >>! In T122697#2054635, @MoritzMuehlenhoff wrote: > We can bundle tomorrow's logstash* kernel upgrade to also make the upgr... [09:57:36] gehel: I'll plan to upgrade logstash* tomorrow at 21:00 CET, if you want to be around and are available at that time (scheduled for the evening so that Bryan is around just in case) [09:57:36] moritzm: how do we coordinate for the update? You'll just upgrade ES as part of the kernel upgrade? Should I do it? Be available during that time? [09:58:03] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [09:58:08] I'll be there! [09:58:37] moritzm: is there a calendar meeting? A list of planned operations? Something similar? [09:58:38] ok, I'll ping you on IRC and we just sort out how to split the steps tomorrow, then [09:58:50] sounds good to me [09:59:04] (wrt calendar) not that I'm aware of [09:59:14] if you run into any problems with reprepro, ping me [09:59:33] https://wikitech.wikimedia.org/wiki/Reprepro has the basic, but it has ugly pitfalls [10:00:31] I already saw that the update config for elastic search seems to be only done for 1.6*. So the current 1.7.1 that we have has probably been uploaded "by hand". I will probably try to fix that. [10:01:27] especially the sudo remark at the end is a common gotcha (as it will fail to sign the repo in a non-obvious way) [10:01:55] (03PS1) 10Filippo Giunchedi: remove gdash record [dns] - 10https://gerrit.wikimedia.org/r/272692 (https://phabricator.wikimedia.org/T104365) [10:02:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] remove gdash record [dns] - 10https://gerrit.wikimedia.org/r/272692 (https://phabricator.wikimedia.org/T104365) (owner: 10Filippo Giunchedi) [10:03:07] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2054662 (10dcausse) [10:10:46] (03PS1) 10Filippo Giunchedi: uwsgi: don't declare uwsgi-startup service [puppet] - 10https://gerrit.wikimedia.org/r/272694 (https://phabricator.wikimedia.org/T127684) [10:12:37] (03PS2) 10Filippo Giunchedi: uwsgi: don't declare uwsgi-startup service [puppet] - 10https://gerrit.wikimedia.org/r/272694 (https://phabricator.wikimedia.org/T127684) [10:12:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] uwsgi: don't declare uwsgi-startup service [puppet] - 10https://gerrit.wikimedia.org/r/272694 (https://phabricator.wikimedia.org/T127684) (owner: 10Filippo Giunchedi) [10:16:28] 7Puppet, 13Patch-For-Review: Service_unit[uwsgi-startup] causes log churn - https://phabricator.wikimedia.org/T127684#2054675 (10fgiunchedi) 5Open>3Resolved ``` graphite1001:~$ pat Info: Retrieving plugin Notice: /File[/var/lib/puppet/lib]/mode: mode changed '0755' to '0775' Notice: /File[/var/lib/puppet/l... [10:23:41] (03PS1) 10Muehlenhoff: Backport upstream fix 062c189fee20c18fae5ac3716a7379143d64150e which deals with changes in OpenSSL's SSL_shutdown() function during SSL handshakes introduced in 1.0.2f (causing false positive critical errors) Bug: T126616 [software/nginx] - 10https://gerrit.wikimedia.org/r/272695 (https://phabricator.wikimedia.org/T126616) [10:35:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 696 [10:40:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 996 [10:45:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 3006415 Threads: 1 Questions: 22799383 Slow queries: 20072 Opens: 6620 Flush tables: 2 Open tables: 406 Queries per second avg: 7.583 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:46:00] !log Executing /opt/wmf-mariadb10/install on not-yet-production es2011-es2019 [T127330] [10:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:46:50] (03PS1) 10Muehlenhoff: Silence openssl shutdown messages [software/nginx] (wmf-1.9.4-1) - 10https://gerrit.wikimedia.org/r/272696 (https://phabricator.wikimedia.org/T126616) [10:49:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Silence openssl shutdown messages [software/nginx] (wmf-1.9.4-1) - 10https://gerrit.wikimedia.org/r/272696 (https://phabricator.wikimedia.org/T126616) (owner: 10Muehlenhoff) [10:53:33] 6Operations, 10Traffic, 13Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2054764 (10MoritzMuehlenhoff) The new packages have been installed on cp1008 and seem to work fine there, so I also copied these to carbon. I'll leave picking a proper candid... [11:02:11] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2054813 (10ema) p:5Triage>3High [11:08:45] !log Starting MariaDB on es2011 (not yet in production) [ T127330 ] [11:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:11:10] RECOVERY - mysqld processes on es2011 is OK: PROCS OK: 1 process with command name mysqld [11:15:25] !log re-enabling puppet agent on scandium [11:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:21:53] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2054872 (10MoritzMuehlenhoff) >>! In T122697#2054645, @Reedy wrote: > Sounds good. I presume they'll be done at one at a time, rebooted, wait... [11:27:19] !log re-enabling puppet agent on mc2016 [11:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:28:30] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [11:28:52] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:29:25] taking a look [11:30:47] huge spike in network [11:30:47] yeah, high load on labstore1001 [11:30:49] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.683 second response time [11:31:14] <_joe_> again? [11:31:17] (03PS5) 10Giuseppe Lavagetto: role::memcached: add cross-dc IPsec for the various shards [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) [11:31:18] <_joe_> sigh [11:31:33] <_joe_> godog: need assistance? [11:34:20] _joe_: not atm, I'll ping in case [11:34:20] there is also a 'CRITICAL - Expecting active but unit nfs-exports is failed' on labstore1001, not sure if it's related [11:34:20] <_joe_> ema: it probably is [11:34:20] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] [11:34:32] yeah looks like iowait spiked and network dropped [11:35:18] <_joe_> godog: yup [11:35:28] <_joe_> so what's up exactly? [11:36:10] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 790791 bytes in 6.583 second response time [11:36:54] <_joe_> someone did something? [11:37:43] not me, no [11:39:07] I'm assuming an heavy reader/writer, judging by the initial network spike https://ganglia.wikimedia.org/latest/?c=Labs%20NFS%20cluster%20eqiad&h=labstore1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [11:45:42] moritzm: I'm having a loog at updating reprepro with elasticsearch 1.7.5. We already have an entry for 1.6. Should I just replace it? Or create a new one. [11:46:22] moritzm: also, it seems to me that the `updates` file is not managed by puppet. Is that correct? Should I just modify it locally? [11:46:35] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [11:47:20] 6Operations, 7Icinga: Icinga errors on neon: Contact group 'admin' specified in service XXX is not specified anywhere - https://phabricator.wikimedia.org/T127821#2054926 (10ema) [11:48:05] each suite/distro can only contain one version and since logstash* uses jessie and elastic* uses trusty we'll have to add it to both jessie-wikimedia and trusty-wikimedia [11:48:19] but where do you see 1.6? the repo currently contains 1.7.1 [11:48:36] under pool/thirdparty/e/elasticsearch [11:49:00] it seems that 1.7.1 was uploaded manually, but there is a config for 1.6 in /srv/wikimedia/conf/updates [11:49:49] I have no experience with reprepro, so I might just be trying to make things more complicated then they need to be ... [11:53:16] 6Operations, 10MediaWiki-extensions-CentralAuth, 10MobileFrontend, 10Traffic: S:UL requests with crazy encoding of mobile parameters - https://phabricator.wikimedia.org/T127823#2054955 (10BBlack) [11:56:08] 6Operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2054972 (10fgiunchedi) I was reviewing the size or codfw vs eqiad and 7x 1TB for each machine in codfw appears to be too big of a 'blast radius'. In the inter... [11:56:32] gehel: sure, you can also the internal reprepro mechanism, the current elasticsearch config seems outdated, though, it seems their repo is now at elastic.co instead of elasticsearch.org and the apt key has also changed. the updates file is managed in puppet in modules/install_server/files/reprepro/updates [12:02:56] 6Operations, 10Monitoring, 7Graphite, 13Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#2054997 (10fgiunchedi) 5Open>3Resolved static site remove, dns entry removed, resolving [12:03:48] 6Operations, 7Icinga: Icinga errors on neon: Contact group 'admin' specified in service XXX is not specified anywhere - https://phabricator.wikimedia.org/T127821#2054999 (10jcrespo) a:3jcrespo [12:04:27] ACKNOWLEDGEMENT - PyBal backends health check on lvs1008 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! Brandon Black T112781 [12:04:54] 6Operations, 7Icinga: Icinga errors on neon: Contact group 'admin' specified in service XXX is not specified anywhere - https://phabricator.wikimedia.org/T127821#2054926 (10jcrespo) p:5Triage>3Unbreak! My fault, creating patch: https://gerrit.wikimedia.org/r/272478 [12:05:42] (03PS3) 10BBlack: decom cache_parsoid [puppet] - 10https://gerrit.wikimedia.org/r/272323 (https://phabricator.wikimedia.org/T110472) [12:08:05] (03PS1) 10Jcrespo: Fix typo s/admin/admins/ on MariaDB's icinga config [puppet] - 10https://gerrit.wikimedia.org/r/272702 (https://phabricator.wikimedia.org/T127821) [12:11:17] 6Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T127824#2055015 (10fgiunchedi) 3NEW [12:12:14] ACKNOWLEDGEMENT - RAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi https://phabricator.wikimedia.org/T127824 [12:12:14] ACKNOWLEDGEMENT - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi https://phabricator.wikimedia.org/T127824 [12:12:34] RECOVERY - RAID on ms-be2003 is OK: OK: optimal, 13 logical, 13 physical [12:12:51] (03CR) 10BBlack: [C: 032] decom cache_parsoid [puppet] - 10https://gerrit.wikimedia.org/r/272323 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack) [12:16:50] <_joe_> bblack: \o/ [12:17:18] yeah :) [12:17:21] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/272702 (https://phabricator.wikimedia.org/T127821) (owner: 10Jcrespo) [12:18:49] 6Operations: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825#2055036 (10MoritzMuehlenhoff) [12:19:42] (03PS2) 10Jcrespo: Fix typo s/admin/admins/ on MariaDB's icinga config [puppet] - 10https://gerrit.wikimedia.org/r/272702 (https://phabricator.wikimedia.org/T127821) [12:21:16] jynus: heheh I was about to say! [12:21:43] it is all joes fault, he gave it +1 [12:21:52] it is allways joes fault :-) [12:22:43] :-P [12:22:59] (03CR) 10Jcrespo: [C: 032] Fix typo s/admin/admins/ on MariaDB's icinga config [puppet] - 10https://gerrit.wikimedia.org/r/272702 (https://phabricator.wikimedia.org/T127821) (owner: 10Jcrespo) [12:24:53] (03PS1) 10Giuseppe Lavagetto: Brandon knows what he's doing [software/conftool] - 10https://gerrit.wikimedia.org/r/272704 [12:24:57] <_joe_> bblack: ^^ [12:25:32] <_joe_> :D [12:25:32] (03CR) 10jenkins-bot: [V: 04-1] Brandon knows what he's doing [software/conftool] - 10https://gerrit.wikimedia.org/r/272704 (owner: 10Giuseppe Lavagetto) [12:25:39] <_joe_> oh jenkins, come on [12:25:55] <_joe_> you have no sense of humour [12:26:05] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [12:26:07] lol [12:26:18] lol [12:26:34] jenkins knows, jenkins -1's me all the time :P [12:26:35] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 20 connecting: cp1045_v4, cp1045_v6, cp1058_v4, cp1058_v6 [12:26:41] ^ that's me [12:26:50] <_joe_> yeah I guessed that [12:26:57] py27: commands failed ? [12:27:34] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail [12:27:42] <_joe_> AssertionError: SystemExit not raised [12:27:52] netmon1001 is probably me too [12:28:05] I tend to ignore torrus in my rearrange-cache-clusters work, and it tends to break [12:28:43] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 24 ESP OK [12:30:28] i think we can do without torrus for that now right [12:30:35] assuming it's in graphite? [12:31:09] I think some people still use torrus for some things [12:31:27] it's just the cache-related part isn't really in use and the config for it keeps falling apart because it's out of sync with reality [12:31:34] like greenpeace [12:31:49] there was something about aggregated graphs raised the last time folks thought about phasing out, iirc [12:32:00] the only positive thing thy had to say about wmf us torrus :p [12:32:27] they, is [12:33:26] godog: about scap / gbp config, you might want to fill a new task copy pasting https://phabricator.wikimedia.org/T127762#2054924 :D [12:33:40] bblack: my impression is that i'm the only one still requiring torrus, and that's only for power aggregates [12:33:47] and I -think- we're now able to do that in librenms as well [12:34:06] i used to use torrus mainly for the caching aggregates, but that's been broken for so long and we have graphite now which is way more flexible [12:34:26] godog: scap has no debian/gbp.conf file and gbp ends up defaulting to 'master' [12:34:29] in the early years, the request rate aggregate graphs were my main dashboard for gauging site stability ;) [12:34:47] :) [12:35:31] mainly these: https://commons.wikimedia.org/wiki/Category:Wikimedia_statistics#/media/File:Reqstats-daily-2007-10-25.png [12:35:47] yaseo [12:35:48] wow [12:35:52] yup [12:36:00] yahoo seoul [12:36:54] someday, we'll get back to having stuff in asia :) [12:37:26] 6Operations, 6Services, 10Traffic, 13Patch-For-Review: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#2055094 (10BBlack) 5Open>3Resolved [12:37:28] 6Operations, 10ops-eqiad, 10Traffic, 13Patch-For-Review: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#2055095 (10BBlack) [12:38:04] can we do it with librenms? [12:38:10] I guess I could write something a la ifdescr [12:38:16] !log installing cpio security updates [12:38:19] unless they added that kind of functionality already? [12:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:38:25] paravoid: that's what I was assuming [12:38:27] but not sure :) [12:39:09] apergos: we even had a few squids in Paris at one point [12:39:42] (03PS6) 10Giuseppe Lavagetto: role::memcached: add cross-dc IPsec for the various shards [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) [12:40:17] hashar: that's well before my time :-) I remember the removal of the last references to yaseo though [12:41:48] apergos: https://wikitech.wikimedia.org/wiki/Obsolete:Lopar_cluster [12:43:25] 6Operations, 7Puppet, 7Documentation, 7Need-volunteer: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2055116 (10Peachey88) [12:44:59] hashar: yeah you are right, thanks and {{done}} [12:46:44] godog: we even have task to have the .deb generated automagically ( https://phabricator.wikimedia.org/T127741 ) [12:46:45] touched by hagger [12:47:19] oooh I like "and potentially any other debs we maintain" [12:47:35] is wikipedia broken? [12:47:38] https://en.wikipedia.org/wiki/Folk_Art_(album) [12:47:50] was just clicking random a bunch of times [12:47:51] I see it [12:48:00] well we have all the logic on CI to build .deb packages. I lack interest to pursue that pet project though [12:48:11] what are you seeing, aude? [12:48:26] aude: maybe the error page is part of the corpus of random pages ? [12:48:33] 503 error [12:48:41] Request from 10.20.0.112 via cp1052 cp1052 ([10.64.32.104]:3128), Varnish XID 429037239 [12:48:52] hm nope, not logged in or out, worksforme [12:49:01] (checked in different browsers to be sure) [12:49:22] could just be SPecial:Random randomly broken [12:49:26] aude: do you have the date / time ? [12:49:35] Error: 503, Service Unavailable at Tue, 23 Feb 2016 12:47:19 GMT [12:49:42] ie right now [12:50:26] not much luck in fatalmonitor :D [12:50:28] clicking random some more is ok [12:50:33] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:34] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:50:43] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:51:45] in general, the 5xx rate doesn't seem excessive right now [12:52:15] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:52:15] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [13:00:02] 6Operations, 7Icinga, 13Patch-For-Review: Icinga errors on neon: Contact group 'admin' specified in service XXX is not specified anywhere - https://phabricator.wikimedia.org/T127821#2055163 (10jcrespo) I think this is finally fixed. Will test the changes soon, but we can probably close this, with your permis... [13:01:30] (03PS1) 10BBlack: torrus: remove cache_parsoid refs [puppet] - 10https://gerrit.wikimedia.org/r/272710 [13:02:12] (03CR) 10BBlack: [C: 032 V: 032] torrus: remove cache_parsoid refs [puppet] - 10https://gerrit.wikimedia.org/r/272710 (owner: 10BBlack) [13:02:22] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [13:04:44] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:15:42] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Puppet has 1 failures [13:19:32] (03PS1) 10Muehlenhoff: Assign salt grains for debdeploy for salt masters [puppet] - 10https://gerrit.wikimedia.org/r/272713 [13:19:34] (03PS1) 10Muehlenhoff: Assign salt grain for zuul::merger role and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272714 [13:19:36] (03PS1) 10Muehlenhoff: Add salt grain for alsafi and use it in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272715 [13:19:38] (03PS1) 10Muehlenhoff: Add salt grains for main kafka brokers and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272716 [13:19:40] (03PS1) 10Muehlenhoff: Also split video scaler in eqiad/codfw-specific grains as already done for the other mediawiki grains used in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272717 [13:19:42] (03PS1) 10Muehlenhoff: Also use logstash grain for logstash100[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/272718 [13:19:44] (03PS1) 10Muehlenhoff: Also use the canary grain in the debdeploy group for redis/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/272719 [13:19:52] 6Operations: Gerritbot didn't notify about patch in task. - https://phabricator.wikimedia.org/T127830#2055177 (10Danny_B) [13:21:30] (03PS1) 10Gehel: Upgrade elastic search to 1.7.5 [puppet] - 10https://gerrit.wikimedia.org/r/272721 (https://phabricator.wikimedia.org/T122697) [13:24:53] 6Operations, 6Discovery, 10MediaWiki-Vendor, 10Wikimedia-Logstash, and 2 others: Upgrade ruflin/elastica to 2.3.1 - https://phabricator.wikimedia.org/T127831#2055202 (10Reedy) [13:25:35] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me (my earlier IRC comment about a new apt key was wrong, I had been looking at the wrong entry)" [puppet] - 10https://gerrit.wikimedia.org/r/272721 (https://phabricator.wikimedia.org/T122697) (owner: 10Gehel) [13:26:19] moritzm: I was trying (and failing) to add you as a reviewer, but seems you are faster than I am... [13:27:23] moritzm: who should I ask for additional review (and +2) ? [13:28:10] the wikitech/gerrit names are sometimes confusing, mine is different from my cluster username [13:28:58] yep, too easy to get lost... [13:29:51] do you have +2 in gerrit? you can either merge yourself then or solicit further reviews (but seems like overkill for that kind of change) [13:30:56] (03CR) 10ArielGlenn: [C: 031] Assign salt grains for debdeploy for salt masters [puppet] - 10https://gerrit.wikimedia.org/r/272713 (owner: 10Muehlenhoff) [13:31:15] I do have +2, I was wondering if it was appropriate to +2 yourself [13:32:56] gehel: it's possible, especially in trivial cases, but if you end up causing pain, people are going to say "wtf nobody reviewed that and you self +2'd" [13:33:27] (as in, self+2 with no review from others) [13:33:29] ok, I have moritzm who plus oned me, so I feel reasonably safe... [13:34:01] having others review and +1 at least raises the odds the change is ok, and spreads a small percentage of the blame for not noticing the breakage :) [13:34:18] but the bulk of the blame always lies on the +2, IMHO [13:36:59] bblack: in this case, I wrote the patch, if I +2 it at least it is clear who's fault it is if it breaks anything ... [13:37:26] gehel: and as for finding reviewers in general, that depends a lot on the change, but usually running "git log" on the file in question will be a good indication [13:37:44] (03CR) 10Gehel: [C: 032] Upgrade elastic search to 1.7.5 [puppet] - 10https://gerrit.wikimedia.org/r/272721 (https://phabricator.wikimedia.org/T122697) (owner: 10Gehel) [13:40:41] !log updating reprepro configuration on carbon.eqiad.wmnet to include elasticsearch 1.7 repo [13:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:59] gehel: did you merge to ops/puppet before? there's an additional manual merge step involved for it: https://wikitech.wikimedia.org/wiki/Puppet#Updating_operations.2Fpuppet_for_production_nodes [13:41:32] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:41:41] moritzm: yep, I did that (I had a crash course by _joe_ last week) [13:41:47] ok [13:42:57] (03PS1) 10BBlack: cache_misc: remove cp1056, cp1069 [puppet] - 10https://gerrit.wikimedia.org/r/272723 (https://phabricator.wikimedia.org/T125486) [13:45:29] !log elasticsearch 1.7.5 now available on apt repository [13:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:10] moritzm: thanks for the help! Looks like we are ready to roll for tomorrow [13:46:35] nice [13:53:43] 6Operations, 7Icinga, 13Patch-For-Review: Icinga errors on neon: Contact group 'admin' specified in service XXX is not specified anywhere - https://phabricator.wikimedia.org/T127821#2055262 (10ema) 5Open>3Resolved This bug is indeed fixed. Thanks! [13:54:12] (03CR) 10Gehel: logstash: fix top-scope var w/o namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/272675 (owner: 10Dzahn) [13:57:49] (03PS2) 10Muehlenhoff: Assign salt grains for debdeploy for salt masters [puppet] - 10https://gerrit.wikimedia.org/r/272713 [13:57:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for debdeploy for salt masters [puppet] - 10https://gerrit.wikimedia.org/r/272713 (owner: 10Muehlenhoff) [13:58:10] (03PS2) 10Ladsgroup: Move ORES settings to beta features part [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272526 [13:58:46] (03CR) 10Gehel: [C: 031] "Seems trivial and looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/272666 (owner: 10Dzahn) [14:00:18] (03PS6) 10Bmansurov: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271264 (https://phabricator.wikimedia.org/T123980) [14:00:20] (03PS1) 10Bmansurov: Reduce sampling rate for language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272724 (https://phabricator.wikimedia.org/T127212) [14:00:43] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [14:01:19] (03CR) 10Ema: [C: 031] "+1" [puppet] - 10https://gerrit.wikimedia.org/r/272723 (https://phabricator.wikimedia.org/T125486) (owner: 10BBlack) [14:04:24] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T127808#2055302 (10ema) p:5Triage>3Normal [14:05:14] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T127808#2054571 (10ema) [14:05:17] 6Operations, 6Discovery, 10MediaWiki-Vendor, 10Wikimedia-Logstash, and 2 others: Upgrade ruflin/elastica to 2.3.1 - https://phabricator.wikimedia.org/T127831#2055305 (10Gehel) a:5Gehel>3dcausse [14:05:46] (03PS2) 10Muehlenhoff: Assign salt grain for zuul::merger role and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272714 [14:06:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grain for zuul::merger role and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272714 (owner: 10Muehlenhoff) [14:06:22] (03PS2) 10BBlack: cache_misc: remove cp1056, cp1069 [puppet] - 10https://gerrit.wikimedia.org/r/272723 (https://phabricator.wikimedia.org/T125486) [14:06:24] (03PS1) 10BBlack: cp10[56]1: upload->misc [puppet] - 10https://gerrit.wikimedia.org/r/272725 (https://phabricator.wikimedia.org/T125486) [14:06:26] (03PS1) 10BBlack: cache_misc: remove cp1057, cp1070 [puppet] - 10https://gerrit.wikimedia.org/r/272726 (https://phabricator.wikimedia.org/T125486) [14:07:25] poor grrrit-wm [14:08:23] (03CR) 10jenkins-bot: [V: 04-1] cache_misc: remove cp1057, cp1070 [puppet] - 10https://gerrit.wikimedia.org/r/272726 (https://phabricator.wikimedia.org/T125486) (owner: 10BBlack) [14:09:26] 6Operations, 10Swift, 13Patch-For-Review: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1788316 (10faidon) So the uid issue we should definitely fix at some point, but I don't understand exactly why it's a blocker to a jessie upgrade (= reinstall). We can install with jessie and manually ad... [14:09:50] godog: ^ :) [14:10:09] (03PS2) 10BBlack: cache_misc: remove cp1057, cp1070 [puppet] - 10https://gerrit.wikimedia.org/r/272726 (https://phabricator.wikimedia.org/T125486) [14:13:02] (03Abandoned) 10BBlack: Brandon knows what he's doing [software/conftool] - 10https://gerrit.wikimedia.org/r/272704 (owner: 10Giuseppe Lavagetto) [14:13:35] (03PS4) 10BBlack: upload/misc VCL: remove t2-be bypass trick [puppet] - 10https://gerrit.wikimedia.org/r/271966 (https://phabricator.wikimedia.org/T127481) [14:13:55] (03PS1) 10Volans: Depool of es1016 for RAID perf comparison [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272727 (https://phabricator.wikimedia.org/T127330) [14:17:24] (03CR) 10Jcrespo: [C: 031] "Good choice, that way you avoid replication complication. Keep an eye on load for es2012 and es1018." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272727 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [14:20:11] (03CR) 10Volans: [C: 032] Depool of es1016 for RAID perf comparison [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272727 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [14:20:38] (03Merged) 10jenkins-bot: Depool of es1016 for RAID perf comparison [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272727 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [14:22:33] (03CR) 10Eevans: disable package-installed initscript (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/272612 (https://phabricator.wikimedia.org/T127365) (owner: 10Eevans) [14:23:15] (03PS2) 10Muehlenhoff: Add salt grain for alsafi and use it in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272715 [14:23:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grain for alsafi and use it in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272715 (owner: 10Muehlenhoff) [14:23:45] !log volans@tin Synchronized wmf-config/db-eqiad.php: Depool to compare RAID perfs with new es201* T127330 (duration: 01m 42s) [14:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:31] 6Operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2015334 (10Gehel) a:3Gehel [14:26:33] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:26:39] paravoid: thanks! I'll probably get to reply tomorrow [14:27:54] !log fixed apt config on es2011 (also affected by T125044) [14:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:28:46] (03CR) 10BBlack: [C: 032] upload/misc VCL: remove t2-be bypass trick [puppet] - 10https://gerrit.wikimedia.org/r/271966 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [14:28:52] (03PS3) 10Filippo Giunchedi: disable package-installed initscript [puppet] - 10https://gerrit.wikimedia.org/r/272612 (https://phabricator.wikimedia.org/T127365) (owner: 10Eevans) [14:28:55] (03PS5) 10BBlack: upload/misc VCL: remove t2-be bypass trick [puppet] - 10https://gerrit.wikimedia.org/r/271966 (https://phabricator.wikimedia.org/T127481) [14:29:03] (03CR) 10Filippo Giunchedi: disable package-installed initscript (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/272612 (https://phabricator.wikimedia.org/T127365) (owner: 10Eevans) [14:29:09] (03CR) 10BBlack: [V: 032] upload/misc VCL: remove t2-be bypass trick [puppet] - 10https://gerrit.wikimedia.org/r/271966 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [14:30:12] I think there may be some temporary cpNNNN icinga alerts as a result of the above [14:30:22] I'm trying to minimize them, but it's an unpredictable mess [14:36:33] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures [14:38:13] (03PS3) 10BBlack: cache_misc: remove cp1056, cp1069 [puppet] - 10https://gerrit.wikimedia.org/r/272723 (https://phabricator.wikimedia.org/T125486) [14:38:23] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:38:31] 6Operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2055348 (10Gehel) My understanding of the flows is summarized in the following 2 diagrams: * https... [14:39:24] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: remove cp1056, cp1069 [puppet] - 10https://gerrit.wikimedia.org/r/272723 (https://phabricator.wikimedia.org/T125486) (owner: 10BBlack) [14:40:58] (03PS2) 10Muehlenhoff: Add salt grains for main kafka brokers and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272716 [14:42:14] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 1 failures [14:43:20] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T127808#2054571 (10ema) Hi, it looks like @Nikerabbit is already a member of the statistics-users group. Perhaps something is wrong with the S... [14:43:43] !log depool cp1051 (upload eqiad) - T125486 [14:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:03] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 1 failures [14:44:03] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:44:39] !log Run sysbench on es1016 (already depooled) T127330 [14:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:47] 6Operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2055354 (10Krenair) >>! In T126472#2055348, @Gehel wrote: > * unencrypted, might contain PII The M... [14:45:54] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:46:18] ema: ping? :D [14:46:33] !log testing database paging changes by stopping mysql slave on db1021 (it should page dbas) [14:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:46] Nikerabbit: pong! [14:47:17] ema: you saying that I should already be able to access stat1002? [14:47:43] Nikerabbit: yup, or at least you're already listed among the members of statistics-users [14:48:36] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T127808#2054571 (10Krenair) statistics-users provides stat1003 access, not stat1002. It also does not provide access to the password needed to a... [14:49:24] okay, I can confirm access to stat1003 but not 1002 [14:49:41] statistics-users should not provide access to stat1002. [14:49:52] I don't think you need access to stat1002 anyway [14:50:13] oh, I thought so because the request mentioned https://phabricator.wikimedia.org/T122524 [14:50:32] Krenair: right https://phabricator.wikimedia.org/T122524#1916417 [14:51:08] ema, I would generally never trust past access requests [14:51:59] statistics-users is basically a subset of researchers [14:52:13] AFAIK it just does not provide you with the research password [14:53:27] so I am actually asking access to 'researches' (?) [14:54:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for main kafka brokers and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272716 (owner: 10Muehlenhoff) [14:54:38] PROBLEM - MariaDB Slave Lag: s2 on db1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.400345 seconds [14:55:01] ok, paging works, hoo, volans disagree? [14:55:07] got page too [14:55:12] and the logging here, too [14:55:13] ^ [14:55:20] email too [14:55:37] so far looks good [14:55:39] ema: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups [14:56:03] (03CR) 10Eevans: [C: 031] disable package-installed initscript [puppet] - 10https://gerrit.wikimedia.org/r/272612 (https://phabricator.wikimedia.org/T127365) (owner: 10Eevans) [14:56:52] (03PS3) 10BBlack: Remove sub vcl_foo from sub-includes [puppet] - 10https://gerrit.wikimedia.org/r/271986 (https://phabricator.wikimedia.org/T127481) [14:57:34] (03CR) 10BBlack: [C: 032 V: 032] Remove sub vcl_foo from sub-includes [puppet] - 10https://gerrit.wikimedia.org/r/271986 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [14:58:36] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T127808#2054571 (10Ottomata) https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups [14:58:56] thanks Krenair and ottomata [14:59:19] Nikerabbit, what are you currently able to do with your statistics-users access? [14:59:43] (other than log in to stat1003) [14:59:52] Krenair: I don't know if anything else [15:00:11] (03PS3) 10BBlack: cache_misc: call recv_purge like others [puppet] - 10https://gerrit.wikimedia.org/r/271987 (https://phabricator.wikimedia.org/T127481) [15:00:18] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: call recv_purge like others [puppet] - 10https://gerrit.wikimedia.org/r/271987 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [15:00:35] (03PS4) 10Filippo Giunchedi: disable package-installed initscript [puppet] - 10https://gerrit.wikimedia.org/r/272612 (https://phabricator.wikimedia.org/T127365) (owner: 10Eevans) [15:00:39] statistics-users exists so folks can log into stat1003, and access log files stored on disk there [15:00:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] disable package-installed initscript [puppet] - 10https://gerrit.wikimedia.org/r/272612 (https://phabricator.wikimedia.org/T127365) (owner: 10Eevans) [15:00:46] i think some people used to use it to access research dbs [15:00:54] maybe with some other creds than the research one, but i'm not sure about that [15:01:59] ottomata: according to wikitech people in statistics-users should be able to access the SQL research slaves [15:02:22] yeah, i dunno why it says that. it is technically true, if they have a mysql user/pw that allows them access [15:02:41] usually people get researcher group access for that. it allows them to read a mysql.conf file that has the research user pw [15:03:13] Nikerabbit: it sounds like you want researcher access [15:03:28] ema: i'm editing that page [15:03:34] when the alarm clears, I will do the same test on codfw [15:03:44] if I remember correctly I got statistics-users for eventlogging originally [15:03:47] ottomata: cheers [15:08:01] Nikerabbit: for eventlogging log files? [15:08:02] ottomata: the sql table for events [15:08:02] what user did you use to connect to mysql? [15:08:02] and...how long ago was this? [15:08:03] ottomata: many years, ori did that I think :) [15:08:03] oh wait T122524 actually mentions stat1003 https://phabricator.wikimedia.org/T122524#1916417 [15:08:03] RECOVERY - MariaDB Slave Lag: s2 on db1021 is OK: OK slave_sql_lag Replication lag: 0.111194 seconds [15:08:04] ottomata: user is research_prod [15:08:04] yeah, Nikerabbit that sounds like the very wild west days of mysql db access [15:08:05] we are only in wild west now [15:08:05] jynus: paged now with the recovery [15:08:05] actually 30 secs ago [15:08:05] ja, so, Nikerabbit you want to be in the researchers group, and you will use it to access a file on stat1003 to connect to mysql dbs [15:08:05] so trying now on codfw [15:08:05] Nikerabbit: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_slaves [15:08:30] 6Operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2055382 (10Gehel) >>! In T126472#2055354, @Krenair wrote: >>>! In T126472#2055348, @Gehel wrote: >>... [15:09:02] (03PS2) 10Muehlenhoff: Also split video scaler in eqiad/codfw-specific grains as already done for the other mediawiki grains used in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272717 [15:10:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also split video scaler in eqiad/codfw-specific grains as already done for the other mediawiki grains used in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272717 (owner: 10Muehlenhoff) [15:10:53] !log restarted pdns on labservices1001 [15:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:10] but I am confused about one thing: ema says I am in 'statistics-users' already, but I don't have access to stat1003 per https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups ? [15:12:54] ah, but I do, mystery solved [15:13:00] (03PS7) 10Giuseppe Lavagetto: role::memcached: add cross-dc IPsec for the various shards [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) [15:13:26] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to researches - https://phabricator.wikimedia.org/T127808#2055392 (10Nikerabbit) [15:14:14] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to researches - https://phabricator.wikimedia.org/T127808#2054571 (10Nikerabbit) Ottomata summarised this well: > Nikerabbit you want to be in the researchers group, and you will use it to access a file o... [15:17:22] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [15:20:23] !log shutting down analytics (hadoop) cluster for CDH 5.5 upgrade [15:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:47] goddog: lmk when it's okay to upgrade restabse1008 [15:22:48] cmjohnson1: yup, machine halted, should be shutting down shortly [15:22:55] !log halt restbase1008 for cpu/mem upgrade [15:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:08] 6Operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2055414 (10Krenair) I'm not aware of any extra filters applied after MediaWiki, but if you think th... [15:25:36] cmjohnson1: should be down by now [15:25:48] okay [15:25:48] thx [15:28:26] (03PS1) 10Alex Monk: admin: Clarify researchers vs. statistics-users rights [puppet] - 10https://gerrit.wikimedia.org/r/272736 [15:29:19] (03PS1) 10Filippo Giunchedi: add restbase100[89]-[bc] instances [dns] - 10https://gerrit.wikimedia.org/r/272737 [15:29:44] salt-minion is not running on mw2173, any specific reasons for that? [15:30:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add restbase100[89]-[bc] instances [dns] - 10https://gerrit.wikimedia.org/r/272737 (owner: 10Filippo Giunchedi) [15:30:31] (03PS1) 10Aude: Enable WikibaseClient on Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272738 (https://phabricator.wikimedia.org/T109675) [15:31:15] (03CR) 10Ottomata: admin: Clarify researchers vs. statistics-users rights (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/272736 (owner: 10Alex Monk) [15:31:38] (03PS2) 10Muehlenhoff: Also use logstash grain for logstash100[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/272718 [15:32:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also use logstash grain for logstash100[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/272718 (owner: 10Muehlenhoff) [15:34:32] (03PS2) 10Alex Monk: admin: Clarify researchers vs. statistics-users rights [puppet] - 10https://gerrit.wikimedia.org/r/272736 [15:35:45] (03PS1) 10Filippo Giunchedi: cassandra: add restbase100[89]-b to seeds [puppet] - 10https://gerrit.wikimedia.org/r/272739 [15:35:59] (03PS2) 10Filippo Giunchedi: cassandra: add restbase100[89]-b to seeds [puppet] - 10https://gerrit.wikimedia.org/r/272739 [15:36:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase100[89]-b to seeds [puppet] - 10https://gerrit.wikimedia.org/r/272739 (owner: 10Filippo Giunchedi) [15:37:26] !log stopping db2035 replication to test no paging [15:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:32] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:39:42] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:40:08] (03PS2) 10Muehlenhoff: Also use the canary grain in the debdeploy group for redis/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/272719 [15:40:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also use the canary grain in the debdeploy group for redis/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/272719 (owner: 10Muehlenhoff) [15:41:04] apergos: what can you tell me about labs instance 'dumps-stats' [15:41:17] specifically: why is it so big? And, can I migrate it and cause some downtime? [15:42:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is blocked until all mc1* hosts are on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [15:43:45] 6Operations, 10Analytics-Wikistats, 7Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2055478 (10Krinkle) Fair enough, but I'm making the case we don't need statistics. This worked and was used and linked to. There's... [15:44:22] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [15:45:21] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [15:45:44] (03PS1) 10Muehlenhoff: Move role declaration to the top of the site.pp entry to fix Hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/272740 [15:45:48] godog restbase1008 is back [15:46:50] (03CR) 10Giuseppe Lavagetto: [C: 031] Move role declaration to the top of the site.pp entry to fix Hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/272740 (owner: 10Muehlenhoff) [15:48:01] 6Operations, 10Analytics-Wikistats, 7Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krenair) I don't care about this redirect enough to upload the patch, but I imagine this is because stats.wikipedia.org... [15:48:05] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move role declaration to the top of the site.pp entry to fix Hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/272740 (owner: 10Muehlenhoff) [15:50:29] cmjohnson1: can't ssh to either mgmt or the machine yet, seeing the same? [15:51:37] try now [15:53:58] godog: good now [15:54:47] cmjohnson1: yup! checking and we can move to 1009 shortly [15:56:27] jdlrobson: can you look at your item for the next swat and possibly correct it? [15:59:50] thcipriani: when can we get an updated scap with the mtime fix? [15:59:59] !log shut restbase1009 for cpu/mem upgrade [16:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160223T1600). [16:00:04] bmansurov James_F Jdlrobson jzerebecki: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:13] here [16:00:31] o/ [16:00:55] cmjohnson1: 1009 should be shutting now [16:01:11] FYI SWATers, I'm on tin right now [16:01:36] so don't start trying to merge/sync things quite yet [16:01:48] okay [16:01:57] 7Blocked-on-Operations, 10RESTBase: Separate metrics & logs between staging and production - https://phabricator.wikimedia.org/T103124#2055531 (10GWicke) [16:02:15] I can SWAT. bd808 ack, made a ticket for the new scap package that contained the mtime update yesterday. May already be done... [16:02:23] (03CR) 10GWicke: "This misses icinga and ganglia." [puppet] - 10https://gerrit.wikimedia.org/r/272536 (https://phabricator.wikimedia.org/T127747) (owner: 10Eevans) [16:02:24] I should be done in 5 minutes... [16:03:30] * James_F is here too. [16:05:32] thcipriani: ok, all clear on tin [16:05:49] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271264 (https://phabricator.wikimedia.org/T123980) (owner: 10Bmansurov) [16:05:56] bd808: cool, thanks [16:06:12] (03PS1) 10Glaisher: Enable uploader group at Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272744 (https://phabricator.wikimedia.org/T127826) [16:06:33] thcipriani: is it okay if I add that to SWAT? ^ [16:07:00] Glaisher: sure, go for it [16:07:08] great, thanks [16:07:10] (03Merged) 10jenkins-bot: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271264 (https://phabricator.wikimedia.org/T123980) (owner: 10Bmansurov) [16:08:34] andrewbogott: no clue [16:08:51] apergos: really, the ‘dumps’ project isn’t you? [16:09:48] apergos: oh, it isn’t. Ok, sorry [16:10:11] I was just going to look to see if somehow I owned something I didn't know about [16:10:20] but for sure nothing with stats inthe name [16:10:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270346 (https://phabricator.wikimedia.org/T126801) (owner: 10Jforrester) [16:10:45] Whee. [16:10:53] I have some salt related stuff over there yo ucan ask me about (especially since I've likely still got puppet disabled on those instances) [16:11:06] (03Merged) 10jenkins-bot: VisualEditor: Switch to Single Edit Tab mode on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270346 (https://phabricator.wikimedia.org/T126801) (owner: 10Jforrester) [16:11:30] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable the structured language overlay and increase the instrumentation rate [[gerrit:271264]] (duration: 01m 32s) [16:11:32] ^ bmansurov check please [16:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:47] ah, Nemo. I see [16:12:13] thcipriani: looking [16:14:00] thcipriani: actually, my change will be visible on thursday when the dependency is pushed to wikipedias [16:14:19] bmansurov: kk, although I'm not sure that train is on schedule this week, FYI. [16:14:48] ok [16:14:56] jzerebecki: whew, that's a big update :) [16:15:13] (03PS2) 10Ottomata: Make MySQL instance on analytics1015 the master [puppet] - 10https://gerrit.wikimedia.org/r/272606 (https://phabricator.wikimedia.org/T110090) [16:15:14] thcipriani: needs a full scap :( [16:15:26] indeed. [16:15:55] things that need a full scap really shouldn't be in swat [16:15:58] thcipriani: so if you want to not do that I would do it myself during the train slot [16:16:00] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: VisualEditor: Switch to Single Edit Tab mode on Hungarian Wikipedia [[gerrit:270346]] (duration: 01m 32s) [16:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:06] ^ James_F check please [16:16:07] we are abusing sawt pretty badly these days [16:16:12] Doing so. [16:16:14] jzerebecki: If you're scaping today, make sure to also pick up aude's WikimediaMessage changes [16:16:19] (03PS3) 10Ottomata: Make MySQL instance on analytics1015 the master [puppet] - 10https://gerrit.wikimedia.org/r/272606 (https://phabricator.wikimedia.org/T110090) [16:16:30] thcipriani: ok please skip my patch [16:16:37] without the train, SWAT abuse happens :( [16:16:44] jzerebecki: ack, thanks. [16:16:56] godog: all good with restabse1009. We'll have to fix 1007 [16:17:00] (03CR) 10Ottomata: [C: 032 V: 032] Make MySQL instance on analytics1015 the master [puppet] - 10https://gerrit.wikimedia.org/r/272606 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [16:17:16] cmjohnson1: ack, checking 1009 and then we'll move to 1007 [16:17:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272744 (https://phabricator.wikimedia.org/T127826) (owner: 10Glaisher) [16:17:29] thcipriani: Yup, seems to work well. [16:17:37] (03CR) 10Giuseppe Lavagetto: Rationalize services definitions for labs too. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 (owner: 10Giuseppe Lavagetto) [16:17:44] James_F: thanks for checking [16:18:08] (03Merged) 10jenkins-bot: Enable uploader group at Simple English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272744 (https://phabricator.wikimedia.org/T127826) (owner: 10Glaisher) [16:19:04] (Yay.) [16:19:37] :D [16:19:56] (03PS5) 10Giuseppe Lavagetto: Rationalize services definitions for labs too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 [16:20:22] (03PS4) 10Ottomata: Move Hive and Oozie to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272605 (https://phabricator.wikimedia.org/T110090) [16:20:57] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable uploader group at Simple English Wikipedia [[gerrit:272744]] (duration: 01m 32s) [16:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:21:03] ^ Glaisher check please [16:21:10] (03PS5) 10Ottomata: Move Hive and Oozie to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272605 (https://phabricator.wikimedia.org/T110090) [16:21:13] doing [16:22:07] (03CR) 10Hoo man: [C: 031] Enable WikibaseClient on Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272738 (https://phabricator.wikimedia.org/T109675) (owner: 10Aude) [16:22:27] 6Operations, 10ops-eqiad: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2055632 (10Cmjohnson) The card will not work for the server. The controller card is too large for the space and will not fit. [16:22:35] thcipriani: looks good, thanks again [16:22:50] Glaisher: awesome. Thanks for checking :) [16:23:00] !log depool restbase1007 and shut for ram upgrade [16:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:24] (03CR) 10Ottomata: [C: 032] Move Hive and Oozie to analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272605 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [16:24:14] (03CR) 10Giuseppe Lavagetto: Define service entries for InitialiseSettings (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:24:48] cmjohnson1: restbase1007 should be shutting shortly [16:24:56] great thx [16:28:48] (03PS7) 10Giuseppe Lavagetto: Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) [16:28:50] (03PS16) 10Giuseppe Lavagetto: Define service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [16:29:12] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:30:33] (03CR) 10BryanDavis: "This looks like a cut-n-paste problem from when the scripts module was split out of logstash::output::elasticsearch when it was changed to" [puppet] - 10https://gerrit.wikimedia.org/r/272675 (owner: 10Dzahn) [16:31:43] (03PS1) 10Ottomata: spark-core now depends on flume, mirror it from CDH to our apt [puppet] - 10https://gerrit.wikimedia.org/r/272753 (https://phabricator.wikimedia.org/T119646) [16:32:15] (03CR) 10Ottomata: [C: 032 V: 032] spark-core now depends on flume, mirror it from CDH to our apt [puppet] - 10https://gerrit.wikimedia.org/r/272753 (https://phabricator.wikimedia.org/T119646) (owner: 10Ottomata) [16:34:22] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:35:42] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [16:35:47] (03PS1) 10Ottomata: Use proper flume package naems in reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/272754 (https://phabricator.wikimedia.org/T119646) [16:36:06] (03CR) 10Ottomata: [C: 032 V: 032] Use proper flume package naems in reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/272754 (https://phabricator.wikimedia.org/T119646) (owner: 10Ottomata) [16:36:21] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:37:12] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [16:37:31] godog: restbase1007 is up [16:38:22] (03PS9) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/271797 (https://phabricator.wikimedia.org/T124680) [16:38:47] cmjohnson1: thanks, checking [16:41:19] 6Operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2055695 (10Gehel) @Krenair sorry for the subscription, I got confused by the multiple nicks. I rev... [16:46:50] (03CR) 10Andrew Bogott: [C: 032] Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/271797 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [16:49:05] !log grow raid0 on restbase1009 T119935 [16:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:19] (03PS3) 10Jcrespo: Small formatting fixes for replication lag check [puppet] - 10https://gerrit.wikimedia.org/r/271815 (https://phabricator.wikimedia.org/T114752) [16:53:42] (03CR) 10Jcrespo: [C: 032] Small formatting fixes for replication lag check [puppet] - 10https://gerrit.wikimedia.org/r/271815 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [16:54:42] (03PS1) 10Ottomata: Include analytics_cluster* database classes on analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272759 [16:54:55] (03PS2) 10Ottomata: Include analytics_cluster* database classes on analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272759 [16:55:25] (03CR) 10Ottomata: [C: 032 V: 032] Include analytics_cluster* database classes on analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/272759 (owner: 10Ottomata) [16:58:33] !log stopping again db2035 replication [16:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:45] yes, gerrit patch does exactly what it should do, for a change [17:00:05] _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160223T1700). [17:00:32] PROBLEM - designate-pool-manager process on labservices1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-pool-manager [17:01:02] <_joe_> oh the puppetswat is empty [17:01:08] <_joe_> niiiice [17:01:10] 6Operations, 10hardware-requests, 13Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2055818 (10fgiunchedi) update: restbase100[789] have ram/cpu upgraded and in line with codfw. restbase1009 is growing its raid0 and restbase1008-b will... [17:01:28] 6Operations, 10hardware-requests, 13Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2055819 (10Eevans) @fgiunchedi During the last expansion (1008), we saw elevated 99p latencies. https://grafana-admin.wikimedia.org/dashboard/db/restba... [17:02:06] !log rebooting analytics1027 for kernel upgrae [17:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:21] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:03:49] (03PS1) 10Volans: Repool es1016 after RAID perf test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272761 (https://phabricator.wikimedia.org/T127330) [17:03:57] 6Operations, 10ops-eqiad: db1021 degraded RAID - https://phabricator.wikimedia.org/T126451#2055825 (10Cmjohnson) It is a possibility...i used a "used disk" Do you want to replace it with another? [17:04:13] (03PS2) 10Volans: Repool es1016 after RAID perf test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272761 (https://phabricator.wikimedia.org/T127330) [17:04:45] (03PS1) 10Andrew Bogott: Revert "Updates to designate/mdns/pdns setup for Labs internal dns" [puppet] - 10https://gerrit.wikimedia.org/r/272764 [17:04:51] 6Operations, 10hardware-requests, 13Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2055846 (10fgiunchedi) @eevans rebuild is currently throttled at 6MB/s max and I'm seeing around 500ms p99, will keep an eye on that and if it increase... [17:04:58] (03PS2) 10Andrew Bogott: Revert "Updates to designate/mdns/pdns setup for Labs internal dns" [puppet] - 10https://gerrit.wikimedia.org/r/272764 [17:06:19] (03CR) 10Andrew Bogott: [C: 032 V: 032] Revert "Updates to designate/mdns/pdns setup for Labs internal dns" [puppet] - 10https://gerrit.wikimedia.org/r/272764 (owner: 10Andrew Bogott) [17:08:22] RECOVERY - designate-pool-manager process on labservices1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/designate-pool-manager [17:10:35] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2055880 (10thcipriani) [17:14:19] 6Operations, 10ops-eqiad: db1021 degraded RAID - https://phabricator.wikimedia.org/T126451#2055904 (10jcrespo) a:3jcrespo Not really- I reimagined. I would understand the disk braking, but the controler should have managed it, not create corruption at application level. Let me check the RAID status, to see... [17:14:36] (03PS4) 10Mattflaschen: Exclude fishbowl and add computed dblist for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 [17:15:01] (03CR) 10Mattflaschen: "Rebased to exclude labswiki and labtestwiki as master does." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 (owner: 10Mattflaschen) [17:16:33] !log depool cp1061 (eqiad upload) - T125486 [17:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:23] btw there was a considerable (but still small) jump in 5xx start around 16:00 [17:19:43] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1456237170805&to=1456247970805&var-site=eqiad&var-cache_type=text&var-status_type=5 [17:20:44] basically a doubling of the rate, but the absolute rate is still pretty small [17:23:23] PROBLEM - NTP on restbase1008 is CRITICAL: NTP CRITICAL: Offset unknown [17:23:32] 6Operations, 10hardware-requests, 13Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2055923 (10Eevans) >>! In T119935#2055846, @fgiunchedi wrote: > @eevans rebuild is currently throttled at 6MB/s max and I'm seeing around 500ms p99, wi... [17:25:25] (03PS1) 10Andrew Bogott: Change master=no for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/272769 (https://phabricator.wikimedia.org/T124680) [17:25:27] (03PS1) 10Andrew Bogott: Remove a couple of obsolete designate settings that no longer have any effect. [puppet] - 10https://gerrit.wikimedia.org/r/272770 [17:25:29] (03PS1) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/272771 (https://phabricator.wikimedia.org/T124680) [17:26:07] 6Operations, 10ops-eqiad: db1021 degraded RAID - https://phabricator.wikimedia.org/T126451#2055926 (10jcrespo) a:5jcrespo>3Cmjohnson Could you "replace" these 2 disks which are marked as critical (it is ok to waste two old disks on this, it will be decommissioned soon- replacement is on its way), but I nee... [17:26:16] paravoid: I'm trying to understand how traffic flows from mediawiki to IRC (context: DC switchover) [17:26:16] 6Operations, 10hardware-requests, 13Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2055931 (10Eevans) >>! In T119935#2055818, @fgiunchedi wrote: > update: restbase100[789] have ram/cpu upgraded and in line with codfw. restbase1009 is... [17:26:32] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1008-b instance [puppet] - 10https://gerrit.wikimedia.org/r/272772 [17:27:05] (03PS2) 10Filippo Giunchedi: cassandra: add restbase1008-b instance [puppet] - 10https://gerrit.wikimedia.org/r/272772 [17:27:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1008-b instance [puppet] - 10https://gerrit.wikimedia.org/r/272772 (owner: 10Filippo Giunchedi) [17:27:41] (03PS2) 10Andrew Bogott: Change master=no for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/272769 (https://phabricator.wikimedia.org/T124680) [17:27:43] (03CR) 10Jcrespo: [C: 031] Repool es1016 after RAID perf test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272761 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [17:28:53] gehel: udp, iirc [17:28:58] paravoid: in particular, the UDP flow from mediawiki to udpmxircecho. It is now directed to 208.80.154.160. Do you know if the flow from codfw goes through the wild internet? How can I check if we can pass that flow through a more controlled env ? [17:29:18] I'm in a meeting, talk to you in 30mins? [17:29:27] sure [17:29:30] !log bootstrap restbase1008-b T119935 [17:29:32] communication over our two datacenter is not over the internet, we have private links between them [17:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:29:47] between all of our DCs in fact [17:30:00] gehel: in CommonSettings.php, there is code to configure a UDPRCFeedEngine: https://github.com/iSCInc/mediawiki-config/blob/da96f0db2f5c48c43f5f34a9df61b8252c3c1b46/wmf-config/CommonSettings.php#L2983-L2988 [17:30:03] (03CR) 10Volans: [C: 032] Repool es1016 after RAID perf test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272761 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [17:30:47] paravoid: I wanted to make sure that network flow is routed through our private links and that it is OK to keep UDP traffic flowing through those links (what do we expect in term of packet loss). [17:30:49] $wmgRC2UDPAddress is configured to be InitialiseSettings.php to be 208.80.154.160 [17:31:05] the feeds are public [17:31:11] ori: Yep, I read that part. [17:31:16] (03Merged) 10jenkins-bot: Repool es1016 after RAID perf test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272761 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [17:31:30] ok, meeting finished early [17:31:42] paravoid: that was a fast 30' ! [17:31:49] heh [17:32:37] (03CR) 10Andrew Bogott: [C: 032] "I've tested this with a hotfix and it doesn't break anything -- it was also recommended by Kiall @ designate." [puppet] - 10https://gerrit.wikimedia.org/r/272769 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [17:33:41] gehel: the rcfeed implementation has independently-configurable formatters and transports. So you can send the IRC-bound stream via redis, if you adapted udpmxircecho to use redis pub/sub instead of udp [17:33:41] so ... at the moment, this UDP flow is going to 208.80.154.160 from both DC. That should probably be routed from codfw to eqiad through our private link. How can I check? [17:34:01] !log volans@tin Synchronized wmf-config/db-eqiad.php: Repool es1016 T127330 (duration: 01m 39s) [17:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:31] ori: Ok, that could be an option. I'm still trying to see if we need to change anything or if the current configuration is sufficient. [17:35:35] right. I'm really glad you're looking into this, by the way. This is one of those seemingly superficial details that end up being critical; vandalism gets out of hand when the IRC feed is disrupted [17:35:47] We have UDP traffic flowing from codfw to eqiad. There's no PII in it as far as I can tell. So there is just a question of reliability of the link (as it is UDP and it seems we really do not want loose this update traffic). [17:37:03] Further down the timeline, we probably want to have redundancy in this stream, but that seems to be out of scope at the moment. [17:37:38] gehel: you could tcpdump, but obviously mediawiki instances in codfw are not active [17:37:44] however, I don't see why it would not work [17:37:49] (03PS2) 10BBlack: cp10[56]1: upload->misc [puppet] - 10https://gerrit.wikimedia.org/r/272725 (https://phabricator.wikimedia.org/T125486) [17:40:10] paravoid: I'm brand new here and I have ho idea how our network is set up, so I might be asking stupid questions... [17:40:28] no worries [17:40:37] (03CR) 10BBlack: [C: 032] cp10[56]1: upload->misc [puppet] - 10https://gerrit.wikimedia.org/r/272725 (https://phabricator.wikimedia.org/T125486) (owner: 10BBlack) [17:40:56] paravoid: I expect that not all traffic is open to 208.80.154.160 (there is no authentication on the udp echo to irc) [17:41:05] who's our resident ldap pro? [17:41:35] paravoid: so I'm expecting some firewalling, so I want to make sure the path is opened from codfw as well [17:42:17] paravoid: _joe_ was wondering if packet loss is low enough that we can keep UDP for inter DC changelog publishing [17:42:41] there is no packet loss between the two DCs [17:42:47] (under normal circumstances) [17:43:07] so, regarding "all traffic open" [17:43:14] you can start from site.pp [17:43:28] # irc.wikimedia.org [17:43:28] node 'argon.wikimedia.org' { role mw_rc_irc [17:43:29] but that doesn't change the fact that UDP is unreliable. there will be no real indicator that you lost something if e.g. there's a CPU/net spike on a related host and a UDP packet falls off of a buffer [17:43:46] which points you to manifests/role/mw_rc_irc.pp [17:43:55] which has this ferm (= iptables management tool) rule: [17:43:58] # IRC RecentChanges bot - gets updates from appservers [17:43:59] ferm::service { 'udpmxircecho': [17:44:01] srange => '$MW_APPSERVER_NETWORKS', [17:44:20] $MW_APPSERVER_NETWORKS is defined under modules/base/templates/firewall/defs.erb [17:44:33] to puppet's $mw_appserver_networks [17:44:38] which in turn is defined under network.pp [17:44:56] and indeed includes all of codfw as well [17:46:03] Kool, so we define firewalling in puppet as well! [17:47:19] yes [17:47:34] typically everything in a machine is puppetized [17:48:10] paravoid: https://gerrit.wikimedia.org/r/#/c/272670/ [17:48:14] Oh, that's host firewalling. Do we have something at network level as well? [17:48:14] decided against a 301 [17:49:04] (03CR) 10Catrope: [C: 031] Exclude fishbowl and add computed dblist for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 (owner: 10Mattflaschen) [17:49:24] last question: the assumption was that UDP is reliable enough inside a DC to publish change logs. Does this assumption change in the multi DC scenario? [17:49:35] gehel: not typically no, with some exceptions (all of analytics is one) [17:50:33] (03PS2) 10Andrew Bogott: Remove a couple of obsolete designate settings that no longer have any effect. [puppet] - 10https://gerrit.wikimedia.org/r/272770 [17:51:29] paravoid: thanks a lot for the help [17:51:41] * gehel is now going to read those slides on networking [17:51:57] * gehel has just learn they exist [17:52:32] (03CR) 10Andrew Bogott: [C: 032] "Tested and deemed harmless" [puppet] - 10https://gerrit.wikimedia.org/r/272770 (owner: 10Andrew Bogott) [17:58:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [17:59:56] ^ that's real [18:00:05] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160223T1800). Please do the needful. [18:00:05] (03PS1) 10Jforrester: Follow-up 0fa358a: Set the correct value for SET secondary default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272778 [18:00:15] (03PS1) 10Ottomata: Add accidentally uncommitted hiera cdh users.yaml [puppet] - 10https://gerrit.wikimedia.org/r/272779 (https://phabricator.wikimedia.org/T119646) [18:00:21] what just happened circa 17:50? [18:00:34] (03CR) 10Ottomata: [C: 032 V: 032] Add accidentally uncommitted hiera cdh users.yaml [puppet] - 10https://gerrit.wikimedia.org/r/272779 (https://phabricator.wikimedia.org/T119646) (owner: 10Ottomata) [18:00:44] greg-g: Hey, sorry, SWAT follow-up – can we push out https://gerrit.wikimedia.org/r/272778 right now? CC Krenair. [18:00:45] https://grafana-admin.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1456239635609&to=1456250435609&var-site=eqiad&var-cache_type=text&var-status_type=5 [18:00:56] I see nothin that happened in here anyways [18:01:05] ^ we had the doubling of the usual background 5xx rate back at 16:00, now it's jumped up quite a bit more at 17:50 [18:01:38] James_F: yes [18:01:41] that seems important [18:01:49] Thanks. Krenair, can you do that? [18:02:01] greg-g: Yeah, who knew, true !== false. [18:02:05] :) [18:02:13] * greg-g goes afk [18:02:16] not deploying [18:02:26] (03CR) 10Alex Monk: [C: 032] Follow-up 0fa358a: Set the correct value for SET secondary default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272778 (owner: 10Jforrester) [18:03:02] (03Merged) 10jenkins-bot: Follow-up 0fa358a: Set the correct value for SET secondary default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272778 (owner: 10Jforrester) [18:04:27] (03PS2) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/272771 (https://phabricator.wikimedia.org/T124680) [18:05:13] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/272778/ (duration: 01m 31s) [18:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:05:33] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [18:05:33] Thank you Krenair. [18:05:35] greg-g: all clear. [18:05:43] PROBLEM - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is CRITICAL: Connection refused [18:05:52] RECOVERY - NTP on restbase1008 is OK: NTP OK: Offset -0.001382231712 secs [18:11:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:11:37] and then it's gone [18:11:46] what causes a blip lke that? [18:12:04] some application service problem that mysteriosly disappeared [18:12:12] 6Operations, 10Traffic, 10domains, 13Patch-For-Review: wikiknihy.cz - transfer to Wikimedia Czech Republic? - https://phabricator.wikimedia.org/T127573#2056159 (10Aklapper) [18:12:15] 6Operations, 10Huggle, 10Traffic, 7HTTPS: Huggle 2 fails on HTTP used when HTTPS expected - https://phabricator.wikimedia.org/T126357#2056160 (10Aklapper) [18:12:17] 6Operations, 10DNS, 10Traffic, 13Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#2056162 (10Aklapper) [18:12:20] 6Operations, 10Traffic, 7HTTPS: ssl certificate replacement: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#2056164 (10Aklapper) [18:12:24] possibly that cassandra-b thing? [18:12:26] 6Operations, 6Labs, 10Labs-Infrastructure, 10Traffic, 7HTTPS: add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486#2056168 (10Aklapper) [18:12:27] 6Operations, 10Traffic, 7HTTPS, 7Varnish: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#2056167 (10Aklapper) [18:12:45] 6Operations, 10DNS, 10Mail, 10Traffic: Set up role accounts and feedback loops (FBL) with all providers - https://phabricator.wikimedia.org/T106664#2056184 (10Aklapper) [18:12:45] 6Operations, 10Traffic, 10Wikimedia-Blog, 7HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2056186 (10Aklapper) [18:12:47] heh I'm guessing Aklapper is working on ops phab tag reorg [18:12:48] 6Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect yue.wikipedia.org to zh-yue.wikipedia.org - https://phabricator.wikimedia.org/T105999#2056185 (10Aklapper) [18:12:51] 6Operations, 10Traffic, 10Wikimedia-Blog, 7HTTPS: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728#2056190 (10Aklapper) [18:13:00] guess so :-) [18:13:02] 6Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS, 7JavaScript: Use Upgrade Insecure Requests on Wikimedia - https://phabricator.wikimedia.org/T101002#2056199 (10Aklapper) [18:13:03] 6Operations, 10DNS, 10Traffic, 13Patch-For-Review: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051#2056206 (10Aklapper) [18:13:05] is robh away? [18:13:06] 6Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#2056205 (10Aklapper) [18:13:20] apergos: nope [18:13:20] what;s the cassandra-b thing? [18:13:30] apergos: whats up? [18:13:31] 18:05 < icinga-wm> PROBLEM - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is CRITICAL: Connection refused [18:13:34] oh. um, you wanna shove out the new cert for dataset1001 [18:13:35] ? [18:13:40] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 7HTTPS: Make default interwiki map links protocol-relative - https://phabricator.wikimedia.org/T33327#2056243 (10Aklapper) [18:13:42] 6Operations, 10MediaWiki-extensions-CodeReview, 10Traffic, 7HTTPS: Provide HTTPS links in CodeReview emails - https://phabricator.wikimedia.org/T31008#2056244 (10Aklapper) [18:13:46] 6Operations, 6Project-Admins, 3DevRel-February-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2056247 (10Aklapper) >>! In T119944#2044189, @faidon wrote: >> **TO SORT OUT:** >> * For #DC-Ops, tag along all of the #ops-$site + #procurement tagged tasks" -... [18:13:51] it expired today eh? i can push right now yep [18:13:53] https://phabricator.wikimedia.org/T122321#2056164 I guess it expires very soon [18:13:56] it lines up with about the end mark of a wide 5xx spike [18:14:00] 26th, ahh [18:14:11] apergos: yea if you are about for the next few i can push now actually [18:14:15] I am [18:14:19] so yep fire away [18:14:20] well not really, but close enough to be suspicious, if something took a while to die and depool in some sense [18:15:02] ok, plausible [18:15:04] hm [18:15:13] !log disabled puppet on dataset1001 [18:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:46] 6Operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#1998563 (10Eevans) I've put together the following for a proposed sequence of tasks, and an estimation of the time required for each. Hopefully this will be... [18:17:26] (03PS3) 10RobH: new dumps.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/260783 [18:17:40] (03PS1) 10Joal: Update hiera heapsize for hive and oozie servers [puppet] - 10https://gerrit.wikimedia.org/r/272783 (https://phabricator.wikimedia.org/T110090) [18:17:50] ottomata: --^ [18:18:40] 6Operations, 6Project-Admins, 3DevRel-February-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2056274 (10BBlack) @Aklapper - relatedly, since we're killing the #Varnish tag, we should probably batch-edit them to include #Traffic before killing it (so that... [18:20:01] (03CR) 10RobH: [C: 032] new dumps.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/260783 (owner: 10RobH) [18:20:44] 6Operations, 10Huggle, 10Traffic, 7HTTPS: Huggle 2 fails on HTTP used when HTTPS expected - https://phabricator.wikimedia.org/T126357#2056289 (10DVdm) By the way, I haven't commented out the aborts. I have trapped the errors that are mere warnings about HTTPS. Corrected a few bugs too. Works pretty good now. [18:21:04] joal: also metastore.yaml [18:21:05] heapsize [18:21:05] apergos: ok, reenabling puppet and kicking it to refresh, it should simply refresh apache and work [18:21:11] or i may have to poke it to rehup but doing now [18:21:18] nginx right? [18:22:21] ottomata: hm, how much ram on an1015? [18:22:41] 48G [18:22:56] joal: recommendations for 6G hive server was 10G metastore [18:22:59] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: rv (duration: 01m 32s) [18:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:07] ok, so it should fit [18:23:12] ja [18:23:14] ottomata: yeah was looking onto that [18:24:21] (03PS3) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/272771 (https://phabricator.wikimedia.org/T124680) [18:24:23] (03PS1) 10Andrew Bogott: designate: Set pool_target master to a public ip rather than 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/272784 [18:25:08] (03PS2) 10Joal: Update hiera heapsize for hive and oozie servers [puppet] - 10https://gerrit.wikimedia.org/r/272783 (https://phabricator.wikimedia.org/T110090) [18:25:27] apergos: success, its updated [18:25:30] sweet [18:25:34] had to restart nginx [18:25:40] okey dokey [18:25:51] i'll close out the assorted tasks. it already has an entry for next years expiry on the tracking calendar [18:25:53] RECOVERY - HTTPS on dataset1001 is OK: SSL OK - Certificate dumps.wikimedia.org valid until 2017-04-26 10:47:38 +0000 (expires in 427 days) [18:25:53] =] [18:26:07] (03PS3) 10Ottomata: Update hiera heapsize for hive and oozie servers [puppet] - 10https://gerrit.wikimedia.org/r/272783 (https://phabricator.wikimedia.org/T110090) (owner: 10Joal) [18:26:21] cool joal thanks. fingers crossed the start scripts actually respect this setting [18:26:24] 6Operations, 10Traffic, 7HTTPS: ssl certificate replacement: dumps.wikimedia.org (expires 2016-02-26) - https://phabricator.wikimedia.org/T122321#2056295 (10RobH) 5Open>3Resolved This was pushed live today, resolved! [18:26:32] (03CR) 10Ottomata: [C: 032 V: 032] Update hiera heapsize for hive and oozie servers [puppet] - 10https://gerrit.wikimedia.org/r/272783 (https://phabricator.wikimedia.org/T110090) (owner: 10Joal) [18:26:40] ottomata: I didn't check your templates for heapsize setting :) [18:26:55] hehe [18:26:56] the templates are good [18:27:02] cool [18:27:09] but i remember problems with the cdh .sh scripts that actually launch the JVM not respecting the env var [18:27:18] Arrrrrfn [18:27:19] 26/04/2017 robh? [18:27:33] Lrt's check that at restart ottomata [18:27:40] yup [18:27:48] apergos: it was a transfer from a competitor so they gave us some extra days [18:28:01] great [18:28:14] thanks1 [18:28:15] ! [18:28:17] if its less than a month or two, its no big deal to have it extend out slightly longer than a year [18:28:24] we just dont wanna do that with the large unified cert is all. [18:28:32] ottomata: I don't have access to an1015 :( [18:29:12] doh! [18:29:16] gotcha [18:30:07] (03PS1) 10Cmjohnson: Adding priavate dns entry for test server. Testing R430 with new LSI Raid Card for potential use. Changes will be removed after testing. [dns] - 10https://gerrit.wikimedia.org/r/272785 [18:33:29] (03PS1) 10Ottomata: Add analytics1015.yaml for analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/272788 (https://phabricator.wikimedia.org/T119646) [18:34:06] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: un-rv (duration: 01m 37s) [18:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:23] (03CR) 10Ottomata: [C: 032 V: 032] Add analytics1015.yaml for analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/272788 (https://phabricator.wikimedia.org/T119646) (owner: 10Ottomata) [18:35:44] (03PS3) 10BBlack: cache_misc: remove cp1057, cp1070 [puppet] - 10https://gerrit.wikimedia.org/r/272726 (https://phabricator.wikimedia.org/T125486) [18:36:21] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: remove cp1057, cp1070 [puppet] - 10https://gerrit.wikimedia.org/r/272726 (https://phabricator.wikimedia.org/T125486) (owner: 10BBlack) [18:37:16] (03PS1) 10Cmjohnson: Adding prseed and dhcp entries for test servers wmf4727-test. This is test server for raid controller card and will removed once testing is complete. [puppet] - 10https://gerrit.wikimedia.org/r/272789 [18:38:31] (03CR) 10Cmjohnson: [C: 032] Adding priavate dns entry for test server. Testing R430 with new LSI Raid Card for potential use. Changes will be removed after testing. [dns] - 10https://gerrit.wikimedia.org/r/272785 (owner: 10Cmjohnson) [18:39:03] (03PS2) 10Cmjohnson: Adding prseed and dhcp entries for test servers wmf4727-test. This is test server for raid controller card and will removed once testing is complete. [puppet] - 10https://gerrit.wikimedia.org/r/272789 [18:40:44] (03CR) 10Andrew Bogott: [C: 032] "hotfix testing shows this to be harmless." [puppet] - 10https://gerrit.wikimedia.org/r/272784 (owner: 10Andrew Bogott) [18:40:54] (03PS2) 10Andrew Bogott: designate: Set pool_target master to a public ip rather than 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/272784 [18:42:05] (03CR) 10Cmjohnson: [C: 032] Adding prseed and dhcp entries for test servers wmf4727-test. This is test server for raid controller card and will removed once testing is [puppet] - 10https://gerrit.wikimedia.org/r/272789 (owner: 10Cmjohnson) [18:43:20] (03PS3) 10Andrew Bogott: designate: Set pool_target master to a public ip rather than 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/272784 [18:45:12] (03CR) 10Andrew Bogott: [C: 032] designate: Set pool_target master to a public ip rather than 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/272784 (owner: 10Andrew Bogott) [18:49:05] Krinkle, ping [18:52:12] (03PS4) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/272771 (https://phabricator.wikimedia.org/T124680) [18:54:41] (03PS1) 10BBlack: 2layer: remove dead nodes storage_size [puppet] - 10https://gerrit.wikimedia.org/r/272792 (https://phabricator.wikimedia.org/T125486) [18:56:01] (03PS1) 10Krinkle: wmfstatic: Allow longer client-side caching of 'nohash' responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272793 [18:56:05] ori: ^ [18:56:24] (03PS2) 10Krinkle: Set $wgResourceBasePath to "/w" for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271710 (https://phabricator.wikimedia.org/T99096) [18:56:31] (03PS2) 10Krinkle: Set $wgResourceBasePath to "/w" for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271711 (https://phabricator.wikimedia.org/T99096) [18:57:08] !log db1021 replacing disk 7 [18:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:19] (03CR) 10BBlack: [C: 032] 2layer: remove dead nodes storage_size [puppet] - 10https://gerrit.wikimedia.org/r/272792 (https://phabricator.wikimedia.org/T125486) (owner: 10BBlack) [18:58:07] db1021 is depooled (preciselly in testing due to the data issue, so no big issue) [18:59:08] 6Operations, 6Discovery, 10Maps, 10Traffic, 13Patch-For-Review: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2056415 (10BBlack) [18:59:18] 6Operations, 6Discovery, 10Maps, 10Traffic, 13Patch-For-Review: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1542014 (10BBlack) [18:59:20] 6Operations, 10ops-eqiad, 10Traffic, 13Patch-For-Review: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#2056416 (10BBlack) [19:00:01] (03CR) 10Ori.livneh: [C: 031] wmfstatic: Allow longer client-side caching of 'nohash' responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272793 (owner: 10Krinkle) [19:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160223T1900). Please do the needful. [19:00:15] 6Operations, 10ops-eqiad, 10Traffic, 13Patch-For-Review: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#1988833 (10BBlack) [19:00:39] 6Operations, 10ops-eqiad, 10Traffic, 13Patch-For-Review: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#1988833 (10BBlack) Status update: this is basically-done other than sorting out decom/reclaim. [19:01:55] Krinkle: Krenair are you deploying now? [19:02:03] aude: I am not [19:02:08] I am trying to fix an issue with a previous deployment [19:02:09] ok [19:02:30] take your time, but let me know when you are done [19:02:30] do you have something important to deploy? [19:02:33] ok, thanks [19:02:37] !log download and decompress enwiki sessions dataset on labsdb1001 [19:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:06] we want to deploy https://phabricator.wikimedia.org/T109675 [19:03:28] we also wanted to update the wikidata branch on wmf14 [19:03:53] but can be anytime in the next few hours ... [19:04:50] PROBLEM - RAID on db1021 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [19:05:50] (03CR) 10Luke081515: [C: 04-1] "I think your second comment is wrong, but I think the config for stewards is still improvable." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [19:06:11] Krenair: FYI: ^ [19:06:36] Yeah, common passwords didn't exist when I made the stewards patch [19:07:17] I'll notify them and update the policy separately [19:07:21] (03CR) 10BBlack: [C: 031] wmfstatic: Allow longer client-side caching of 'nohash' responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272793 (owner: 10Krinkle) [19:07:52] ok :) [19:08:32] csteipp: but the problem still exits, I guess you can override this policy by renaming the groups [19:15:19] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.32.195:9042 on restbase1008 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [19:15:51] godog: what's the cassandra-b thing about anyways? [19:17:47] apergos: each restbase machine is running multiple instance of cassandra, with cassandra-b being the second [19:18:38] and what's happening with the specific instance on there? I ask cause b black and I were lightly speculating on its connection to a 503 issue. very lightly mind you [19:19:48] apergos: it is bootstrapping, IOW joining the existing cluster, that shouldn't impact cassandra clients (i.e. restbase in this case) though [19:20:11] prolly completely unrelated then [19:20:23] well chalk up another 'unexplained' blip then :-) [19:20:46] I'd say so too but possible in theory, still unexplained? [19:20:59] the blip went away so [19:22:07] it's cosmic rays from a new supernova, which disturbed server memory for a 10 minute period, causing broken responses. [19:22:10] and it did not come back. over an hour ago [19:22:15] that's it! [19:22:53] I shoulda just looked to see if we had repeat blip since then but [19:23:06] shrug [19:23:19] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures [19:23:28] doesn't seem to have recurred [19:25:08] nope [19:29:22] (03CR) 10Paladox: "recheck" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/237700 (https://phabricator.wikimedia.org/T111377) (owner: 10Hashar) [19:33:31] (03PS1) 10Alex Monk: Try to force user.defaults module to change to unbreak huwiki editing for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272799 [19:33:34] Krinkle, ^ [19:34:20] initially that was James' suggestion of 'RLsucks' [19:34:29] then I changed it without thinking much [19:34:47] Krenair: use 1/0 for user options, not boolean. [19:35:05] Fine otherwise :) [19:35:22] (03PS2) 10Alex Monk: Try to force user.defaults module to change to unbreak huwiki editing for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272799 [19:35:36] (03CR) 10Alex Monk: [C: 032] Try to force user.defaults module to change to unbreak huwiki editing for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272799 (owner: 10Alex Monk) [19:36:01] (03Merged) 10jenkins-bot: Try to force user.defaults module to change to unbreak huwiki editing for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272799 (owner: 10Alex Monk) [19:36:59] James_F, Krinkle: ^ syncing [19:38:24] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/272799 (duration: 01m 37s) [19:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:29] Krinkle, so um [19:39:31] mw.user.options.get( 'T47877-buster' ) [19:39:31] null [19:39:35] mw.user.options.get( 'visualeditor-editor' ) [19:39:35] "wikitext" [19:39:35] etc. [19:40:20] [MwRlVersionTrack] Version of "user.defaults" changed from "dR8aPl//" to "XdDH83DW". [19:40:20] mw.user.options.get( 'T47877-buster' ) [19:40:20] 1 [19:40:20] mw.user.options.get( 'visualeditor-editor' ) [19:40:20] "visualeditor" [19:40:54] open up an incognito window and try it on https://hu.wikipedia.org/w/index.php?title=David_Wenham&action=edit ? [19:41:11] Krenair: Yeah, works for me now. [19:41:15] oh [19:41:15] same here [19:41:18] just started working [19:41:21] Give it 5 min :) [19:41:30] Finally. [19:41:30] Thank you, Krinkle. [19:41:50] Thanks for all your help and patience Krinkle [19:47:23] 6Operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 2 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2056703 (10Gehel) Traffic from mw cluster in codfw to udpmxircecho is going through our wave (see c... [19:47:48] yw :) [19:48:09] RECOVERY - RAID on db1021 is OK: OK: optimal, 1 logical, 2 physical [19:49:31] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:00:04] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160223T2000). Please do the needful. [20:00:25] jouncebot: No it isn't go away [20:02:27] Krenair: are you done and everything good? [20:02:42] aude, think so! [20:02:50] ok [20:02:59] * aude will need to run scap, etc. :/ [20:03:17] I'm still a little nervous about it truly being fixed [20:03:21] but not deploying anything right now [20:03:41] :/ [20:04:46] (03PS1) 10Ottomata: Increase analytics mysql meta innodb_buffer_pool_size to 1G, query_cache_size to 16M [puppet] - 10https://gerrit.wikimedia.org/r/272808 (https://phabricator.wikimedia.org/T119646) [20:05:15] (03PS2) 10Ottomata: Increase analytics mysql meta innodb_buffer_pool_size to 1G, query_cache_size to 16M [puppet] - 10https://gerrit.wikimedia.org/r/272808 (https://phabricator.wikimedia.org/T119646) [20:05:54] (03CR) 10Ottomata: [C: 032 V: 032] Increase analytics mysql meta innodb_buffer_pool_size to 1G, query_cache_size to 16M [puppet] - 10https://gerrit.wikimedia.org/r/272808 (https://phabricator.wikimedia.org/T119646) (owner: 10Ottomata) [20:07:17] aude: Let me know when you're done :) [20:07:30] no rush. I'm up for some thigns after the train (which isn't happening) [20:07:39] ok [20:10:39] tgr: ori: We're up in ~ 1h. Can prepare in -perf when you're available [20:11:13] i'm going to deploy some stuf for bd808 etc also [20:26:32] !log aude@tin Synchronized php-1.27.0-wmf.13/includes/session/CookieSessionProvider.php: Fix invalid key warning (duration: 01m 32s) [20:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:43] bd808: ^ [20:27:10] aude: thanks! [20:27:51] now the other patches and the wikibase stuff [20:30:02] (03PS1) 10Ottomata: Analytics MySQL Meta instance: 4G innodb_buffer_pool_size 4 instances [puppet] - 10https://gerrit.wikimedia.org/r/272813 (https://phabricator.wikimedia.org/T119646) [20:30:53] and waiting a while for jenkins, of course :/ [20:32:08] (03PS2) 10Ottomata: Analytics MySQL Meta instance: 4G innodb_buffer_pool_size 4 instances [puppet] - 10https://gerrit.wikimedia.org/r/272813 (https://phabricator.wikimedia.org/T119646) [20:32:26] (03CR) 10Ottomata: [C: 032 V: 032] Analytics MySQL Meta instance: 4G innodb_buffer_pool_size 4 instances [puppet] - 10https://gerrit.wikimedia.org/r/272813 (https://phabricator.wikimedia.org/T119646) (owner: 10Ottomata) [20:40:55] Krinkle: the backport chain is https://gerrit.wikimedia.org/r/#/c/272815/ https://gerrit.wikimedia.org/r/#/c/272816/ https://gerrit.wikimedia.org/r/#/c/272641/ (and jenkins is taking its sweet time with it) [20:42:21] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Puppet has 110 failures [20:43:21] except I omitted the first one :( [20:50:54] !log initiating `nodetool upgradesstables -a' in staging to rewrite sstables (restored earlier compression settings) [20:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:51:22] jenkins is being especially slow :/ [20:55:36] (03PS1) 10Ottomata: Fix typo in my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/272821 [20:55:51] (03CR) 10Ottomata: [C: 032 V: 032] Fix typo in my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/272821 (owner: 10Ottomata) [21:00:04] Krinkle: Dear anthropoid, the time has come. Please deploy Perf deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160223T2100). [21:08:35] Krinkle: do you know if https://gerrit.wikimedia.org/r/#/c/272812/ is somehow stuck? [21:09:00] dunno see https://integration.wikimedia.org/zuul/ [21:09:05] i don't see it on https://integration.wikimedia.org/zuul/ [21:09:48] i know jenkins is slow sometimes but this is insanely slow :/ [21:09:50] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:24] (03CR) 10Aaron Schulz: [C: 031] Define service entries for InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [21:10:35] now i see it :) [21:10:50] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:11:20] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:11:30] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:11:39] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:00] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:30] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:40] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:50] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:30] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:15:39] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:15:51] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:16:10] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:21] aude: zuul intentionally runs with a 5 second delay now [21:18:03] legoktm: ok [21:18:19] is someone looking at mw1114? [21:18:30] idk if just hhvm needs restarting [21:19:19] 7Blocked-on-Operations, 10RESTBase: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124#2057260 (10Eevans) [21:23:23] aude: no, it's offline [21:23:25] not responding to ssh [21:23:28] needs opsen to reboot it [21:23:36] ok [21:23:42] yeah, i can't ssh to it [21:24:43] it's a canary api server, afaik [21:26:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 64.00% of data above the critical threshold [5000000.0] [21:26:44] hello ops? [21:27:04] i prefer not to scap yet [21:27:27] (03PS2) 10Ottomata: Remove $::realm conditionals around varnishkafka for labs webrequests [puppet] - 10https://gerrit.wikimedia.org/r/271685 (https://phabricator.wikimedia.org/T127369) [21:30:06] aude: because of mw1114? [21:30:16] pybal will have depooled it instantly [21:30:17] suppose we can do sync-common there [21:30:21] you don't have to worry about that [21:30:22] ok [21:30:32] * aude proceeds :) [21:30:39] (03CR) 10Ottomata: [C: 032] Remove $::realm conditionals around varnishkafka for labs webrequests [puppet] - 10https://gerrit.wikimedia.org/r/271685 (https://phabricator.wikimedia.org/T127369) (owner: 10Ottomata) [21:30:44] (03PS1) 10Alex Monk: Remove old WEF dashboard IP from enwiki ratelimit exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272837 (https://phabricator.wikimedia.org/T126541) [21:31:24] !log aude@tin Started scap: Add Wikidata i18n messages to WikimediaMessages, update Wikibase on wmf14, and some core backports [21:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:52] !log reboot mw1114 via mgmt as unresponsive [21:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:49] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [21:35:00] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [21:35:19] RECOVERY - configured eth on mw1114 is OK: OK - interfaces up [21:35:21] RECOVERY - DPKG on mw1114 is OK: All packages OK [21:35:32] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [21:35:49] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [21:35:50] RECOVERY - Disk space on mw1114 is OK: DISK OK [21:35:50] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 0 % full [21:36:01] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [21:36:20] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [21:36:20] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [21:36:39] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:36:39] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 68135 bytes in 4.417 second response time [21:37:21] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 499 bytes in 0.251 second response time [21:39:20] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:44:49] (03PS8) 10Phedenskog: webperf: Create new navtiming metric with higher value limit [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) [21:45:14] 6Operations, 10Analytics, 6Analytics-Kanban, 13Patch-For-Review: Increase HADOOP_HEAPSIZE (-Xmx) for hive-server2 - https://phabricator.wikimedia.org/T76343#2057411 (10Ottomata) [21:46:34] (03PS9) 10Phedenskog: webperf: Create new navtiming metric with higher value limit [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) [21:48:20] (03PS1) 10Jforrester: VisualEditor: Enable in two extra namespaces on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272889 (https://phabricator.wikimedia.org/T127819) [21:48:40] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: puppet fail [21:48:50] (03PS2) 10Dzahn: fix whitespace-related lint issues [puppet] - 10https://gerrit.wikimedia.org/r/272666 [21:49:51] (03CR) 10Eevans: "No matter the technical merits, the outcome here is pretty awful. Anything would have been better than leaving this open for 6 months, wh" [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [21:54:09] (03CR) 10GWicke: "Right, so lets separate them." [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [21:55:15] (03CR) 10Dzahn: [C: 032] fix whitespace-related lint issues [puppet] - 10https://gerrit.wikimedia.org/r/272666 (owner: 10Dzahn) [21:57:13] aude: Done? [21:57:33] Krinkle: not yet :( [21:58:43] (03CR) 10Krinkle: "fixme: https://github.com/search?q=live-1.5+@wikimedia&type=Code&ref=searchresults" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272607 (owner: 10MaxSem) [22:00:08] (03PS2) 10Dzahn: phabricator: move roles to module/role/ [puppet] - 10https://gerrit.wikimedia.org/r/272667 [22:00:23] (03CR) 10Dzahn: [C: 032] "noop http://puppet-compiler.wmflabs.org/1844/" [puppet] - 10https://gerrit.wikimedia.org/r/272667 (owner: 10Dzahn) [22:00:42] 7Blocked-on-Operations, 10RESTBase: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124#2057480 (10GWicke) I'm disappointed by the delay as well. The main issue has been nobody actually working on this. [22:02:49] (03PS1) 10Rush: labstore: stage tc script and enable on boot [puppet] - 10https://gerrit.wikimedia.org/r/272891 [22:03:00] (03CR) 10Dzahn: "confirmed on iridium" [puppet] - 10https://gerrit.wikimedia.org/r/272667 (owner: 10Dzahn) [22:03:26] (03PS2) 10Dzahn: ferm: fix "not documented" warnings [puppet] - 10https://gerrit.wikimedia.org/r/272674 [22:04:12] (03PS1) 10Cmjohnson: Fixing partman recipe that wmf4727-test uses. Needed gpt [puppet] - 10https://gerrit.wikimedia.org/r/272892 [22:04:19] (03CR) 10Dzahn: [C: 032] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/272674 (owner: 10Dzahn) [22:04:29] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:04:34] (03CR) 10jenkins-bot: [V: 04-1] labstore: stage tc script and enable on boot [puppet] - 10https://gerrit.wikimedia.org/r/272891 (owner: 10Rush) [22:04:50] (03CR) 10Dzahn: [V: 032] ferm: fix "not documented" warnings [puppet] - 10https://gerrit.wikimedia.org/r/272674 (owner: 10Dzahn) [22:05:23] (03PS2) 10Dzahn: backup: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/272668 [22:06:37] (03PS2) 10Rush: labstore: stage tc script and enable on boot [puppet] - 10https://gerrit.wikimedia.org/r/272891 [22:10:41] (03CR) 10Dzahn: [C: 032] "noop http://puppet-compiler.wmflabs.org/1845/" [puppet] - 10https://gerrit.wikimedia.org/r/272668 (owner: 10Dzahn) [22:11:06] (03CR) 10MaxSem: "That repository is abandoned." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272607 (owner: 10MaxSem) [22:12:31] (03CR) 10Dzahn: "confirmed on helium, lithium and heze" [puppet] - 10https://gerrit.wikimedia.org/r/272668 (owner: 10Dzahn) [22:14:59] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [22:15:24] (03PS2) 10Dzahn: role/mail: split file, move to module/role/ [puppet] - 10https://gerrit.wikimedia.org/r/271738 [22:16:18] aude: Did it crash? [22:16:24] scap hasn't taken over 15minutes in weeks [22:16:28] or months [22:16:31] it's been an hour [22:16:44] (03PS1) 10Ottomata: Use different my.cnf in labs for analytics-meta instance [puppet] - 10https://gerrit.wikimedia.org/r/272896 [22:17:08] Krinkle: looks like rebuild-cdbs is running [22:18:00] Krinkle: Not true, I had several ~30m scaps in the past few weeks [22:18:23] Hm.. what makes it vary [22:18:27] 6Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#2057570 (10Dzahn) Since we said it's going to be external and i abandoned the change to add that CNAME on our side,... [22:18:32] bd808: Oh? It runs it on each server separately? [22:18:34] 2 branches active too? where do we have .14 active [22:18:51] Krinkle: yes, to turn the json back into cdbs [22:18:57] Seeings lots of mwdeploy ssh processed to eqiad and codfw from tin all runnin the same [22:19:03] I thought we built on tin and synced [22:19:14] we sync the json [22:19:16] I guess it can sometimes be faster to sync a small delta [22:19:19] and built cdb [22:19:33] but we've had quick syncs that sync cdb files everywhere [22:19:34] we build cdb, turn into json, sync json and turn back into cdb [22:19:41] it's slooooooooooooooow :( [22:19:57] Yeah, I understand the rationale. Though we're not using deltas or git yet, (right?) and this is taking longer now.. [22:20:07] I can get timing data to see what was crappy [22:20:28] correct, no precomputed deltas. all rsync [22:20:42] then i need to do some config patches, but those should be fairly quick [22:20:56] (03CR) 10Andrew Bogott: [C: 04-1] "I would prefer that it also applied changes /now/ as well as pending for next reboot; otherwise it's kind of like hiding a surprise that w" [puppet] - 10https://gerrit.wikimedia.org/r/272891 (owner: 10Rush) [22:21:05] It looks like all of the l10n caches ended up being dirty too [22:21:10] bd808: and rsync full json is faster than full cdb? [22:21:14] 391 langs * 2 branches [22:21:37] Krinkle: much much faster (2 years ago when we tested and implemented) [22:21:55] (03PS2) 10Cmjohnson: Fixing partman recipe that wmf4727-test uses. Needed gpt [puppet] - 10https://gerrit.wikimedia.org/r/272892 [22:21:56] that's suspicious [22:22:04] * aude totally could have avoided needing scap if i added the new messages we need ~2 weeks ago :( [22:22:05] why would that matter? [22:22:09] cdb is binary and rsyncs badly [22:22:18] Hm. ok [22:22:23] i tend to forget when enabling wikibase on new wikis :/ [22:22:25] but rsync just replaces the file, right? [22:22:40] I guess rsync could compress it somewhat if json [22:22:42] no, very fancy delta computation [22:22:47] really? [22:23:05] yeah rsync is the best delta transport system out there mostly [22:23:09] !log aude@tin Finished scap: Add Wikidata i18n messages to WikimediaMessages, update Wikibase on wmf14, and some core backports (duration: 51m 45s) [22:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:14] alright. [22:23:24] bd808: cool [22:23:34] bd808: ori tgr please check your stuff [22:23:40] but the deltas are recomputed for each client which is not optimal for the sort of thing we are doing [22:23:41] i'm checking that my messages are ok [22:23:58] bd808: So on this subject of cdbs and scap speed, I've had some anecdotal experiences on that. [22:24:05] I'm pretty sure we don't always write the md5 files [22:24:13] And end up needlessly recomputing data as a result. [22:24:27] there's a scap bug right now in the mtime comparisons [22:24:38] Yeah I know that issue, but this isn't that afaik. [22:24:41] which is making the master-master sync slower than it should be [22:25:05] how would we skip writing an md5 checkfile? [22:25:16] or are you thinking the md5s can be corrupted? [22:25:21] I saw permission errors in the logs before. [22:25:24] So we couldn't write the files [22:25:28] ah [22:25:42] (03CR) 10Aude: [C: 032] Enable WikibaseClient on Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272738 (https://phabricator.wikimedia.org/T109675) (owner: 10Aude) [22:25:42] Then next run sees no files, assumes they never existed, and rebuilds everything. [22:25:47] and bad/no error reporting for that [22:26:05] lemme see if I can dig up the old logs I saw [22:26:29] (03Merged) 10jenkins-bot: Enable WikibaseClient on Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272738 (https://phabricator.wikimedia.org/T109675) (owner: 10Aude) [22:26:56] aude: what's going out exactly? [22:28:15] awww, my 30d query to kibana made it fall over [22:28:32] heh. that will happen [22:28:54] tgr: https://gerrit.wikimedia.org/r/#/c/272795/ [22:28:56] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Memcached+eqiad&h=mc1017.eqiad.wmnet&jr=&js=&v=969&m=instantaneous_ops_per_sec&vl=ops%2Fs&ti=instantaneous_ops_per_sec [22:29:04] and corresponding patch for wmf14 [22:29:25] 1.6k/sec -> 1.0/sec [22:29:27] aude: which core backports did you just sync? I assume we didn't just sync things that werne't merged in their wmf branch [22:29:31] (and that's one server) [22:29:32] And I don't see the merges yet [22:29:33] cdb rebuilds "only" took ~8 minutes of the scap [22:30:32] Krinkle: onlythings merged in the branches [22:30:36] only* [22:30:40] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1846/ (note the fails are unrelated and exist in the production version too)" [puppet] - 10https://gerrit.wikimedia.org/r/271738 (owner: 10Dzahn) [22:30:43] !log aude@tin Synchronized wmf-config/Wikibase.php: Add Wikiversity site link section to Wikidata (duration: 02m 23s) [22:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:48] * aude checks [22:30:52] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1846/ (note the fails are unrelated and exist in the production version too)" [puppet] - 10https://gerrit.wikimedia.org/r/271738 (owner: 10Dzahn) [22:31:08] okay, so nothing from https://etherpad.wikimedia.org/p/perf-20160223 went out yet [22:31:21] Once you're done I'll stage some of them on tin and sync to mw1017. Starting with wmf13 [22:31:27] aude: that's a trivial change, should be fine [22:32:25] (03CR) 10Dzahn: "confirmed on mx1001, mx2001, .." [puppet] - 10https://gerrit.wikimedia.org/r/271738 (owner: 10Dzahn) [22:34:24] l10n build on tin took 13m 22s; sync-masters took 04m 20s; sync-proxies took 06m 16s; sync-apaches took 19m 06s; scap-rebuild-cdbs took 07m 46s [22:34:33] aude: Ready? :) [22:35:52] Krinkle: 2-3 more mintes [22:35:55] minutes* [22:36:03] (03CR) 10Dzahn: "wanna remove the -1? see compiler link now" [puppet] - 10https://gerrit.wikimedia.org/r/260939 (owner: 10Dzahn) [22:36:19] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Add Wikibase settings for Wikiversity (duration: 01m 31s) [22:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:42] (03PS5) 10Dzahn: Program Dashboard configuration for initial labs rollout [puppet] - 10https://gerrit.wikimedia.org/r/271149 (https://phabricator.wikimedia.org/T105967) (owner: 10Dduvall) [22:38:49] (03CR) 10Dzahn: [C: 032] Program Dashboard configuration for initial labs rollout [puppet] - 10https://gerrit.wikimedia.org/r/271149 (https://phabricator.wikimedia.org/T105967) (owner: 10Dduvall) [22:40:37] (03PS3) 10Rush: labstore: stage tc script and enable on boot [puppet] - 10https://gerrit.wikimedia.org/r/272891 [22:40:40] mutante: oh no! i was just going to fix something there [22:40:45] :) [22:41:47] (03PS1) 10Aude: Temporarily rv adding wikiversity to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272898 [22:41:52] Krinkle: i forgot a few things i need to do [22:41:55] (03PS1) 10Thcipriani: Beta: fix nutcracker config changes [puppet] - 10https://gerrit.wikimedia.org/r/272899 (https://phabricator.wikimedia.org/T127845) [22:41:59] marxarelli: me too :) [22:42:24] mutante: i messed up the hieradata [22:42:26] aude: OK. I started merging something in core, so beware with git-pull. It might land in any sec. [22:42:41] (03CR) 10Aude: [C: 032] Temporarily rv adding wikiversity to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272898 (owner: 10Aude) [22:42:42] so i wait on the dblist change and do these things [22:42:52] (03CR) 10Rush: "as discussed on irc a two phase process. first staging all config with a "salt" rollout in phases and then puppet config to follow to ens" [puppet] - 10https://gerrit.wikimedia.org/r/272891 (owner: 10Rush) [22:42:56] Krinkle: that's fine, i am not touching core now [22:43:00] k :) [22:43:00] (03PS1) 10Rush: labstore: real time application of tc setup [puppet] - 10https://gerrit.wikimedia.org/r/272900 [22:43:10] (03Merged) 10jenkins-bot: Temporarily rv adding wikiversity to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272898 (owner: 10Aude) [22:43:24] marxarelli: no problem, just make a follow-up, at least the bigger part is in now [22:43:25] * aude needs to add some db tables and double check things like the sites tables [22:43:32] marxarelli: i wanna slightly move the role too [22:43:35] mutante: no worries though. i need to do a follow-up for ssh key authorization, etc. [22:43:43] marxarelli: alright [22:43:43] (03PS2) 10Rush: labstore: real time application of tc setup [puppet] - 10https://gerrit.wikimedia.org/r/272900 [22:43:44] mutante: thanks for the merge though [22:44:06] (03CR) 10Rush: "as discussed on irc a two phase process. first staging all config with a "salt" rollout in phases and then puppet config to follow to ens" [puppet] - 10https://gerrit.wikimedia.org/r/272891 (owner: 10Rush) [22:44:29] mutante: where do you think the role should be moved to? [22:44:55] and don't think i need to sync my last thing [22:45:23] Krinkle: go ahead and let me know when you are done [22:45:42] aude: no worries. We'll take a while though, so if you need to do anything else, do it now :) [22:46:21] (03PS1) 10Dzahn: programdashboard: move role to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/272903 [22:46:33] marxarelli: like this ^ [22:47:06] mutante: ah, ok. i like that better as well [22:47:17] (03CR) 10Dduvall: [C: 031] programdashboard: move role to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/272903 (owner: 10Dzahn) [22:47:25] marxarelli: it also fixes the "not in autoload layout" lint warning [22:47:28] 'k, cool [22:47:45] (03CR) 10Dzahn: [C: 032] programdashboard: move role to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/272903 (owner: 10Dzahn) [22:48:25] Krinkle: i'll let you know when i am ready again [22:49:17] k [22:49:20] (03CR) 10Andrew Bogott: [C: 031] labstore: real time application of tc setup [puppet] - 10https://gerrit.wikimedia.org/r/272900 (owner: 10Rush) [22:49:26] (03CR) 10Andrew Bogott: [C: 031] labstore: stage tc script and enable on boot [puppet] - 10https://gerrit.wikimedia.org/r/272891 (owner: 10Rush) [22:50:19] (03PS4) 10Rush: labstore: stage tc script and enable on boot [puppet] - 10https://gerrit.wikimedia.org/r/272891 [22:52:58] (03CR) 10Rush: [C: 032] labstore: stage tc script and enable on boot [puppet] - 10https://gerrit.wikimedia.org/r/272891 (owner: 10Rush) [22:55:44] (03PS1) 10Aude: Revert "Temporarily rv adding wikiversity to wikidataclient.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272904 [22:55:46] Krinkle: think i'm ready [22:55:46] (03PS1) 10Aude: Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272905 [22:55:58] should be quick, assuming no problems [22:56:04] * Krinkle seens another patch go in :D [22:56:12] for core? [22:56:16] (PS1) Aude: Bump cache epoch for Wikidata [mediawiki-config] - https://gerrit.wikimedia.org/r/272905 [22:56:29] Krinkle: we need both [22:57:00] It went to gerrit after you said you thought you were done :D - no worries, but it looked liek things happened our of order. [22:57:54] can i sync these now? [22:57:58] yep [22:58:00] k [22:58:04] exit [23:04:41] aude: there's not many of us here [23:05:52] (03CR) 10Aude: [C: 032] Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272905 (owner: 10Aude) [23:06:19] (03Merged) 10jenkins-bot: Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272905 (owner: 10Aude) [23:07:05] 503 [23:07:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:08:45] !log aude@tin Synchronized wmf-config/Wikibase.php: Bump cache epoch for Wikidata for Wikiversity sitelinks section (duration: 01m 32s) [23:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:27] think i am done now :) [23:09:44] aude: yay :) [23:09:58] i would like to deploy https://gerrit.wikimedia.org/r/#/c/271336/ [23:10:14] but krinkle is in netsplit :/ [23:10:48] Why do you need Krinkle for that? There are two +1 on that change [23:11:07] because he's deploying stuff also [23:11:33] i can always wait for swat [23:11:40] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1559 bytes in 0.231 second response time [23:11:45] ugh [23:11:45] aude: write him on tin? [23:11:52] heh [23:12:46] He's on the dark side of the network now? [23:13:11] net split! [23:13:20] and irccloud isn't working [23:13:30] aude: I pulled changed down to tin and synced to mw1017 [23:13:36] Krinkle_: ok [23:13:47] i am done, though might put https://gerrit.wikimedia.org/r/#/c/271336/ into swat [23:13:58] (otherwise i might forget about that change) [23:15:23] (03CR) 10Krinkle: [C: 032] wmfstatic: Allow longer client-side caching of 'nohash' responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272793 (owner: 10Krinkle) [23:16:11] (03Merged) 10jenkins-bot: wmfstatic: Allow longer client-side caching of 'nohash' responses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272793 (owner: 10Krinkle) [23:16:47] (03PS1) 10Andrew Bogott: designate.conf: remove terminal comment that was wreaking havok [puppet] - 10https://gerrit.wikimedia.org/r/272907 [23:17:21] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1538 bytes in 0.248 second response time [23:17:36] (03PS1) 10Smalyshev: Always add Access-Control-Allow-Origin for WDQS backend response [puppet] - 10https://gerrit.wikimedia.org/r/272908 (https://phabricator.wikimedia.org/T115476) [23:19:59] (03CR) 10Andrew Bogott: [C: 032] designate.conf: remove terminal comment that was wreaking havok [puppet] - 10https://gerrit.wikimedia.org/r/272907 (owner: 10Andrew Bogott) [23:22:44] !log krinkle@tin Synchronized w/static.php: (no message) (duration: 01m 35s) [23:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:59] 6Operations: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2057999 (10ArielGlenn) [23:25:49] 6Operations: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2057999 (10Krinkle) Phabricator chat rooms! [23:27:55] (03PS3) 10Cmjohnson: Fixing partman recipe that wmf4727-test uses. Needed gpt [puppet] - 10https://gerrit.wikimedia.org/r/272892 [23:28:40] 6Operations: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2057999 (10greg) Can't tell if trolling or not ;) But, that's actually not a bad idea. They can be publicly viewable/joinable and everyone working on an outage will have a phab account. It's be... [23:29:25] 6Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2058062 (10ArielGlenn) p:5Triage>3High [23:30:11] 6Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2057999 (10ArielGlenn) I slapped deployment-systems on here because people doing deployments will be one of the main users of such a fallback setup. Yeha, that's not ex... [23:33:55] 6Operations, 7HHVM, 13Patch-For-Review: Rise in "parent, LightProcess exiting" fatals - https://phabricator.wikimedia.org/T124956#2058122 (10greg) Ori made some patches that fixed this for Roan during a SWAT, but I'm still seeing this in our beta-scap-eqiad jenkins job. See eg: https://integration.wikimedia... [23:35:17] (03PS1) 10Andrew Bogott: labtest: Update designate domain ids [puppet] - 10https://gerrit.wikimedia.org/r/272911 [23:36:53] (03CR) 10Andrew Bogott: [C: 032] labtest: Update designate domain ids [puppet] - 10https://gerrit.wikimedia.org/r/272911 (owner: 10Andrew Bogott) [23:37:13] 6Operations, 7HHVM, 13Patch-For-Review: Rise in "parent, LightProcess exiting" fatals - https://phabricator.wikimedia.org/T124956#1970973 (10bd808) >>! In T124956#2058122, @greg wrote: > Ori made some patches that fixed this for Roan during a SWAT, but I'm still seeing this in our beta-scap-eqiad jenkins job... [23:37:30] (03PS1) 10Ottomata: Add ori to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/272912 [23:37:41] (03PS2) 10Ottomata: Add ori to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/272912 [23:38:06] (03PS5) 10Andrew Bogott: Updates to designate/mdns/pdns setup for Labs internal dns [puppet] - 10https://gerrit.wikimedia.org/r/272771 (https://phabricator.wikimedia.org/T124680) [23:38:11] (03CR) 10Ottomata: [C: 032 V: 032] Add ori to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/272912 (owner: 10Ottomata) [23:38:26] bd808: now you know I didn't really look at o.ri's patch, dangit [23:38:37] :) [23:38:44] mutante: follow-up to fix the hieradata if you have a sec https://gerrit.wikimedia.org/r/#/c/272906/ [23:48:23] hello, I'm not able to login at https://hue.wikimedia.org/accounts/login/ Can someone please help me recover my password? (I'm pretty sure I didn't change my password recently though) [23:49:00] mutante, sorry I just added you to acl*phabricator then noticed you had removed yourself. If you dodn't want to be in the group I didn't mean to override that choice [23:49:06] *didn't [23:49:44] 6Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2058180 (10Krinkle) >>! In T127904#2058053, @greg wrote: > Can't tell if trolling or not ;) But, that's actually not a bad idea. Yep. Despite my trolling earlier, this... [23:51:27] 6Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2058182 (10ArielGlenn) Etherpad is public, which might not be cool, a chunk of what we do might want to wind up in a private space. But it could replace _operations tem... [23:52:31] 6Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2058185 (10greg) >>! In T127904#2058062, @ArielGlenn wrote: > I slapped deployment-systems on here because people doing deployments will be one of the main users of such... [23:53:30] 6Operations, 10Ops-Access-Requests: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2058186 (10bmansurov) 5Resolved>3Open Hello again, I'm suddenly unable to login at https://hue.wikimedia.org/accounts/login/. The error message I get is "Invalid username... [23:54:38] 6Operations, 10Deployment-Systems: Have fallback communication channel when freenode has problems - https://phabricator.wikimedia.org/T127904#2058216 (10greg) >>! In T127904#2058182, @ArielGlenn wrote: > Etherpad is public, which might not be cool, a chunk of what we do might want to wind up in a private space... [23:57:23] !log krinkle@tin Synchronized php-1.27.0-wmf.13/autoload.php: (no message) (duration: 01m 42s) [23:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:14] (03CR) 10Aaron Schulz: Add references to wmfServices for Cirrusearch. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [23:59:35] (03CR) 10Aaron Schulz: [C: 031] Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [23:59:50] (03CR) 10Aaron Schulz: [C: 04-1] "Needs double quotes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto)