[00:08:00] MatmaRex: gerrit gc depends on the official git [00:08:36] and the crafted paths required for the exploits aren’t rejected by JGit [00:24:35] ytrezq: looking into it [00:25:24] ytrezq: do you know where I can find more details? http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-2324 is reserved, but no details yet. [00:28:17] ori it’s top third (Jeff Kind alias Peff)Git maintener who found based on my finding of ᴄᴠᴇ‑2016‑2315 [00:29:34] ori: basically, everything that used name_path() or revision.c is vulnerable (the only requirment is name path need to be used) [00:30:37] ytrezq: OK -- thanks very much for letting us know. Very good of you. I'm reading up a bit. May have a question or two :) [00:32:31] ori: name_path was used at near evryplace a path was needed to be builded which means everywhere a tree object is parsed [00:33:51] ori: ᴄᴠᴇ‑2016‑2315 was fixed for git 2.6 and above (it was the easiest 1 to trigger). 2.7.1 put a real fix by removing name_path() completely, so it should avoid further problems. [00:34:55] ytrezq: so you're saying it's possible to craft a commit which will allow remote code execution next time gerrit calls git-gc? [00:35:17] ori: Yes, wand I tested it on GitHub [00:36:52] of course any attempt to clone a repository will trigger the bug on each client computer who download the repository [00:37:21] ori: an another mitigating factor is if wikimedia servers use big endian binaries: litlle endiannes might be concidered as a requirement [00:41:53] can you send the details to security@wikimedia.org? [00:45:57] TimStarling: ori: ᴄᴠᴇ‑2016‑2315 is detailed by me here http://pastebin.com/1grmz9Q2 and on the Google git mailing listing [00:47:05] that's irresponsible [00:47:26] I’m not the fist who did it [00:47:53] since it’s already done with details fixed in master [00:50:28] it's still irresponsible [00:51:02] you should be telling vendor-sec first, not the whole world [00:52:07] it's not appropriate to publicize before patches are ready [00:52:20] TimStarling: which was did http://thread.gmane.org/gmane.comp.version-control.git/286008 [00:53:55] TimStarling: anyway, the point is the mitre isn’t doing is work. I raised the issue to only 1 organization, I can’t afford the time to deal with each ones https://bounty.github.com/researchers/ytrezq.html [00:54:21] yeah, I understand that you're new to this and you haven't got much time [00:54:31] maybe you don't know what vendor-sec is and what it is for [00:55:23] Contrary to GitHub, and as you can read on http://pastebin.com/1grmz9Q2 I think it allows for remote code executions. [00:56:08] if you like I can ask our security folks to follow up on this and to get it fixed generally [00:56:17] I mean, generally in the world [00:56:31] you won't get there by talking to each and every git user on IRC [00:56:36] TimStarling: yes do it please [00:59:04] ok [00:59:57] I'll try reporting that pastebin [00:59:58] TimStarling: The point for security@wikimedia.org is only relevant for git versions under 2.6. For versions between 2.6 and 2.7.1 I’m unable to explain, just ask Jeff King (aka Peff) [01:00:13] for the details [01:03:11] TimStarling: that pastebin expire in 2 weeks. and is only relevant for versions before 2.6 [01:03:28] not those before 2.7.1 [01:08:29] TimStarling: I added informations to the pastebin http://pastebin.com/UX2P2jjg [01:08:39] TimStarling: I added informations to a new pastebin http://pastebin.com/UX2P2jjg [01:09:05] you are adding more pastebins? [01:09:23] I am halfway through requesting removal of the first one [01:09:59] TimStarling: full disclosure was done by Peff on gname [01:10:31] nobody reads that [01:10:57] similar vulnerability reports on other development mailing lists have gone unnoticed for years [01:13:09] that ML post doesn't even say it is a vulnerability [01:13:47] TimStarling: I’m not talking to the one who got the patches [01:14:20] TimStarling: and https://bounty.github.com/researchers/ytrezq.html kindly suggest that path_name() is the problem (question: what does involve size_t to in truncation and paths) [01:15:25] TimStarling: check for “Current existing way for heap buffer overflow before https://github.com/git/git/commit/34fa79a6cde56d6d428ab0d3160cb094ebad3305” on git-security@googlegroups.com [01:17:58] TimStarling: oh sorry, the real patch thread is here http://thread.gmane.org/gmane.comp.version-control.git/286253 (18 patches to fix this but not for previous versions who will remain unfixed) [01:25:21] thanks for reporting this, ytrezq [01:29:29] TimStarling: there’s one things I don’t understand, what I basically found as the same impact as the case insensitive .GIt directory, but doesn’t seems to append in the news (I don’t mind if it’s by the news or other things). I can’t keep up with every important systems [01:30:05] TimStarling: I also don’t know how much time does it takes generally for cve details to get publised [01:31:39] it depends on what has been published elsewhere [01:31:47] in the meantimes don’t trust google gerrit nor GitLab [01:32:37] TimStarling: once everything is fixed into master of course [01:38:20] you demostrated a crash, right? not RCE? [01:40:01] TimStarling: yes, I don’t have the knowledge for performing rce from a heap overflow [01:40:12] ok [01:41:11] However there’s also underflows, and I demonstrated to github how this can be used to read servers memory (since path_name() is also used to update trees ) [01:43:20] TimStarling: but for the safest case, remember incorrect strcpy use due to previously incorrect allocated memory is as ʀᴄᴇ as gets() [01:43:41] sure [01:44:35] the input is a file name from a tree object? [01:44:59] TimStarling: or several files with a nested tree [01:45:48] TimStarling: please don’t act like GitHub staff who did nothing during 2 mounts after I proved the crash (I mean until I proved a buffer underflow case) [01:51:20] ytrezq: where did you report CVE-2016-2315, to github or the git developers? there was no communication either way to any distro security mailing list [01:51:51] moritzm: just read here https://bounty.github.com/researchers/ytrezq.html [01:53:15] Ugh https://de.wikipedia.org/wiki/Tantalcarbid Mobile rendered content in parser cache? [01:53:19] moritzm: concerning distros I tried to communicate a cygwin specific cve in git and I dropped the idea [01:55:34] legoktm: ^ think you were looking into that before [01:56:00] moritzm: I would really like in breaks the news like the case insensitive .gIt directory in 2014 (cve still unpublished) it would really helps raising the issue to everyone [01:58:05] TimStarling: I even tried to send details to our official « Bureau des failles et de la sécurité » without any response for 2 months [01:59:33] the published/unpublished status at MITRE is a red herring, that just reflects the writeup in their internal database, what really matters is that it get the attention of Linux distros or the general public. In this case the bug has been fixed without even an upstream announcement by the git developers (the 2.7.0 release notes don't mention it) [01:59:59] so you reported it to git-security@googlegroups.com ? is that archived somewhere? [02:00:17] moritzm: I don’t know if it’s archived [02:00:27] moritzm, you may want to read my security@wikimedia.org post, that has a summary of the story so far [02:00:40] although I missed the bit about git-security@googlegroups.com [02:00:59] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 7 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2093337 (10Legoktm) https://de.wikipedia.org/wiki... [02:01:11] hoo|busy: thanks https://phabricator.wikimedia.org/T124356#2093337 [02:02:27] ytrezq: I reported it to GitHub in order to get rewarded for the hours I spent on researching this. Jeff King is a GitHub staff member as well as the top third contributor of Git. So he simply used his access to fix everything [02:03:26] his access to fix everything 3 months later [02:04:45] But I fear if wikimedia is powered by sparc or Ibm power there’s no need to worry (big endian) [02:05:44] moritzm: I reported it to GitHub in order to get rewarded for the hours I spent on researching this. Jeff King is a GitHub staff member as well as the top third contributor of Git. So he simply used his access to fix everything [02:07:38] moritzm: the git-security@googlegroups.com thread name is Re: Current existing way for heap buffer overflow before https://github.com/git/git/commit/34fa79a6cde56d6d428ab0d3160cb094ebad3305 [02:08:59] It’s also on this thread the different issue were brought to mitre after the failure to get a reply from cert [02:09:16] but the initial report was done here https://bounty.github.com/submit-a-vulnerability.html [02:09:38] last october [02:18:46] need to go in bed [02:32:15] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.15) (duration: 14m 27s) [02:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:10:29] (03PS2) 10Muehlenhoff: Enable base::firewall on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/274715 (https://phabricator.wikimedia.org/T113343) [04:47:48] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2093430 (10Nuria) Approved, sorry for the couple days delay. [05:05:09] 6Operations, 10Traffic, 10Wikimedia-Blog, 7HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2093434 (10Tbayer) Update: Followed up again on March 4 and got yet another response from yet another person: It's actually possibly to use WP-CLI for replacements, but scripts can't proc... [06:05:02] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 17.86% of data above the critical threshold [100000000.0] [06:30:22] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:32] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [06:30:33] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: puppet fail [06:31:03] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:31:22] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:23] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:54] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:56:51] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:57:01] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:57:01] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:11] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:12] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:42] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:17:08] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2093626 (10Joe) [08:17:10] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the memcached and redis (sessions) configuration and functionality in codfw - https://phabricator.wikimedia.org/T124879#2093624 (10Joe) 5Open>3Resolved [08:23:33] (03PS2) 10Giuseppe Lavagetto: [WiP] Add ipvs-related FSM [debs/pybal] - 10https://gerrit.wikimedia.org/r/272679 [08:25:21] (03CR) 10Jcrespo: [C: 031] "Try to warm up the buffer pool (as they are not replicating, they will not do it naturally)-not a blocker for repooling." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275166 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [08:25:26] (03CR) 10jenkins-bot: [V: 04-1] [WiP] Add ipvs-related FSM [debs/pybal] - 10https://gerrit.wikimedia.org/r/272679 (owner: 10Giuseppe Lavagetto) [08:29:38] (03PS1) 10Giuseppe Lavagetto: appservers: decommission permanently mw1026-69 [puppet] - 10https://gerrit.wikimedia.org/r/275374 (https://phabricator.wikimedia.org/T126242) [08:31:53] (03CR) 10Giuseppe Lavagetto: [C: 032] appservers: decommission permanently mw1026-69 [puppet] - 10https://gerrit.wikimedia.org/r/275374 (https://phabricator.wikimedia.org/T126242) (owner: 10Giuseppe Lavagetto) [08:41:52] <_joe_> !log disabled puppet on mw1026-69, cleaning up puppet facts and certs, then shutting them down [08:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:50:44] 6Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2093643 (10jcrespo) We only need a double certificate on one server (one slave of s3, for example, db1021) and maybe, for a short period of time, on es2/es3. Codfw can be restarted all at on... [08:54:41] (03PS1) 10Muehlenhoff: Update to 4.4.4 [debs/linux44] - 10https://gerrit.wikimedia.org/r/275376 [08:55:57] (03CR) 10DCausse: [C: 031] Create pool counter for CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268029 (https://phabricator.wikimedia.org/T125547) (owner: 10EBernhardson) [09:09:05] 6Operations: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2093664 (10elukey) Hi Aaron! >>! In T128730#2089572, @aaron wrote: > How would the queues on 1003 drain if it's depooled? Wouldn't they just be left as they were be... [09:09:41] (03PS2) 10Muehlenhoff: Update to 4.4.4 [debs/linux44] - 10https://gerrit.wikimedia.org/r/275376 [09:13:03] 6Operations: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2093668 (10Joe) To better frame the issue: We need a reliable method to depool one rdb host. The sequence should be something like: # MediaWiki stops writing messa... [09:19:10] 6Operations: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2093672 (10elukey) Would it be possible that sync-file wmf-config/jobqueue-eqiad.php is not enough to clear the following configuration? https://github.com/wikimedi... [09:23:40] (03PS3) 10Muehlenhoff: Update to 4.4.4 [debs/linux44] - 10https://gerrit.wikimedia.org/r/275376 [09:34:19] (03PS4) 10Muehlenhoff: Update to 4.4.4 [debs/linux44] - 10https://gerrit.wikimedia.org/r/275376 [09:35:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.4 [debs/linux44] - 10https://gerrit.wikimedia.org/r/275376 (owner: 10Muehlenhoff) [09:35:31] 6Operations: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2093688 (10aaron) During the buffered job warning flood there was also a related flood of "No configuration for partition 'rdb2-6379'" exceptions in similar volume.... [09:37:33] (03PS1) 10Aaron Schulz: Added some jobqueue comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275378 [09:42:16] <_joe_> AaronSchulz: thanks for the added comments, would you care to comment on https://phabricator.wikimedia.org/T128730#2093668 as well? :) [09:42:55] <_joe_> just to be sure we're doing the right thing [09:46:22] after sunrise [09:48:25] <_joe_> AaronSchulz: oh yes of course :) [09:48:40] <_joe_> (I was just leaving the comment behind for you when you wake up :P) [09:52:25] 6Operations, 10Traffic, 13Patch-For-Review: Port varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T128788#2093709 (10elukey) @ema: I discovered an interesting thing from https://www.varnish-cache.org/docs/trunk/reference/vsl.html: > Timestamp - Timing information > Contains timing informatio... [10:01:19] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2093712 (10Joe) I have removed every reference to mw1026-1069 from puppet and conftool, and shut down the machines. I'' also opening... [10:03:13] 6Operations, 10ops-eqiad: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2093716 (10Joe) [10:03:24] 6Operations, 10ops-eqiad: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2093716 (10Joe) a:5Joe>3None [10:04:55] !log uploaded kernel-wedge 2.93+wmf1 for jessie-wikimedia to carbon (needed to build modern kernels) [10:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:19] (03PS1) 10Giuseppe Lavagetto: scap: remove decommissioned appservers from the scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/275383 (https://phabricator.wikimedia.org/T126242) [10:05:41] <_joe_> jynus: if you need to merge any mediawiki-config patch, wait for a sec until I merge ^^ [10:06:10] (03CR) 10Giuseppe Lavagetto: [C: 032] scap: remove decommissioned appservers from the scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/275383 (https://phabricator.wikimedia.org/T126242) (owner: 10Giuseppe Lavagetto) [10:06:19] (03CR) 10Giuseppe Lavagetto: [V: 032] scap: remove decommissioned appservers from the scap dsh group [puppet] - 10https://gerrit.wikimedia.org/r/275383 (https://phabricator.wikimedia.org/T126242) (owner: 10Giuseppe Lavagetto) [10:07:44] I am not ready yet, I am working on a complex patch, that is why I need 100% dedication to it [10:08:08] <_joe_> ok, sorry, I didn't want to make your life harder by getting a ton of errors [10:08:38] what I mean is go on with what you are doing [10:08:54] <_joe_> {{done}} btw [10:10:47] one thing I will need afterwards is codfw testing [10:11:48] so I will need your (not joe's, mediawiki people in general) to do some load testing (Will that create problems with caches?) [10:11:59] *permission [10:12:26] <_joe_> jynus: I guess you'll be load-testing the appservers directly, right? [10:13:11] yes, curl localhost was what I wanted to do [10:13:28] <_joe_> then I don't think we risk any cache pollution [10:13:30] maybe mysql directly in some cases, but I do not need permission for that :-P [10:13:42] <_joe_> eheh [10:13:55] someone mentioned something about emptiing memcache/parsercache [10:14:21] that is why I asked- GET's do writes, too [10:15:16] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 2 others: Create a PKI that can be used by Puppet and for general purpose certificates - https://phabricator.wikimedia.org/T128077#2093741 (10Joe) [10:18:43] 6Operations, 6Performance-Team, 7Availability, 7Epic, and 3 others: Cleanup active-DC based MW config code and make it more robust and easy to change - https://phabricator.wikimedia.org/T114273#1690157 (10Joe) a:5aaron>3Joe [10:24:07] !log disable puppet on graphite1001 / graphite2001 / labmon1001 before merging https://gerrit.wikimedia.org/r/#/c/274716 [10:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:18] (03PS4) 10Filippo Giunchedi: graphite: switch carbon-c-relay to carbon_ch hash [puppet] - 10https://gerrit.wikimedia.org/r/274716 (https://phabricator.wikimedia.org/T105218) [10:24:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: switch carbon-c-relay to carbon_ch hash [puppet] - 10https://gerrit.wikimedia.org/r/274716 (https://phabricator.wikimedia.org/T105218) (owner: 10Filippo Giunchedi) [10:35:41] 6Operations, 6Discovery, 10Maps, 10kartotherian, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2093759 (10Yurik) [10:35:45] 7Blocked-on-Operations, 6Operations, 6Discovery, 10Maps, and 4 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#2093761 (10Yurik) [10:35:59] 6Operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2093768 (10Yurik) [10:38:26] 6Operations, 6Discovery, 10Maps, 10tilerator, 3Discovery-Maps-Sprint: water_polygons import is broken - https://phabricator.wikimedia.org/T112831#2093798 (10Yurik) [10:38:55] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2093807 (10Joe) [10:38:57] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2093806 (10Joe) 5Open>3stalled [10:42:11] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2093812 (10Joe) I think we can reduce the pool size further, but it's already smaller than the current pool in codfw [10:51:04] (03PS6) 10Jcrespo: [WIP]Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 [10:51:14] !log reimaging iron to jessie [10:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:56:08] (03PS7) 10Jcrespo: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 [10:56:42] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reorganize mw2* servers across clusters - https://phabricator.wikimedia.org/T129062#2093842 (10Joe) [10:57:05] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reorganize mw2* servers across clusters - https://phabricator.wikimedia.org/T129062#2093856 (10Joe) p:5Normal>3High a:3Joe [10:59:01] (03CR) 10Jcrespo: "This is ready for review, please complain about any potential syntax or logical error, but let me deploy as soon as possible to perform pr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [11:01:45] (03CR) 10Jcrespo: "For future work, master configuration should be refactored into a separate file (shared by db-eqiad.php and db-codfw.php), and as an array" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [11:08:22] joe, I am going to steal your analysis on T129062 for dbs [11:08:49] <_joe_> go on :) [11:09:08] I am going to leave doing some pending partitioning on codfw, and then check fr [11:16:57] 6Operations: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2093894 (10faidon) 5Resolved>3Open p:5Triage>3Normal We've been getting RAID failures. It looks like this: ```lines=10 [ 5040.656250] INFO: task apt-get:3558 blocked for more than 120 seconds. [ 5040.663553] Not... [11:18:30] (03CR) 10Luke081515: [C: 04-1] Modify throttle settings for frwiki and cawiki due to Workshop (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [11:42:09] 6Operations, 10Traffic: Images not showing up at Commons - https://phabricator.wikimedia.org/T128961#2093978 (10faidon) [11:43:12] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: convert mw2153-62 to be jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/275389 (https://phabricator.wikimedia.org/T129062) (owner: 10Giuseppe Lavagetto) [11:49:21] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove the newly-appointed jobrunners from conftool [puppet] - 10https://gerrit.wikimedia.org/r/275421 (https://phabricator.wikimedia.org/T129062) [11:49:52] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: remove the newly-appointed jobrunners from conftool [puppet] - 10https://gerrit.wikimedia.org/r/275421 (https://phabricator.wikimedia.org/T129062) (owner: 10Giuseppe Lavagetto) [11:50:00] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki: remove the newly-appointed jobrunners from conftool [puppet] - 10https://gerrit.wikimedia.org/r/275421 (https://phabricator.wikimedia.org/T129062) (owner: 10Giuseppe Lavagetto) [11:54:40] (03CR) 10Jcrespo: "Requesting the abandonment of this change because the initial problem (detecting a stale slave) is now done thanks to pt-heartbeat, which " [puppet] - 10https://gerrit.wikimedia.org/r/270584 (owner: 10Hoo man) [11:57:15] (03CR) 10Jcrespo: "Will +1 if it is rebased after https://gerrit.wikimedia.org/r/267659" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266534 (owner: 10Dereckson) [12:12:26] (03PS1) 10Faidon Liambotis: mirrors: update ftpsync to 20160306 [puppet] - 10https://gerrit.wikimedia.org/r/275439 [12:12:55] (03PS2) 10Muehlenhoff: Add ferm rules for kartotherian, tilerator and tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/274936 [12:13:01] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: update ftpsync to 20160306 [puppet] - 10https://gerrit.wikimedia.org/r/275439 (owner: 10Faidon Liambotis) [12:13:14] (03PS2) 10Dereckson: Document db-codfw readOnlyBySection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266534 [12:13:39] !log depool ms-fe1003 for trusty upgrade T125024 [12:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:14:14] (03CR) 10Dereckson: "Rebased. I would have imagined the comment would be redundant." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266534 (owner: 10Dereckson) [12:21:21] (03CR) 10Alex Monk: [C: 04-1] "needs moving to thursday per task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275149 (https://phabricator.wikimedia.org/T128847) (owner: 10Dereckson) [12:23:08] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reorganize mw2* servers across clusters - https://phabricator.wikimedia.org/T129062#2094101 (10Joe) Updated situation in codfw jobrunners: eqiad - 672 cores, 750 GB RAM, 25 hosts codfw - 720 cores. 1.1 TB RAM, 22 hosts apps... [12:33:39] (03CR) 10Alexandros Kosiaris: [C: 031] Add ferm rules for kartotherian, tilerator and tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/274936 (owner: 10Muehlenhoff) [12:37:34] (03PS3) 10Dereckson: Ateneo de Manila University workshops throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275149 (https://phabricator.wikimedia.org/T124284) [12:38:55] (03PS4) 10Dereckson: Ateneo de Manila University workshops throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275149 (https://phabricator.wikimedia.org/T128847) [12:39:34] (03CR) 10Dereckson: "Date updated from 2016-03-08 to 2016-03-10." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275149 (https://phabricator.wikimedia.org/T128847) (owner: 10Dereckson) [12:40:05] (03CR) 10Luke081515: [C: 031] Ateneo de Manila University workshops throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275149 (https://phabricator.wikimedia.org/T128847) (owner: 10Dereckson) [12:46:52] (03CR) 10Alexandros Kosiaris: [V: 04-1] "gbp:error: upstream/3.3.2_r63423 is not a valid treeish" [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/269115 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [12:47:05] (03CR) 10Alexandros Kosiaris: [V: 04-1] "gbp:error: upstream/0.3.0_r65318 is not a valid treeish" [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/269912 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [12:47:15] (03CR) 10Alexandros Kosiaris: [V: 04-1] "gbp:error: upstream/0.5.1_r65328 is not a valid treeish" [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/269914 (https://phabricator.wikimedia.org/T124317) (owner: 10KartikMistry) [12:47:28] (03CR) 10Alexandros Kosiaris: [V: 04-1] "gbp:error: upstream/1.2.2_r65301 is not a valid treeish" [debs/contenttranslation/apertium-dan-nor] - 10https://gerrit.wikimedia.org/r/269916 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [12:47:39] (03CR) 10Alexandros Kosiaris: [V: 04-1] "gbp:error: upstream/0.5.0_r65328 is not a valid treeish" [debs/contenttranslation/apertium-nno] - 10https://gerrit.wikimedia.org/r/269915 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [12:47:56] (03CR) 10Alexandros Kosiaris: [V: 04-1] "gbp:error: upstream/0.1.1_r129227 is not a valid treeish" [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/270671 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [12:49:59] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2094118 (10mark) I approve using one of the old rb servers for this, as soon as available. Let's make sure we have disks for them? [12:52:56] !log repool ms-fe1003 [12:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:43] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2094160 (10Joe) [12:53:45] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reorganize mw2* servers across clusters - https://phabricator.wikimedia.org/T129062#2094159 (10Joe) 5Open>3Resolved [12:57:09] (03PS1) 10Giuseppe Lavagetto: realm: add $::master_dc hash [puppet] - 10https://gerrit.wikimedia.org/r/275443 (https://phabricator.wikimedia.org/T125673) [13:15:25] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2094197 (10mark) [13:21:03] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2094208 (10Gilles) What's the caching strategy for this API? Will it simply redirect/proxy to the canonical t... [13:21:45] akosiaris: look like I forgot to push tags; doing it. [13:23:12] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [13:32:35] (03Abandoned) 10Hoo man: Add --crit-stopped to check_mariadb.pl [puppet] - 10https://gerrit.wikimedia.org/r/270584 (owner: 10Hoo man) [13:34:18] (03CR) 10Jcrespo: [C: 031] Document db-codfw readOnlyBySection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266534 (owner: 10Dereckson) [13:38:27] 6Operations, 10Traffic: Images not showing up at Commons - https://phabricator.wikimedia.org/T128961#2094218 (10BBlack) Note that `cp3043 frontend ... Error: 403, Requested target domain not allowed` is a legitimate error from the correct (upload) cluster defined here: https://github.com/wikimedia/operations-p... [13:42:07] !log performing schema change on db2038 (s5: T120513), lag on that server expected [13:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:35] (03PS1) 10Jcrespo: Update s5 partitioning according to current row distribution [software] - 10https://gerrit.wikimedia.org/r/275454 (https://phabricator.wikimedia.org/T120513) [13:50:22] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:55:28] 7Blocked-on-Operations, 10RESTBase: Long-term graphite aggregation for restbase.requests.varnish_requests API request metrics not working - https://phabricator.wikimedia.org/T121580#2094246 (10MoritzMuehlenhoff) a:3fgiunchedi [13:58:02] PROBLEM - MariaDB Slave IO: s5 on db2038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:58:29] mmm, I had downtimed that [13:58:43] PROBLEM - MariaDB Slave SQL: s5 on db2038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:59:14] no, this is a related, but different error [13:59:44] which is ok because alerting works as intended [14:15:53] PROBLEM - HHVM rendering on mw2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:53] PROBLEM - HHVM rendering on mw2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:54] PROBLEM - HHVM rendering on mw2094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:54] PROBLEM - HHVM rendering on mw2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:54] PROBLEM - HHVM rendering on mw2165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:54] PROBLEM - HHVM rendering on mw2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:54] PROBLEM - HHVM rendering on mw2089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:55] PROBLEM - HHVM rendering on mw2136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:55] PROBLEM - HHVM rendering on mw2036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:56] PROBLEM - HHVM rendering on mw2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:56] PROBLEM - HHVM rendering on mw2041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:57] PROBLEM - HHVM rendering on mw2140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:58] PROBLEM - HHVM rendering on mw2068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:58] PROBLEM - HHVM rendering on mw2190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:09] PROBLEM - HHVM rendering on mw2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:10] PROBLEM - HHVM rendering on mw2109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:10] PROBLEM - HHVM rendering on mw2099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:11] PROBLEM - HHVM rendering on mw2045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:11] PROBLEM - HHVM rendering on mw2142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:12] PROBLEM - HHVM rendering on mw2096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:13] PROBLEM - HHVM rendering on mw2105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:13] PROBLEM - HHVM rendering on mw2122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:13] PROBLEM - HHVM rendering on mw2179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:14] PROBLEM - HHVM rendering on mw2183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:15] PROBLEM - HHVM rendering on mw2113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:15] PROBLEM - HHVM rendering on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:15] PROBLEM - HHVM rendering on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:16] PROBLEM - HHVM rendering on mw2164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:16] PROBLEM - HHVM rendering on mw2204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:17] PROBLEM - HHVM rendering on mw2074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:17] PROBLEM - HHVM rendering on mw2111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:18] PROBLEM - HHVM rendering on mw2052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:33] PROBLEM - HHVM rendering on mw2188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:33] PROBLEM - HHVM rendering on mw2192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:33] PROBLEM - HHVM rendering on mw2064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:33] PROBLEM - HHVM rendering on mw2059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:33] PROBLEM - HHVM rendering on mw2163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:33] PROBLEM - HHVM rendering on mw2095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:34] PROBLEM - HHVM rendering on mw2108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:34] PROBLEM - HHVM rendering on mw2103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:35] PROBLEM - HHVM rendering on mw2145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:35] PROBLEM - HHVM rendering on mw2141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:36] PROBLEM - HHVM rendering on mw2038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:36] PROBLEM - HHVM rendering on mw2170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:47] PROBLEM - HHVM rendering on mw2144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:48] PROBLEM - HHVM rendering on mw2115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:49] PROBLEM - HHVM rendering on mw2120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:49] PROBLEM - HHVM rendering on mw2143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:49] PROBLEM - HHVM rendering on mw2102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:50] PROBLEM - HHVM rendering on mw2177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:51] PROBLEM - HHVM rendering on mw2178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:51] PROBLEM - HHVM rendering on mw2180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:52] PROBLEM - HHVM rendering on mw2196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:52] PROBLEM - HHVM rendering on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:52] PROBLEM - HHVM rendering on mw2030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:53] PROBLEM - HHVM rendering on mw2072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:53] PROBLEM - HHVM rendering on mw2063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:54] PROBLEM - HHVM rendering on mw2025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:17:11] mmhh codfw only? [14:17:39] (03PS2) 10Jcrespo: Update s5 & s6 partitioning according to current row distribution [software] - 10https://gerrit.wikimedia.org/r/275454 (https://phabricator.wikimedia.org/T120513) [14:19:59] seems so [14:20:28] still investigating why icinga would timeout, connecting from neon to one of the hosts on 5666 works [14:21:19] could it be related to my change? it seems to timeout instead of failing [14:21:55] <_joe_> yes [14:21:59] let me depool a couple of servers just in case [14:22:00] <_joe_> let me check [14:22:48] (03CR) 10Andrew Bogott: "Hotfixed on silver to no ill effect" [puppet] - 10https://gerrit.wikimedia.org/r/275147 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [14:22:58] <_joe_> godog: uhm this is strange [14:23:01] I usually depool but due to a race condition (slave lag and my schema change) there is a metadata lock [14:23:05] (03PS2) 10Andrew Bogott: wikitech: Remove confusing "Alias /w" that breaks static files [puppet] - 10https://gerrit.wikimedia.org/r/275147 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [14:23:26] _joe_: yeah discard my comment, I was misreading the config [14:26:33] <_joe_> no I said the situation is a bit strange [14:26:36] (03CR) 10Andrew Bogott: [C: 032] wikitech: Remove confusing "Alias /w" that breaks static files [puppet] - 10https://gerrit.wikimedia.org/r/275147 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [14:27:21] <_joe_> because the main page of enwiki takes a lot to be loaded on those servers [14:27:28] <_joe_> and that's why they timeout on the check [14:27:38] <_joe_> "wgBackendResponseTime":15114 [14:28:12] <_joe_> so yeah I suppose it's related to db changes [14:30:09] there is some dependency between enwiki and wikidata on the en: main page that I did not expect, then [14:30:38] and I would add-- that is not properly timeout'ed [14:30:42] <_joe_> yeah Special:Blankpage loads much faster [14:30:57] <_joe_> what do you mean? [14:32:09] well, I did not depool those in the first place because I did not expect the main page check to fail if wikidata has problems [14:32:29] <_joe_> jynus: any wiki page is excruciatingly slow [14:32:37] <_joe_> I tried via Special:Random [14:35:31] (03PS1) 10Jcrespo: Depool db2038, db2039, db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275471 (https://phabricator.wikimedia.org/T120513) [14:36:14] (03CR) 10Jcrespo: [C: 032] Depool db2038, db2039, db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275471 (https://phabricator.wikimedia.org/T120513) (owner: 10Jcrespo) [14:37:41] well, that is my complain- I have not touched s1, only s5 [14:37:55] (knowing that there are only proper checks about s1) [14:38:25] and I will check mw1033, it seems dead [14:39:58] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2038, db2039, db2040 (duration: 02m 59s) [14:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:13] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [14:41:32] PROBLEM - Disk space on ms-be2010 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda1 is not accessible: Input/output error [14:41:43] PROBLEM - RAID on ms-be2010 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [14:41:57] !log powercycling mw1033 (unresponsive) [14:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:34] ^that is not me, I ihave not yet hit enter [14:43:53] it is rebooting, will check the log and ither sync-common or depool it [14:44:06] (03PS1) 10BBlack: VCL: unproxy and normalize Host-header before anything else [puppet] - 10https://gerrit.wikimedia.org/r/275474 [14:47:15] !log sync-common mw1033 [14:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:08] !log installing squid security updates [14:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:53] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:50:57] 6Operations, 6Labs: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2094318 (10chasemp) I found some old stale snapshots on labstore2001 from failed backups in the past on friday. I things up and the replicate jobs seem to be running fine over the weekend. At the same ti... [14:52:23] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [14:52:27] 6Operations, 10ops-eqiad: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2093716 (10ArielGlenn) It looks like the salt keys for these are still around; can I delete them? [14:52:53] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [14:53:23] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: Puppet has 1 failures [14:54:09] !log clean out snapshots from teh weekend on labstore1001 as load is running higher than expected [14:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:00] (03PS2) 10BBlack: VCL: unproxy and normalize Host-header before anything else [puppet] - 10https://gerrit.wikimedia.org/r/275474 [14:57:09] I was wondering why my fix didn't work [14:57:38] it turns out that you have to actually apply it for it to work [14:57:43] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [14:58:05] "mw1033.eqiad.wmnet returned [255]: Host key verification failed" is there a reason for that? [14:58:13] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [14:58:44] I will depool it for the time being [14:58:49] 6Operations, 6Language-Engineering, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare cxserver/zotero for the codfw switchover - https://phabricator.wikimedia.org/T125065#2094323 (10Joe) a:3Joe [14:58:52] RECOVERY - HHVM rendering on mw2024 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.289 second response time [14:58:53] RECOVERY - HHVM rendering on mw2019 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.304 second response time [14:59:02] RECOVERY - HHVM rendering on mw2175 is OK: HTTP OK: HTTP/1.1 200 OK - 72910 bytes in 0.302 second response time [14:59:02] RECOVERY - HHVM rendering on mw2190 is OK: HTTP OK: HTTP/1.1 200 OK - 72910 bytes in 0.299 second response time [14:59:02] RECOVERY - HHVM rendering on mw2036 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.293 second response time [14:59:02] RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.293 second response time [14:59:02] RECOVERY - HHVM rendering on mw2022 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.288 second response time [14:59:26] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2038, db2039, db2040 (duration: 02m 29s) [14:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:16] RECOVERY - HHVM rendering on mw2037 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.297 second response time [15:00:16] RECOVERY - HHVM rendering on mw2058 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.290 second response time [15:00:16] RECOVERY - HHVM rendering on mw2119 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.301 second response time [15:00:16] RECOVERY - HHVM rendering on mw2139 is OK: HTTP OK: HTTP/1.1 200 OK - 72910 bytes in 0.357 second response time [15:00:24] RECOVERY - HHVM rendering on mw2199 is OK: HTTP OK: HTTP/1.1 200 OK - 72908 bytes in 0.249 second response time [15:00:24] RECOVERY - HHVM rendering on mw2138 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.258 second response time [15:00:24] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 72910 bytes in 0.287 second response time [15:00:24] RECOVERY - HHVM rendering on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.295 second response time [15:00:24] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.303 second response time [15:00:24] RECOVERY - HHVM rendering on mw2035 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.248 second response time [15:00:24] RECOVERY - HHVM rendering on mw2193 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.252 second response time [15:00:25] RECOVERY - HHVM rendering on mw2039 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.291 second response time [15:00:25] RECOVERY - HHVM rendering on mw2148 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.302 second response time [15:00:26] RECOVERY - HHVM rendering on mw2017 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.292 second response time [15:00:26] RECOVERY - HHVM rendering on mw2185 is OK: HTTP OK: HTTP/1.1 200 OK - 72909 bytes in 0.262 second response time [15:00:27] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 72908 bytes in 0.292 second response time [15:01:52] (03PS3) 10Muehlenhoff: Add ferm rules for kartotherian, tilerator and tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/274936 [15:02:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for kartotherian, tilerator and tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/274936 (owner: 10Muehlenhoff) [15:02:15] !log depooling mw1033 [15:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:01] moritzm, hope you won't break it ;) [15:04:55] yurik: it's not enabled, currently only adding the rules. I keep you in the loop o [15:05:06] yurik: it's not enabled, currently only adding the rules. I'll keep you in the loop when it's enabled [15:05:32] moritzm, cool, we plan to enable maps on wikivoyage today, so i don't want it to break the very first day. Second day is ok [15:06:36] yurik: don't worry, later this week probably. I [15:06:44] yurik: don't worry, later this week probably. I'll ping you before enabling it [15:09:42] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [15:10:45] (03PS3) 10BBlack: VCL: unproxy and normalize Host-header before anything else [puppet] - 10https://gerrit.wikimedia.org/r/275474 [15:12:32] 6Operations, 10scap: mw1033 crashed, check it is healthy - https://phabricator.wikimedia.org/T129083#2094432 (10jcrespo) [15:13:52] yurik: we do? [15:14:15] <_joe_> jynus: mw1033? [15:14:23] <_joe_> isn't it turned off? [15:14:37] I just rebooted it [15:14:43] <_joe_> why? [15:14:47] bblack, i emailed about it last week. Don't worry - it will have no change in traffic - even if community decideds to switch to it right away, they will simply swap out the wmflabs code with kartographer code, but tiles are the same [15:14:59] It failed on sync [15:15:05] I know, I seem to remember there was some conflicting phab traffic though [15:15:11] and found no ticket about it [15:15:13] <_joe_> jynus: it should be removed from the scap config [15:15:23] (03CR) 10Ema: [C: 031] VCL: unproxy and normalize Host-header before anything else [puppet] - 10https://gerrit.wikimedia.org/r/275474 (owner: 10BBlack) [15:15:25] bblack, it was mostly due to what features should be released/not released. All settled. [15:15:37] <_joe_> jynus: the ticket is https://phabricator.wikimedia.org/T129060 [15:16:02] <_joe_> :) [15:16:22] <_joe_> jynus: no idea why it's still in mediawiki-installation, let me verify and fix that [15:16:32] well, the person that shut it down [15:16:39] should have removed it, so I did [15:16:55] yurik: https://phabricator.wikimedia.org/T127136 ? [15:17:12] <_joe_> jynus: I removed all the entries this morning in https://gerrit.wikimedia.org/r/#/c/275383/ [15:17:15] bblack, yes [15:17:17] that just doesn't sound like a resolved conversation yet in the latter half [15:17:34] _joe_, probably is special because it is a proxy?= [15:17:44] <_joe_> oh yeah... sigh [15:17:47] bblack, i spoke with greg-g and our team on IRC afterwards. [15:17:48] <_joe_> I forgot about that [15:18:18] I am not "attacking" you, I am defending myself on what I just did [15:18:25] it was pooled, in any case [15:18:30] <_joe_> I wasn't attacking you [15:18:31] <_joe_> :) [15:18:54] accooring to conftool [15:19:16] <_joe_> how did you find that out? [15:19:32] <_joe_> I sense a bug in conftool [15:19:32] I think there is a discrepancy between palladium copy and conftool reality [15:19:38] yes [15:19:47] it has been removed on the real conftool [15:19:47] <_joe_> oh you mean the copy that is served via web [15:19:47] what's "palladium copy" in this case? [15:19:54] ok [15:19:56] <_joe_> thats the horrible confd hack I created [15:20:03] but it has not been reflected on palladium [15:20:18] hence all the confusion [15:20:24] <_joe_> that's the reason why we didn't use confd for pybal :) [15:20:35] <_joe_> yes, btw let me correct that [15:20:37] we're still using it heavily for varnish though [15:21:01] <_joe_> bblack: for varnish where you don't flood it with 50 changes in a minute, it works decently [15:21:06] (03PS4) 10BBlack: VCL: unproxy and normalize Host-header before anything else [puppet] - 10https://gerrit.wikimedia.org/r/275474 [15:21:18] <_joe_> it's the rate of change for large pools that broke it down, sort of [15:21:32] <_joe_> and why I still want to write a "saner confd" [15:21:34] (03CR) 10BBlack: [C: 032 V: 032] VCL: unproxy and normalize Host-header before anything else [puppet] - 10https://gerrit.wikimedia.org/r/275474 (owner: 10BBlack) [15:21:54] <_joe_> is bast1001 being reimaged at the moment? [15:21:59] <_joe_> I can't seem to reach it [15:22:43] no, I can log in [15:22:50] and using it as bastion [15:23:21] <_joe_> I can't actually get into any of production atm [15:23:42] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:23:48] probably you are trying to proxy on iron? [15:24:02] with some of the 200 rules you have on your config [15:24:12] (03PS2) 10BBlack: caches: remove backend_scaled_weights [puppet] - 10https://gerrit.wikimedia.org/r/275115 (https://phabricator.wikimedia.org/T125485) [15:24:14] (03PS2) 10BBlack: wikimedia-common VCL: remove static backend weighting [puppet] - 10https://gerrit.wikimedia.org/r/275116 (https://phabricator.wikimedia.org/T127484) [15:24:16] (03PS7) 10BBlack: varnish: get rid of backend_options [puppet] - 10https://gerrit.wikimedia.org/r/275117 (https://phabricator.wikimedia.org/T127484) [15:24:18] (03PS7) 10BBlack: varnish: allow director backends to be single-value again [puppet] - 10https://gerrit.wikimedia.org/r/275118 (https://phabricator.wikimedia.org/T127484) [15:24:20] (03PS7) 10BBlack: r::c::config: remove has_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/275119 (https://phabricator.wikimedia.org/T127484) [15:24:22] (03PS7) 10BBlack: r::c::config: remove parsoid (unused) [puppet] - 10https://gerrit.wikimedia.org/r/275121 (https://phabricator.wikimedia.org/T127484) [15:24:24] (03PS7) 10BBlack: r::c::config: remove lvs::configuration include [puppet] - 10https://gerrit.wikimedia.org/r/275120 (https://phabricator.wikimedia.org/T127484) [15:24:26] (03PS8) 10BBlack: r::c::config: move to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/275123 (https://phabricator.wikimedia.org/T127484) [15:24:28] (03PS7) 10BBlack: r::c::config: add restbase @ codfw [puppet] - 10https://gerrit.wikimedia.org/r/275122 (https://phabricator.wikimedia.org/T127484) [15:24:30] (03PS8) 10BBlack: varnishes: control applayer DC routing from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/275124 (https://phabricator.wikimedia.org/T127484) [15:24:32] (03PS1) 10BBlack: WIP: first attempt at cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [15:28:32] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [15:28:33] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures [15:28:35] (03CR) 10BBlack: [C: 032] caches: remove backend_scaled_weights [puppet] - 10https://gerrit.wikimedia.org/r/275115 (https://phabricator.wikimedia.org/T125485) (owner: 10BBlack) [15:28:43] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [15:28:43] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [15:29:02] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures [15:29:16] (03CR) 10BBlack: [C: 032 V: 032] wikimedia-common VCL: remove static backend weighting [puppet] - 10https://gerrit.wikimedia.org/r/275116 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:29:34] heh, those latest two aren't puppet-merged [15:29:43] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [15:29:45] something's wrong with the host normalization patch I think [15:30:02] can I shutdown mw1033, then? [15:30:13] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:30:14] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:30:14] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [15:30:14] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [15:30:23] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [15:30:52] <_joe_> jynus: I'll do it, give me 5 mins [15:30:53] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [15:31:02] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [15:31:14] <_joe_> bblack: ? [15:31:31] bblack: perhaps s/'/"/g ? [15:32:31] (03PS1) 10BBlack: VCL bugfix: quote chars in 8d95a5efe [puppet] - 10https://gerrit.wikimedia.org/r/275503 [15:33:03] _joe_: it's just a VCL syntax error (which is only critical in the sense of "puppet's broke", it doesn't break the caches if they fail to load new VCL) [15:33:33] if only we had VTC tests via jenkins :) [15:33:41] 6Operations: upgrade 15+4 swift servers from precise to trusty - https://phabricator.wikimedia.org/T125024#2094542 (10fgiunchedi) 5Open>3Resolved finished upgrading all `ms-fe` hosts today, resolving [15:33:43] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2094544 (10fgiunchedi) [15:33:49] (03PS2) 10BBlack: VCL bugfix: quote chars in 8d95a5efe [puppet] - 10https://gerrit.wikimedia.org/r/275503 [15:34:04] (03CR) 10BBlack: [C: 032 V: 032] VCL bugfix: quote chars in 8d95a5efe [puppet] - 10https://gerrit.wikimedia.org/r/275503 (owner: 10BBlack) [15:34:06] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10fgiunchedi) [15:34:17] bblack: that would be great! [15:34:24] bblack: CI has some troubles catching up with changes spam right now, but will get fixed tonight with a new version of Nodepool :) [15:35:18] hashar: I already compiler-checked that long series back on Friday anyways [15:36:11] (03PS2) 10BBlack: WIP: first attempt at cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [15:36:13] (03PS8) 10BBlack: varnish: get rid of backend_options [puppet] - 10https://gerrit.wikimedia.org/r/275117 (https://phabricator.wikimedia.org/T127484) [15:36:15] (03PS8) 10BBlack: varnish: allow director backends to be single-value again [puppet] - 10https://gerrit.wikimedia.org/r/275118 (https://phabricator.wikimedia.org/T127484) [15:36:17] (03PS8) 10BBlack: r::c::config: remove has_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/275119 (https://phabricator.wikimedia.org/T127484) [15:36:19] (03PS8) 10BBlack: r::c::config: remove parsoid (unused) [puppet] - 10https://gerrit.wikimedia.org/r/275121 (https://phabricator.wikimedia.org/T127484) [15:36:21] (03PS8) 10BBlack: r::c::config: remove lvs::configuration include [puppet] - 10https://gerrit.wikimedia.org/r/275120 (https://phabricator.wikimedia.org/T127484) [15:36:23] (03PS9) 10BBlack: r::c::config: move to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/275123 (https://phabricator.wikimedia.org/T127484) [15:36:25] (03PS8) 10BBlack: r::c::config: add restbase @ codfw [puppet] - 10https://gerrit.wikimedia.org/r/275122 (https://phabricator.wikimedia.org/T127484) [15:36:27] (03PS9) 10BBlack: varnishes: control applayer DC routing from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/275124 (https://phabricator.wikimedia.org/T127484) [15:36:55] (03CR) 10BBlack: [C: 032 V: 032] varnish: get rid of backend_options [puppet] - 10https://gerrit.wikimedia.org/r/275117 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:37:12] (03CR) 10BBlack: [C: 032 V: 032] varnish: allow director backends to be single-value again [puppet] - 10https://gerrit.wikimedia.org/r/275118 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:37:27] (03CR) 10BBlack: [C: 032 V: 032] r::c::config: remove has_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/275119 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:37:42] (03CR) 10BBlack: [C: 032 V: 032] r::c::config: remove lvs::configuration include [puppet] - 10https://gerrit.wikimedia.org/r/275120 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:37:56] (03CR) 10jenkins-bot: [V: 04-1] r::c::config: add restbase @ codfw [puppet] - 10https://gerrit.wikimedia.org/r/275122 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:37:58] (03CR) 10BBlack: [C: 032 V: 032] r::c::config: remove parsoid (unused) [puppet] - 10https://gerrit.wikimedia.org/r/275121 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:38:00] (03CR) 10jenkins-bot: [V: 04-1] varnishes: control applayer DC routing from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/275124 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:38:03] (03CR) 10jenkins-bot: [V: 04-1] WIP: first attempt at cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:38:12] bohh [15:38:31] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2094553 (10akosiaris) >>! In T128475#2089514, @RobH wrote: >>>! In T128475#2084761, @faidon wrote: >> Same as the terbium replacement concern here: these boxes aren't very specia... [15:38:32] rebase race [15:38:41] the -1's are, confusingly, from older PS than current due to merge fail, yes [15:39:01] anyways, this series is already tested and works right [15:39:14] (03CR) 10BBlack: [C: 032 V: 032] r::c::config: add restbase @ codfw [puppet] - 10https://gerrit.wikimedia.org/r/275122 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:39:37] (03CR) 10BBlack: [C: 032 V: 032] r::c::config: move to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/275123 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:39:55] (03CR) 10BBlack: [C: 032 V: 032] varnishes: control applayer DC routing from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/275124 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:41:04] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [15:42:04] (03PS1) 10Mforns: Increase log verbosity on reportupdater cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/275508 (https://phabricator.wikimedia.org/T126058) [15:42:19] 6Operations, 10scap: mw1033 crashed, check it is healthy - https://phabricator.wikimedia.org/T129083#2094560 (10jcrespo) [15:42:21] 6Operations, 10ops-eqiad: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2094561 (10jcrespo) [15:42:31] (03CR) 10Jdlrobson: [C: 04-1] "On hold pending discussion. Please do not swat Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274470 (https://phabricator.wikimedia.org/T126802) (owner: 10Jdlrobson) [15:42:46] (03CR) 10Jdlrobson: [C: 04-1] Enable reference storage on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275058 (owner: 10Jdlrobson) [15:44:03] 6Operations: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2094564 (10elukey) Read some documentation: https://wikitech.wikimedia.org/wiki/OCG#Monitoring The Job Queue related to rendering is zero, meanwhile the job status kept for caching is... [15:44:46] (03PS1) 10BBlack: wikimedia-common VCL: add back truly-static backend weighting [puppet] - 10https://gerrit.wikimedia.org/r/275509 (https://phabricator.wikimedia.org/T127484) [15:44:48] (03PS2) 10Mforns: Increase log verbosity on reportupdater cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/275508 (https://phabricator.wikimedia.org/T126058) [15:45:22] (03PS3) 10BBlack: WIP: first attempt at cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [15:45:24] (03PS2) 10BBlack: wikimedia-common VCL: add back truly-static backend weighting [puppet] - 10https://gerrit.wikimedia.org/r/275509 (https://phabricator.wikimedia.org/T127484) [15:46:12] (03CR) 10BBlack: [C: 032 V: 032] wikimedia-common VCL: add back truly-static backend weighting [puppet] - 10https://gerrit.wikimedia.org/r/275509 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [15:46:32] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2094567 (10RobH) @akosiaris: The 32GB being ok is directly in contention with @mobrovac's earlier revised statement that these new systems need to match the SCA configuration. (... [15:48:03] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:51:15] !deploy [15:51:26] no, it wasn't that [15:51:41] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2094570 (10akosiaris) >>! In T128475#2094567, @RobH wrote: > @akosiaris: The 32GB being ok is directly in contention with @mobrovac's earlier revised statement that these new sys... [15:52:24] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [15:52:32] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:53:13] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [15:53:56] looking at Ganglia for no good reason, I see a large increase of traffic this night (http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=LVS%20loadbalancers%20codfw&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1457363422&g=network_report&z=large) [15:54:13] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:54:13] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:54:15] ^ these showing up now are just the usual "when re-enabling and re-running successfully, icinga first reports another puppet failure" [15:54:22] Just that I can learn something more today, does anyone has a pointer to what happend? [15:54:35] (03PS1) 10Elukey: Move the OCG Redis Job queue away from rdb1002 to rdb1007 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/275510 (https://phabricator.wikimedia.org/T128491) [15:54:54] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:56:13] gehel: if you look at lvs200[123], that jump only shows on 2002 [15:56:14] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:19] (03PS2) 10Elukey: Move the OCG Redis Job queue away from rdb1002 to rdb1007 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/275510 (https://phabricator.wikimedia.org/T128491) [15:56:24] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:33] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:56] gehel: following from there, modules/role/manifests/lvs/balancer.pp says 2002 is for upload, misc_web, and dns_rec [15:57:12] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:57:17] 6Operations, 13Patch-For-Review: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2094588 (10elukey) Monitoring; https://grafana.wikimedia.org/dashboard/db/redis-jobqueue-elukey https://logstash.wikimedia.org/#/dashboard/elasticsearch/OCG%20Back... [15:57:22] so that goes with the tuning you are doing on upload cluster ? [15:57:52] gehel: I don't think so [15:58:03] in any case, I wouldn't have been deploying anything in that timeframe [15:58:03] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:58:12] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:14] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:00:04] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160307T1600). [16:00:04] ebernhardson jdlrobson Dereckson kart_: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:22] Hi. [16:00:33] gehel: that's likely swiftrepl replicating from eqiad to codfw, there's a similar spike in 'swift codfw' cluster [16:00:43] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures [16:00:51] (03PS1) 10Andrew Bogott: Keystone policy: restrict get_project to admins. [puppet] - 10https://gerrit.wikimedia.org/r/275512 [16:01:26] I can SWAT this morning. Dereckson hiya. [16:01:53] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:01:55] godog, bblack: sorry for asking stupid questions, but that seems like a good way to learn about how all this works... [16:01:59] Good morning thcipriani. [16:02:33] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:02:54] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275155 (https://phabricator.wikimedia.org/T128354) (owner: 10Dereckson) [16:03:04] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [16:03:17] gehel: it's a good question, I'm just multitasking a lot [16:03:17] (03PS1) 10Jcrespo: Disable mw1033 as a scap proxy (it has been decommisioned) [puppet] - 10https://gerrit.wikimedia.org/r/275514 (https://phabricator.wikimedia.org/T129060) [16:03:33] gehel: my next step would be compare lvs2002 to lvs1002 and see if there was a similar spike in eqiad or not [16:03:33] _joe_, I am trying to help^ [16:03:34] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:03:43] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:03:43] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:04:36] * kart_ available [16:04:39] it is strange to have this on hiera but not the list of hosts [16:04:53] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:05:14] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 1 failures [16:05:22] o/ akosiaris [16:05:24] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [16:05:25] (03Merged) 10jenkins-bot: Namespace configuration on wuu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275155 (https://phabricator.wikimedia.org/T128354) (owner: 10Dereckson) [16:05:27] Back in the office? [16:05:33] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [16:05:42] "the office" == back from vacation [16:05:46] gehel: that's fine! now worries at all, keep asking questions! [16:06:08] (03CR) 10Andrew Bogott: [C: 032] Keystone policy: restrict get_project to admins. [puppet] - 10https://gerrit.wikimedia.org/r/275512 (owner: 10Andrew Bogott) [16:06:11] (03PS1) 10Giuseppe Lavagetto: scap: remove row A7 proxy [puppet] - 10https://gerrit.wikimedia.org/r/275517 [16:06:18] jynus: _joe_ is it fine to be swatting right now? (saw your scap proxies patch) [16:06:21] <_joe_> jynus: eh sorry [16:06:25] <_joe_> thcipriani: wait a sec [16:06:29] ack. [16:06:39] abandoning mine [16:06:45] <_joe_> jynus: +1 to your patch [16:07:00] or abandon yours, you deploy! [16:07:03] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:07:13] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:07:13] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:07:15] (03PS2) 10BBlack: Drop full stop from 403 error message [puppet] - 10https://gerrit.wikimedia.org/r/274978 (owner: 10Ema) [16:07:20] (I am with something else) [16:07:29] (03CR) 10BBlack: [C: 031] Drop full stop from 403 error message [puppet] - 10https://gerrit.wikimedia.org/r/274978 (owner: 10Ema) [16:07:43] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [16:07:51] (03PS2) 10Giuseppe Lavagetto: Disable mw1033 as a scap proxy (it has been decommisioned) [puppet] - 10https://gerrit.wikimedia.org/r/275514 (https://phabricator.wikimedia.org/T129060) (owner: 10Jcrespo) [16:08:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "all row A7 appservers have been decommissioned" [puppet] - 10https://gerrit.wikimedia.org/r/275514 (https://phabricator.wikimedia.org/T129060) (owner: 10Jcrespo) [16:09:13] _joe_: can I kill the salt keys for https://phabricator.wikimedia.org/T129060 ? [16:09:32] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:09:35] the app servers mw1026-69 [16:09:45] (03PS3) 10Ema: Drop full stop from 403 error message [puppet] - 10https://gerrit.wikimedia.org/r/274978 [16:09:57] ashley, wait until mw1033 is down again [16:09:58] (03CR) 10Ema: [C: 032 V: 032] Drop full stop from 403 error message [puppet] - 10https://gerrit.wikimedia.org/r/274978 (owner: 10Ema) [16:10:16] ashley, sorry, I meant apergos [16:10:36] can I kill the rest of em? [16:10:54] <_joe_> !log shutting down mw1033 [16:10:56] (03PS3) 10MarcoAurelio: Modify throttle settings for frwiki and cawiki due to Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) [16:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:03] well ok. all of them then :-P [16:11:15] <_joe_> thcipriani: do you have a list of scap proxies within scap? [16:11:33] <_joe_> thcipriani: let me rephrase, where do the scap clients read the list of scap proxies from? [16:11:48] the old dsh group files [16:11:49] _joe_: nope. We use the /etc/dsh/group files [16:12:19] <_joe_> oh locally on every machine? [16:12:21] <_joe_> ugh [16:12:27] no, only on the deploy server [16:12:31] yarp. [16:12:32] _joe_ whenever you have time - https://gerrit.wikimedia.org/r/#/c/275510/ [16:12:35] <_joe_> ok cool [16:12:36] <_joe_> so it [16:12:42] the list gets sent to the end node with the pull command [16:12:44] <_joe_> *it's ok for you to deploy now [16:12:52] /etc/dsh/group/scap-proxies if I read puppet correctly [16:12:52] <_joe_> bd808: thanks for that :) [16:12:52] _joe_: cool, thanks! [16:12:59] (03CR) 10MarcoAurelio: Modify throttle settings for frwiki and cawiki due to Workshop (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:13:05] halfak: yup [16:13:09] (03PS4) 10MarcoAurelio: Modify throttle settings for frwiki and cawiki due to Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) [16:13:25] and more or less up to speed, so picking up my TODOs finally [16:13:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275149 (https://phabricator.wikimedia.org/T128847) (owner: 10Dereckson) [16:13:38] 6Operations: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2094615 (10Dzahn) uhmm, ok. i saw no issue during the install, didn't partition or setup anything manual, no changes to partman, just jessie installer and reboot [16:14:00] (03CR) 10MarcoAurelio: [C: 031] Modify throttle settings for frwiki and cawiki due to Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:15:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Move the OCG Redis Job queue away from rdb1002 to rdb1007 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/275510 (https://phabricator.wikimedia.org/T128491) (owner: 10Elukey) [16:15:32] Thcipriani: FYI, I added one patch for last minute SWAT, I hope this is ok? [16:15:56] <_joe_> elukey: you might need to restart ocg once you're done, and I don't exclude you'd need to create the keys on redis by hand [16:16:09] ( It's https://gerrit.wikimedia.org/r/#/c/270897/) [16:16:17] (03Merged) 10jenkins-bot: Ateneo de Manila University workshops throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275149 (https://phabricator.wikimedia.org/T128847) (owner: 10Dereckson) [16:16:25] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2094625 (10Milimetric) I think 3 servers would be better than 2. We want these machines to have as much memory as possible, and 3 servers gives us the option to grow in memory if this cluster b... [16:16:37] Luke081515: we'll see what we can get through, but hopefully that'll be fine :) [16:17:04] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration on wuu.wikipedia [[gerrit:275155]] (duration: 02m 26s) [16:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:14] ^ Dereckson check please [16:17:55] _joe_: got a mw2212.codfw.wmnet port 22: Connection timed out is that host still supposed to be in the apaches list? [16:17:58] thanks [16:18:10] Testing. [16:18:17] <_joe_> thcipriani: yeah it's just down temporarily [16:18:49] 275155 ok [16:18:58] _joe_: okie doke. makes syncs slow waiting for timeout. [16:19:02] Dereckson: thank you [16:19:07] <_joe_> thcipriani: ok let me check it [16:19:15] thanks [16:20:08] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2094648 (10faidon) >>! In T124444#2087323, @EBernhardson wrote: > For configuring CirrusSearch to use https connections and utilize a speci... [16:21:01] 6Operations, 10Traffic: Images not showing up at Commons - https://phabricator.wikimedia.org/T128961#2094653 (10BBlack) We've fixed a some subtle issues on our end with the de-coding of proxy-style URIs which included a port number, meaning the client request from our perspective was of the form `GET https://u... [16:21:36] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275154 (https://phabricator.wikimedia.org/T127654) (owner: 10Dereckson) [16:22:48] (03CR) 10Luke081515: [C: 04-1] "Maybe I'm a bit critically today, but since the day where this is needed is not tomorrow, I think it is ok ;)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:22:48] _joe_ ah you mean ocg_job_status right? (Just ran keys *ocg* on rdb1002) [16:23:22] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2094663 (10EBernhardson) >>! In T124444#2094648, @faidon wrote: >>>! In T124444#2087323, @EBernhardson wrote: >> For configuring CirrusSear... [16:23:25] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: switch upload varnish backends to codfw ahead of full switch - https://phabricator.wikimedia.org/T129089#2094664 (10fgiunchedi) [16:23:51] <_joe_> elukey: yes [16:24:18] (03CR) 10MarcoAurelio: "Do we have to add this as well on $wgContentNamespaces?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) (owner: 10MarcoAurelio) [16:24:33] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: switch upload varnish backends to codfw ahead of full switch - https://phabricator.wikimedia.org/T129089#2094664 (10fgiunchedi) p:5Triage>3High [16:25:04] (03Merged) 10jenkins-bot: Namespace configuration on he.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275154 (https://phabricator.wikimedia.org/T127654) (owner: 10Dereckson) [16:25:31] 6Operations, 13Patch-For-Review: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2094689 (10elukey) Useful info: ``` elukey@rdb1002:~$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" 127.0.0.1:6379> GET ocg_render... [16:27:04] _joe_ super weird that ocg_render_job_queue is not a key in each of the Redis instances [16:27:16] annnnyyyyhooow [16:27:47] <_joe_> each or any? [16:28:46] (03CR) 10Dereckson: [C: 04-1] "Use an array with the 3 IPs and not all the range." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:28:53] (03CR) 10Physikerwelt: [C: 031] "@hashar: what happened?" [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [16:28:55] (03PS3) 10Elukey: Move the OCG Redis Job queue away from rdb1002 to rdb1007 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/275510 (https://phabricator.wikimedia.org/T128491) [16:29:56] (03CR) 10Physikerwelt: "from IRC" [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [16:30:07] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Ateneo de Manila University workshops throttle rule [[gerrit:275149]] and Namespace configuration on he.wikivoyage [[gerrit:275154]] (duration: 02m 30s) [16:30:10] ^ Dereckson check please [16:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:33] We can't really test a throttle rule. [16:30:42] Testing 275154. [16:31:08] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Setup LVS for parsoid in codfw - https://phabricator.wikimedia.org/T129090#2094708 (10Joe) [16:31:15] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Setup LVS for parsoid in codfw - https://phabricator.wikimedia.org/T129090#2094721 (10Joe) p:5Triage>3High [16:31:22] <_joe_> fun times ^^ [16:31:53] (03CR) 10MarcoAurelio: "Okay, will address all those issues. Now I have a merge conflict ;-) Best regards." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:31:53] <_joe_> !log powercycling mw2212 [16:31:55] (03CR) 10Elukey: [C: 032] Move the OCG Redis Job queue away from rdb1002 to rdb1007 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/275510 (https://phabricator.wikimedia.org/T128491) (owner: 10Elukey) [16:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:59] 6Operations, 10Wikimedia-General-or-Unknown, 10vm-requests, 13Patch-For-Review: 2 Ganeti VMs for X-Wikimedia-Debug proxy - https://phabricator.wikimedia.org/T129003#2094723 (10ori) [16:32:11] thcipriani: 275154 seems good. Maybe we should run namespaceDupes to check if they haven't already created pages for this namespace. [16:33:29] !log moved OCG Redis Job Queue from rdb1002 to rdb1007 for maintenance. [16:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:11] Dereckson: namespacedupes has been run, now. Thanks! [16:34:35] Thanks for the deploy. [16:34:47] kart_: lemme know when you have pushed all the tags for me to re-review [16:36:41] yurik: ping re https://phabricator.wikimedia.org/T125686#2094732 [16:36:52] yurik: also, when are you going on vacation and when do you return? [16:37:38] yurik: also also, I'm having trusted advisors tell me this is too rushed and not ready and a bad move, you are the only one who is arguing for it and, bluntly, you have a bad track record of pushing stuff out too early. [16:38:02] thcipriani: did you sync my patch yet? I saw, it is merged? [16:38:04] !log thcipriani@tin Synchronized php-1.27.0-wmf.15/extensions/Translate/tag/PageTranslationHooks.php: SWAT: Fix regression in marking page for translation [[gerrit:275366]] (duration: 02m 20s) [16:38:05] ^ kart_ check please [16:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:11] oh in time. [16:38:18] :D [16:38:55] (03PS5) 10MarcoAurelio: Modify throttle settings for frwiki and cawiki due to Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) [16:38:58] ebernhardson: jdlrobson ping for SWAT if you're around. [16:39:24] (03CR) 10jenkins-bot: [V: 04-1] Modify throttle settings for frwiki and cawiki due to Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:39:41] (03CR) 10MarcoAurelio: "Manual rebase follows." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:39:58] thcipriani: \o [16:40:09] (03CR) 10Luke081515: "But patch is ok now, so only the rebase is still needed ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:40:23] _joe_ I forced puppet on ocg1001 and in the logs I can see "key not found in Redis" but also "Fetched keys from redis" too [16:40:41] so I think that after a transitioning period it should be fine [16:40:55] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Make Magnus tools on tools.wmflabs.org work in HTTPS - https://phabricator.wikimedia.org/T102457#2094745 (10Magnus) Just FYI, treeviews is now completely https (the last remaining http element was the tree image...) https://tools.wmflabs.org/glamtools/t... [16:40:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268029 (https://phabricator.wikimedia.org/T125547) (owner: 10EBernhardson) [16:40:58] ebernhardson: howdy. [16:41:55] (03Merged) 10jenkins-bot: Create pool counter for CirrusSearch completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268029 (https://phabricator.wikimedia.org/T125547) (owner: 10EBernhardson) [16:42:31] thcipriani: thanks! [16:42:35] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2073537 (10Magnus) @Dzahn: Does "using http" mean "available over http", or "do not work on https"? [16:42:35] <_joe_> elukey: do you see the keys generated in redis? [16:42:44] Hi could someone turn https on http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page please. Since even though its a test website its always good to have security. [16:42:59] <_joe_> elukey: also, can you now create a pdf of a page on wiki? [16:43:15] <_joe_> sorry I'm still having connection issues [16:43:16] paladox: i could be wrong, but i dont believe we own a * cert for wmflabs.org [16:44:14] ebernhardson: I think we have at least a cert for somethings. For example all my instances have https, and they are at *.wmflabs.org [16:44:26] ebernhardson: Yes there are cert for wmflabs.org see https://tools.wmflabs.org/ please. [16:44:28] *a cert for thid domain [16:44:50] _joe_ I am checking now on rdb1007 [16:45:04] (03PS6) 10MarcoAurelio: Modify throttle settings for frwiki and cawiki due to Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) [16:45:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274834 (https://phabricator.wikimedia.org/T128761) (owner: 10EBernhardson) [16:45:44] (03Merged) 10jenkins-bot: Update CirrusSearch PoolCounter for cross-dc search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274834 (https://phabricator.wikimedia.org/T128761) (owner: 10EBernhardson) [16:46:03] (03CR) 10Luke081515: [C: 031] Modify throttle settings for frwiki and cawiki due to Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:46:13] Luke081515: Hi would you know why the certificates wont work on a third domain but work on something like tools.wmflabs.org. [16:46:17] (03PS7) 10MarcoAurelio: Modify throttle settings for frwiki and cawiki due to Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) [16:46:21] 6Operations, 10Wikimedia-General-or-Unknown, 10vm-requests, 13Patch-For-Review: 2 Ganeti VMs for X-Wikimedia-Debug proxy - https://phabricator.wikimedia.org/T129003#2094782 (10ori) a:3akosiaris [16:46:32] !log thcipriani@tin Synchronized wmf-config/PoolCounterSettings.php: SWAT: Create pool counter for CirrusSearch completion suggester [[gerrit:268029]] (duration: 02m 22s) [16:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:36] 7Puppet, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Hiera is not properly configured on Nodepool instances - https://phabricator.wikimedia.org/T129092#2094783 (10hashar) [16:46:39] ^ ebernhardson sync'd [16:46:39] (03PS1) 10Giuseppe Lavagetto: lvs: add parsoid configuration in codfw [puppet] - 10https://gerrit.wikimedia.org/r/275530 (https://phabricator.wikimedia.org/T129090) [16:46:46] thcipriani: thanks looking [16:47:04] paladox: Do you got an exmaple domain? for example the bta cluster? [16:47:24] I guess this certificate don't accepts a URL with a '.' before the .wmflabs [16:47:45] thcipriani: generally looks good. thanks! [16:47:47] Luke081515: paladox this is a known request that will not be resolved now [16:47:56] greg-g: Ok. [16:48:03] ebernhardson: thanks for checking, 2nd patch going. [16:48:07] greg-g: Ok, thanks [16:48:43] paladox: I got two wikis and three phabricators running at domains without a . for the main domain, and the certs are working at my sites :D [16:48:56] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270897 (https://phabricator.wikimedia.org/T126950) (owner: 10Pmlineditor) [16:49:07] Luke081515: Oh, ok. [16:49:17] (03PS1) 10BBlack: cache_upload: separate applayer backend for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/275531 (https://phabricator.wikimedia.org/T125510) [16:49:33] (03Merged) 10jenkins-bot: Enable assignment of 'accountcreator' for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270897 (https://phabricator.wikimedia.org/T126950) (owner: 10Pmlineditor) [16:49:54] paladox: Luke081515 if you want to subscribe to https://phabricator.wikimedia.org/T50501 that's where you'll see some updates about it. It is blocked on getting time from Operations, which is in short supply now (and always) [16:50:16] greg-g: thanks. [16:50:28] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2094810 (10EBernhardson) One downside we have happening right now is that we do not have persistent connections to the elasticsearch cluste... [16:51:00] !log thcipriani@tin Synchronized wmf-config: SWAT: Update CirrusSearch PoolCounter for cross-dc search [[gerrit:270897]] (duration: 02m 25s) [16:51:02] ^ ebernhardson check please [16:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:06] greg-g: I think this is the wrong task https://phabricator.wikimedia.org/T126950 since thats for mai.wikipedia.org [16:51:28] paladox: I have no idea what you're talking about [16:51:46] greg-g: The task you linked above. [16:51:50] https://phabricator.wikimedia.org/T50501 [16:51:53] is not the one you just linked [16:52:04] thcipriani: all looks happy. Thanks! [16:52:13] The other one was the task, where I currently waiting for deploy of a patch ;) [16:52:15] ebernhardson: cool, thanks for checking. [16:52:23] greg-g: Oh sorry wrong link. [16:52:39] I'm going into a meeting [16:52:43] paladox: Maybe you looked at the grrrit-wm? [16:53:03] Luke081515: I was just getting ready to sync your maiwiki patch, FYI [16:53:12] no problem ;) [16:53:22] Luke081515: It seemed to keep redirecting me to the wrong task but works now. [16:53:31] don't hurry, better it takes a bit longer than a failed deploy :D [16:54:07] (03PS2) 10Giuseppe Lavagetto: lvs: add parsoid configuration in codfw [puppet] - 10https://gerrit.wikimedia.org/r/275530 (https://phabricator.wikimedia.org/T129090) [16:54:28] Luke081515: no worries :) [16:54:28] 6Operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2094832 (10Paladox) [16:54:31] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2094835 (10Papaul) [16:55:02] !log restbase deploy start of 88363c03e0 on restbase1001 [16:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:11] (03PS2) 10Giuseppe Lavagetto: realm: add $::master_dc hash [puppet] - 10https://gerrit.wikimedia.org/r/275443 (https://phabricator.wikimedia.org/T125673) [16:56:13] (03PS1) 10Giuseppe Lavagetto: restbase: make restbase configuration $master_dc [puppet] - 10https://gerrit.wikimedia.org/r/275536 (https://phabricator.wikimedia.org/T126235) [16:56:15] (03PS1) 10Giuseppe Lavagetto: cxserver: use $rb_primary in configuring restbase urls [puppet] - 10https://gerrit.wikimedia.org/r/275537 (https://phabricator.wikimedia.org/T125065) [16:56:17] (03PS1) 10Giuseppe Lavagetto: mobileapps: point to $rb_primary, not to the local restbase cluster [puppet] - 10https://gerrit.wikimedia.org/r/275538 [16:56:19] (03PS1) 10Giuseppe Lavagetto: iegreview: use $parsoid_primary [puppet] - 10https://gerrit.wikimedia.org/r/275539 (https://phabricator.wikimedia.org/T125673) [16:56:29] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable assignment of accountcreator for maiwiki [[gerrit:270897]] (duration: 02m 21s) [16:56:32] ^ Luke081515 check please [16:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:49] (03PS2) 10BBlack: cache_upload: separate applayer backend for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/275531 (https://phabricator.wikimedia.org/T125510) [16:56:51] (03PS4) 10BBlack: WIP: first attempt at cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [16:56:59] thcipriani: Checked, works correct. Thanks for SWAT, I will close the task now :D [16:57:10] Luke081515: awesome. Thanks for checking. [16:59:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "no conftool data ?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/275530 (https://phabricator.wikimedia.org/T129090) (owner: 10Giuseppe Lavagetto) [16:59:51] (03CR) 10Hashar: "So hiera lookup was broken due to some files not being properly synced to disk before the snapshot is created. That is solved" [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [17:03:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "makes sense in general, comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [17:10:13] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL applayer datacenter-switch via confd - https://phabricator.wikimedia.org/T127485#2094977 (10BBlack) [17:10:15] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2094978 (10BBlack) [17:10:17] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Refactor VCL for applayer datacenter-switching - https://phabricator.wikimedia.org/T127484#2094974 (10BBlack) 5Open>3Resolved a:3BBlack This works now, and is controlled by per-cluster hieradata in the `apps` s... [17:14:51] 6Operations, 10Analytics, 10Datasets-General-or-Unknown, 10Traffic: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2094989 (10ArielGlenn) [17:15:31] 6Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#1881737 (10Milimetric) p:5Triage>3Normal [17:18:15] 6Operations, 10Ops-Access-Requests: Root on labtestweb for Alex Monk (Krenair) - https://phabricator.wikimedia.org/T129097#2094999 (10Andrew) [17:21:25] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482#2095026 (10BBlack) [17:21:27] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2082195 (10RobH) Operations has approved this request in the weekly meeting, so there are no approval blockers. (If no one pushes a patchset in the next hour I'll get to it later tod... [17:22:07] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL applayer datacenter-switch via confd - https://phabricator.wikimedia.org/T127485#2095029 (10BBlack) [17:23:18] 6Operations, 10Ops-Access-Requests: Root on labtestweb for Alex Monk (Krenair) - https://phabricator.wikimedia.org/T129097#2095034 (10Andrew) This was approved in the Ops meeting, pending the standard 3-day approval period. [17:24:45] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2095039 (10BBlack) Removing the 2x confd-related blocker tasks: they'll still be open tasks tagged for #codfw-roll... [17:24:55] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL applayer datacenter-switch via confd - https://phabricator.wikimedia.org/T127485#2095042 (10BBlack) [17:24:57] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482#2095043 (10BBlack) [17:24:59] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2095041 (10BBlack) [17:25:08] 6Operations, 10Traffic, 5codfw-rollout: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482#2044847 (10BBlack) [17:25:11] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2095045 (10MoritzMuehlenhoff) a:5MoritzMuehlenhoff>3RobH [17:25:26] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2082195 (10MoritzMuehlenhoff) I hadn't created a patch yet, so reassigning to Rob [17:25:35] 6Operations, 10Traffic, 5codfw-rollout: Enable VCL applayer datacenter-switch via confd - https://phabricator.wikimedia.org/T127485#2044954 (10BBlack) [17:27:23] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2095054 (10BBlack) Overall status update: work is complete at the configuration level to support the necessary swi... [17:28:20] 6Operations, 10Analytics, 10Datasets-General-or-Unknown, 10Wikidata: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#2095058 (10Milimetric) 5Open>3declined So we're leaning towards declining this unless requests for dumps.wik... [17:29:33] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2095064 (10scfc) @Magnus: In this context, it means that a request came in with `http` and was not redirected or failed. This could be by someone typing... [17:29:53] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switch ulsfo to backend to codfw rather than eqiad - https://phabricator.wikimedia.org/T127492#2095068 (10BBlack) This is ready to test. Will shoot for early Tuesday (tomorrow). Should not impact anything else. [17:32:51] (03PS1) 10Yuvipanda: uwsgi: Stop the distro provided service properly [puppet] - 10https://gerrit.wikimedia.org/r/275560 (https://phabricator.wikimedia.org/T124621) [17:39:16] (03PS2) 10Yuvipanda: uwsgi: Stop the distro provided service properly [puppet] - 10https://gerrit.wikimedia.org/r/275560 (https://phabricator.wikimedia.org/T124621) [17:49:31] 7Puppet, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Hiera is not properly configured on Nodepool instances - https://phabricator.wikimedia.org/T129092#2095169 (10mobrovac) You can add facts by passing in environment variables: ``` $ FACTER_MYVARNAME=blah puppet apply ``` This... [17:57:04] 7Puppet, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: Hiera is not properly configured on Nodepool instances - https://phabricator.wikimedia.org/T129092#2095218 (10hashar) ``` $ facter labsproject $ FACTER_LABSPROJECT=foobar facter labsproject foobar $ ``` Super nice hack @mobr... [17:59:58] 6Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T127824#2095246 (10Papaul) a:5Papaul>3faidon I forgot that i have 3 spares on site so no need to open a purchase ticket. I am replacing the bad drive with one of the spare that I have on site. Drive... [18:05:30] (03CR) 10Dzahn: [C: 031] Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [18:06:43] (03CR) 1020after4: [C: 031] "this has no dependency and should be totally safe to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [18:07:14] 6Operations, 10ops-eqiad, 13Patch-For-Review: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2095272 (10ArielGlenn) Giuseppe gave me the ok so I've cleaned up their salt keys. [18:09:27] 6Operations, 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2095280 (10Andrew) 5Open>3Resolved [18:14:28] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2095303 (10MoritzMuehlenhoff) [18:14:30] 6Operations: upgrade iron to jessie (or get rid of it) - https://phabricator.wikimedia.org/T125025#2095301 (10MoritzMuehlenhoff) 5Open>3Resolved iron has been reimaged with jessie. [18:17:33] 6Operations, 10ops-codfw: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2095309 (10Papaul) a:5Papaul>3jcrespo @ jcrespo I was about to open a ticket for ordering a drive replacement for this system and I did a count down of all the db servers that are showing failed drive... [18:18:24] mutante: hi, who are the list admins for FDC mailing list ? [18:18:29] (03PS1) 10Krinkle: mediawiki: Use [PT] instead of [L] for static.php rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/275582 (https://phabricator.wikimedia.org/T128747) [18:18:49] (03CR) 10Krinkle: [C: 04-1] "Not yet tested in prod. For testing on silver." [puppet] - 10https://gerrit.wikimedia.org/r/275582 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [18:19:07] mutante: never mind, found it [18:19:21] (03PS1) 10Papaul: Add DNS entries for sinistra Bug:T128796 [dns] - 10https://gerrit.wikimedia.org/r/275584 (https://phabricator.wikimedia.org/T128796) [18:20:38] (03PS1) 10Elukey: Remove rdb1003.eqiad from the Redis Job Queues for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275585 (https://phabricator.wikimedia.org/T128730) [18:22:38] (03PS1) 10RobH: adding madhuvishy to deployers [puppet] - 10https://gerrit.wikimedia.org/r/275587 [18:23:40] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2095324 (10Papaul) sinistra mgmt dns 10.193.2.246 port ge-5/0/4 rack c5 [18:24:16] 6Operations, 13Patch-For-Review: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2095326 (10elukey) @aaron: I didn't see the partition configuration in the file and I should have probably asked twice. I am planning to add do... [18:24:24] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2095327 (10Papaul) [18:24:40] (03CR) 10Dzahn: [C: 031] adding madhuvishy to deployers [puppet] - 10https://gerrit.wikimedia.org/r/275587 (owner: 10RobH) [18:24:52] (03CR) 10RobH: [C: 032] adding madhuvishy to deployers [puppet] - 10https://gerrit.wikimedia.org/r/275587 (owner: 10RobH) [18:26:00] (03CR) 10Dzahn: [C: 032] "https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=2910" [dns] - 10https://gerrit.wikimedia.org/r/275584 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [18:26:35] 6Operations: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2095341 (10elukey) [18:26:37] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2095342 (10RobH) 5Open>3Resolved a:5RobH>3None @madhuvishy: Your deployment access is now live on the cluster. It will take an hour or so for all servers affected to call in a... [18:27:16] 6Operations, 6Labs: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#2095346 (10yuvipanda) a:5yuvipanda>3None [18:28:31] 6Operations: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2076647 (10elukey) All right, patch merged and rdb1002's client connections dropped in favor of rdb1007. I checked on the latter and Redis queues have been created an populated correctly... [18:29:20] (03PS1) 10Dzahn: fix typos for labstore2004 and sinistra (wfm!=wmf) [dns] - 10https://gerrit.wikimedia.org/r/275589 [18:30:59] (03CR) 10Dzahn: [C: 032] "this fix wfm :p" [dns] - 10https://gerrit.wikimedia.org/r/275589 (owner: 10Dzahn) [18:32:16] !log restbase deploy start of 5add37b16 on restbase1001 [18:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:32] (03PS6) 10Dzahn: Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [18:33:38] (03CR) 10Yuvipanda: [C: 032] uwsgi: Stop the distro provided service properly [puppet] - 10https://gerrit.wikimedia.org/r/275560 (https://phabricator.wikimedia.org/T124621) (owner: 10Yuvipanda) [18:33:45] (03PS3) 10Yuvipanda: uwsgi: Stop the distro provided service properly [puppet] - 10https://gerrit.wikimedia.org/r/275560 (https://phabricator.wikimedia.org/T124621) [18:33:52] (03CR) 10Yuvipanda: [V: 032] uwsgi: Stop the distro provided service properly [puppet] - 10https://gerrit.wikimedia.org/r/275560 (https://phabricator.wikimedia.org/T124621) (owner: 10Yuvipanda) [18:34:56] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2095376 (10GWicke) > What's the caching strategy for this API? Will it simply redirect/proxy The most effici... [18:35:37] rreeeeebase [18:36:00] (03PS7) 10Dzahn: Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [18:36:30] mutante: don't puppet merge my patch for a sec maybe [18:36:57] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2075945 (10RobH) a:5RobH>3mark @Akosiaris: Duly noted, that changes things! We have spares to allocate for this right now in CODFW. @Mark: Please review for approval the al... [18:37:05] ok, not merging anything yet because i have to keep rebasing over and over :) [18:37:31] haha :D [18:38:03] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2095396 (10RobH) [18:38:24] 6Operations, 10Traffic, 13Patch-For-Review: Port varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T128788#2095398 (10ori) [18:38:40] 6Operations, 6Services, 10hardware-requests: codfw: (2+2) sca & scb service clusters - https://phabricator.wikimedia.org/T128475#2075945 (10RobH) [18:39:50] (03CR) 10Filippo Giunchedi: [C: 031] "thanks Brandon!" [puppet] - 10https://gerrit.wikimedia.org/r/275531 (https://phabricator.wikimedia.org/T125510) (owner: 10BBlack) [18:46:16] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2095436 (10madhuvishy) Thanks @RobH :) I will! [18:47:28] (03PS1) 10EBernhardson: Enable completion suggester as default prefix search on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275593 [18:50:09] (03CR) 1020after4: "should I put this on puppetswat for tomorrow?" [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [18:50:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:51:03] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:52:02] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: switch upload varnish backends to codfw ahead of full switch - https://phabricator.wikimedia.org/T129089#2095538 (10fgiunchedi) see also related https://gerrit.wikimedia.org/r/#/c/275531/ [18:52:13] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:52:43] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [18:53:49] gonna be a bunch of puppet errors [18:53:52] that'll be me [18:55:54] 6Operations, 6Performance-Team, 10Traffic: Segment Navigation Timing data by continent - https://phabricator.wikimedia.org/T128709#2083445 (10ori) p:5Triage>3High [18:56:28] 6Operations, 10ops-codfw: ms-be2010.codfw.wmnet: slot=0 dev=sda failed - https://phabricator.wikimedia.org/T129117#2095571 (10fgiunchedi) 3NEW [18:57:13] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:59:55] RECOVERY - RAID on ms-be2010 is OK: OK: optimal, 13 logical, 13 physical [19:01:12] (03CR) 10Dzahn: "no :)" [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [19:03:11] !log restbase deploy end of 5add37b16 [19:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:54] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:07:05] (03CR) 10Dzahn: [C: 032 V: 032] Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [19:09:12] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/maniphest/task/create not a 404 anymore" [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [19:09:33] mutante: thanks! [19:09:37] twentyafterfour: yw [19:15:26] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 7 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2095706 (10ori) >>! In T124356#2090639, @Jdlrobso... [19:16:48] (03PS2) 10EBernhardson: Enable completion suggester as default prefix search on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275593 (https://phabricator.wikimedia.org/T128774) [19:17:42] !log deployed initial patch for T109140 [19:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:09] (03CR) 10Aaron Schulz: [C: 031] Remove rdb1003.eqiad from the Redis Job Queues for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275585 (https://phabricator.wikimedia.org/T128730) (owner: 10Elukey) [19:29:10] (03PS1) 10Papaul: Add production DNS for sinistra Bug:T128796 [dns] - 10https://gerrit.wikimedia.org/r/275600 (https://phabricator.wikimedia.org/T128796) [19:33:14] (03PS1) 10Dzahn: add wikitech-static-jessie for migration [dns] - 10https://gerrit.wikimedia.org/r/275601 [19:34:09] (03PS2) 10Dzahn: add wikitech-static-jessie for migration [dns] - 10https://gerrit.wikimedia.org/r/275601 (https://phabricator.wikimedia.org/T126385) [19:35:52] (03PS3) 10Dzahn: add wikitech-static-jessie for migration [dns] - 10https://gerrit.wikimedia.org/r/275601 (https://phabricator.wikimedia.org/T126385) [19:36:55] (03CR) 10Mobrovac: Services: introduce service::packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [19:39:12] (03PS2) 10Papaul: Add production DNS for sinistra Bug:T128796 [dns] - 10https://gerrit.wikimedia.org/r/275600 (https://phabricator.wikimedia.org/T128796) [19:41:06] (03PS3) 10EBernhardson: Enable completion suggester as default prefix search on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275593 (https://phabricator.wikimedia.org/T128774) [19:49:43] RECOVERY - RAID on db2018 is OK: OK: optimal, 1 logical, 6 physical [19:51:30] (03PS1) 10EBernhardson: Enable completion suggester as default on all but top 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275605 (https://phabricator.wikimedia.org/T128776) [19:52:01] (03PS2) 10EBernhardson: Enable completion suggester as default on all but top 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275605 (https://phabricator.wikimedia.org/T128775) [20:00:04] hashar andrewbogott: Respected human, time to deploy CI (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160307T2000). Please do the needful. [20:00:13] hashar: I’m here [20:00:59] andrewbogott: I am too :) [20:01:13] do you want to sync on this channel / private message or over google hangouts? [20:01:17] here is fine [20:01:25] k [20:01:28] I’m just here to troubleshoot if things go terribly wrong, right? [20:01:33] I have made a copy of the current nodepool .deb package in my homedir [20:01:37] oh [20:01:50] I dont have root on labnodepool1001.eqiad.wmnet :-D [20:01:57] so you get to do the reprepo dance + apt-get update && install [20:02:11] sure, ok :) [20:02:28] https://phabricator.wikimedia.org/T118573 should have the .deb [20:02:34] there https://people.wikimedia.org/~hashar/debs/nodepool_0.1.1-wmf4/ [20:02:52] !log Upgrading Nodepool from 0.1.1-wmf3 to 0.1.1-wmf.4 with andrewbogott | T118573 [20:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:04] it is merely a few cherry picks [20:03:16] so I am not worried [20:04:11] actually it is only a couple changes [20:04:35] hashar: this is running on trusty, right? [20:04:41] Jessie [20:05:00] jessie-wikimedia/thirdparty [20:05:34] reprepro is all a mystery to me. I can't even write it properly [20:07:52] !log deployed patch for T129120 [20:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:02] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 7 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2096022 (10Jdlrobson) @ori I don't appreciate yo... [20:08:09] ok, the package should be available [20:08:18] (03PS1) 10Papaul: adding install params for sinistra Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275607 (https://phabricator.wikimedia.org/T128796) [20:08:50] you want to apt-get update && apt-get upgrade [20:08:50] and that should restart nodepoo [20:09:04] (hopefully :D ) [20:09:04] oh right, the no root thing [20:09:06] ok, one moment... [20:09:09] yeah [20:09:36] I got root on gallium (jenkins / zuul scheduler) since both requires bunch of strace and other funny diagnostics [20:10:06] but for Nodepool, there was not much need for me to get root beside ninja upgrading and live hacking (which both are evil :D ) [20:12:30] andrewbogott: apparently the .deb package does not restart the service :-( [20:14:32] well, I upgraded way more things than I needed to, but that should do it. [20:14:36] Running puppet now... [20:14:38] :) [20:14:50] maybe I have instructed the .deb to not intentionally reload [20:15:02] !log stopping nodepool [20:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:13] It’s up now isn’t it? [20:17:07] yeah [20:17:11] but can't stop it for some reason [20:17:13] systemd reports: [20:17:13] Job for nodepool.service canceled. [20:17:18] puppet is unhappy now [20:18:07] what is sudo -H -u nodepool bash -l [20:18:07] ? [20:18:16] (I kill -9’d the nodepool service itself, I think) [20:18:33] that sudo comes from the shell script 'become-nodepool' [20:18:43] seems our sudo default is to not set $HOME [20:18:50] and that kept confusing me [20:18:59] so I went with a script helper become-nodepool that sets -H [20:19:09] this way the openstack credentials helps in /var/lib/nodepool/.profile are loaded [20:19:17] and then one can interact with the Openstack api [20:19:27] ok… are you able to do what you need to do now? [20:19:31] !log Nodepool restarting [20:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:47] (03PS1) 10Papaul: dhcp:adding sinistra MAC address Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275609 (https://phabricator.wikimedia.org/T128796) [20:21:07] andrewbogott: yes I think it is fine now. Will verify a few things and then close tasks [20:21:11] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2096087 (10Papaul) [20:21:12] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:22:52] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:23:18] (03CR) 10Hashar: [C: 032 V: 032] nodepool 0.1.1-wmf4 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/237700 (https://phabricator.wikimedia.org/T111377) (owner: 10Hashar) [20:23:24] (03CR) 10Hashar: "Deployed with Andrew" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/237700 (https://phabricator.wikimedia.org/T111377) (owner: 10Hashar) [20:25:43] TimStarling: just one thing, did you reply yesterday because you knew the installed git version was below 2.7.1 ? [20:26:14] (03CR) 10Dzahn: [C: 032] add wikitech-static-jessie for migration [dns] - 10https://gerrit.wikimedia.org/r/275601 (https://phabricator.wikimedia.org/T126385) (owner: 10Dzahn) [20:26:20] (03PS1) 10Alexandros Kosiaris: Introduce hassaleh, hassium for debug_proxy service [dns] - 10https://gerrit.wikimedia.org/r/275611 (https://phabricator.wikimedia.org/T129003) [20:26:20] 6Operations, 10Wikimedia-General-or-Unknown, 10vm-requests, 13Patch-For-Review: 2 Ganeti VMs for X-Wikimedia-Debug proxy - https://phabricator.wikimedia.org/T129003#2096105 (10akosiaris) Looks pretty reasonable to me. However it does not look like those will ever be a cluster of servers so I 'll switch to... [20:27:48] (03CR) 10Ori.livneh: [C: 031] Introduce hassaleh, hassium for debug_proxy service [dns] - 10https://gerrit.wikimedia.org/r/275611 (https://phabricator.wikimedia.org/T129003) (owner: 10Alexandros Kosiaris) [20:28:55] (03PS1) 10Hashar: nodepool: set delete-delay to 0 seconds [puppet] - 10https://gerrit.wikimedia.org/r/275612 (https://phabricator.wikimedia.org/T113359) [20:29:33] 6Operations, 10Mail: status of wikigroup@ alias - https://phabricator.wikimedia.org/T127551#2096137 (10bbogaert) Hi Daniel, Can you investigate the fr-all address? These are apparently being delayed in delivery. Thank you, Byron [20:31:43] (03PS2) 10Hashar: nodepool: set delete-delay to 0 seconds [puppet] - 10https://gerrit.wikimedia.org/r/275612 (https://phabricator.wikimedia.org/T113359) [20:32:57] andrewbogott: looks fine so far. Then I would need puppet change https://gerrit.wikimedia.org/r/#/c/275612/ to set the deletion delay to 0 seconds ;) [20:33:06] andrewbogott: which is the primary reason for the wmf4 upgrade [20:33:26] (03CR) 10Andrew Bogott: [C: 032] nodepool: set delete-delay to 0 seconds [puppet] - 10https://gerrit.wikimedia.org/r/275612 (https://phabricator.wikimedia.org/T113359) (owner: 10Hashar) [20:33:28] after that we are done [20:34:22] applying puppet now... [20:34:42] hm, looks like it didn’t restart [20:34:45] do you want to do that? [20:34:54] yeah I will [20:34:54] (03PS3) 10BBlack: cache_upload: separate applayer backend for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/275531 (https://phabricator.wikimedia.org/T125510) [20:34:56] (03PS5) 10BBlack: WIP: first attempt at cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [20:35:07] 6Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2096182 (10bbogaert) [20:35:10] 6Operations, 10Mail: move fundraising group aliases to OIT - https://phabricator.wikimedia.org/T128647#2096181 (10bbogaert) 5Resolved>3Open [20:35:30] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: separate applayer backend for thumbs [puppet] - 10https://gerrit.wikimedia.org/r/275531 (https://phabricator.wikimedia.org/T125510) (owner: 10BBlack) [20:35:32] andrewbogott: I think Nodepool catch up with config changes all by itself [20:35:42] (03PS2) 10Alexandros Kosiaris: Introduce hassaleh, hassium for debug_proxy service [dns] - 10https://gerrit.wikimedia.org/r/275611 (https://phabricator.wikimedia.org/T129003) [20:35:45] (03PS1) 10Alexandros Kosiaris: Introduce hassaleh, hassium [puppet] - 10https://gerrit.wikimedia.org/r/275613 (https://phabricator.wikimedia.org/T129003) [20:35:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce hassaleh, hassium for debug_proxy service [dns] - 10https://gerrit.wikimedia.org/r/275611 (https://phabricator.wikimedia.org/T129003) (owner: 10Alexandros Kosiaris) [20:36:09] 6Operations, 10Mail: move fundraising group aliases to OIT - https://phabricator.wikimedia.org/T128647#2081487 (10bbogaert) Hi Daniel, Can you take a look at the fr-all? Fundraising is receiving failed delivery notices. Thanks, Byron [20:37:42] (03PS2) 10Ori.livneh: Add debug_proxy module, for X-Wikimedia-Debug request routing [puppet] - 10https://gerrit.wikimedia.org/r/275307 (https://phabricator.wikimedia.org/T129000) [20:38:20] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2096207 (10Dzahn) ``` root@wikitech-static-jessie.wikimedia.org's password: ___________________________ < wikitech-static on jessie > < T126385 > -----... [20:44:41] andrewbogott: it is broken somehow. Not able to spawn instances any more apparently [20:45:03] dang [20:45:08] do you know what’s failing? [20:45:12] nop [20:45:29] it loose track of instances [20:45:35] and got some old ones stalled in a delete state [20:46:03] maybe that’s an after-effect from the kill-9? [20:46:12] potentially yeah [20:46:23] oh no [20:46:27] something caught up [20:46:58] (03CR) 10jenkins-bot: [V: 04-1] Add debug_proxy module, for X-Wikimedia-Debug request routing [puppet] - 10https://gerrit.wikimedia.org/r/275307 (https://phabricator.wikimedia.org/T129000) (owner: 10Ori.livneh) [20:47:56] andrewbogott: not sure what happened, but apparently a bunch of tasks (such as deleting servers / listing them etc) got unlocked [20:48:28] things that needed to be unlocked? Or things that should not have been unlocked? [20:51:03] andrewbogott: not sure. When it restarts, I guess it delete all nodes and spawn new ones [20:51:22] somehow the instances deletion requests ended up being queued somewhere as well as lot of build requests [20:51:32] with not much happening until the tasks got processed [20:51:44] but it’s working properly now? [20:51:51] seems so [20:51:56] I think that is enough [20:52:20] will monitor it a bit, but I can see instances added/removed just fine now [20:52:39] (03PS3) 10Ori.livneh: Add debug_proxy module, for X-Wikimedia-Debug request routing [puppet] - 10https://gerrit.wikimedia.org/r/275307 (https://phabricator.wikimedia.org/T129000) [20:52:41] great! [20:53:29] (03PS2) 10Alexandros Kosiaris: Introduce hassaleh, hassium [puppet] - 10https://gerrit.wikimedia.org/r/275613 (https://phabricator.wikimedia.org/T129003) [20:53:41] (03CR) 10jenkins-bot: [V: 04-1] Add debug_proxy module, for X-Wikimedia-Debug request routing [puppet] - 10https://gerrit.wikimedia.org/r/275307 (https://phabricator.wikimedia.org/T129000) (owner: 10Ori.livneh) [20:55:18] andrewbogott: I am freeing you. I am calling it a success :} thank you! [20:55:23] (03PS4) 10Ori.livneh: Add debug_proxy module, for X-Wikimedia-Debug request routing [puppet] - 10https://gerrit.wikimedia.org/r/275307 (https://phabricator.wikimedia.org/T129000) [20:55:32] ok! Pretty painless. [20:56:07] !log Nodepool successfully upgraded. T118573 [20:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:16] andrewbogott: will try to figure out why it doesn't stop properly later on [20:57:30] maybe it was stopping properly and we just got impatient [20:58:25] (03CR) 10Ori.livneh: [C: 031] Introduce hassaleh, hassium [puppet] - 10https://gerrit.wikimedia.org/r/275613 (https://phabricator.wikimedia.org/T129003) (owner: 10Alexandros Kosiaris) [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160307T2100). [21:00:19] andrewbogott: looks like on start the instances "ready" are kept around but checked on a 15 Minutes schedule [21:00:50] andrewbogott: so when the 15 min scheduled task kicks in, it checks whether they are still reachable and if so add them back to Jenkins [21:00:58] (03PS3) 10Alexandros Kosiaris: Introduce hassaleh, hassium [puppet] - 10https://gerrit.wikimedia.org/r/275613 (https://phabricator.wikimedia.org/T129003) [21:01:05] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce hassaleh, hassium [puppet] - 10https://gerrit.wikimedia.org/r/275613 (https://phabricator.wikimedia.org/T129003) (owner: 10Alexandros Kosiaris) [21:01:27] andrewbogott: will bug fill that / fix it to upstream [21:01:45] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2096372 (10Dzahn) @Krenair i added your ssh key to that ^ got root , like on the current wikitech-static [21:02:00] Krenair: https://phabricator.wikimedia.org/T126385#2096207 [21:02:03] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2096373 (10RobH) robh@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] member ge-1/0/1 { ... } + member ge-5/0/4; [e... [21:02:23] andrewbogott: if you still have a couple minutes, I would like to change a log rotation to happen on midnight: https://gerrit.wikimedia.org/r/#/c/269213/ :) [21:02:30] (03PS2) 10Hashar: nodepool: rotate daily at midnight [puppet] - 10https://gerrit.wikimedia.org/r/269213 [21:03:57] (03PS5) 10Ori.livneh: Add debug_proxy module, for X-Wikimedia-Debug request routing [puppet] - 10https://gerrit.wikimedia.org/r/275307 (https://phabricator.wikimedia.org/T129000) [21:03:59] (03PS1) 10Ori.livneh: Update appservers_debug Varnish backend to point to debug_proxy instances [puppet] - 10https://gerrit.wikimedia.org/r/275621 (https://phabricator.wikimedia.org/T129000) [21:05:39] !log starting parsoid deploy [21:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:26] (03PS1) 10MaxSem: Deploy Kartographer on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275622 (https://phabricator.wikimedia.org/T127136) [21:07:51] !log synced code; restarted parsoid on wtp1001 as a canary [21:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:58] (03CR) 10Andrew Bogott: [C: 032] nodepool: rotate daily at midnight [puppet] - 10https://gerrit.wikimedia.org/r/269213 (owner: 10Hashar) [21:10:19] andrewbogott: it happened to logrotate right in the middle of a daily operation :-} [21:12:00] looking good. restarting parsoid on all nodes [21:14:13] !log finished deploying parsoid sha 5db1d28b [21:14:15] who uses zotero again? The citoid service? [21:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:41] greg-g: yes [21:14:51] multi level services... [21:15:27] akosiaris: :) [21:15:28] ty [21:15:33] yw [21:16:41] grrrit-wm: Yes. [21:18:09] 6Operations, 10Monitoring, 10Traffic, 10Scap3 (scap3-adoption): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2096567 (10greg) Added #monitoring and #traffic assuming that hits the right people who care about libreNMS. [21:18:19] James_F: grrrrrreat [21:18:31] Ha. [21:18:32] greg-g: Yeah. [21:19:20] (03CR) 10Yurik: [C: 031] Deploy Kartographer on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275622 (https://phabricator.wikimedia.org/T127136) (owner: 10MaxSem) [21:24:49] Maybe we should make the bots use names like "[bot]gerrit" so we don't mis-ping them. [21:24:52] 6Operations, 10Deployment-Systems, 10Monitoring, 10scap, 10Scap3 (scap3-adoption): Deploy servermon with scap3 - https://phabricator.wikimedia.org/T129152#2096645 (10greg) [21:25:54] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2096662 (10Gehel) you can check cost per call with the following oneliner in mwrepl: ``` $time = 0; for ( $i = 0; $i < 100; ++$i) { $ch =... [21:28:33] 6Operations, 10Deployment-Systems, 10Monitoring, 10scap, 10Scap3 (scap3-adoption): Deploy servermon with scap3 - https://phabricator.wikimedia.org/T129152#2096707 (10thcipriani) [21:29:06] 6Operations, 10Monitoring, 10Traffic, 10Scap3 (scap3-adoption): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2096721 (10thcipriani) [21:31:10] !log starting mobileapps deploy [21:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:22] (03PS1) 10Nschaaf: Remove reader segmentation survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275669 (https://phabricator.wikimedia.org/T125946) [21:36:55] !log mobileapps deployed 49169e9 [21:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:38:52] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:18] bearND: ^ [21:39:23] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:46] ori, thanks. ohoh [21:39:54] that's gonna page [21:40:33] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:40:40] ah, that's good [21:41:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [21:41:11] good. Where can i see a stack trace or log entries? [21:41:59] if you log stuff at /srv/log/mobileapps on the 2 boxes and of course logstash [21:41:59] When I tried the endpoints earlier it looked fine to me, and the same number of node processes were running as before I restarted them [21:42:16] and please tell me you do log stuff [21:42:28] cause I 've seen my fair share of non logging applications lately [21:42:29] akosiaris: thanks. yes, we do log stuff. [21:42:41] good :-). /me happy [21:44:32] 6Operations, 13Patch-For-Review: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2096841 (10aaron) >>! In T128730#2093668, @Joe wrote: > To better frame the issue: > > We need a reliable method to depool one rdb host. The s... [21:49:31] (03PS4) 10EBernhardson: Enable completion suggester as default prefix on test/test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275593 (https://phabricator.wikimedia.org/T128774) [21:49:39] (03PS5) 10EBernhardson: Enable completion suggester as default prefix algo on test/test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275593 (https://phabricator.wikimedia.org/T128774) [21:52:32] 6Operations, 10Wikimedia-Mailing-lists: Password for list and ownership - https://phabricator.wikimedia.org/T129165#2096900 (10klove) [21:52:42] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:31] scb1001 is in trouble again [21:55:05] it looks like inbound traffic to that host halved since the deployment: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=scb1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Service+Cluster+B+eqiad [21:55:20] 6Operations, 10Wikimedia-Mailing-lists: Password for list and ownership - https://phabricator.wikimedia.org/T129165#2096921 (10RobH) a:3RobH [21:55:24] bearND: ^ [21:55:42] greg-g: there it goes again :( [21:57:54] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:58:46] bearND: have you found the cause or at least the issue? [21:58:58] greg-g: no [21:59:17] greg-g: should i revert? now scb1001 is ok again [21:59:34] I don't think it's really OK; I think it's flapping [22:00:04] yurik maxsem: Dear anthropoid, the time has come. Please deploy Kartographer extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160307T2200). [22:00:04] MaxSem, go go go! :D [22:00:13] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:00:22] * yurik and max take down wikivoyage [22:00:36] bearND: yes, revert, if you don't know what's going on, and it's flapping, revert [22:00:50] greg-g: ori: i'm going to revert https://gerrit.wikimedia.org/r/#/c/275626/ and redeploy. sound good? [22:01:29] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Dynamic backend selection via X-Wikimedia-Debug header - https://phabricator.wikimedia.org/T129000#2096958 (10akosiaris) [22:01:31] 6Operations, 10Wikimedia-General-or-Unknown, 10vm-requests, 13Patch-For-Review: 2 Ganeti VMs for X-Wikimedia-Debug proxy - https://phabricator.wikimedia.org/T129003#2096956 (10akosiaris) 5Open>3Resolved VMs created, added to puppet/salt. I am resolving this, @ori the VMs are yours to implement the debu... [22:01:34] if that's what you deployed, yes :) [22:01:46] mobrovac: ^^^ [22:01:53] greg-g: ok, will revert [22:02:12] (03PS6) 10Ori.livneh: Add debug_proxy module, for X-Wikimedia-Debug request routing [puppet] - 10https://gerrit.wikimedia.org/r/275307 (https://phabricator.wikimedia.org/T129000) [22:02:19] (03CR) 10Ori.livneh: [C: 032 V: 032] Add debug_proxy module, for X-Wikimedia-Debug request routing [puppet] - 10https://gerrit.wikimedia.org/r/275307 (https://phabricator.wikimedia.org/T129000) (owner: 10Ori.livneh) [22:03:27] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2096962 (10Papaul) Thanks Rob [22:03:41] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:04:04] 6Operations, 10Wikimedia-Mailing-lists: Password for list and ownership - https://phabricator.wikimedia.org/T129165#2096963 (10RobH) 5Open>3Resolved p:5Triage>3Normal I've reset the admin password for the list and emailed the new password to the list admins. Resolving task! [22:04:20] ori: just one thing, did you reply yesterday because you knew the installed git version was below 2.7.1 ? [22:04:26] 6Operations, 10Wikimedia-Mailing-lists: Password for list and ownership - https://phabricator.wikimedia.org/T129165#2096973 (10klove) Thanks! [22:05:06] ytrezq: I replied because what you reported seemed like a credible threat that should be investigated. And, yes, it was below 2.7.1. [22:05:40] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2096990 (10Dzahn) [22:05:41] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:05:52] ori: so do you mean it is no longer the case ? if yes, good [22:06:32] ytrezq: thanks again for pointing it out [22:06:42] !log starting to revert mobileapps deploy [22:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:06:59] ori: so do you mean it is no longer the case ? [22:07:03] isn’t it ? [22:07:22] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:07:23] ytrezq: I'm not sure; it was handled by our security folks. [22:10:05] !log revert of mobileapps deploy complete [22:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:11:48] (03CR) 10Dzahn: [C: 04-1] "the line above that is using the same recipe, you can put that in one line" [puppet] - 10https://gerrit.wikimedia.org/r/275607 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [22:12:16] 6Operations, 10Mail: fr-all fails with error 451 - https://phabricator.wikimedia.org/T129168#2097033 (10bbogaert) [22:13:29] (03PS1) 10Ori.livneh: debug_proxy: add resolver param and set it to $::nameservers [puppet] - 10https://gerrit.wikimedia.org/r/275679 [22:13:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [22:13:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [22:13:54] (03PS2) 10Ori.livneh: debug_proxy: add resolver param and set it to $::nameservers [puppet] - 10https://gerrit.wikimedia.org/r/275679 [22:14:08] (03PS3) 10Ori.livneh: debug_proxy: add resolver param and set it to $::nameservers [puppet] - 10https://gerrit.wikimedia.org/r/275679 [22:14:16] (03CR) 10Ori.livneh: [C: 032 V: 032] debug_proxy: add resolver param and set it to $::nameservers [puppet] - 10https://gerrit.wikimedia.org/r/275679 (owner: 10Ori.livneh) [22:18:32] (03PS2) 10Ori.livneh: Update appservers_debug Varnish backend to point to debug_proxy instances [puppet] - 10https://gerrit.wikimedia.org/r/275621 (https://phabricator.wikimedia.org/T129000) [22:18:41] 6Operations, 10Mail: move fundraising group aliases to OIT - https://phabricator.wikimedia.org/T128647#2097062 (10bbogaert) 5Open>3Resolved Hi All, This is complete except for the fr-all address. Due to errors with delivery when the exim alias has been removed this will be tracked in T129168. Thanks, Byron [22:18:41] ori: of course what would really help is to get the same result as this https://www.google.fr/webhp?ie=utf-8&oe=utf-8&channel=suggest#q=CVE-2014-9390&tbm=nws so it’s not necessary to point it out to each organizations [22:18:43] 6Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2097066 (10bbogaert) [22:18:54] (03PS3) 10Dzahn: Add production DNS for sinistra Bug:T128796 [dns] - 10https://gerrit.wikimedia.org/r/275600 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [22:19:12] (03CR) 10Dzahn: [C: 032] Add production DNS for sinistra Bug:T128796 [dns] - 10https://gerrit.wikimedia.org/r/275600 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [22:19:38] (03CR) 10BBlack: [C: 032 V: 032] Update appservers_debug Varnish backend to point to debug_proxy instances [puppet] - 10https://gerrit.wikimedia.org/r/275621 (https://phabricator.wikimedia.org/T129000) (owner: 10Ori.livneh) [22:20:19] (03PS2) 10Dzahn: dhcp:adding sinistra MAC address Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275609 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [22:20:25] (03CR) 10Dzahn: [C: 032] dhcp:adding sinistra MAC address Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275609 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [22:20:50] (03CR) 10Dzahn: [V: 032] dhcp:adding sinistra MAC address Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275609 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [22:21:35] (03PS2) 10Dzahn: adding install params for sinistra Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275607 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [22:21:40] (03CR) 10Dzahn: [C: 032] adding install params for sinistra Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275607 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [22:21:59] (03CR) 10Dzahn: [V: 032] adding install params for sinistra Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275607 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [22:24:21] !log maxsem@tin Synchronized php-1.27.0-wmf.15/extensions/Kartographer: Initial deploy: get files into place (duration: 02m 27s) [22:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:27:12] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:27:22] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:27:45] (03PS2) 10Andrew Bogott: mediawiki: Use [PT] instead of [L] for static.php rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/275582 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [22:27:48] (03CR) 10MaxSem: [C: 032] Deploy Kartographer on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275622 (https://phabricator.wikimedia.org/T127136) (owner: 10MaxSem) [22:28:18] (03PS1) 10Ori.livneh: debug_proxy: add mw[12]099 to backend_regexp [puppet] - 10https://gerrit.wikimedia.org/r/275689 [22:28:20] (03Merged) 10jenkins-bot: Deploy Kartographer on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275622 (https://phabricator.wikimedia.org/T127136) (owner: 10MaxSem) [22:28:35] (03PS2) 10Ori.livneh: debug_proxy: add mw[12]099 to backend_regexp [puppet] - 10https://gerrit.wikimedia.org/r/275689 [22:28:48] (03CR) 10Ori.livneh: [C: 032 V: 032] debug_proxy: add mw[12]099 to backend_regexp [puppet] - 10https://gerrit.wikimedia.org/r/275689 (owner: 10Ori.livneh) [22:29:34] !log maxsem@tin Started scap: Enable Kartographer on testwiki [22:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:29:46] yurik, ^^^ [22:30:04] wooot! [22:30:23] (03CR) 10Krinkle: [C: 031] "Confirmed with Andrew to fix the issue on silver/wikitech. And doesn't break it for mw1017/testwiki." [puppet] - 10https://gerrit.wikimedia.org/r/275582 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [22:30:23] MaxSem, if you have bad wifi, you might want to run it in nohup/screen :) [22:30:28] andrewbogott: ^ [22:30:59] (03PS3) 10Andrew Bogott: mediawiki: Use [PT] instead of [L] for static.php rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/275582 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [22:34:06] yurik, MaxSem: Are you still deploying? Can we sneak in an emergency fix for VE? [22:34:19] James_F, MaxSem is scapping right now [22:34:22] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Wikitechwiki has 4xx responses to requests for some static assets inc. poweredby_mediawiki_88x31.png and WikiEditor's button-sprite.svg - https://phabricator.wikimedia.org/T128747#2097133 (10Krinkle) Yeah, so that was the issue. On silver... [22:34:26] we're in a middle of scap, James_F [22:34:36] Ah. Fun. [22:34:41] * James_F waits then. :-( /cc RoanKattouw. [22:34:43] James_F, once scap is done, we will be testing for a bit, and you can sneak it in [22:34:48] Kk. [22:34:53] MaxSem, ^ [22:35:07] yurik: OK, ping me when it's done? [22:35:14] ok [22:35:33] (03PS1) 10Cmjohnson: Adding dns entries both mgmt and production for db1074-db1078 and labsdb1008 [dns] - 10https://gerrit.wikimedia.org/r/275694 [22:36:16] (03PS1) 10Ori.livneh: VCL: pass all "X-Wikimedia-Debug"-bearing requests to appservers_debug backend [puppet] - 10https://gerrit.wikimedia.org/r/275695 (https://phabricator.wikimedia.org/T129000) [22:36:56] (03PS2) 10Cmjohnson: Adding dns entries both mgmt and production for db1074-db1078 and labsdb1008 [dns] - 10https://gerrit.wikimedia.org/r/275694 [22:38:04] (03CR) 10Cmjohnson: [C: 032] Adding dns entries both mgmt and production for db1074-db1078 and labsdb1008 [dns] - 10https://gerrit.wikimedia.org/r/275694 (owner: 10Cmjohnson) [22:38:11] (03PS3) 10Jforrester: Enable VisualEditor Single Edit Tab on the Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274130 (https://phabricator.wikimedia.org/T128477) [22:38:22] (03PS2) 10Ori.livneh: VCL: pass all "X-Wikimedia-Debug"-bearing requests to appservers_debug backend [puppet] - 10https://gerrit.wikimedia.org/r/275695 (https://phabricator.wikimedia.org/T129000) [22:38:29] (03CR) 10Jforrester: "Planned for 15 March." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274130 (https://phabricator.wikimedia.org/T128477) (owner: 10Jforrester) [22:38:39] (03PS3) 10Jforrester: Enable VisualEditor Single Edit Tab on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274131 (https://phabricator.wikimedia.org/T128478) [22:39:17] 6Operations, 10ops-eqiad: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2097159 (10Cmjohnson) [22:39:47] (03CR) 10BBlack: [C: 031] VCL: pass all "X-Wikimedia-Debug"-bearing requests to appservers_debug backend [puppet] - 10https://gerrit.wikimedia.org/r/275695 (https://phabricator.wikimedia.org/T129000) (owner: 10Ori.livneh) [22:40:27] (03CR) 10Ori.livneh: [C: 032] VCL: pass all "X-Wikimedia-Debug"-bearing requests to appservers_debug backend [puppet] - 10https://gerrit.wikimedia.org/r/275695 (https://phabricator.wikimedia.org/T129000) (owner: 10Ori.livneh) [22:41:22] bblack: woot, thanks! `X-Wikimedia-Debug: 1` reqs are already going through the debug-proxies [22:47:27] 6Operations, 10Mail: fr-all fails with error 451 - https://phabricator.wikimedia.org/T129168#2097176 (10akosiaris) I would say the 2 entries are because of the 2 different DNs that exist under ou=groups with businessCagetory=fr-all@wikimedia.org ``` ldapsearch -LLL -x businessCategory=fr-all@wikimedia.org dn... [22:50:44] hey, is mw2212 still down? [22:50:55] !log installing sinistra :new mw log host [22:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:51:05] papaul: hahaha, I love the hostname. [22:51:22] ori ; Rob did [22:51:36] pick the host name [22:52:18] why did the SSH keys for ytterbium change? [22:52:39] Was it ever up? https://grafana.wikimedia.org/dashboard/db/server-board?fullscreen&var-server=mw2212&from=now-6M [22:53:33] TimStarling: /etc/ssh/ssh_known_hosts, you mean? [22:53:35] TimStarling: maybe you have changed your ssh client ? [22:53:45] I mean the host key [22:54:07] hmm [22:54:10] TimStarling: the host key has been around since 2013 but the ed25519 one is from June 2015 [22:54:36] Krinkle, SAL says j o e powercycled it today. so there was a chance... :P [22:54:54] Ah yeah [22:54:55] https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&c=API+application+servers+codfw&h=mw2212.codfw.wmnet [22:55:02] Somehow it doesn't show up in ganglia autocomplete, but going there manually shows it [22:55:04] weird [22:55:12] yeah, it's been down for 24h_ [22:55:13] + [22:55:17] oh and the host key of bast2001 (codfw) has changed a few days ago [22:56:04] because it was reimaged [22:58:05] !log maxsem@tin Finished scap: Enable Kartographer on testwiki (duration: 28m 31s) [22:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:58:10] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2097251 (10RobH) [22:58:22] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2097270 (10RobH) [22:58:24] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: codfw: (2) servers for redis jobrunners - https://phabricator.wikimedia.org/T126453#2014739 (10RobH) [22:58:25] MaxSem, done? can RoanKattouw go? [22:58:29] yurik, Wikipedia successfully destroyed [22:59:10] forgive me for being suspicious about SSH host key warnings from "git pull" the day after someone reported a publically known git vulnerability which may lead to arbitrary execution [22:59:20] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2097251 (10RobH) [22:59:24] it's not actually very hard to preserve the host keys across a reimage is it? [23:01:26] (03PS3) 10EBernhardson: Reduce replica count for commonswiki_file in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 [23:02:38] (03PS3) 10EBernhardson: [test only] Stricter avro schema tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261296 [23:03:15] (03CR) 10EBernhardson: [C: 032] [test only] Stricter avro schema tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261296 (owner: 10EBernhardson) [23:03:45] 6Operations, 10ops-eqiad: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2097296 (10Cmjohnson) @robh: I cannot get the controller bios to install..any suggestions? [23:04:03] (03Merged) 10jenkins-bot: [test only] Stricter avro schema tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261296 (owner: 10EBernhardson) [23:06:17] anyway, ytterbium wasn't reimaged, something weirder was going on [23:06:52] !log ebernhardson@tin Synchronized tests/loggingTest.php: Sync out test only change (duration: 02m 20s) [23:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:56] my comment went towards bast2001, no idea about ytterbium [23:07:54] actually it was iron that was reimaged, I forgot that even public *.wikimedia.org IPs are proxied via it [23:08:07] the key warning was about iron [23:08:56] (03PS1) 10MaxSem: Enable Kartographer in vw, test2 and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275703 [23:09:11] (03CR) 10MaxSem: [C: 032] Enable Kartographer in vw, test2 and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275703 (owner: 10MaxSem) [23:09:23] which was in the SAL [23:09:32] right [23:09:50] 6Operations, 10ops-eqiad: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2097318 (10RobH) IRC Update: Chris and I chatted about this and all the steps he has taken would typically suffice. The system detects the card, but otherwise isn't working. It is a ran... [23:09:50] (03Merged) 10jenkins-bot: Enable Kartographer in vw, test2 and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275703 (owner: 10MaxSem) [23:11:17] (03CR) 10Jforrester: Enable Kartographer in vw, test2 and mw.o (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275703 (owner: 10MaxSem) [23:12:05] anyway, the point remains that it is pretty straightforward to preserve host keys through a reimage, and doing that would avoid encouraging people to ignore "key changed" warnings [23:12:20] (03CR) 10MaxSem: Enable Kartographer in vw, test2 and mw.o (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275703 (owner: 10MaxSem) [23:12:42] James_F, RoanKattouw are you still deploying? [23:12:46] MaxSem: :-) [23:12:54] MaxSem is deploying? :) [23:13:02] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/275703 (duration: 02m 23s) [23:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:49] yurik: Waiting for Jenkins right nwo [23:13:58] fun fun [23:14:03] yurik: It'll take… a while. [23:14:07] we should do better about that: either preserve the keys when there's no suspicion the host has been compromised, or publicize them vi another route [23:14:23] but users probably would not check that other route so [23:15:23] 6Operations: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180#2097333 (10ori) [23:15:26] ^ [23:18:49] 6Operations, 10Mail: fr-all fails with error 451 - https://phabricator.wikimedia.org/T129168#2097363 (10bbogaert) Hi Alex, I believe that might be it. This would make sense if the MX server queries against the businessCategory. I changed the businessCategory for cn=fundraising_vmail to businessCategory=fundra... [23:20:04] twentyafterfour: heads-up for the train: if you get loads of MediaWiki warnings about sessions being used when they shouldn't, that's due to https://gerrit.wikimedia.org/r/#/c/273372/ [23:20:14] it shouldn't happen, but just in case [23:20:23] the patch is a no-op apart from logging [23:20:38] tgr: and if it does happen, revert that change? [23:20:41] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/275703, now for realz (duration: 02m 22s) [23:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:03] if the log traffic is problematic then yes [23:21:15] RoanKattouw: https://gerrit.wikimedia.org/r/275710 [23:30:43] (03PS1) 10Yuvipanda: uwsgi: Don't attempt to stop service on non-jessie systems [puppet] - 10https://gerrit.wikimedia.org/r/275713 [23:30:59] (03PS2) 10Yuvipanda: uwsgi: Don't attempt to stop service on non-jessie systems [puppet] - 10https://gerrit.wikimedia.org/r/275713 [23:33:45] (03CR) 10Yuvipanda: [C: 032] uwsgi: Don't attempt to stop service on non-jessie systems [puppet] - 10https://gerrit.wikimedia.org/r/275713 (owner: 10Yuvipanda) [23:34:37] (03PS1) 10Andrew Bogott: Include a designate policy file for Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/275715 [23:34:38] !log Updating mw1099 (which is depooled) to HHVM 3.12 [23:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:48] (03PS2) 10Andrew Bogott: Include a designate policy file for Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/275715 [23:35:22] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:37:08] (03CR) 10Andrew Bogott: [C: 032] Include a designate policy file for Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/275715 (owner: 10Andrew Bogott) [23:37:11] PROBLEM - Apache HTTP on mw1099 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [23:37:12] PROBLEM - HHVM rendering on mw1099 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.013 second response time [23:37:25] that's me, it's not pooled [23:39:42] PROBLEM - DPKG on mw1099 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:39:42] PROBLEM - HHVM processes on mw1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [23:39:53] Oh that's why my sync-file is hanging then? [23:40:31] No, it's mw2212 [23:40:55] !log catrope@tin Synchronized php-1.27.0-wmf.15/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: Remember editor preference in WikiEditor too (duration: 02m 21s) [23:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:11] ssh: connect to host mw2212.codfw.wmnet port 22: Connection timed out [23:41:31] RoanKattouw: Yup, working. [23:42:34] Awesome [23:43:11] RECOVERY - DPKG on mw1099 is OK: All packages OK [23:43:12] RECOVERY - HHVM processes on mw1099 is OK: PROCS OK: 6 processes with command name hhvm [23:44:01] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.135 second response time [23:44:11] RECOVERY - HHVM rendering on mw1099 is OK: HTTP OK: HTTP/1.1 200 OK - 72873 bytes in 2.402 second response time [23:45:11] !log ssh: connect to host mw2212.codfw.wmnet port 22: Connection timed out [23:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:31] 6Operations: mw2212 unresponsive - https://phabricator.wikimedia.org/T129188#2097531 (10greg) [23:58:22] greg-g: thanks [23:59:01] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:59:42] * James_F is here.