[00:01:23] (03CR) 10Dzahn: [C: 032] "https://meta.wikimedia.org/w/index.php?title=Requests_for_new_languages%2FWikipedia_South_Azerbaijani&type=revision&diff=12707639&oldid=12" [dns] - 10https://gerrit.wikimedia.org/r/225830 (https://phabricator.wikimedia.org/T106305) (owner: 10Mjbmr) [00:02:02] !log DNS update - adding language "azb" to langlist [00:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:29] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: nb subdomain redirects - https://phabricator.wikimedia.org/T86924#1464445 (10Krenair) 5Open>3Resolved [00:10:46] Krenair: thanks! [00:11:03] yw [00:15:00] should the arbcom wikis have ".m." too? [00:15:30] (find some reason to touch wikipedia.org zone in DNS :) [00:16:24] mutante, I'd expect so [00:16:30] lfaraone would be able to confirm [00:16:49] (arbcom-en member) [00:17:04] https://arbcom-en.wikipedia.org/wiki/?useformat=mobile works so MF is not broken [00:17:30] mutante: I have zero need for that functionality, but... I guess, sure, why not? [00:18:09] Although the Special:History link there is broken. I guess it ignores the page whitelist [00:18:15] That's an MF bug I guess [00:19:14] t'd apply to all private sites, e.g. https://office.m.wikimedia.org/wiki/Special:History/Main_Page is broken as well [00:21:25] lfaraone: Krenair: ok, thanks, i'll upload a patch for that [00:21:56] if you wonder why .. it also solves https://phabricator.wikimedia.org/T106305#1464443 :) [00:22:17] sure i could just add a . or something but why not make a useful edit [00:30:24] PROBLEM - puppet last run on mw2120 is CRITICAL puppet fail [00:58:26] RECOVERY - puppet last run on mw2120 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [02:03:24] !log LocalisationUpdate failed (1.26wmf14) at 2015-07-20 02:03:24+00:00 [02:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:05:34] failed? [02:07:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 20 02:07:34 UTC 2015 (duration 7m 33s) [02:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:36] 6operations, 7Database, 5Patch-For-Review: db1022 duplicate key errors - https://phabricator.wikimedia.org/T105879#1464512 (10Springle) @jcrespo, ouch, so maybe something is broken in our physical backup process (was it xtrabackup?) that required the SQL_SLAVE_SKIP_COUNTER after reinstall on 2015-06-29? Scar... [02:24:21] !log l10nupdate Synchronized php-1.26wmf14/cache/l10n: (no message) (duration: 07m 07s) [02:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:03] !log LocalisationUpdate completed (1.26wmf14) at 2015-07-20 02:28:03+00:00 [02:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:55] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1464514 (10Springle) [02:38:19] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1464517 (10Springle) All s1-7 slaves are now 10.0. >>! In T105135#1437745, @jcrespo wrote: >> Which shard should we do first? > All shards are production, doesn't matter. The important thing is having a plan t... [02:49:27] 6operations, 7Database: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#1464530 (10Springle) 3NEW [02:49:56] 6operations, 7Database: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#1464537 (10Springle) [02:56:41] (03PS1) 10Alex Monk: Disable a bunch of extensions on loginwiki/votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) [03:02:10] (03PS2) 10Alex Monk: Disable a bunch of extensions on loginwiki/votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) [03:08:24] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [03:10:15] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [03:24:08] (03CR) 10Alex Monk: "Hoo man: Ping." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118654 (https://phabricator.wikimedia.org/T58169) (owner: 10Gerrit Patch Uploader) [03:42:40] (03CR) 10GWicke: "Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [05:32:31] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 20 05:32:31 UTC 2015 (duration 32m 30s) [05:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:32:05] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:32:15] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:33:04] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:56:34] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:05] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:57:05] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:15] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:22] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 and eventlogging for legoktm - https://phabricator.wikimedia.org/T106184#1464628 (10Joe) a:3Joe [06:59:36] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 and eventlogging for legoktm - https://phabricator.wikimedia.org/T106184#1464631 (10Joe) We will need your manager approval apparently, after that we're GTG I guess :) [07:06:19] 10Ops-Access-Requests, 6operations, 6Reading-Admin: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1464648 (10Joe) Hi, we'll need manager approval of course. [07:06:28] <_joe_> wow a weekend of access requests [07:10:08] 10Ops-Access-Reviews, 6operations: Review: access to stat1002 for tbayer - https://phabricator.wikimedia.org/T106317#1464649 (10Joe) [07:17:41] (03PS1) 10Muehlenhoff: Add some references [debs/linux] - 10https://gerrit.wikimedia.org/r/225847 [07:26:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add some references [debs/linux] - 10https://gerrit.wikimedia.org/r/225847 (owner: 10Muehlenhoff) [07:30:05] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1464665 (10Joe) Imaging wmf3543 as wdqs1001 [07:30:05] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [07:42:22] (03PS1) 10Giuseppe Lavagetto: wdqs: install as jessie, add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/225848 (https://phabricator.wikimedia.org/T86561) [07:45:48] (03PS2) 10Giuseppe Lavagetto: wdqs: install as jessie, add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/225848 (https://phabricator.wikimedia.org/T86561) [07:48:05] _joe_: hi! are you still working on https://gerrit.wikimedia.org/r/#/c/223663/ ? [07:49:10] <_joe_> SMalyshev: I will in a few [07:49:20] <_joe_> fist I want to install the server though [07:49:28] _joe_: great, thanks [07:49:45] <_joe_> my plan is install server => give you access => apply the role and see what needs adjusting [07:50:15] <_joe_> then we're left with setting up LVS, basically :) [07:50:44] ok, cool [07:50:47] <_joe_> SMalyshev: wdqs needs to be directly accessible from the internet, or it will be queried via a mediawiki extension? [07:50:57] <_joe_> or both? [07:50:58] <_joe_> :) [07:51:17] _joe_: btw, I understand I need some permissions on tin for that? or on deploy beta cluster? or both? [07:51:58] <_joe_> yes, I'll take a look at all that [07:52:02] _joe_: nginx end should be accessible, but may not be necessarily directly accessible, i.e. we may put varnish or something in front of it. [07:52:18] <_joe_> SMalyshev: yeah we'll see [07:52:38] _joe_: I'm not sure if it worth it to cache SPARQL responses (probably not, it'd do more harm than good) but caching static stuff is fine [07:52:59] <_joe_> SMalyshev: I'll try to do as much as possible today, as I need to get back to other things as well [07:53:16] I'm going to bed soon but if you need anything from me send me an email and I'll look at it first thing tomorrow [07:53:17] <_joe_> I wanted to give you a working installation in prod so you can fix things there [07:53:21] <_joe_> ok [07:53:31] <_joe_> I'll send you an email if I need something :) [07:55:32] (03PS3) 10Giuseppe Lavagetto: wdqs: install as jessie, add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/225848 (https://phabricator.wikimedia.org/T86561) [07:56:15] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:00:43] (03PS4) 10Giuseppe Lavagetto: wdqs: install as jessie, add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/225848 (https://phabricator.wikimedia.org/T86561) [08:01:24] (03CR) 10Giuseppe Lavagetto: [C: 032] wdqs: install as jessie, add partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/225848 (https://phabricator.wikimedia.org/T86561) (owner: 10Giuseppe Lavagetto) [08:09:46] (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt3 [debs/linux] - 10https://gerrit.wikimedia.org/r/225850 [08:13:19] hi together :) A short question: Are CAPTCHA's for FancyCaptcha on wmf wikis manually created? Is there any log, when they are re-generated the last time? [08:23:10] (03CR) 10DCausse: [C: 032] Upgrade swift repository [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/225483 (owner: 10Manybubbles) [08:27:53] 6operations, 10Wikimedia-DNS: DNS zones do not get re-generated when adding new language - https://phabricator.wikimedia.org/T84684#1464686 (10Glaisher) Duplicate of T97051? [08:28:20] (03CR) 10Filippo Giunchedi: [C: 031] "what's blocking this to get a systemd service file instead?" [puppet] - 10https://gerrit.wikimedia.org/r/225836 (owner: 10GWicke) [08:43:00] 6operations, 7Database, 5Patch-For-Review: db1022 duplicate key errors - https://phabricator.wikimedia.org/T105879#1464692 (10jcrespo) @springle I did not even use xtrabackup: `stop slave; service mysql stop; tar; gz; nc; md5sum; reinstall; nc; untar; mysqld_safe; mysql_upgrade` Please note that the SKIP w... [08:45:24] (03CR) 10Glaisher: "I don't see a point in keeping CharInsert extension here as we don't edit this wiki." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) (owner: 10Alex Monk) [08:46:39] 10Ops-Access-Reviews, 6operations: Review access to stat1003, eventlogging for legoktm - https://phabricator.wikimedia.org/T106315#1464693 (10fgiunchedi) [08:47:03] (03CR) 10Glaisher: "loginwiki (maybe we can keep it on votewiki)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) (owner: 10Alex Monk) [09:10:30] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1464712 (10jcrespo) Things I think they are a big issues: * Multisource replication (it does not work well, bugs with parallel?) * TokuDB (the versions we use have very important bugs, but I do not know if all... [09:11:27] (03PS2) 10Glaisher: Remove several dead domains from redirects [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) [09:12:09] (03CR) 10Glaisher: Remove several dead domains from redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) (owner: 10Glaisher) [09:17:36] (03PS1) 10Muehlenhoff: Add ferm rules for dbproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/225851 (https://phabricator.wikimedia.org/T104699) [09:26:41] (03PS1) 10Gilles: Fix ICO MIME regexps [puppet] - 10https://gerrit.wikimedia.org/r/225852 (https://phabricator.wikimedia.org/T63443) [09:26:43] 10Ops-Access-Requests, 6operations: Login for jkrauska to librenms - https://phabricator.wikimedia.org/T101064#1464729 (10fgiunchedi) [09:27:28] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user madhuvishy on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1464736 (10fgiunchedi) [09:28:30] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 and eventlogging for legoktm - https://phabricator.wikimedia.org/T106184#1464738 (10fgiunchedi) p:5Triage>3Normal [09:28:34] 10Ops-Access-Requests, 6operations: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1464739 (10fgiunchedi) p:5Triage>3Normal [09:28:36] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations: Provide hoo (Marius Hoch) with Hive access - https://phabricator.wikimedia.org/T106045#1464740 (10fgiunchedi) p:5Triage>3Normal [09:28:39] 10Ops-Access-Requests, 6operations, 6Reading-Admin: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1464741 (10fgiunchedi) p:5Triage>3Normal [09:28:41] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user madhuvishy on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1464742 (10fgiunchedi) p:5Triage>3Normal [09:35:33] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1464744 (10fgiunchedi) @hashar still blocked on operations? [09:37:37] 6operations, 6Discovery, 10Maps, 6Services, and 2 others: Puppetize Kartotherian & Tilerator for deployment - https://phabricator.wikimedia.org/T105074#1464746 (10fgiunchedi) a:3akosiaris [09:37:56] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: consider moving Cassandra to G1GC in production - https://phabricator.wikimedia.org/T103161#1464749 (10fgiunchedi) a:3fgiunchedi [09:39:36] 6operations, 5Patch-For-Review: Ferm rules for elasticsearch - https://phabricator.wikimedia.org/T104962#1464750 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [09:46:18] 6operations, 10ops-eqiad: wmf3543 can't install from PXE - https://phabricator.wikimedia.org/T106320#1464755 (10Joe) 3NEW [09:48:37] 6operations, 5Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1464762 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [09:48:55] 6operations, 10ops-eqiad: wmf3543 can't install from PXE - https://phabricator.wikimedia.org/T106320#1464764 (10Joe) p:5Triage>3Normal [09:50:05] 6operations, 10ops-eqiad: wmf3543 can't install from PXE - https://phabricator.wikimedia.org/T106320#1464755 (10Joe) [09:50:07] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1464766 (10Joe) [09:53:02] 6operations, 10Citoid, 6Services, 10VisualEditor, 3VisualEditor 2015/16 Q1 blockers: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1464768 (10Joe) yes, we need to change our UA, as the other side has gone completely silent since I reiterated we were not seeing an... [09:56:51] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1464769 (10hashar) 5Open>3declined a:3hashar Nop, we will migrate to provide jsduck via rubygems.org . Saves us the t... [09:58:35] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1464773 (10hashar) 5declined>3Open Reopening because we need some cleanup actions. We need to remove on our apt for jes... [10:09:50] 6operations, 5Patch-For-Review: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1464789 (10fgiunchedi) a:3Dzahn [10:13:27] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1464791 (10fgiunchedi) a:3Dzahn [10:15:20] 10Ops-Access-Requests, 6operations, 6Discovery, 10SEO, 3Discovery-Analysis-Sprint: Get Oliver Keyes access to Google Webmaster Tools for all Wikimedia domains - https://phabricator.wikimedia.org/T101157#1464796 (10fgiunchedi) a:5Deskana>3chasemp moving to @chasemp [10:22:26] (03PS1) 10Giuseppe Lavagetto: citoid: change UA to avoid blockage from NIH [puppet] - 10https://gerrit.wikimedia.org/r/225856 (https://phabricator.wikimedia.org/T106044) [10:22:46] <_joe_> mobrovac: ^^ [10:22:58] <_joe_> I'd love your +1 before merging it, though :) [10:23:57] 6operations, 10Citoid, 6Services, 10VisualEditor, and 2 others: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1464802 (10Mvolz) I'm not entirely sure I'm comfortable with that for two reasons: One, it's within their rights to block us if they want to, and two, the... [10:27:36] 6operations, 10Citoid, 6Services, 10VisualEditor, and 2 others: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1464803 (10Joe) @mvolz while I agree you need to change the logic in Citoid, changing the UA (by still making it evident it's a) a bot b) used for wikimedi... [10:33:03] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1464806 (10fgiunchedi) confirmed this is still a problem, I think what's happening is that we're no longer caching in varnish but it will still try to fet... [10:33:24] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1464807 (10fgiunchedi) p:5Triage>3Normal [10:34:04] (03PS1) 10Giuseppe Lavagetto: hhvm: expire APC keys after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/225858 (https://phabricator.wikimedia.org/T104769) [10:35:46] 6operations, 10Deployment-Systems, 5Patch-For-Review: Trebuchet doesn't like when a deployer server is also a minion, a edge case for scap - https://phabricator.wikimedia.org/T67549#1464811 (10fgiunchedi) @bd808 @arielglenn thoughts on this? [10:40:02] 7Puppet, 6operations, 6Discovery, 10Wikidata, and 2 others: Make a puppet role that sets up a query service and loads it - https://phabricator.wikimedia.org/T95679#1464813 (10fgiunchedi) a:3GLavagetto moving to @joe [10:41:55] 7Puppet, 6operations, 6Discovery, 10Wikidata, and 2 others: Make a puppet role that sets up a query service and loads it - https://phabricator.wikimedia.org/T95679#1464815 (10fgiunchedi) a:5GLavagetto>3Joe [10:44:49] 6operations, 10Deployment-Systems, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1464818 (10fgiunchedi) @arielglenn what's the status? [10:46:12] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1464820 (10fgiunchedi) a:3Dzahn [10:49:44] 6operations, 10Beta-Cluster, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1464823 (10fgiunchedi) @yuvipanda what's left on this task? also is it blocked on anything? [10:50:18] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1464827 (10fgiunchedi) a:3BBlack [10:51:50] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1464836 (10fgiunchedi) a:3Joe [10:57:03] 6operations, 6Services, 3Mobile-Content-Service, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1464840 (10fgiunchedi) [10:57:06] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1464839 (10fgiunchedi) [10:58:36] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1464841 (10fgiunchedi) a:3bearND [11:02:44] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1464847 (10fgiunchedi) current status: puppet lenient is voting, strict isn't yet. Is it worth to chase strict warnings too? [11:39:48] 6operations, 5Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#1464920 (10fgiunchedi) I think this works out the same, `ttf-wqy-zenhei` was renamed to `fonts-wqy-zenhei` in trusty/jessie ``` Package: fonts-wqy-zenhei Version: 0.9.45-5ubuntu1... [11:46:20] 6operations, 5Patch-For-Review: Mediawiki font packages: switch to Jessie - https://phabricator.wikimedia.org/T102623#1464924 (10fgiunchedi) a:3Dzahn [11:51:08] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1090 has a read-only filesystem - https://phabricator.wikimedia.org/T105835#1464929 (10fgiunchedi) a:3Cmjohnson looks like the disk is gone, @cmjohnson please swap it, machine has been depooled ``` [19329848.396526] ata1.00: error: { UNC } [19329848.582621] a... [11:54:43] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1464931 (10fgiunchedi) a:3BBlack [11:55:11] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Drop AES-256 mid/compat lists. - https://phabricator.wikimedia.org/T105716#1464933 (10fgiunchedi) a:3BBlack [12:15:14] (03PS4) 10Giuseppe Lavagetto: ganglia_new: switch ULSFO to new by default [puppet] - 10https://gerrit.wikimedia.org/r/225276 (owner: 10Dzahn) [12:15:30] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia_new: switch ULSFO to new by default [puppet] - 10https://gerrit.wikimedia.org/r/225276 (owner: 10Dzahn) [12:38:33] (03PS1) 10ArielGlenn: make xml{stubs,abstracts,logs}.py behave as dumpBackup.php does [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/225862 [13:05:40] godog: bongiorno you might want to remove the ruby-rkelly-remix and ruby-jsduck packages from jessie-wikimedia :D (ref: https://phabricator.wikimedia.org/T95008#1464773 ) [13:07:03] hashar: hey, sure sounds easy enough [13:07:25] godog: we will eventually look at using jsduck from rubygems instead :D [13:08:47] hashar: you still have ruby-jsduck in contint::packages btw [13:08:54] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [13:08:59] godog: yeah will clean it out for Debian [13:09:51] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1465002 (10fgiunchedi) {{done}} ``` root@carbon:~# reprepro remove jessie-wikimedia ruby-jsduck Exporting indices... Delet... [13:12:34] 6operations, 7Graphite, 5Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#1465004 (10fgiunchedi) a:3fgiunchedi [13:14:35] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [13:15:58] (03PS1) 10Hashar: contint: drop jsduck from Debian slaves [puppet] - 10https://gerrit.wikimedia.org/r/225866 (https://phabricator.wikimedia.org/T95008) [13:16:37] godog: dropping ruby-jsduck from Debian Jenkins slaves is https://gerrit.wikimedia.org/r/225866 :D [13:17:18] (03CR) 10Hashar: "For CI, we will have to migrate the jobs to use bundler/Gemfile. Not the end of the world :)" [puppet] - 10https://gerrit.wikimedia.org/r/225866 (https://phabricator.wikimedia.org/T95008) (owner: 10Hashar) [13:17:24] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1465010 (10fgiunchedi) from the announcement this might land in stretch, since we were already usi... [13:18:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: drop jsduck from Debian slaves [puppet] - 10https://gerrit.wikimedia.org/r/225866 (https://phabricator.wikimedia.org/T95008) (owner: 10Hashar) [13:19:33] hashar: yup, merged [13:20:44] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1465013 (10brion) As long as whatever we switch to supports VP9 and Opus in the build I don't care... [13:21:13] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1465019 (10fgiunchedi) a:3BBlack [13:21:50] 6operations, 5Patch-For-Review, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1465021 (10fgiunchedi) a:3Joe [13:25:03] 6operations, 5Patch-For-Review: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1465026 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi fixed [13:25:16] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1465031 (10hashar) 5Open>3Resolved It is gone now. Thanks! [13:25:54] PROBLEM - Host labnet1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:18] (03CR) 10Mobrovac: [C: 031] citoid: change UA to avoid blockage from NIH [puppet] - 10https://gerrit.wikimedia.org/r/225856 (https://phabricator.wikimedia.org/T106044) (owner: 10Giuseppe Lavagetto) [13:30:48] 6operations, 5Patch-For-Review: Allow rsync traffic between analytics VLAN and fluorine - https://phabricator.wikimedia.org/T99245#1465036 (10fgiunchedi) what's the status on this? also since we're deprecating udp2log this should go away shortly after? [13:32:05] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1465040 (10fgiunchedi) [13:32:07] 6operations, 10Wikimedia-IRC, 7Ipv6, 5Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#1465039 (10fgiunchedi) [13:32:41] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1451943 (10fgiunchedi) [13:33:42] 6operations, 10Traffic, 5Patch-For-Review, 7discovery-system, 5services-tooling: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1465044 (10fgiunchedi) a:3Joe [13:34:45] RECOVERY - Host labnet1002 is UPING OK - Packet loss = 0%, RTA = 0.80 ms [13:36:54] 7Puppet: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377#1465050 (10fgiunchedi) [13:36:56] 7Puppet, 6operations, 5Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908#1465049 (10fgiunchedi) [13:37:29] 7Puppet, 6operations, 5Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908#1098540 (10fgiunchedi) I've blocked this with {T95377} since that's where we want it fixed in the future [13:40:47] 6operations, 10ops-eqiad: db1058 (s5 master) degraded RAID - https://phabricator.wikimedia.org/T105627#1465055 (10Cmjohnson) Requested a new hard drive Congratulations: Work Order SR914069620 was successfully submitted. [13:41:00] (03CR) 10Mobrovac: [C: 031] "LGTM, but I agree with Filippo that we should switch to systemd. Deployment-prep runs on Jessie, and the mw-vagrant puppet module flavour " [puppet] - 10https://gerrit.wikimedia.org/r/225836 (owner: 10GWicke) [13:42:02] 6operations: Make ircecho run as its own user - https://phabricator.wikimedia.org/T76203#1465057 (10fgiunchedi) [13:42:16] 6operations, 10ops-eqiad: install 10g NIC card to labnet1002 - https://phabricator.wikimedia.org/T103849#1465059 (10Cmjohnson) Replaced the card and still having the same result. There has to be something I am missing. The server is installed on via the 1Gb link and is accessible if you want to take a look. [13:45:51] 6operations, 10Deployment-Systems, 5Patch-For-Review: Trebuchet doesn't like when a deployer server is also a minion, a edge case for scap - https://phabricator.wikimedia.org/T67549#1465060 (10ArielGlenn) fine to test on deployment-prep; if it works there it's good for prod. [14:10:26] 6operations, 6Discovery: Rollout CirrusSearch to codfw as a backup DC - https://phabricator.wikimedia.org/T105711#1465079 (10fgiunchedi) p:5Triage>3Normal [14:11:35] 6operations, 10Wikimedia-Logstash: Update Elasticsearch on logstash* - https://phabricator.wikimedia.org/T106126#1465081 (10fgiunchedi) p:5Triage>3High [14:11:53] 6operations: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099#1465083 (10fgiunchedi) p:5Triage>3Normal [14:17:22] 6operations: Track source of packages in reprepro - https://phabricator.wikimedia.org/T105385#1465087 (10fgiunchedi) p:5Triage>3Low [14:19:06] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1465090 (10JohnLewis) Mailman is now compatible with labs. All I can see with Debian Jessie is an apache change due to a version upgrade. So marking point one as done unless there is anyth... [14:19:45] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) - https://phabricator.wikimedia.org/T105756#1465091 (10JohnLewis) [14:20:19] 6operations: Track source of packages in reprepro - https://phabricator.wikimedia.org/T105385#1465094 (10fgiunchedi) agreed it would be nice, one way to do this would be to map the source package name with the expected component. Then we can audit what's in reprepro and upload packages in their expected component [14:21:09] (03CR) 10Manybubbles: "So this repository doesn't have a Jenkins hookup for it to V+2. And in any case C+2 here has a special meaning: "I'm deploying this to pro" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/225483 (owner: 10Manybubbles) [14:21:43] 6operations: Remove poolcounter from mw1154 for housecleaning - https://phabricator.wikimedia.org/T105380#1465095 (10fgiunchedi) p:5Triage>3Low [14:22:17] (03PS1) 10Giuseppe Lavagetto: ganglia: use ganglia_new by default, disable otherwise [puppet] - 10https://gerrit.wikimedia.org/r/225872 [14:22:19] (03PS1) 10Giuseppe Lavagetto: ganglia: move ganglia::plugins::python to the correct location [puppet] - 10https://gerrit.wikimedia.org/r/225873 [14:22:21] (03PS1) 10Giuseppe Lavagetto: ganglia: move ganglia::collector to ganglia::deprecated::collector [puppet] - 10https://gerrit.wikimedia.org/r/225874 [14:22:23] (03PS1) 10Giuseppe Lavagetto: ganglia: remove ganglia::aggregator [puppet] - 10https://gerrit.wikimedia.org/r/225875 [14:22:25] (03PS1) 10Giuseppe Lavagetto: ganglia: remove unused files [puppet] - 10https://gerrit.wikimedia.org/r/225876 [14:22:27] (03PS1) 10Giuseppe Lavagetto: ganglia: remove references to ganglia::cname [puppet] - 10https://gerrit.wikimedia.org/r/225877 [14:22:29] (03PS1) 10Giuseppe Lavagetto: ganglia: remove ganglia_class conditionals [puppet] - 10https://gerrit.wikimedia.org/r/225878 [14:22:31] (03PS1) 10Giuseppe Lavagetto: ganglia: remove ganglia.pp [puppet] - 10https://gerrit.wikimedia.org/r/225879 [14:22:33] (03PS1) 10Giuseppe Lavagetto: ganglia: standardize has_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225880 [14:22:35] (03PS1) 10Giuseppe Lavagetto: ganglia: rename ganglia_new to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225881 [14:22:38] 6operations: Remove poolcounter from mw1154 for housecleaning - https://phabricator.wikimedia.org/T105380#1465097 (10Joe) 5Open>3Resolved a:3Joe [14:22:45] 6operations: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1465100 (10fgiunchedi) p:5Triage>3High [14:22:51] 6operations: Remove poolcounter from mw1154 for housecleaning - https://phabricator.wikimedia.org/T105380#1442381 (10Joe) I reimaged this server last week. [14:25:05] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1465103 (10fgiunchedi) p:5Triage>3Normal [14:25:29] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Assign varnish memory-only role to maps servers - https://phabricator.wikimedia.org/T105076#1465107 (10fgiunchedi) p:5Triage>3Normal [14:26:04] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1465110 (10fgiunchedi) p:5Triage>3Normal [14:27:14] (03PS2) 10Giuseppe Lavagetto: citoid: change UA to avoid blockage from NIH [puppet] - 10https://gerrit.wikimedia.org/r/225856 (https://phabricator.wikimedia.org/T106044) [14:27:56] 6operations: Add Ferm rules for snapshot hosts - https://phabricator.wikimedia.org/T104991#1465124 (10fgiunchedi) p:5Triage>3Normal [14:27:59] 6operations, 10Wikimedia-Stream: Ferm rules for rcstream - https://phabricator.wikimedia.org/T104981#1465125 (10fgiunchedi) p:5Triage>3Normal [14:28:02] 6operations, 7Mail: Ferm rules for MX mail servers - https://phabricator.wikimedia.org/T104979#1465126 (10fgiunchedi) p:5Triage>3Normal [14:28:04] 6operations, 10OCG-General-or-Unknown: Ferm rules for ocg hosts - https://phabricator.wikimedia.org/T104976#1465127 (10fgiunchedi) p:5Triage>3Normal [14:28:06] 6operations: Ferm rules for job runners - https://phabricator.wikimedia.org/T104972#1465128 (10fgiunchedi) p:5Triage>3Normal [14:28:08] 6operations: Ferm rules for image scalers - https://phabricator.wikimedia.org/T104969#1465129 (10fgiunchedi) p:5Triage>3Normal [14:28:10] 6operations: Ferm rules for app servers - https://phabricator.wikimedia.org/T104968#1465130 (10fgiunchedi) p:5Triage>3Normal [14:28:12] 6operations: Ferm rules for postgres roles / labsdb - https://phabricator.wikimedia.org/T104960#1465132 (10fgiunchedi) p:5Triage>3Normal [14:31:29] 6operations, 10Analytics-Cluster: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1465137 (10fgiunchedi) [14:33:40] (03CR) 10Giuseppe Lavagetto: [C: 032] citoid: change UA to avoid blockage from NIH [puppet] - 10https://gerrit.wikimedia.org/r/225856 (https://phabricator.wikimedia.org/T106044) (owner: 10Giuseppe Lavagetto) [14:41:09] 6operations, 10ops-eqiad, 10Analytics-Cluster: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1465156 (10Cmjohnson) analytics1042-1045 are racked and ready for install in row D2. Racktables has been updated. analytics1045.mgmt.eqiad.wmnet has address 10.65.4.17 analytics1044.mgmt.e... [14:46:01] 6operations, 10Deployment-Systems, 5Patch-For-Review: Trebuchet doesn't like when a deployer server is also a minion, a edge case for scap - https://phabricator.wikimedia.org/T67549#1465168 (10fgiunchedi) p:5High>3Normal doubtful this is high priority, @thcipriani looks good to test on deployment-prep an... [14:51:21] 6operations: irc bots should send NOTICE not PRIVMSG - https://phabricator.wikimedia.org/T101575#1465171 (10fgiunchedi) [14:51:57] 6operations, 7Graphite: graphite2001 OOM and unresponsive - https://phabricator.wikimedia.org/T101572#1465173 (10fgiunchedi) p:5High>3Normal [14:52:27] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: consider moving Cassandra to G1GC in production - https://phabricator.wikimedia.org/T103161#1465175 (10fgiunchedi) p:5Triage>3Normal [14:53:49] 6operations: Document new platform specific doc for Dell Poweredge RN30 systems - https://phabricator.wikimedia.org/T101288#1465182 (10Cmjohnson) 5Open>3Resolved Updated the page to add the 13th Gen servers RN30's. The only setup change was leaving the serial console to auto [14:55:45] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [14:55:53] <_joe_> mobrovac: ^^ [14:55:57] <_joe_> :) [14:56:08] yaaaay [14:56:33] _joe_: now you can close / de-prioritise the ticket [14:56:42] probably the latter [14:56:52] <_joe_> mobrovac: just did [14:56:55] :) [14:57:44] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [14:58:07] 6operations, 6Commons: Commons thumbnail of Pluto photo is broken at 500px - https://phabricator.wikimedia.org/T105793#1465201 (10fgiunchedi) p:5Triage>3Low I can load the image but still worth investigating if it is related to hhvm imagescalers [15:00:04] manybubbles anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150720T1500). Please do the needful. [15:04:50] (03CR) 10DCausse: [C: 031] Upgrade swift repository [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/225483 (owner: 10Manybubbles) [15:10:33] (03CR) 10GWicke: "Lets do one step at a time. We'll need to continue supporting the init script for third-party Ubuntu user compatibility, and it looks like" [puppet] - 10https://gerrit.wikimedia.org/r/225836 (owner: 10GWicke) [15:10:55] PROBLEM - Hadoop NodeManager on analytics1040 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:11:37] 6operations, 6Commons: Commons thumbnail of Pluto photo is broken at 500px - https://phabricator.wikimedia.org/T105793#1465226 (10Joe) I took a look at the logs, even tried to re-render the thumbnail at different sizes, but chances are the logs were already lost in the reimaging of either mw1154 or mw1155 whic... [15:12:05] PROBLEM - Hadoop NodeManager on analytics1033 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:12:09] 6operations: Track source of packages in reprepro - https://phabricator.wikimedia.org/T105385#1465227 (10fgiunchedi) the same map could be also used to aid jenkins build debian packages for us [15:13:29] (03PS2) 10Filippo Giunchedi: Don't killall $DAEMON [puppet] - 10https://gerrit.wikimedia.org/r/225836 (owner: 10GWicke) [15:13:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Don't killall $DAEMON [puppet] - 10https://gerrit.wikimedia.org/r/225836 (owner: 10GWicke) [15:14:05] RECOVERY - Hadoop NodeManager on analytics1033 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:18:41] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release wikimedia-extra plugin for Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106161#1465230 (10Manybubbles) [15:20:14] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1465232 (10fgiunchedi) p:5Triage>3High looks like we need this to finish migrating to hhvm [15:23:12] 6operations: salt-minion dies if /var is full - https://phabricator.wikimedia.org/T104866#1465241 (10fgiunchedi) p:5Triage>3Normal yup for operations, thanks @krenair ! [15:23:25] 6operations: salt-minion dies if /var is full - https://phabricator.wikimedia.org/T104866#1465243 (10fgiunchedi) a:3ArielGlenn [15:23:46] 6operations: on bootup, salt-minion should not start with -d - https://phabricator.wikimedia.org/T104867#1465244 (10fgiunchedi) a:3ArielGlenn [15:23:56] 6operations: on bootup, salt-minion should not start with -d - https://phabricator.wikimedia.org/T104867#1430288 (10fgiunchedi) p:5Triage>3Normal [15:24:16] 6operations: Track systems/roles for which intentionally no firewall rules are applied - https://phabricator.wikimedia.org/T104958#1465247 (10fgiunchedi) p:5Triage>3Normal [15:24:24] RECOVERY - Hadoop NodeManager on analytics1040 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:26:57] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1465250 (10Joe) [15:34:33] (03CR) 10Mobrovac: "> We'll need to continue supporting the init script for third-party Ubuntu user compatibility" [puppet] - 10https://gerrit.wikimedia.org/r/225836 (owner: 10GWicke) [15:37:49] 6operations, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1465266 (10jcrespo) a:5Springle>3jcrespo First thing we need to determine is the future of this service: should new clusters or shards be added from the application point of view or should we be 100% transpa... [15:39:35] 6operations, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1465268 (10jcrespo) p:5Triage>3Normal [15:40:52] 6operations, 10ops-eqiad, 7Database, 5Patch-For-Review: Remove db1002-db1007 from production - https://phabricator.wikimedia.org/T105768#1465272 (10jcrespo) [15:40:53] 6operations, 7Database: db1002-db1007 - decom or repurpose? - https://phabricator.wikimedia.org/T103005#1465271 (10jcrespo) [15:47:49] (03CR) 10Jcrespo: [C: 031] "Shouldn't we just wait and delete the role completelly, once the masters have been migrated?" [puppet] - 10https://gerrit.wikimedia.org/r/224558 (owner: 10John F. Lewis) [15:48:50] (03CR) 10Jcrespo: [C: 04-1] "Ok with the change, only -1 to make sure this is the LAST commit we do related to 1002-1007." [dns] - 10https://gerrit.wikimedia.org/r/224560 (https://phabricator.wikimedia.org/T105768) (owner: 10John F. Lewis) [15:50:45] (03PS1) 10Filippo Giunchedi: admin: add hoo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/225902 [15:51:14] (03PS2) 10Filippo Giunchedi: admin: add hoo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/225902 (https://phabricator.wikimedia.org/T106045) [15:51:59] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1465294 (10jcrespo) p:5Triage>3Normal [16:00:44] 10Ops-Access-Requests, 6operations: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1465304 (10fgiunchedi) @tjones we'll have also to provision your shell user across the cluster, to do that we'll need the following information from https://wikitech.wikimedia.org/wiki/Requesting_she... [16:00:56] 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1465308 (10Eevans) From an IRC conversation today with Moritz regarding the use of firejail for this purpose: > 11:30 if not sure if it's the most viable opt... [16:07:11] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1465322 (10jcrespo) I am trying to get a list of blockers (we can edit the task description): [] Prepare a deployment and rollback plan [] Perform a table checksum on all affected servers [] Find the best node... [16:14:51] 6operations, 7Database: install/deploy dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86958#1465348 (10jcrespo) [16:16:56] 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1465354 (10mobrovac) >>! In T95253#1465308, @Eevans wrote: > From an IRC conversation today with Moritz regarding the use of firejail for this purpose: >> 11:30 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1465372 (10GWicke) >>! In T95253#1465354, @mobrovac wrote: >>>! In T95253#1465308, @Eevans wrote: >> From an IRC conversation today with Moritz regarding the use of fi... [16:21:39] (03CR) 10Jcrespo: "Also self-note: remove the nodes from icinga and ganglia first if they still reference them." [dns] - 10https://gerrit.wikimedia.org/r/224560 (https://phabricator.wikimedia.org/T105768) (owner: 10John F. Lewis) [16:36:34] PROBLEM - Host mw1090 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:33] !log powercycle mw1090, no console no anything [16:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:39] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release swift-repository for 1.7.0 - https://phabricator.wikimedia.org/T106163#1465429 (10Manybubbles) 5Open>3Resolved [16:49:41] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1465430 (10Manybubbles) [16:49:48] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release wikimedia-extra plugin for Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106161#1465431 (10Manybubbles) 5Open>3Resolved [16:49:50] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1458711 (10Manybubbles) [16:49:55] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1458711 (10Manybubbles) [16:49:56] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release experimental-highlighter for 1.7.0 - https://phabricator.wikimedia.org/T106162#1465433 (10Manybubbles) 5Open>3Resolved [16:53:48] cmjohnson1: is it you on mw1090? machine rebooted, I'll just shut it because of T105835 [16:56:59] or not [17:21:55] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [17:23:06] RECOVERY - Host mw2027 is UPING WARNING - Packet loss = 80%, RTA = 119.99 ms [17:48:08] godog: that was me with mw1090. https://phabricator.wikimedia.org/T105835 [17:55:50] !log canary restbase deploy of 0951a6d on restbase1001 [17:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:10] somone here which can look for me if a job is still running? [17:56:29] gwicke or godog? [17:57:02] Steinsplitter: I can check if a job class is backing up [17:57:08] if that's what you mean [17:57:54] gwicke: can you see if a job related to https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/Hansmuller&ilshowall=1 this is still running (GWT JOB), seems halted. [17:59:29] Steinsplitter: I don't know much about GWT [17:59:34] ok [17:59:44] _joe_: around? [18:00:06] SMalyshev, he said he disconnected some time ago [18:00:16] jynus: thanks! [18:01:08] Steinsplitter: I do see some entries for the first of those images in the job runner logs [18:01:52] gwicke: ok, thanks. means that the job is still running (or at least queued)? [18:02:33] Steinsplitter: I'd say it's running [18:02:46] ok, thx :) [18:02:49] timestamp was only 20 minutes ago [18:06:26] 6operations, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1465554 (10demon) >>! In T105843#1465266, @jcrespo wrote: > First thing we need to determine is the future of this service: should new clusters or shards be added from the application point of view or should we... [18:08:17] 6operations, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1465555 (10demon) I'm less sure on the "can we do better in MW" with regards to better compression of blobs and so forth. That's more a question for @tstarling or @aaron... [18:09:50] jynus: tldr: mw doesn't care where we put the ES clusters, as long as they exist. [18:12:01] 6operations, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1465567 (10jcrespo) > It would require a bit of read-only time Do not worry about the operational part, thanks to replication impact would be minimal, if any. :-) [18:12:30] my question was more of: should we/do we need to create a blobs_cluster26? [18:12:45] if yes, now it is the time [18:13:35] or will it grow faster than usual with the new formats? [18:16:22] It's hard to see how fast it's growing beyond the past year. [18:18:18] Actually, hmm. I'm not entirely sure how we'd re-shard the existing content across a new cluster. We'd have to pick some data, copy it over, and then I get update the `text` table entries to match their new homes [18:19:57] recompressTracked might do it [18:21:11] qotd: "Automatically deletes the tracking table and starts from the start again when restarted" [18:21:15] (03CR) 10Gilles: [C: 031] Rename 'cookie_munging' VCL subroutine to 'stash_cookie' [puppet] - 10https://gerrit.wikimedia.org/r/225281 (owner: 10Ori.livneh) [18:21:24] (03PS1) 10GWicke: Revert "Don't killall $DAEMON" [puppet] - 10https://gerrit.wikimedia.org/r/225932 [18:22:08] !log deployed restbase 0951a6d to remaining nodes [18:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:29:19] ostriches, here it is the predicted output: http://ganglia.wikimedia.org/latest/graph.php?c=MySQL%20eqiad&h=es1005.eqiad.wmnet&r=year&z=small&jr=&js=&st=1437416850&v=89.7&m=part_max_used&vl=%25&ti=Maximum%20Disk%20Space%20Used&trend=1&z=xlarge [18:31:50] 6operations, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1465614 (10jcrespo) {F201547} Please note that **hardware purchase, installing and data migration takes months**. [18:32:31] thanks, BTW for the feedback [19:03:52] (03PS1) 10Gilles: Assign thumbnail access log to Monolog debug channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225935 (https://phabricator.wikimedia.org/T106323) [19:11:58] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1465674 (10TheDJ) Figured out what that Dalvik UA is likely coming from. It's the default of the [[ https://github.com/square/picasso | Picasso library ]], which is used by #Wikipedia-Andro... [19:36:25] 6operations, 10Beta-Cluster, 6Labs, 7Monitoring: Setup (simple) catchpoint monitoring for enwiki betacluster just like production - https://phabricator.wikimedia.org/T97865#1465714 (10hashar) Poked our internal ops mailling list. [19:40:23] !log (eevans, gwicke) removed *.hprof heap dumps from /var/lib/cassandra, freeing up a lot of space especially on 1004 & 1005 [19:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:47:35] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1465720 (10TheDJ) Another possible explanation: > I just noticed a similar pattern of User-agents for back to back requests. In my case, the first request (with the Mozilla User agent) was... [19:51:44] 10Ops-Access-Requests, 6operations, 6Reading-Admin: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1465726 (10dr0ptp4kt) I emailed Tilman's manager for approval on ticket. [19:52:41] (03PS1) 1001tonythomas: Changed my blog address to new Jekyll from Wordpress [puppet] - 10https://gerrit.wikimedia.org/r/225952 [19:53:43] 6operations, 10RESTBase-Cassandra: setup an alertable threshold for Cassandra heap dumps - https://phabricator.wikimedia.org/T106346#1465732 (10Eevans) 3NEW [20:00:04] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150720T2000). Please do the needful. [20:16:25] 10Ops-Access-Requests, 6operations: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1465782 (10TJones) Signed https://phabricator.wikimedia.org/L3 wikitech profile: https://wikitech.wikimedia.org/wiki/User:Tjones preferred shell username: tjones (backup: trey) @fgiunchedi: Can I pu... [20:28:14] (03PS6) 10Chad: Elastic: move auto_create_index into hiera instead of role [puppet] - 10https://gerrit.wikimedia.org/r/207140 [21:30:27] (03PS1) 10Eevans: WIP: Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [21:53:55] YuviPanda: i have a question about https://gerrit.wikimedia.org/r/#/c/143857/ [22:10:44] PROBLEM - Hadoop NodeManager on analytics1036 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [22:20:14] RECOVERY - Hadoop NodeManager on analytics1036 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [22:42:13] (03PS1) 1020after4: Ensure that phabricator/src/extensions exists [puppet] - 10https://gerrit.wikimedia.org/r/226031 (https://phabricator.wikimedia.org/T104904) [22:44:15] (03CR) 10Hoo man: "If you think this is ok, go ahead." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118654 (https://phabricator.wikimedia.org/T58169) (owner: 10Gerrit Patch Uploader) [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150720T2300). Please do the needful. [23:00:04] gilles: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:10:14] RoanKattouw_away: ostriches: Krenair: any of you around? [23:22:21] I'm going to assume that everyone is flying :) I'll move my stuff to the next SWAT window [23:43:58] !log removed experimental nodes (1008, 1009) from system.peers on production C* nodes [23:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master