[00:03:50] (03CR) 10Nemo bis: "WONTFIX, please abandon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195197 (https://phabricator.wikimedia.org/T44894) (owner: 10Jalexander) [00:06:22] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized - https://phabricator.wikimedia.org/T90679#1102388 (10Dzahn) >>! In T90679#1102380, @AKoval_WMF wrote: > I do not even have access to contacts.wikimedia.org. And to my knowledge, no one else on my team is currently using Civi either.... [00:15:07] 6operations, 7Shinken: Setup IRC notifications for shinken - https://phabricator.wikimedia.org/T1260#1102404 (10Dzahn) This is closed as resolved but on the workboard it's in "doing". [00:22:36] (03PS1) 10Thcipriani: Add master_key param for salt_minion module [puppet] - 10https://gerrit.wikimedia.org/r/195492 [00:28:39] 6operations: https://planet.wikimedia.org/ redirect broken - https://phabricator.wikimedia.org/T92051#1102438 (10Dzahn) 3NEW [00:29:11] 6operations: https://planet.wikimedia.org/ redirect broken - https://phabricator.wikimedia.org/T92051#1102453 (10Dzahn) The Apache config for this exits on the backend and is puppetized. [00:29:34] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized - https://phabricator.wikimedia.org/T90679#1102456 (10AKoval_WMF) Yes, for our purposes now, Asana has replaced Civi. However, I'm sure we'd appreciate a data dump just in case I'm wrong about that! :) Also, it's my understanding tha... [00:30:08] (03CR) 10Dzahn: "did this not catch http://planet.wikimedia.org/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/181419 (https://phabricator.wikimedia.org/T60048) (owner: 10John F. Lewis) [00:30:25] (03CR) 10Thcipriani: Add /etc/mysql dir before linking inside it (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/194925 (owner: 10Thcipriani) [00:31:01] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized - https://phabricator.wikimedia.org/T90679#1102467 (10Dzahn) Thank you. I believe the Fundraising team uses an entirely seperate installation of CiviCRM, i will confirm though. [00:32:46] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized - https://phabricator.wikimedia.org/T90679#1102468 (10Dzahn) @jgreen ^ the fundraising civicrm has nothing to do with contact civi on zirconium, correct? i'm pretty sure, just to have the confirmation here on ticket. [00:33:26] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:37] (03CR) 10Springle: [C: 04-1] Add /etc/mysql dir before linking inside it (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/194925 (owner: 10Thcipriani) [00:47:53] 6operations, 10ops-codfw: label/update mgmt & settings/test eventlog2001 - https://phabricator.wikimedia.org/T90909#1102496 (10Papaul) It is B5 not A5 [00:51:06] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [00:54:59] (03CR) 10Ori.livneh: [C: 04-1] "Cassandra's "Getting Started" documentation says: "a newly started node needs to know of at least one other, this is called a Seed." (http" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [00:56:23] (03PS2) 10Thcipriani: Add /etc/mysql dir before linking inside it [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/194925 [00:58:49] (03CR) 10Thcipriani: Add /etc/mysql dir before linking inside it (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/194925 (owner: 10Thcipriani) [01:04:57] (03PS2) 10Ori.livneh: More secure permissions on conf cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195196 (owner: 10Tim Starling) [01:05:02] (03CR) 10Ori.livneh: [C: 032] More secure permissions on conf cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195196 (owner: 10Tim Starling) [01:05:10] (03Merged) 10jenkins-bot: More secure permissions on conf cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195196 (owner: 10Tim Starling) [01:06:26] !log ori Synchronized wmf-config/CommonSettings.php: I749477ac1: More secure permissions on conf cache (duration: 00m 06s) [01:06:34] Logged the message, Master [01:10:26] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [01:24:23] !log I749477ac1 follow-up: chmodded 0755 /tmp/mw-cache-* and 0644 /tmp/mw-cache-*/conf-* [01:24:28] Logged the message, Master [01:30:26] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [02:04:03] !log l10nupdate Synchronized php-1.25wmf19/cache/l10n: (no message) (duration: 00m 02s) [02:04:10] Logged the message, Master [02:05:11] !log LocalisationUpdate completed (1.25wmf19) at 2015-03-10 02:04:08+00:00 [02:05:14] Logged the message, Master [02:05:59] !log l10nupdate Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 00m 01s) [02:06:03] Logged the message, Master [02:07:06] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-10 02:06:03+00:00 [02:07:10] Logged the message, Master [02:20:38] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Mar 10 02:19:34 UTC 2015 (duration 19m 33s) [02:20:43] Logged the message, Master [02:41:33] Is the search-repository-swift ElasticSearch plugin in use or maintained? I got an issue trying to use it with ES 1.4.4 https://github.com/wikimedia/search-repository-swift/issues/7 [02:41:52] besides, i couldn’t find where in Phabricator I should create an issue. [02:54:48] I think that'd be a question for manybubbles or ^demon if they were here [03:25:59] Krenair, was this a response to my question about search-repository-swift? [03:26:19] renoirb, yeah [03:26:27] thx! :) [03:26:31] not many people are around at this time, sorry [03:26:59] that’s OK, I’m a night bird and I hammered about this for a while. I had hopes. [03:27:42] I guess with the nick you gave me Krenair, I could find his contact info in wmf and/or Wikitech and/or phabricator directories? [03:30:16] https://phabricator.wikimedia.org/p/demon/ or chorohoe@, https://phabricator.wikimedia.org/p/Manybubbles/ or neverett@ [03:30:54] https://github.com/demon was active on that repo [03:33:21] thx Krenair [03:34:24] they may both have emails about it already [04:19:22] springle: ping? [04:29:51] legoktm: pong [04:31:02] springle: hi! did you see my question on https://phabricator.wikimedia.org/T91920#1101449? Is this db1068 issue an isolated incident? or is it possible other slaves for other wikis could be affected in a similar manner? [04:31:34] no, hadn't seen that yet [04:31:42] i'll update the ticket [04:32:03] thanks [04:36:26] 6operations, 6CA-team, 6MediaWiki-Core-Team, 10SUL-Finalization: db1068 (s4/commonswiki slave) is missing data about at least 6 users - https://phabricator.wikimedia.org/T91920#1103151 (10Springle) >>! In T91920#1101449, @Legoktm wrote: > Thanks. How (un)likely is it that other wiki's slaves might have sim... [04:36:51] springle: how long do you expect it'll take? [04:41:46] legoktm: another 24h /guess [04:59:36] Hi, I have an user account issue. Is someone available? [05:05:02] raj: hi, what's the issue you're running into? [05:06:19] hi legoktm, I created an account and I can't recall the password and don't think I have access to the email address any longer either [05:07:40] raj: if that's the case, we can't really do much about that. I'd recommend creating a new account and just saying that you used to use another account, but forgot the password. [05:08:25] can someone check which email address was used so I can ensure it's not one of the email addresses I haven't tried? [05:55:47] (03PS1) 10BBlack: depool cp4007,cp3015 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195503 [05:56:01] (03CR) 10BBlack: [C: 032 V: 032] depool cp4007,cp3015 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195503 (owner: 10BBlack) [05:58:35] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [05:58:46] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [06:03:48] (03CR) 10GWicke: "@Ori, that default is wrong indeed, but it's also not used anywhere." [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [06:07:55] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [06:09:05] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [06:13:55] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 79.29 ms [06:14:25] RECOVERY - Host cp3015 is UP: PING WARNING - Packet loss = 28%, RTA = 88.00 ms [06:28:55] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:15] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:16] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:25] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:05] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:56] (03PS1) 10BBlack: repool cp3015,cp4007 - update amssq33 [puppet] - 10https://gerrit.wikimedia.org/r/195506 [06:38:07] (03CR) 10BBlack: [C: 032 V: 032] repool cp3015,cp4007 - update amssq33 [puppet] - 10https://gerrit.wikimedia.org/r/195506 (owner: 10BBlack) [06:45:34] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:53] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:46:04] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:46:23] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:46:53] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:01:00] (03PS1) 10BBlack: faster varnish<->varnish probes [puppet] - 10https://gerrit.wikimedia.org/r/195508 [07:15:48] RECOVERY - Disk space on fluorine is OK: DISK OK [07:18:07] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [07:18:25] (03PS1) 10KartikMistry: Beta: Enable ContentTranslation in Xhosa (xh) [puppet] - 10https://gerrit.wikimedia.org/r/195509 (https://phabricator.wikimedia.org/T90126) [07:20:32] (03CR) 10Cwek: "I don't know how to use this thing.Sorry. :-p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193827 (https://phabricator.wikimedia.org/T91223) (owner: 10Gerrit Patch Uploader) [07:23:18] (03CR) 10Liuxinyu970226: "TLDR, where's the document about how to add VisualEditor support to a specific namespace." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193827 (https://phabricator.wikimedia.org/T91223) (owner: 10Gerrit Patch Uploader) [07:37:58] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:05:32] (03PS8) 10Giuseppe Lavagetto: mediawiki: add configs to support the Dallas DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194830 (https://phabricator.wikimedia.org/T91754) [08:06:19] (03PS2) 10KartikMistry: Beta: Enable ContentTranslation in Xhosa (xh) [puppet] - 10https://gerrit.wikimedia.org/r/195509 (https://phabricator.wikimedia.org/T90126) [08:06:58] akosiaris: ^ Please :) [08:12:29] kart_: y’know, long term I wonder if this shouldn’t be in ops/puppet [08:12:31] (the config, that is) [08:12:52] (03CR) 10Yuvipanda: [C: 032] Beta: Enable ContentTranslation in Xhosa (xh) [puppet] - 10https://gerrit.wikimedia.org/r/195509 (https://phabricator.wikimedia.org/T90126) (owner: 10KartikMistry) [08:13:04] maybe a cxserver/deploy repo [08:13:23] <_joe_> YuviPanda: I think puppet is fine [08:13:34] it’s basically a JSON config file. [08:18:05] YuviPanda: what's why I asked about help sometime back ;) [08:18:12] (Specially for Beta) [08:18:13] right. [08:18:23] Production must be in Puppet. [08:18:34] I’m just doing the ‘mouth off opinions and not actually help fix the underlying problem’ act :) [08:18:44] :) [08:18:47] kart_: feel free to poke me as well for the beta changes [08:18:59] Summer started. Laptop shutdown due to heat. [08:19:46] YuviPanda: sure. Thanks! [08:19:55] <_joe_> kart_: summer started in march? [08:20:24] _joe_: it technically starts in Feb here when we start Fan/Air Condition etc :( [08:21:22] _joe_: yeah, is summer here. March to… July? [08:21:37] <_joe_> oh ok [08:21:43] * _joe_ is ignorant [08:22:46] (03PS1) 10Matanya: resolv: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195516 [08:30:03] (03PS1) 10Matanya: selector outside a resource + 4 spaces [puppet] - 10https://gerrit.wikimedia.org/r/195518 [08:33:02] YuviPanda: it is March to next March ;) [08:35:33] kart_: heh [08:38:25] anyone with knowledge of trebuchet around? [08:38:29] (03PS1) 10Matanya: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195519 [08:41:29] <_joe_> matanya: what would you suggest as an introductory reading for a wannabe-volunteer with us? [08:41:35] (03PS1) 10Matanya: monitoring: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195520 [08:41:56] _joe_: puppet learning? unless he knows that [08:42:40] _joe_: https://wikitech.wikimedia.org/wiki/Get_involved is a good place to start [08:42:41] <_joe_> matanya: more about how to work with us, it's a friend of mine who would like to do some volunteering work [08:43:00] <_joe_> matanya: ok I just wanted your expert opinion as a volunteer :) [08:43:04] <_joe_> thanks [08:43:15] sure :) [08:44:21] good morning [08:52:13] apergos: are you still futzing with git deploy? [08:52:24] turns out I’m recreating tin for staging-tin atm and I’m wondering if my troubles are related :) [08:53:12] <_joe_> btw where is our salt code located? [08:53:23] <_joe_> do we have a salt repo with our own modules? [08:53:43] <_joe_> I'd like to code a few of the most common salt recipes I use into a module [08:56:55] _joe_: good question. no idea. apergos probably knows. [08:57:03] to rephrase, apergos hopefully knows? [09:10:51] 6operations: logster Debian package does not logrotate - https://phabricator.wikimedia.org/T76995#1103671 (10hashar) I have upgraded the package on deployment-bastion.eqiad.wmflabs and it comes with a logrotate rule: ``` cat /etc/logrotate.d/logster... [09:11:11] (03PS1) 10Matanya: monitoring: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195523 [09:16:25] (03PS1) 10Matanya: ldap: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195524 [09:17:28] _joe_: we don't have a single salt repo for those. [09:19:30] if you want to look at a few examples you can check /usr/share/pyshared/salt/modules on any host, /usr/share/pyshared/salt/modules/test.py is a very simpleone to look at [09:19:46] <_joe_> apergos: where should I put a module I wrote? [09:19:52] <_joe_> how do we install those? [09:19:57] ah it's already written! nice [09:20:23] <_joe_> no it's not, I was skipping to the part of the process I didn't already figure out :P [09:22:36] oh :-D [09:23:10] I don't know that we have a lot of salt modules just by themselves [09:23:18] IIRC we don’t at all... [09:23:31] modules/deployment has for example a module or two but also runners etc [09:23:41] maybe you should just make a puppet module for it likee that [09:23:57] and it can deal with placing the file, any other things needed, etc [09:24:32] <_joe_> ok [09:25:20] if it turns out that a bunch of other people have recipes or whatever then maybe they would add on [09:25:28] heh, I’ve been working today for only 2h and already touched trebuchet *and* scap.. [09:25:31] * YuviPanda dives into scap code again [09:25:38] <_joe_> touching scap? [09:25:39] YuviPanda: I still need to get out the trigger packages [09:25:40] <_joe_> why? [09:25:49] what did you do in trebuchet, just so I know? [09:26:07] apergos: I’m setting up staging-tin, which is deployment server... [09:26:10] also I can give *you* a workaround if you are having checkout difficulties. this is in deploymnt-prep right? [09:26:16] nope [09:26:19] this isn’t deployment-prep [09:26:24] ah well I have not touched anything else [09:26:41] this is ‘staging’ which is the ‘deployment-prep built from ground up where we actually go fix puppet when we can’t do something rather than hack around it' [09:26:47] so I can look at it with you (happy to) but if it's broken it will have been broken for some time [09:26:56] right [09:27:07] _joe_: because scap assumes that if you’re on labs, deployment-bastion is your deployment server... [09:27:15] even if it’s in a different project. [09:27:22] ah right [09:27:23] <_joe_> YuviPanda: use env variables :) [09:27:28] apergos: yeah, it looks like I’ve to do at least one manual git deploy start / sync to ‘initialize' [09:27:38] _joe_: yeah, so that all involves touching scap :) [09:27:41] which is why I’m touching it [09:28:17] <_joe_> add me as a reviewer, so that I get to play with some python again :) [09:28:49] what's happening in trebuchet? (if there's changesets just point me at em, it's only o I can keep track of what's going on, given that I'm playing in that sandbox these days) [09:29:08] _joe_: :D I will! [09:29:23] 7Puppet, 6operations, 10Tool-Labs: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1103738 (10scfc) 3NEW [09:29:31] apergos: oh, nothing. I haven’t touched trebuchet’s code itself. I just had to do a manual ‘git deploy start / sync’ before puppet’s provider => trebuchet would work [09:29:42] ah ok [09:29:55] was this for an added host to some repo? [09:29:59] _joe_: my current approach is basically to hack away on this mega patch https://gerrit.wikimedia.org/r/195340 (to simplify tin) and then split it up. [09:30:09] apergos: so this is the deployment server itself... [09:30:18] huh [09:30:20] so this is the first ever deployment from any server... [09:30:24] ahhhh [09:30:30] staging-tin is equivalent in stagingproject of tin in prod [09:30:35] I'm going to have to look at this ffrom start to finish one of these days [09:30:38] if you look at that patch it might make a bbit more sense. [09:30:40] yeaaaaah [09:30:53] as long as you have that workd around I might... put it off for now though :-D but I'm lookiing at the patch, sure [09:31:45] apergos: :) So once that patch is all merged, I’ll basically re-create staging-tin again... [09:32:03] ah wow nice, kudos for tackling this [09:32:06] _joe_: apergos in this case, I’m taking this as an opportunity to get our puppet config right, rather than ‘get this machine up and running faaast' [09:32:18] +1 to that [09:32:33] for some definition of ‘right' [09:38:10] <_joe_> yeah [09:38:38] <_joe_> YuviPanda: if you eliminate the "eventual consistency" exec, you get karma [09:38:58] _joe_: I’m not touching the trebuchet code itself yet. Just the roles. [09:39:32] aha! hosts_allow => $::network::constants::mw_appserver_networks; [09:39:32] <_joe_> that is in the role I guess [09:40:17] terrible puppet question, but [09:40:28] do I need to include network::constants before I use a global constant like that? [09:40:42] <_joe_> YuviPanda: try to guess? :P [09:40:51] aaaaarrrrggggh [09:41:04] so I can’t actually put those in as default param values? [09:41:25] although to be fair, network::constants needs to die [09:42:04] <_joe_> no [09:42:10] <_joe_> why should it? [09:42:13] well [09:42:17] be moved to hiera [09:42:20] but maybe not that either. [09:42:24] I’m not sure. [09:42:37] or I could make the realm branches in the network config itself... [09:42:40] <_joe_> the reaction you had, I had months ago [09:43:13] heh. [09:43:19] you’re just secretly me from the future [09:44:01] <_joe_> ahah [09:44:04] <_joe_> I hope not [09:44:05] <_joe_> for you [09:44:26] :D I hope not, for you as well :) [09:44:39] realm branching in network.pp then [09:45:03] although what’s the point of constants if you can’t actually include them as defaults... [09:45:12] but it somewhat makes sense [09:45:12] oh wait [09:45:14] problem is [09:45:19] network::constants doesn’t actually have constants... [09:45:23] well [09:45:31] depends on how you define ‘constant’ I guess [09:45:37] <_joe_> YuviPanda: I'll take a look, I am thinking of manifests/network.pp [09:45:45] yeah [09:45:48] manifests/network.pp [09:50:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 716 [09:50:42] (03PS1) 10Yuvipanda: network: Realm branch appserver ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/195527 [09:50:45] why are my git fetches being so slow... [09:50:52] _joe_: ^ intermediate step... [09:51:34] I’ll verify this on tin afer merging. [09:51:37] PROBLEM - SSH on rhenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:38] (03CR) 10jenkins-bot: [V: 04-1] network: Realm branch appserver ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/195527 (owner: 10Yuvipanda) [09:51:47] yes yes, I’ll rebase when my fetch completes, jerkins [09:51:52] <_joe_> YuviPanda: actually... [09:52:18] (03PS2) 10Yuvipanda: network: Realm branch appserver ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/195527 [09:52:19] <_joe_> I was thinking we could define most constants as class parameters and use hiera to deviate from production [09:52:20] (03PS7) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [09:52:26] _joe_: +1, I agree. [09:52:29] RECOVERY - SSH on rhenium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:53:06] (03CR) 10jenkins-bot: [V: 04-1] network: Realm branch appserver ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/195527 (owner: 10Yuvipanda) [09:53:18] oh, interesting now. [09:53:23] (03CR) 10jenkins-bot: [V: 04-1] deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 (owner: 10Yuvipanda) [09:53:35] _joe_: but I don’t want to block this series of changes (which will end with me getting rid of about 50% of remaining beta/ module) on that. [09:53:38] <_joe_> jenkins doesn't like yuvi's changes [09:55:01] (03PS3) 10Yuvipanda: network: Realm branch appserver ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/195527 [09:55:03] (03PS8) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [09:55:17] RECOVERY - check_mysql on db1008 is OK: Uptime: 45970 Threads: 3 Questions: 250438 Slow queries: 384 Opens: 1390 Flush tables: 2 Open tables: 64 Queries per second avg: 5.447 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:55:29] _joe_: first time I’ve done an ‘elsif’ in puppet. elsif? so there’s elseif, else if, elif, and now elsif... [09:56:31] _joe_: so how about I do this realm branch for now, and move them all out into hiera later? [09:56:37] and by later I mean, say, tomorrow. or day after [09:56:41] as soon as this megapatch is done [09:57:02] there’s a bug for getting rid of network.pp wasn’t there... [10:22:43] 6operations, 6Engineering-Community, 6WMF-Legal, 3ECT-March-2015, 6WMF-NDA: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#11021 (10Qgil) @MBrar.WMF edited he text, and I think everything is good to go. Thank you everybody! [10:22:57] 6operations, 6Engineering-Community, 6WMF-Legal, 3ECT-March-2015, 6WMF-NDA: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1103915 (10Qgil) 5Open>3Resolved [10:23:35] <_joe_> YuviPanda: +1 [10:23:42] cool [10:23:52] running puppetcompiler to check [10:26:01] (03CR) 10Yuvipanda: [C: 032] network: Realm branch appserver ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/195527 (owner: 10Yuvipanda) [10:28:35] (03PS1) 10Matanya: gridengine: correct param order [puppet] - 10https://gerrit.wikimedia.org/r/195531 [10:34:17] Lintian: This package installs an ELF binary in the /usr/share hierarchy, which is reserved for architecture-independent files. [10:34:21] * hashar shakes fists [10:43:13] (03PS9) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [10:44:20] (03CR) 10jenkins-bot: [V: 04-1] deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 (owner: 10Yuvipanda) [10:45:02] (03PS1) 10Matanya: puppet: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195533 [10:46:46] (03PS10) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [10:50:11] interesting error [10:50:12] @ERROR: chroot failed [10:50:15] from rsync [10:50:16] * YuviPanda digs [10:52:41] (03PS1) 10Matanya: statsdlb: fix string containing only a variable [puppet] - 10https://gerrit.wikimedia.org/r/195534 [11:04:18] (03PS1) 10Matanya: nova: lint compute.pp [puppet] - 10https://gerrit.wikimedia.org/r/195535 [11:08:10] (03PS1) 10Matanya: git: fix param order [puppet] - 10https://gerrit.wikimedia.org/r/195536 [11:32:38] (03PS1) 10Yuvipanda: trebuchet: Make the deployment server configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/195539 [11:33:15] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Bugzilla HTML static version and database dump - https://phabricator.wikimedia.org/T1198#1104040 (10Aklapper) Minor: Seeing "WMF Static Bugzilla" results in Google, could that be changed to "Wikimedia Deprecated Bugzilla" or something? [11:34:28] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [11:34:59] (03PS2) 10Yuvipanda: mysql: Cleanup mysql::param [puppet] - 10https://gerrit.wikimedia.org/r/195321 [11:35:48] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tfinc - https://phabricator.wikimedia.org/T90927#1104041 (10mark) I'll approve Tomasz's access to bastion* and stat*. [11:36:29] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#1104042 (10mark) a:5mark>3None Toby's access to stat1001 is approved. [11:37:10] (03CR) 10Matanya: [C: 031] mysql: Cleanup mysql::param [puppet] - 10https://gerrit.wikimedia.org/r/195321 (owner: 10Yuvipanda) [11:37:54] (03CR) 10Yuvipanda: [C: 032 V: 032] mysql: Cleanup mysql::param [puppet] - 10https://gerrit.wikimedia.org/r/195321 (owner: 10Yuvipanda) [11:40:16] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tfinc - https://phabricator.wikimedia.org/T90927#1104052 (10mark) a:5mark>3None [11:42:00] (03PS2) 10Yuvipanda: trebuchet: Make the deployment server configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/195539 [11:42:18] (03CR) 10Yuvipanda: [C: 032 V: 032] trebuchet: Make the deployment server configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/195539 (owner: 10Yuvipanda) [11:46:02] (03PS11) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [11:48:34] 7Puppet, 6operations, 10Tool-Labs: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1104073 (10faidon) This is on purpose — we run apt-get update before puppet as to pick up potentially new packages (e.g. in our own apt repository) and properly enforce ensur... [11:52:20] 7Puppet, 6operations, 10Tool-Labs: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1104078 (10yuvipanda) (on labs too, this got picked up as a puppet staleness failure) [11:53:02] 6operations, 7HTTPS, 3HTTPS-by-default: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1104088 (10Aklapper) [12:01:36] (03PS1) 10Yuvipanda: beta: Set trebuchet deployment server via hiera [puppet] - 10https://gerrit.wikimedia.org/r/195542 [12:02:15] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Set trebuchet deployment server via hiera [puppet] - 10https://gerrit.wikimedia.org/r/195542 (owner: 10Yuvipanda) [12:12:49] akosiaris: hi there, fell like a review raid ? [12:13:16] *feel [12:17:04] matanya: kind of busy right now. Add me as a reviewer and I will do them in my own time [12:17:25] sure, you are reviewer on all 11. thanks! [12:17:54] <_joe_> matanya: add me too, I can sweep a few hopefully [12:18:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [12:18:10] you are a reviewer too _joe_ and thanks! [12:18:41] nobody panic. scap on beta is broken, but I’m on it [12:24:22] (03PS1) 10Yuvipanda: dsh: Allow setting group source via hiera [puppet] - 10https://gerrit.wikimedia.org/r/195544 [12:27:11] 7Puppet, 6operations, 10Tool-Labs: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1104143 (10scfc) I know, that's why I found the error :-). But it took me quite some time to figure out that Puppet is //not// failing, but `apt-get update`. If: ``` timeo... [12:34:31] (03CR) 10Yuvipanda: [C: 032] dsh: Allow setting group source via hiera [puppet] - 10https://gerrit.wikimedia.org/r/195544 (owner: 10Yuvipanda) [12:45:10] YuviPanda: https://gerrit.wikimedia.org/r/#/c/194395/ shoud address all your nits. [12:46:26] (03CR) 10Yuvipanda: [C: 031] Labs: Puppetize labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/194395 (owner: 10coren) [12:46:30] Coren: \o/ [12:47:13] 7Puppet, 6operations, 10Tool-Labs: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1104178 (10chasemp) I think the idea is that if the puppet initiated apt update fails then puppet is failing indeed. [12:47:24] (03PS8) 10coren: Labs: Puppetize labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/194395 [12:48:08] 7Puppet, 6operations: Convert host lists in dsh/files/groups to hiera - https://phabricator.wikimedia.org/T92259#1104185 (10yuvipanda) 3NEW [12:48:59] 7Puppet, 6operations, 10Tool-Labs: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1104192 (10yuvipanda) If I retitle the bug to 'redirect apt-get output to puppet log', would that be an accurate assessment of what you're asking for, @scfc? That sounds usef... [12:50:26] 7Puppet, 6operations, 10Tool-Labs: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1104194 (10faidon) Logging apt-get update's -qq output in the logfile doesn't sound like a bad idea at all! @scfc, do you want to do the honors? :) [12:50:44] 7Puppet, 6operations: Convert host lists in dsh/files/groups to hiera - https://phabricator.wikimedia.org/T92259#1104195 (10yuvipanda) Replacing with some other discovery service is also a valid solution :) [12:51:55] (03PS1) 10Giuseppe Lavagetto: mediawiki: handle errors via php on bits.w.org [puppet] - 10https://gerrit.wikimedia.org/r/195548 [12:52:22] <_joe_> MatmaRex: ^^ [12:52:35] oh, neat. thanks [12:54:15] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: handle errors via php on bits.w.org [puppet] - 10https://gerrit.wikimedia.org/r/195548 (owner: 10Giuseppe Lavagetto) [12:54:43] (03PS12) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [12:59:30] 7Puppet, 6operations, 10Tool-Labs: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1104211 (10scfc) Well, in that case perhaps the monitoring should be condensed to "Everything's okay" and "Something's broken", so that I could spend more hours debugging tha... [13:02:43] 7Puppet, 6operations, 10Tool-Labs: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1104230 (10yuvipanda) >>! In T92239#1104211, @scfc wrote: > Well, in that case perhaps the monitoring should be condensed to "Everything's okay" and "Something's broken", so... [13:04:21] (03PS1) 10Yuvipanda: puppet: Log output of pre-puppet-run apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/195550 (https://phabricator.wikimedia.org/T92239) [13:08:36] (03CR) 10Tim Landscheidt: [C: 04-1] "Probably overcautious, but I would move the "chmod" section up before the log file is now created with apt-get." [puppet] - 10https://gerrit.wikimedia.org/r/195550 (https://phabricator.wikimedia.org/T92239) (owner: 10Yuvipanda) [13:09:40] 6operations, 7HTTPS, 3HTTPS-by-default: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1104239 (10JanZerebecki) I personally would prefer to pin leaf keys, but the question is what the people maintaining the certificates can offer. [13:10:25] (03PS2) 10Yuvipanda: puppet: Log output of pre-puppet-run apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/195550 (https://phabricator.wikimedia.org/T92239) [13:10:30] (03CR) 10Yuvipanda: "good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/195550 (https://phabricator.wikimedia.org/T92239) (owner: 10Yuvipanda) [13:13:13] (03CR) 10Tim Landscheidt: [C: 031] puppet: Log output of pre-puppet-run apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/195550 (https://phabricator.wikimedia.org/T92239) (owner: 10Yuvipanda) [13:14:13] (03PS3) 10Yuvipanda: puppet: Log output of pre-puppet-run apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/195550 (https://phabricator.wikimedia.org/T92239) [13:14:36] (03CR) 10Yuvipanda: [C: 032 V: 032] puppet: Log output of pre-puppet-run apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/195550 (https://phabricator.wikimedia.org/T92239) (owner: 10Yuvipanda) [13:15:58] 7Puppet, 6operations, 10Tool-Labs, 5Patch-For-Review: cron's puppet-run fails silently if apt-get update fails - https://phabricator.wikimedia.org/T92239#1104248 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Thank you for reporting :) [13:16:05] (03CR) 10coren: [C: 032] Labs: Puppetize labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/194395 (owner: 10coren) [13:25:53] Hmm. hieradata/role/common/labs/nfs/dumps.yaml isn't being picked up. [13:26:37] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: puppet fail [13:31:15] _joe_: Clarify my understanding if you have a minute? Should 'role labs::nfs::dumps' have picked up hiera data from hieradata/role/common/labs/nfs/dumps.yaml as I expected? [13:38:07] (03CR) 10Ottomata: "2 qs:" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [13:41:00] (03CR) 10JanZerebecki: [C: 031] Enable HSTS on racktables with max-age=7days [puppet] - 10https://gerrit.wikimedia.org/r/195444 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [13:41:32] <_joe_> Coren: yes [13:41:53] <_joe_> but I'm momentarily off for lunch :) [13:41:54] <_joe_> bbl [13:49:10] (03CR) 10Mark Bergsma: [C: 04-2] "I really don't see the need to limit our flexibility here, and RackTables isn't that sensitive at all." [puppet] - 10https://gerrit.wikimedia.org/r/195444 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [14:04:25] (03CR) 10JanZerebecki: "For what do you foresee a need to not use HTTPS on this domain and why can that not be done by picking a new domain name?" [puppet] - 10https://gerrit.wikimedia.org/r/195444 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [14:04:44] <_joe_> Coren: what is the change that is not working for you? [14:05:04] https://gerrit.wikimedia.org/r/#/c/194395/ [14:05:27] Gets me Must pass dump_servers_ips to Class[Role::Labs::Nfs::Dumps] [14:06:11] <_joe_> Coren: oh I gotcha [14:06:20] <_joe_> Coren: I'll comment on the change [14:08:15] (03PS1) 10Tim Landscheidt: Tools: Fix XML output of qstat for webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/195556 (https://phabricator.wikimedia.org/T92039) [14:09:14] (03CR) 10Giuseppe Lavagetto: Labs: Puppetize labstore1003 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/194395 (owner: 10coren) [14:09:23] (03CR) 10Alexandros Kosiaris: [C: 031] "Don't forget to remove the key from the private repo" [puppet] - 10https://gerrit.wikimedia.org/r/195303 (https://phabricator.wikimedia.org/T92045) (owner: 10Dzahn) [14:09:52] Ha-ah! [14:10:20] * Coren bows in respect to Da Hiera Man! [14:10:49] Coren: _joe_ whoops, I missed that... [14:11:55] (03CR) 10Yuvipanda: Labs: Puppetize labstore1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/194395 (owner: 10coren) [14:23:51] (03PS1) 10Tim Landscheidt: apt: Remove invalid parameter $priority from apt::repository [puppet] - 10https://gerrit.wikimedia.org/r/195562 [14:24:28] (03CR) 10Tim Landscheidt: "Revert in https://gerrit.wikimedia.org/r/#/c/195562/." [puppet] - 10https://gerrit.wikimedia.org/r/110124 (owner: 10Andrew Bogott) [14:25:05] (03CR) 10Faidon Liambotis: [C: 032] apt: Remove invalid parameter $priority from apt::repository [puppet] - 10https://gerrit.wikimedia.org/r/195562 (owner: 10Tim Landscheidt) [14:26:30] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Eventlogging JS client should warn users when serialized event is more than "N" chars long and not sent the event - https://phabricator.wikimedia.org/T91918#1104401 (10mforns) a:5mforns>3None [14:38:28] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1104449 (10Nikerabbit) If I have questions about the document, who is the person to ask? Specifically, about the extend of //Wikimedia facilities// on the third point and... [14:50:40] marktraceur, ^demon|away, thcipriani: Who wants SWAT this morning? [14:50:49] James_F|Away, Glaisher: Ping for SWAT in about 9 minutes. [14:51:00] pong [14:52:03] (03PS1) 10Glaisher: Clean up $wgNamespacesWithSubpages array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195565 [14:52:27] * thcipriani raises hand [14:52:59] I could do SWAT given some supervision. Could use the practice. [14:53:02] added that patch above to the list as well [14:53:12] thcipriani: Ok, you have it. If you want to hangout like you did yesterday, let me know. [14:54:41] anomie: yeah, that'd be good. I still don't have +2 on some of the repos :( I'll call you in 5 if that's cool? [14:54:52] thcipriani: ok [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, James_F: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150310T1500). Please do the needful. [15:00:28] anomie: calling. [15:05:35] (03CR) 10Anomie: [C: 032] "SWAT (merge for thcipriani)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171219 (https://phabricator.wikimedia.org/T57737) (owner: 10Glaisher) [15:05:45] (03Merged) 10jenkins-bot: Delete vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171219 (https://phabricator.wikimedia.org/T57737) (owner: 10Glaisher) [15:06:59] 6operations: Upgrade salt to 2014.7 (investigating) - https://phabricator.wikimedia.org/T88971#1104582 (10ArielGlenn) [15:15:53] (03PS1) 10coren: Fix hiera variable name for labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/195566 [15:16:03] (03PS1) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 [15:16:58] !log Delete vewikimedia deployed via morning swat [[gerrit:171219]] [15:17:03] Logged the message, Master [15:18:04] Glaisher: everything look good to you re:delete vewikimedia? [15:18:46] hard to tell [15:18:58] (03CR) 10JanZerebecki: "This is based on https://github.com/wmde/puppet-builder/ ." [puppet] - 10https://gerrit.wikimedia.org/r/195567 (owner: 10JanZerebecki) [15:19:08] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:16] thcipriani: I think you need to run a script too [15:19:39] I remember seeing deleteWiki.php or sth somewhere [15:20:51] Finally. [15:20:53] anomie: Here now, sorry for the delay. [15:22:59] !log thcipriani Synchronized database lists: (no message) (duration: 00m 07s) [15:23:00] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/removeDeletedWikis.php [15:23:03] Logged the message, Master [15:23:11] thcipriani: that one ^ [15:23:25] Glaisher: got it. Thanks [15:25:00] springle: I’d appreciate a review of https://gerrit.wikimedia.org/r/#/c/195472/ if you’re still up [15:28:04] 6operations, 5Patch-For-Review: reclaim lsearchd hosts - https://phabricator.wikimedia.org/T86149#1104665 (10Ottomata) [15:29:25] Glaisher: script run [15:29:34] ah, ok thanks [15:32:56] (03PS2) 10Anomie: Clean up $wgNamespacesWithSubpages array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195565 (owner: 10Glaisher) [15:33:07] (03CR) 10Anomie: [C: 032] "SWAT (for thcipriani)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195565 (owner: 10Glaisher) [15:33:13] (03Merged) 10jenkins-bot: Clean up $wgNamespacesWithSubpages array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195565 (owner: 10Glaisher) [15:34:26] (03PS1) 10RobH: mw2201-2209 reverse entry corrections [dns] - 10https://gerrit.wikimedia.org/r/195568 [15:34:59] (03CR) 10RobH: [C: 032] mw2201-2209 reverse entry corrections [dns] - 10https://gerrit.wikimedia.org/r/195568 (owner: 10RobH) [15:36:47] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:36:50] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: Morning swat sync of [[gerrit:195565]] (duration: 00m 06s) [15:36:54] Logged the message, Master [15:37:18] Glaisher: 195565 is synced, can you verify? [15:37:42] it isn't working [15:37:46] oh wait.. you just synced it [15:37:51] yessir [15:38:19] works! :) [15:38:24] perfect, thanks! [15:38:37] anomie: closed the task after it was merged at gerrit [15:38:46] (03PS6) 10Anomie: Disable 'beta' label in tab for the VE opt-in wiki (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/112590 (https://phabricator.wikimedia.org/T60583) (owner: 10Jforrester) [15:38:48] I was testing it then [15:39:02] James_F: starting on 112590 [15:39:03] (03CR) 10Anomie: [C: 032] "SWAT (for thcipriani)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/112590 (https://phabricator.wikimedia.org/T60583) (owner: 10Jforrester) [15:39:07] thcipriani: Thanks. [15:39:09] (03Merged) 10jenkins-bot: Disable 'beta' label in tab for the VE opt-in wiki (enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/112590 (https://phabricator.wikimedia.org/T60583) (owner: 10Jforrester) [15:39:49] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#1104706 (10RobH) a:3RobH [15:39:51] (03PS1) 10BBlack: depool cp1062+cp4006 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195572 [15:39:54] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tfinc - https://phabricator.wikimedia.org/T90927#1104707 (10RobH) a:3RobH [15:40:16] (03CR) 10BBlack: [C: 032 V: 032] depool cp1062+cp4006 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195572 (owner: 10BBlack) [15:41:11] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: Morning SWAT [[gerrit:112590]] (duration: 00m 06s) [15:41:16] Logged the message, Master [15:42:07] James_F: synced 112590 can you verify? [15:42:44] (03PS1) 10Dzahn: depool cp1056 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195573 (https://phabricator.wikimedia.org/T86648) [15:42:45] thcipriani: Sure. It'll take a second or so. [15:42:48] kk [15:43:39] (03PS1) 10Yuvipanda: beta: Stop trying to deploy to the deploy host [puppet] - 10https://gerrit.wikimedia.org/r/195574 [15:43:57] (03PS2) 10Yuvipanda: beta: Stop trying to deploy to the deploy host [puppet] - 10https://gerrit.wikimedia.org/r/195574 [15:44:12] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Delete vewikimedia and redirect it to wikimedia.org.ve - https://phabricator.wikimedia.org/T57737#1104717 (10Glaisher) 5Open>3Resolved a:3Glaisher [15:44:18] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Stop trying to deploy to the deploy host [puppet] - 10https://gerrit.wikimedia.org/r/195574 (owner: 10Yuvipanda) [15:44:30] 6operations, 10Wikimedia-General-or-Unknown: Delete vewikimedia and redirect it to wikimedia.org.ve - https://phabricator.wikimedia.org/T57737#636412 (10Glaisher) [15:45:08] thcipriani: Yup, looks good. Thanks! [15:45:17] James_F: neat. Thank You! [15:45:48] (03PS2) 10Dzahn: depool cp1056 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195573 (https://phabricator.wikimedia.org/T86648) [15:46:08] All: SWAT complete! Thank you for your patience. :) [15:49:16] (03CR) 10GWicke: "* What if an ipaddress is used instead of a hostname in $seeds" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [15:51:30] (03CR) 10Dzahn: [C: 032] depool cp1056 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195573 (https://phabricator.wikimedia.org/T86648) (owner: 10Dzahn) [15:51:38] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [15:51:58] PROBLEM - Host cp1062 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:04] ACKNOWLEDGEMENT - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn depooled for reinstall [15:54:34] ACKNOWLEDGEMENT - Host cp1062 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn depooled for reinstall [15:54:50] what I've been doing is just downtiming the services but not the hosts. it keeps the spam down, but still gets the host down reported in here so people are aware [15:55:18] cool, sounds good , yep [15:56:01] after an ACK, if the state changes again it will still report it [15:56:58] just having hosts reported but not all the services sounds perfect to me [15:57:08] well, sometimes we get the next UP event and sometimes not, but after that it all stops [15:57:24] (because it gets cleaned from puppet, and usually that gets to neon and deletes the host before the reinstall finishes) [15:57:34] 6operations, 6Labs, 10hardware-requests: Hardware for Designate - https://phabricator.wikimedia.org/T91277#1104754 (10Andrew) [15:58:23] if I were to vote I would say, silence it all if expected? [15:58:40] getting node down alerts in this context, what is the intended value? [15:59:16] not taking down too many at onces [15:59:18] (03PS1) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) [15:59:51] I don't understand, how does one relate to the other? [15:59:57] RECOVERY - Host cp1062 is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [16:00:25] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1104770 (10RobH) @nikerabbit, You can ask on here. I may not be able to answer it, but I can direct them to someone who (hopefully) can. The third point you reference i... [16:00:29] if you have more than 1 person working on it you know what is happening [16:00:42] and when one is done with reinstall.. i guess [16:01:23] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 80.73 ms [16:01:24] yes ok but node down alerts to teh main -ops channel is not a coordination mechanism [16:01:26] (03PS13) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [16:01:31] or at least not a good one I can wrap my mind around [16:05:09] I was actually about to ask in -sec what was up with cp1062 etc as if I see the flag raised I feel like an unknown node down is a context shifting needed and that's hard to do if it is part of a normalized work flow [16:05:20] i don't have a strong opinion, the only part i care is that if it says CRIT it's good to see either an ACK or a RECOVERY [16:05:23] (03PS2) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 [16:05:31] that is opaque to the rest of the group, anyways, my vote will always be to silence known alerts [16:06:16] (03CR) 10jenkins-bot: [V: 04-1] Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (owner: 10JanZerebecki) [16:07:19] (03PS2) 10RobH: Add tnegrin to statistics-web-users [puppet] - 10https://gerrit.wikimedia.org/r/193848 (owner: 10Rush) [16:07:48] (03PS3) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 [16:08:23] PROBLEM - Host cp1062 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:36] (03CR) 10RobH: [C: 032] Add tnegrin to statistics-web-users [puppet] - 10https://gerrit.wikimedia.org/r/193848 (owner: 10Rush) [16:08:43] RECOVERY - Host cp1062 is UP: PING OK - Packet loss = 0%, RTA = 1.48 ms [16:08:53] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#1104795 (10RobH) [16:08:59] (03CR) 10jenkins-bot: [V: 04-1] Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (owner: 10JanZerebecki) [16:09:06] 6operations, 6Security: Define in Puppet or remove rogue user accounts not currently defined in admin/data.yaml - https://phabricator.wikimedia.org/T90923#1104798 (10RobH) [16:09:07] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#1104796 (10RobH) 5Open>3Resolved I've merged live toby's corrected access to stat1001; resolving ticket. [16:09:28] (03PS4) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 [16:09:47] schedules a downtime incl. the host [16:09:52] (03PS14) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [16:10:46] 6operations, 6Labs, 10hardware-requests: Hardware for Designate - https://phabricator.wikimedia.org/T91277#1104814 (10Andrew) Quick conversation with Mark confirms that this box should have a public IP, same as silver or Horizon. We'll figure out about rabbitmq communication later on. [16:11:07] (03PS5) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 [16:11:18] (03CR) 10coren: [C: 032] "Trivial hiera variable name fix" [puppet] - 10https://gerrit.wikimedia.org/r/195566 (owner: 10coren) [16:11:32] i'm fine with either, we can handle them with scheduled downtime, by disabling notifications or by acknowledgments, as long as we do any of that and don't get into the "there's always a lot of CRIT anyways, so i tend to ignore it"-trap [16:13:58] (03PS2) 10GWicke: Don't include a node in its own seeds [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) [16:15:06] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1104825 (10Nikerabbit) Does the third point include the use Wikimedia Labs or the use of Gerrit/Phabricator? [16:15:22] (03PS1) 10RobH: fixing tomasz user permissions in puppet [puppet] - 10https://gerrit.wikimedia.org/r/195583 [16:16:44] 6operations, 10MediaWiki-extensions-PdfHandler, 6Multimedia, 6Wikisource: Text edit box encoding problem with PDF - https://phabricator.wikimedia.org/T36540#1104887 (10Aklapper) [16:17:02] 6operations, 10Wikimedia-DNS, 6Wikisource: Redirect mul.wikisource.org to wikisource.org - https://phabricator.wikimedia.org/T75407#1104901 (10Aklapper) [16:17:03] 6operations, 10Wikimedia-Interwiki-links, 6Wikisource: Interwiki language links to non-existent wikisources should redirect to the multilingual wikisource - https://phabricator.wikimedia.org/T38033#1104900 (10Aklapper) [16:17:11] (03CR) 10GWicke: "@Ori: On reflection, I now think that letting a single-node cluster start up automatically is actually a fine default. Lets keep $::ipaddr" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [16:17:34] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:57] 6operations, 10Wikimedia-Extension-setup, 6Wikisource: pdftotext should be poppler version not xpdf version on wikisource - https://phabricator.wikimedia.org/T37122#1104957 (10Aklapper) [16:18:03] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 79.82 ms [16:18:50] (03CR) 10RobH: [C: 032] fixing tomasz user permissions in puppet [puppet] - 10https://gerrit.wikimedia.org/r/195583 (owner: 10RobH) [16:19:20] (03CR) 10Ottomata: "Oo, sorry, one more, you might want to check against @fqdn instead of (or in addition to) @hostname. I think it is more likely that @fqdn" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [16:19:21] Coren: i just merged your stuff on palladium [16:19:24] since i had stuff too [16:19:29] is cp1069 a live (and fully installed) host? I'm seeing salt key mismatches on pallaium [16:19:31] palladium [16:19:38] robh: Oh, sorry, I got distracted by email. [16:19:44] (03PS15) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [16:19:46] no worries, just didnt want you to wonder later =] [16:19:46] i.e. is anyone doing things to it? [16:19:51] apergos: not i. [16:20:00] hrm [16:20:06] apergos: bblack is [16:20:09] any puppet-knowledgeable people want to help me fix this lint error? https://gerrit.wikimedia.org/r/#/c/195576/ [16:20:14] he mentioned in another channel just now [16:20:23] I'm off for a few hours; passport renew time. [16:20:37] Coren: eww, have ... fun? [16:20:38] I have a 'define' inside my role definition, only used in that role but it wants me to split that class away. Where should I put it? [16:21:16] (03PS3) 10GWicke: Don't include a node in its own seeds [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) [16:21:31] robh: The Canadian bureaucracy isn't that bad for passports. It's a lot of waiting in line but basically a rubberstamp for a simple renewal. [16:21:52] ah ok, I will ignore it then, thanks [16:22:17] actually never mind, I think I figured it out [16:24:08] (03PS16) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [16:24:14] (03PS4) 10GWicke: Don't include a node in its own seeds [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) [16:25:11] (03PS5) 10GWicke: Don't include a node in its own seeds [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) [16:25:13] (03PS1) 10BBlack: repool cp4006 + cp1062 [puppet] - 10https://gerrit.wikimedia.org/r/195585 [16:25:31] (03CR) 10BBlack: [C: 032 V: 032] repool cp4006 + cp1062 [puppet] - 10https://gerrit.wikimedia.org/r/195585 (owner: 10BBlack) [16:28:58] apergos: got a moment to chat about dumps? [16:29:13] sure [16:29:23] Coren! [16:29:26] gwicke: [16:29:38] robh! [16:29:46] (we're just listing names right?) [16:29:48] r! [16:29:49] so re labs vs. prod: my main worry with labs is that it's not ideal for the amount of IO and storage that dumps will need [16:29:51] Ga [16:29:53] h [16:30:06] Hello robh. How are you? [16:30:35] gwicke: don't worry be happy [16:30:38] all systems nominal =] [16:31:06] apergos: the restbase dumps are pretty straightforward code-wise, don't see a reason to not run them in prod [16:31:29] we're talking about prod vs labs already amongs op folks [16:31:30] for zim we'd need to have a look at what they are doing, and whether that would be a problem in prod [16:32:03] from my understanding it's very similar to what OCG is doing [16:32:30] massaging and loading of metadata / thumbnails [16:32:33] you know it would be the usual steps (assumign we decide to get them real hardware) [16:32:44] make it happy in labs. puppetize. move to prod [16:33:14] yeah [16:33:18] if you want to weigh in you.. are you on the ops list? (sorry but I forget who is and who isn't) [16:33:27] if you aren't, write me an email and I'll fwd it [16:33:30] yes, I am on the ops list [16:33:47] if you are, just send to the list directly that 'this is related to the OpenZIM labs request and etc, these tasks here' [16:34:32] ok [16:35:24] re serving restbase dumps: is there a particular reason for not serving dumps from the machine they are created on? [16:35:56] to me it sounds like that could avoid a good amount of data movement / IO [16:37:11] well it depends how intensive your dump production process is [16:37:22] but you're going to move data around anyways to make sure you have more than one copy [16:37:43] so why not have a dedicated server that people can suck all the bandwidth and disk i/o out of [16:38:20] I guess the main reason is that the dumps are incremental [16:38:39] so a dump actually reuses most of the previous dump's content [16:39:04] yep we have the same deal [16:39:16] so you produce it locally, keep always the last good round locally [16:39:25] ship results ff to the web server [16:39:38] (this is my plan for codfw btw for regular dumps, to get the heck off of nfs) [16:39:40] ok, makes sense [16:39:58] either way works for me [16:40:18] I guess we'd get a bit more use out of the dump machines if we also used them for serving [16:40:29] well if you wind up producing many dumps in parallel [16:40:37] you can get as much use out of them as you want [16:41:04] the html dumps are pretty quick, should be less than a day for all wikis [16:41:27] people will be very greedy about downloads, you probably haven't anticiapted how eh... demanding the community can be ;-) [16:41:47] this is for just main ns yes? [16:41:54] what about folks who want all ns? [16:41:56] yes, only main [16:42:05] it's easy to dump the others too [16:42:14] any notions about speed for that? [16:42:23] but I guess it makes most sense to start just with main [16:42:26] oh sure [16:42:51] what's the ratio of main vs. non-main content? 2x? [16:43:06] oh much bigger on large wikis I think [16:43:26] ok [16:43:38] we'll find out I guess ;) [16:44:05] only current main revisions across all wikipedias take up ~330G in Cassandra [16:44:22] English articles: 4,738,965 Total wiki pages: 35,295,678 [16:44:23] wn wp [16:44:32] compression helps, it's less than 1/5 of the raw input size [16:45:06] I am assuming that no one would have an interest in full history with html, I mean *why* [16:45:26] I think that eventually people will want that [16:45:34] it's quite a bit easier to work with than wikitext [16:45:39] well that will eat up the ret of your cpu then :-D [16:45:43] *rest [16:46:09] (03PS1) 10Yuvipanda: keyholder: Show error message when key can't be armed [puppet] - 10https://gerrit.wikimedia.org/r/195593 [16:46:26] apergos: it'll be a while before we get there [16:46:34] step by step [16:46:37] *nod* [16:47:09] let's plan initially for current revs of main ns and of all ns [16:47:17] with the ability to grow [16:47:39] yup [16:47:54] that's what restbase will have in storage too [16:48:01] all righty [16:48:13] you hve a hw req task for this right? [16:48:20] for the restbase producer piece I mean [16:48:26] yes, https://phabricator.wikimedia.org/T91853 [16:49:08] feel like summarizing this there? then rob (if it's him) will be better informed about the actual needs [16:49:12] the only special need is really sufficiently large storage, otherwise any old misc spare should work [16:49:36] /cc robh ;) [16:49:45] :-D [16:49:57] * gwicke scans https://wikitech.wikimedia.org/wiki/Server_Spares [16:50:24] raid0 is yucky [16:50:28] why you guys want raid0? [16:50:34] (the entire host dies when a disk dies) [16:51:00] robh: if we can get raid-1 with the same capacity then that'd be great too [16:51:02] I don't want raid0 [16:51:09] but redundancy in hosts is better than in disks [16:51:18] two hosts is not much redundancy [16:51:19] you want redundancy also in disks [16:51:33] right now you're cosidering dumps that take a day [16:51:38] sure, I'll happily take both [16:51:41] if we have to scramble to bring a host online when one dies, its not really redundant. we prefer disks be raided so a disk death doesn't result in immediate onsite requirements [16:51:42] there will come the point where you are doing the full history runs [16:51:42] cool [16:51:47] (03PS17) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [16:51:49] (03PS2) 10Yuvipanda: keyholder: Show error message when key can't be armed [puppet] - 10https://gerrit.wikimedia.org/r/195593 [16:51:53] robh: "Dell PowerEdge R420, dual Intel Xeon E5-2450 v2 2.50GHz, 64GB Memory, (4) 3TB Disks" sounds interesting [16:52:02] om nom nom nom [16:52:21] do note on task if you like a particular misc system. you'll want to put in reasoning for dual cpu, etc... [16:52:28] other than 'fast is good' ;D [16:52:29] using raid1 (or 5, if you really can't afford the redundancy for the amount of storage) is pretty much a prerequisite if we're going to support something with minimal downtime/risk [16:52:34] a bit more than we need really, but seem to be the only spares with large storage currently [16:53:13] bblack: that assumes that our software sucks ;) [16:53:18] so that will be 6GB, let's see how large those all-ns dumps turn out to be [16:53:41] gwicke: no that assumes that hw breaks. when you least desire it [16:53:48] bblack: agreed though that for most systems raid-1 is more appropriate [16:54:23] anyways on of those boxes would definitely get you started [16:54:39] yup [16:54:45] (03CR) 10Yuvipanda: [C: 032] keyholder: Show error message when key can't be armed [puppet] - 10https://gerrit.wikimedia.org/r/195593 (owner: 10Yuvipanda) [16:54:53] I love cassandra's distributed and fault-tolerant nature, but even in the cassandra world, node failures have costs, and it would probably be best to raid the storage [16:55:11] bblack: generally the advice I have seen is to have more nodes [16:55:18] especially as you involved more disks-per-node, the failure rate multiplies [16:55:48] having more nodes is good too, but doing basic due diligence to make the nodes reasonably reliable is worth it [16:57:01] it's a cost vs. reliability trade-off [16:57:04] if we were talking about something on the order of 100+ physical nodes with 1-2 disks each, I might lean the other way and say it's not worth raid [16:57:26] but if we're talking about e.g. 4-8 nodes per site and 4 disks per node, not so much [16:57:44] for a given amount of money, will I get more riliability out of having fewer nodes with raid-1 {rotating disks, ssds} or more nodes with raid-0? [16:58:09] unfortunately reliability for those tradeoffs is not a linear function. when the node counts are smaller, things come in stair-steps. [16:58:10] I think for rotating disks raid-1 is probably worth it [16:58:14] for ssds, less sure [16:58:56] SSDs fail too. the mirror also buys you a measure of protection against data error on the medium without the upper-laying software having to deal with it. [16:59:42] in the case of cassandra at least all data is checksummed anyway [17:00:04] maxsem, kaldari: Dear anthropoid, the time has come. Please deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150310T1700). [17:00:35] with a handful of nodes, when a disk fails unredundantly and a node goes out, we're looking at perf impact while data is re-replicated at least. How many nodes do you want to tolerate being dead at once? [17:01:08] how fast does someone need to be at a datacenter getting that node back online with new hardware before the next one? [17:01:33] you can play with replication parameters [17:01:53] default is 3-way, but with more raid-0 nodes you might as well go 5-way [17:01:53] yeah, but how many nodes are we talking about here? [17:02:02] to put it in a sense of perspective, we raid everything except the mw systems [17:02:07] since we have hundreds of them. [17:02:13] <_joe_> gwicke: raid-0 is a bad idea for cassandra, IMO [17:02:31] rephrase: raid1 or greater everything [17:02:32] right, we don't raid the mass storage for varnish cache either, but there's 106 of those machines working on concert [17:02:42] yea, i forgot varnish, my bad [17:02:59] <_joe_> bblack: also, varnish is a proper cache, cassandra is a data store [17:02:59] from the docs: "Because data is stored in the memtable, generally RAID is not needed for the commit log disk, but if you need the extra redundancy, use RAID 1." [17:03:25] and "Use RAID0 if disk capacity is a bottleneck and rely on Cassandra's replication capabilities for disk failure tolerance." [17:03:44] gwicke: I totally agree with the sentiment, for a very large scale cluster [17:03:45] <_joe_> gwicke: we need it IMO. because rebalancing and replication will take a performance hit we would like to avoid as much as possible [17:03:51] but how many nodes are we talking about here? [17:03:51] it's a common setup to use raid-0 or jbod [17:04:05] <_joe_> bblack: ok, for maybe a 50 nodes cluster I may agree [17:04:40] <_joe_> the portion of the data to move will then be in the order of 3/50th of the total [17:04:52] I'd certainly go with raid-1 for a three-node cluster [17:05:21] <_joe_> where, given we keep 3 copies, any machine going down means replicating the whole dataset, yes [17:05:22] but I see the cut-over at maybe 6 or 10 nodes [17:05:45] _joe_: there is no data movement from a node going down [17:05:54] <_joe_> gwicke: coming back up? [17:06:09] <_joe_> or we plan to mutilate the cluster forever? [17:06:18] not directly either, unless it has been down fow long (in which case you want to run a repair on it) [17:06:48] quorum reads implicitly bring nodes up to date [17:07:07] by issuing reads against all replicas & forwarding more up to date data to an out-of date node [17:07:08] <_joe_> well, the data exchange between nodes is not going to be implicit [17:07:22] <_joe_> it wasn't last I tried at least [17:08:00] repair is the explicit and complete version of that, and is important as it also covers data that isn't read [17:08:02] <_joe_> maybe cassandra has evolved since 3 years ago so that data exchange when a node rejoins a cluster is not high at all - I don't exactly see how [17:09:15] there's two cases: 1) a node comes back without any storage, in which case it needs to bootstrap [17:09:34] that's not super-cheap, but the load is distributed across the entire cluster thanks to vnodes [17:09:35] <_joe_> if we're in raid0/jbod, that will most likely be the case [17:10:34] 2) it comes back with storage; if it was only down for a short time other nodes will then send it 'hints' (basically a summary of what it missed) to let it catch up quickly and efficiently [17:11:45] if it was down for longer it's recommended to run a repair [17:12:03] it comes back with sane storage, anyways. if a node crashes, I don't think cassandra's doing barriered serialized fsyncs and all that, right? [17:12:15] which is slightly cheaper than a bootstrap, and also distributed across the cluster [17:12:33] the bootstrap scenario is what we face when a disk fails with no raid, in any case [17:13:30] cassandra afaik fsyncs the journal [17:13:59] so I guess it will know what it needs then [17:14:07] how often is configurable [17:14:42] in any case, so yeah this is a matter of picking a cluster size for the tradeoff [17:15:24] *nod* [17:15:35] I haven't seen the hardware specs your looking at, but in general doubling the disks is cheaper than doubling the servers if this is the only tradeoff (it's not, of course: if doubling the servers is necessary for processing capacity and memory anyways, you're getting that for free) [17:15:49] it's one of those questions that could probably be settled empirically, given a large enough amount of hardware ;) [17:16:45] I'd say we can at least bound this to near your 10 number earlier. At that point we know the lost data/resiliency and increased load from outage/recovery is roughly an order of magnitude smaller than the normal runtime stuff. [17:17:05] it's not going to be the kind of major event it would be for e.g. a 4-node cluster [17:17:15] yeah, agreed [17:17:16] personally, I'd push that number higher and not skip the raid until there are 10's of machines [17:17:52] that said, in my benchmarking the latency impact of a node running a repair has been very low [17:18:00] on a six-node cluster [17:18:36] (from an ops perspective, we like being in a position where one hardware fault is not only not an emergency, but not even a pressing concern - it's something that can be dealt with around working hours, schedules, regular vendor shipping, etc) [17:19:02] yup [17:19:24] it's just the question if more nodes or more disks gets us closer to that ;) [17:20:32] well there's probably a rough target for this in terms of necessary node count for cpu/mem. If that number is less than 10s of nodes, then IMHO putting double disks in those nodes is going to be cheaper than buying the excess nodes with less disks in all. [17:21:49] <_joe_> < bblack> personally, I'd push that number higher and not skip the raid until there are 10's of machines [17:21:52] <_joe_> +1 [17:22:44] I wonder if anybody has done the math on this [17:22:53] metoo (10's of machines) [17:23:05] depends a lot on machine costs too [17:23:16] it's not "more nodes or more disks", it's "more nodes AND more disks" [17:23:18] thcipriani, are you planning to deploy things in future? [17:23:26] I have many times in the past for other projects, so I'm just basing my wild proclamations on my gut feeling for the subject [17:23:36] but yes, numbers for this case today would help :) [17:23:37] Krenair: hoping to [17:23:56] <_joe_> gwicke: well my basic math tells you that in the case you need to bootstrap a node you will transfer rroughly 3/N of your dataset [17:23:56] thcipriani, so you have shell access from ops, you just need the wmf-deployment group from the gerrit admins? [17:24:05] <_joe_> where N is the number of nodes [17:24:13] (03PS1) 10Papaul: added asset tag infor for rbd2001-rbd2004 [dns] - 10https://gerrit.wikimedia.org/r/195597 [17:24:15] (03PS1) 10Papaul: added asset tag infor for rbd2001-rbd2004 [dns] - 10https://gerrit.wikimedia.org/r/195598 [17:24:17] <_joe_> but now sorry, I need to get off :) [17:24:47] and I need to get back on the task of mass cache reinstallation before I fall behind for the week. I've spent too much time talking today already! :) [17:24:57] Krenair: that's correct. [17:25:25] have to have someone around to do merges otherwise. [17:25:32] * gwicke gets back to work as well [17:26:05] bblack, _joe_: thanks for the chat though, fun topic [17:29:31] (03PS6) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 [17:30:13] (03PS7) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 [17:33:09] (03PS1) 10Matanya: librenms: Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195599 [17:36:10] (03PS1) 10Matanya: clamav: Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195601 [17:37:45] bblack: i'm about to disable cp1056 in pybal then. i had it disabled in role/cache earlier, (More than 25 min ago), it's one of the bits servers though [17:37:54] greg-g, can we make sure people who request deployment access get the appropriate gerrit permissions before their task gets closed? [17:38:18] i can fix that [17:38:19] (03PS8) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 [17:38:54] thcipriani: done [17:39:11] ori: neat. Thanks! [17:39:16] thcipriani: you may have to log out and log back in [17:39:36] kk, lemme check. [17:40:53] ori: yup, working. Thanks again. [17:42:45] mutante: ok [17:43:09] (03PS1) 10BBlack: depool cp301[01] for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195602 [17:43:55] (03CR) 10BBlack: [C: 032 V: 032] depool cp301[01] for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195602 (owner: 10BBlack) [17:44:43] (03PS2) 10Yuvipanda: Tools: Fix XML output of qstat for webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/195556 (https://phabricator.wikimedia.org/T92039) (owner: 10Tim Landscheidt) [17:45:59] (03CR) 10Yuvipanda: [C: 032] Tools: Fix XML output of qstat for webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/195556 (https://phabricator.wikimedia.org/T92039) (owner: 10Tim Landscheidt) [17:49:36] godog: still around ? [17:50:09] matanya: he's on vacation until next monday [17:50:15] thanks ori [17:51:31] !log cp1056 - disabled in pybal, reboot to PXE for reinstall [17:51:36] Logged the message, Master [17:56:44] (03PS1) 10Matanya: swift_new: lint and resource quoting [puppet] - 10https://gerrit.wikimedia.org/r/195607 [17:57:14] wmf-reimage is cool [17:58:46] (03CR) 10Aude: [C: 04-1] "I realize this is mainly an import from the old puppet stuff we had on labs, but see comment..." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195567 (owner: 10JanZerebecki) [17:58:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:00:04] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150310T1800). [18:00:47] (03PS1) 10Matanya: system: quote strings [puppet] - 10https://gerrit.wikimedia.org/r/195611 [18:02:33] (03PS9) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 [18:02:41] (03CR) 10Dzahn: [C: 031] statsdlb: fix string containing only a variable [puppet] - 10https://gerrit.wikimedia.org/r/195534 (owner: 10Matanya) [18:03:28] (03CR) 10Dzahn: [C: 031] git: fix param order [puppet] - 10https://gerrit.wikimedia.org/r/195536 (owner: 10Matanya) [18:03:47] mutante: only +1's today? :P [18:06:01] yea, watching a reinstall of a bits server [18:06:38] it's tricky, you have to stare at it for a bit until the second reboot from the Lifecycle Controller thingy before telling it to PXE [18:06:51] that's the most annoying part of this process :/ [18:07:08] if it weren't for that, we could automate this much better and have wmf-reimage do the pxe+reboot interaction [18:07:10] yea, the "ESC Shift+2" hint was good, bblack [18:07:26] it's already so much better with wmf-reimage [18:07:35] i used to always do that manual [18:08:09] the -r flag is what tells wmf-reimage to not do pxe+reboot for you. if it weren't for the HT-enable step (which is the lifecycle thing in this odd case), wmf-reimage could do it all without you opening the console [18:08:29] there's some comments inside the script about needing to set up an env variable carefully first, though [18:08:50] (03PS1) 10Matanya: puppetception: Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195613 [18:08:54] gotcha [18:09:22] saw the "server configurator" doing its thing, yeap [18:11:11] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:18:17] (03PS1) 10Matanya: limn: minor lint and Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195616 [18:19:47] (03PS18) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [18:19:49] (03PS1) 10Yuvipanda: mediawiki: Make mwdeploy / l10nupdate pub keys hiera-able [puppet] - 10https://gerrit.wikimedia.org/r/195617 [18:21:20] (03CR) 10Dzahn: [C: 031] monitoring: selector outside a resource [puppet] - 10https://gerrit.wikimedia.org/r/195520 (owner: 10Matanya) [18:21:55] mutante: you know you can +2 as well :) [18:22:20] +2 means merge and babysit though [18:22:21] bblack: reinstall worked fine, how about the order of readding it, first pybal , then puppet? [18:22:50] I'd do puppet first then pybal personally, but it doesn't matter much either way [18:22:51] apergos: true. [18:23:13] YuviPanda: what apergos said, i was doing it on the side with the reinstall [18:23:20] bblack: ok [18:23:21] cool cool :) [18:23:52] (03PS1) 10Dzahn: Revert "depool cp1056 for reinstall" [puppet] - 10https://gerrit.wikimedia.org/r/195618 [18:24:16] "revert depool" = repool [18:25:20] a full revert kills the Jessie comment though [18:26:09] arr, indeed [18:27:08] !log starting the Tuesday "train" deployment [18:27:14] Logged the message, Master [18:28:07] (03PS2) 10Dzahn: repool cp1056 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195618 [18:30:15] (03PS3) 10Dzahn: repool cp1056 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195618 [18:40:06] (03CR) 10BBlack: [C: 031] repool cp1056 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195618 (owner: 10Dzahn) [18:40:43] (03CR) 10Dzahn: [C: 032] repool cp1056 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195618 (owner: 10Dzahn) [18:42:35] (03PS1) 10BBlack: repool cp3010 [puppet] - 10https://gerrit.wikimedia.org/r/195622 [18:42:54] (03CR) 10BBlack: [C: 032 V: 032] repool cp3010 [puppet] - 10https://gerrit.wikimedia.org/r/195622 (owner: 10BBlack) [18:44:41] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [18:45:48] ^ yea, not really [18:45:51] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:47:42] (03PS1) 10Legoktm: Manage user name blacklist (TitleBlacklist) from meta only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) [18:49:05] (03CR) 10Legoktm: "Needs announcements, time for people to migrate, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [18:49:37] (03CR) 10CSteipp: [C: 031] "No real difference in the security that I see." [puppet] - 10https://gerrit.wikimedia.org/r/195617 (owner: 10Yuvipanda) [18:49:50] (03CR) 10Jforrester: "> Needs announcements, time for people to migrate, etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [18:52:16] mw devs, I would enjoy a review of https://gerrit.wikimedia.org/r/#/c/195472/ — would be nice to roll out in tonight or tomorrow’s scap [18:56:16] (03PS1) 1020after4: Group1 wikis to 1.25wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195624 [18:57:40] (03CR) 10Chmarkine: [C: 031] "@Mark Bergsma: RackTables requires authentication, so it can be considered as sensitive. Also Racktables is already HTTPS only (i.e. http " [puppet] - 10https://gerrit.wikimedia.org/r/195444 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:58:23] (03Abandoned) 10Thcipriani: Add version to mariadb package resource [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/195328 (owner: 10Thcipriani) [18:59:51] (03CR) 10RobH: [C: 04-1] "Please combine with the change in patchset 195597, the forward and reverse file entry changes should be in the same patchset." [dns] - 10https://gerrit.wikimedia.org/r/195598 (owner: 10Papaul) [19:00:07] (03CR) 10RobH: [C: 04-1] "Please combine with the change in patchset 195598, the forward and reverse file entry changes should be in the same patchset." [dns] - 10https://gerrit.wikimedia.org/r/195597 (owner: 10Papaul) [19:00:31] (03CR) 1020after4: [C: 032] Group1 wikis to 1.25wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195624 (owner: 1020after4) [19:00:35] andrewbogott: Do you know how to prep that patch for SWAT or do you need a deployer buddy to help you out? [19:00:36] (03Merged) 10jenkins-bot: Group1 wikis to 1.25wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195624 (owner: 1020after4) [19:00:54] (03PS1) 10Matanya: dynamicproxy: resource attributes quotin [puppet] - 10https://gerrit.wikimedia.org/r/195627 [19:02:21] bd808: Need help! [19:02:31] but tomorrow AM is a safer bet than tonight since I won’t be around for long. [19:02:33] (03CR) 10RobH: [C: 031] "also dont forget to make a revocation task for cert from vendor" [puppet] - 10https://gerrit.wikimedia.org/r/195303 (https://phabricator.wikimedia.org/T92045) (owner: 10Dzahn) [19:03:31] (03CR) 10RobH: [C: 031] added mw2149-mw2214 [puppet] - 10https://gerrit.wikimedia.org/r/195365 (owner: 10Papaul) [19:10:07] (03CR) 10Keegan: "What all needs migrating? How much duplication is out there? How much needs added to meta?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [19:12:00] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:16:52] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Puppet last ran 11 days ago [19:18:01] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:18:18] (03CR) 10Nemo bis: [C: 031] "Did not test the syntax, but ok. See https://phabricator.wikimedia.org/T38939#1105838" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [19:18:30] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:18:43] !log re-enabled puppet on cerium, xenon and praseodymium [19:18:49] Logged the message, Master [19:18:49] (03CR) 10RobH: [C: 032] "discussed with papaul via irc, its not worth effort to combine these now since they are small and easily reviewed. So merging them." [dns] - 10https://gerrit.wikimedia.org/r/195597 (owner: 10Papaul) [19:19:01] (03CR) 10RobH: [C: 032] "discussed with papaul via irc, its not worth effort to combine these now since they are small and easily reviewed. So merging them." [dns] - 10https://gerrit.wikimedia.org/r/195598 (owner: 10Papaul) [19:19:02] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet last ran 11 days ago [19:19:11] (03PS1) 10Dzahn: depool cp1057 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195632 (https://phabricator.wikimedia.org/T86648) [19:20:20] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:22:32] (03CR) 10Hoo man: "I'd give people 2 weeks to move all reasonable rules (and only those) to meta." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [19:22:37] (03CR) 10Dzahn: [C: 032] depool cp1057 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195632 (https://phabricator.wikimedia.org/T86648) (owner: 10Dzahn) [19:25:16] (03CR) 10Keegan: [C: 031] "Never mind, I just caught up on the bug comments. I'll put a line in Tech/News, at least, right now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [19:28:19] (03CR) 10Dzahn: [C: 04-2] "has to wait until https://svn.wikimedia.org/ is retired or proxied. it's true that the cert is already expired anyways but if we delete it" [puppet] - 10https://gerrit.wikimedia.org/r/195310 (https://phabricator.wikimedia.org/T88731) (owner: 10Dzahn) [19:28:38] (03Abandoned) 10Dzahn: delete svn.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/195310 (https://phabricator.wikimedia.org/T88731) (owner: 10Dzahn) [19:29:00] (03PS1) 10BBlack: cp3011 disabled T92306 [puppet] - 10https://gerrit.wikimedia.org/r/195637 [19:29:26] (03CR) 10BBlack: [C: 032 V: 032] cp3011 disabled T92306 [puppet] - 10https://gerrit.wikimedia.org/r/195637 (owner: 10BBlack) [19:32:36] (03CR) 10Keegan: "https://meta.wikimedia.org/w/index.php?title=Tech%2FNews%2F2015%2F12&diff=11516423&oldid=11505530" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [19:35:50] (03PS2) 10RobH: added mw2149-mw2214 [puppet] - 10https://gerrit.wikimedia.org/r/195365 (owner: 10Papaul) [19:37:20] (03CR) 10RobH: [C: 032] added mw2149-mw2214 [puppet] - 10https://gerrit.wikimedia.org/r/195365 (owner: 10Papaul) [19:39:41] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [19:40:31] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [19:42:39] (03CR) 10John F. Lewis: "It was not intended to. Last I checked planet.wm.o was an apache redirect to https://meta.wikimedia.org/wiki/Planet_Wikimedia though I may" [puppet] - 10https://gerrit.wikimedia.org/r/181419 (https://phabricator.wikimedia.org/T60048) (owner: 10John F. Lewis) [19:44:31] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to 1.25wmf20 [19:44:33] (03CR) 10Legoktm: "The following wikis have local rules: https://phabricator.wikimedia.org/P384" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) (owner: 10Legoktm) [19:44:36] Logged the message, Master [19:44:37] bblack: so you had a pending commit on palladium [19:44:42] its now merged with mine [19:44:51] (03PS2) 10Legoktm: Manage user name blacklist (TitleBlacklist) from meta only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195623 (https://phabricator.wikimedia.org/T38939) [19:44:59] cp3011 disabled T92306 [19:45:00] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:45:12] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [19:45:54] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to 1.25wmf20 for real this time [19:45:59] Logged the message, Master [19:46:20] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:48:15] (03PS10) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [19:49:37] (03PS11) 10JanZerebecki: Wikidata builder [puppet] - 10https://gerrit.wikimedia.org/r/195567 (https://phabricator.wikimedia.org/T90567) [19:50:42] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [19:52:51] (03PS1) 10Ottomata: Fix dependency cycle in archiva role [puppet] - 10https://gerrit.wikimedia.org/r/195643 [19:53:10] (03PS2) 10Ottomata: Fix dependency cycle in archiva role [puppet] - 10https://gerrit.wikimedia.org/r/195643 [19:55:01] (03CR) 10Ottomata: [C: 032] Fix dependency cycle in archiva role [puppet] - 10https://gerrit.wikimedia.org/r/195643 (owner: 10Ottomata) [19:55:05] (03PS1) 10Matanya: diamond: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195644 [19:57:20] YuviPanda: feel like merging some stuff ? :) [20:00:05] (03PS1) 10John F. Lewis: varnish: catch planet.wm.o as well [puppet] - 10https://gerrit.wikimedia.org/r/195646 [20:00:41] (03CR) 10John F. Lewis: "https://gerrit.wikimedia.org/r/#/c/195646/" [puppet] - 10https://gerrit.wikimedia.org/r/181419 (https://phabricator.wikimedia.org/T60048) (owner: 10John F. Lewis) [20:10:00] matanya: I believe yuvi is gone to bed. [20:10:26] !log finished train deployment, logs look ok [20:10:27] yeah, i was just playing around with him :) [20:10:32] Logged the message, Master [20:11:21] ARGH [20:11:26] i just broke dhcp on carbon. [20:11:32] =P [20:11:51] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:11:52] robh: amazing :D [20:11:54] heh, i literally just rebooted a server to use it [20:12:06] hit F12 [20:12:07] well, my change seems ok, so debuggin gnow [20:14:51] ahh, i see [20:14:54] i lack quoting, damn it [20:16:12] mutante: want to bring back grrit-wm ? [20:17:05] mutante: ok, carbon is fixed now, you can redo your install pxe boot [20:17:38] robh: cool, thanks [20:17:47] matanya: i'll look [20:17:52] JohnFLewis: it changed :p [20:18:00] thanks [20:18:29] mutante: what did? :p [20:18:54] matanya: eh, it runs in toollabs [20:19:06] matanya: i'll switch to -labs [20:20:02] JohnFLewis: the title of the static BZ index.html .should be [20:20:32] matanya: i dunno about this: " [20:20:32] Grrrit-wm is managed by Bigbrother, so manual restarts should not be necessary [20:20:44] mutante: yeah see it now :D [20:21:03] mutante: bigbrother should bring him back up [20:21:12] but it looks like it fails [20:21:24] * matanya doesn't have enough rights to check logs [20:21:25] mutante: bigbrother acts like one. It acts like it there for you, but never does anything for you [20:21:50] hah [20:25:48] and now it's T92313 because somebody disabled it [20:27:33] (03PS1) 10RobH: setting rbd2001-2004 install params [puppet] - 10https://gerrit.wikimedia.org/r/195648 [20:27:41] (03CR) 10RobH: [C: 032] setting rbd2001-2004 install params [puppet] - 10https://gerrit.wikimedia.org/r/195648 (owner: 10RobH) [20:27:51] (03PS1) 10Matanya: bastionhost: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195650 [20:27:57] (03PS1) 10John F. Lewis: bz: change index.html title [puppet] - 10https://gerrit.wikimedia.org/r/195651 (https://phabricator.wikimedia.org/T1198) [20:28:01] (03PS1) 10Matanya: haproxy: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195652 [20:28:05] (03PS1) 10RobH: rbd2001-2004 entries lacked quotes, caused dhcp break [puppet] - 10https://gerrit.wikimedia.org/r/195653 [20:28:09] it replays its brain now [20:28:11] (03CR) 10RobH: [C: 032 V: 032] rbd2001-2004 entries lacked quotes, caused dhcp break [puppet] - 10https://gerrit.wikimedia.org/r/195653 (owner: 10RobH) [20:28:13] (03CR) 10Dzahn: [C: 032] bz: change index.html title [puppet] - 10https://gerrit.wikimedia.org/r/195651 (https://phabricator.wikimedia.org/T1198) (owner: 10John F. Lewis) [20:28:20] i didnt miss you grrit-em [20:28:21] (03PS1) 10Rush: admin module enable user cleanup [puppet] - 10https://gerrit.wikimedia.org/r/195656 [20:28:23] (03PS1) 1020after4: Fatalmonitor: Remove 'repeated N times: ' to collapse error messages [puppet] - 10https://gerrit.wikimedia.org/r/195657 [20:28:25] (03CR) 10Andrew Bogott: "This looks fine but I'd like to see a couple more responses to that email thread (or a couple of days of silence) before we implement." [puppet] - 10https://gerrit.wikimedia.org/r/195650 (owner: 10Matanya) [20:28:29] (03PS1) 10Matanya: reprepro: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195658 [20:31:48] !log cp1057 - disabled in pybal, reinstalling [20:31:59] Logged the message, Master [20:33:56] (03PS1) 10Matanya: puppet_compiler: resource attributes quoting and minor lints [puppet] - 10https://gerrit.wikimedia.org/r/195660 [20:35:40] (03CR) 10Rush: [C: 04-1] diamond: resource attributes quoting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195644 (owner: 10Matanya) [20:36:36] (03CR) 10Matanya: diamond: resource attributes quoting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195644 (owner: 10Matanya) [20:37:29] (03PS2) 10Matanya: diamond: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195644 [20:43:15] what is up grrit-wm? [20:53:34] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 74670 MB (3% inode=99%): [21:16:30] !log cp1057 - repooled, all bits eqiad are jessie now [21:16:37] Logged the message, Master [21:25:36] mutante, if you have a moment for a quick restbase tweak: https://gerrit.wikimedia.org/r/#/c/195681/ [21:25:52] removes a dead (fictional I believe) wiki from the restbase config [21:26:03] /cc robh [21:27:37] !log erased some api-feature-usage.logs from fluorine to make breathing room; merged a patch that will purge _all_ such logs older than 90 days. [21:27:38] gwicke: dead, yeah, fictional - well, all MW config mentions it :P [21:27:43] Logged the message, Master [21:28:10] JohnFLewis: if my memory serves me right it's the fictional wiki that's used by job runners [21:28:19] it used to be a real wiki... [21:29:20] https://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Afar_Wikipedia sounds like it existed at some point [21:30:35] in any case, it's dead for all practical purposes [21:35:40] and rather prominent in http://rest.wikimedia.org/.. [21:45:02] gwicke: for the rest portal, is it described somewhere how revdel and/or supressed revisions are handled ? [21:47:00] thedj: https://phabricator.wikimedia.org/T76165 [21:47:46] we basically track revision deletions / suppressions from core through the task queue & check the status on each request before returning content [21:48:41] there is a small asterisk there in that we are just enabling those updates [21:49:40] gwicke: thank u, wanted to be able to respond to any concerns contributors might have. [21:50:22] *nod* [22:05:08] what happened to grrrit-wm? [22:05:42] it relies on redis in labs and that has a problem [22:05:44] died bblack [22:06:29] * bblack arranges a wake for it. We shall not mourn the loss of notifications, but instead celebtration the many notifications it gave in life! [22:06:48] and I shall drink more coffee before I try to not misspell anything more important than that :P [22:26:33] (03PS1) 10Matanya: locales: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195665 [22:26:37] (03PS2) 10Krinkle: locales: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195665 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [22:27:37] (03CR) 10BryanDavis: [C: 031] "The only thing that would be better would be to add logic that can expand the N times into N lines so the counts reflect the syslog messag" [puppet] - 10https://gerrit.wikimedia.org/r/195657 (owner: 1020after4) [22:27:39] (03PS1) 10RobH: rbd2001-2004 dns entries [dns] - 10https://gerrit.wikimedia.org/r/195671 [22:27:43] (03PS1) 10Andrew Bogott: Reduce lifetime of api logs to 20 days. [puppet] - 10https://gerrit.wikimedia.org/r/195673 [22:27:49] (03CR) 10RobH: [C: 032] rbd2001-2004 dns entries [dns] - 10https://gerrit.wikimedia.org/r/195671 (owner: 10RobH) [22:27:59] (03Abandoned) 10Andrew Bogott: Reduce lifetime of api logs to 20 days. [puppet] - 10https://gerrit.wikimedia.org/r/195673 (owner: 10Andrew Bogott) [22:28:04] (03PS1) 10Andrew Bogott: Purge api-feature-usage logs older than 90 days. [puppet] - 10https://gerrit.wikimedia.org/r/195677 [22:28:13] (03PS2) 10Andrew Bogott: Purge api-feature-usage logs older than 90 days. [puppet] - 10https://gerrit.wikimedia.org/r/195677 [22:28:15] (03PS1) 10Dzahn: depool cp1054 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195679 [22:28:19] (03CR) 10MaxSem: [C: 031] Purge api-feature-usage logs older than 90 days. [puppet] - 10https://gerrit.wikimedia.org/r/195677 (owner: 10Andrew Bogott) [22:28:21] (03PS1) 10Matanya: scap: lint and resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195680 [22:28:23] (03CR) 10Anomie: [C: 031] Purge api-feature-usage logs older than 90 days. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195677 (owner: 10Andrew Bogott) [22:28:25] (03CR) 10Ori.livneh: [C: 031] Purge api-feature-usage logs older than 90 days. [puppet] - 10https://gerrit.wikimedia.org/r/195677 (owner: 10Andrew Bogott) [22:28:27] (03CR) 10Dzahn: [C: 032] depool cp1054 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/195679 (owner: 10Dzahn) [22:28:32] gj grrrit-wm [22:28:46] spammer [22:28:53] (03PS1) 10Matanya: extdist: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195743 [22:29:00] (03CR) 10Dzahn: "are you sure? aawiki is always the example for a lot of scripts, see" [puppet] - 10https://gerrit.wikimedia.org/r/195681 (owner: 10GWicke) [22:29:09] (03PS1) 10Matanya: zuul: lint + resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195769 [22:29:21] (03PS1) 10Matanya: motd: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195770 [22:29:32] (03PS3) 10BBlack: Enable HSTS on dev.wm.org max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/195338 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [22:29:34] (03CR) 10BBlack: [C: 032 V: 032] Enable HSTS on dev.wm.org max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/195338 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [22:29:46] (03CR) 10Dzahn: "here's a diff between the WP languages in this config yaml and in DNS in the "langs" helper file from which the wikipedia.org zone is crea" [puppet] - 10https://gerrit.wikimedia.org/r/195681 (owner: 10GWicke) [22:32:08] (03PS1) 10Dzahn: restbase: add missing wikipedia domains [puppet] - 10https://gerrit.wikimedia.org/r/195778 [22:34:00] (03PS1) 10Thcipriani: Ensure apt update before sql libraries install [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/195779 [22:35:09] (03PS2) 10Thcipriani: Ensure apt update before sql libraries install [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/195779 (https://phabricator.wikimedia.org/T91545) [22:37:09] (03CR) 10BBlack: [C: 031] "I tend to agree we should HSTS these kinds of things. If nothing else, there's the dogfood argument." [puppet] - 10https://gerrit.wikimedia.org/r/195444 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [23:00:04] RoanKattouw, ^d, Krenair, ebernhardson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150310T2300). [23:00:29] looks like only my patches listed in wikitech? i'll just start merging then [23:00:40] * gwicke has a patch to get out [23:00:44] you deploying them, ebernhardson? [23:00:52] Krenair: yea [23:00:54] ok [23:00:58] gwicke: add it to the list and i'll do it after [23:01:46] ebernhardson: will do [23:01:48] thx! [23:03:18] what the ef, it's 4pm [23:06:15] ebernhardson: I have two extension updates, should I create gerrit patches for the branch merges? [23:06:47] gwicke: makes it easier, yes. but either way. I think most people leave it up to the swat deployer to make the core extension bump's [23:07:17] ebernhardson: that would be sweet, as I have a meeting in a few minutes [23:07:23] gwicke: sure, its easy enough [23:07:39] ebernhardson: cool, thx! [23:07:50] it's only enabled on test.wikipedia.org, so fail-safe [23:16:58] !log ebernhardson Synchronized php-1.25wmf20/extensions/Flow: Bump flow submodule in 1.25wmf20 for SWAT (duration: 00m 08s) [23:17:08] Logged the message, Master [23:19:49] !log ebernhardson Synchronized php-1.25wmf19/extensions/Flow: Bump flow submodule in 1.25wmf19 for SWAT (duration: 00m 07s) [23:19:54] Logged the message, Master [23:23:47] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: Puppet has 1 failures [23:27:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:29:18] where's the gerrit bot at? [23:29:28] ugh [23:29:41] damn you gerrit, and damn you puppet-merge for missing strontium :P [23:30:25] !log ebernhardson Synchronized php-1.25wmf20/extensions/RestBaseUpdateJobs: Update RestBaseUpdateJobs to master in 1.25wmf20 (duration: 00m 06s) [23:30:30] Logged the message, Master [23:30:58] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:31:41] !log ebernhardson Synchronized php-1.25wmf19/extensions/RestBaseUpdateJobs/: Update RestBaseUpdateJobs to master in 1.25wmf19 (duration: 00m 09s) [23:31:43] gwicke: your patches are deployed [23:31:46] Logged the message, Master [23:32:02] bd808: it's redis' in labs that's broken [23:32:35] mutante: ah. no feed then. got it [23:32:37] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures