[00:01:37] PROBLEM - Varnish HTTP mobile-frontend on cp3012 is CRITICAL: Connection refused [00:03:17] RECOVERY - Varnish HTTP mobile-frontend on cp3012 is OK: HTTP OK: HTTP/1.1 200 OK - 285 bytes in 0.191 second response time [00:03:40] (03CR) 1020after4: "@greg: because gerrit sucks at merging? I'm not sure but rebasing wasn't enough, it'll have to be resubmitted fresh I guess (and then acte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188388 (https://phabricator.wikimedia.org/T75905) (owner: 10Reedy) [00:04:17] (03PS1) 10BBlack: T86663 4.2: repool cp301[26] [puppet] - 10https://gerrit.wikimedia.org/r/202962 [00:04:48] (03CR) 10BBlack: [C: 032 V: 032] T86663 4.2: repool cp301[26] [puppet] - 10https://gerrit.wikimedia.org/r/202962 (owner: 10BBlack) [00:20:44] <_2_josselin> jvsvsbsjso [00:24:40] was that the trigger code to waken our sleeper agents or something? [00:25:06] * YuviPanda silently does bad things to our code. [00:27:03] (03PS1) 10BBlack: T86663 4.3: pool 3044; depool 3017,3013,amssq4[78] [puppet] - 10https://gerrit.wikimedia.org/r/202964 [00:27:57] (03CR) 10BBlack: [C: 032 V: 032] T86663 4.3: pool 3044; depool 3017,3013,amssq4[78] [puppet] - 10https://gerrit.wikimedia.org/r/202964 (owner: 10BBlack) [00:29:49] (03PS1) 10Bmansurov: Enable CirrusSearch event logging in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202965 [00:35:07] (03PS1) 10Catrope: Revert "Make VisualEditor access RESTbase directly on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202966 [00:35:17] (03CR) 10Catrope: [C: 032] Revert "Make VisualEditor access RESTbase directly on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202966 (owner: 10Catrope) [00:36:01] kaldari: not sure if you saw my comment related to your patch. just wanted to remind [00:36:30] oops, wrong channel [00:37:22] (03PS1) 10BBlack: T86663 4.3: shuffle roles, no repool yet [puppet] - 10https://gerrit.wikimedia.org/r/202967 [00:38:33] (03CR) 10BBlack: [C: 032 V: 032] T86663 4.3: shuffle roles, no repool yet [puppet] - 10https://gerrit.wikimedia.org/r/202967 (owner: 10BBlack) [00:43:43] PROBLEM - Varnish HTTP upload-frontend on cp3017 is CRITICAL: Connection refused [00:45:12] (03Merged) 10jenkins-bot: Revert "Make VisualEditor access RESTbase directly on Wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202966 (owner: 10Catrope) [00:45:24] RECOVERY - Varnish HTTP upload-frontend on cp3017 is OK: HTTP OK: HTTP/1.1 200 OK - 348 bytes in 0.183 second response time [00:45:41] (03PS3) 10Dzahn: dumps: move hiera data to new location [puppet] - 10https://gerrit.wikimedia.org/r/202637 [00:46:02] !log catrope Synchronized wmf-config/InitialiseSettings.php: Revert direct RESTbase for non-enwiki Wikipedias (duration: 00m 12s) [00:46:11] Logged the message, Master [00:47:16] (03PS1) 10BBlack: T86663 4.3: repool cp301[37] [puppet] - 10https://gerrit.wikimedia.org/r/202969 [00:47:40] (03CR) 10BBlack: [C: 032 V: 032] T86663 4.3: repool cp301[37] [puppet] - 10https://gerrit.wikimedia.org/r/202969 (owner: 10BBlack) [00:47:47] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up a generic API base path to be used by action & REST APIs - https://phabricator.wikimedia.org/T95229#1192764 (10GWicke) p:5Normal>3High [00:53:29] (03PS1) 10BBlack: T86663 4.4: pool 3045; depool 3018,3014,amssq49-50 [puppet] - 10https://gerrit.wikimedia.org/r/202970 [00:53:31] (03CR) 10Dzahn: [C: 032] "not used yet but dumps cant do role-based lookup yet" [puppet] - 10https://gerrit.wikimedia.org/r/202637 (owner: 10Dzahn) [00:53:48] (03PS2) 10BBlack: T86663 4.4: pool 3045; depool 3018,3014,amssq49-50 [puppet] - 10https://gerrit.wikimedia.org/r/202970 [00:54:05] (03CR) 10BBlack: [C: 032 V: 032] T86663 4.4: pool 3045; depool 3018,3014,amssq49-50 [puppet] - 10https://gerrit.wikimedia.org/r/202970 (owner: 10BBlack) [00:56:31] (03PS1) 10Dzahn: Revert "Revert "dumps: ferm service for rsyncd clients using hiera"" [puppet] - 10https://gerrit.wikimedia.org/r/202971 [00:56:42] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "dumps: ferm service for rsyncd clients using hiera"" [puppet] - 10https://gerrit.wikimedia.org/r/202971 (owner: 10Dzahn) [00:57:06] nice, -1 on a revert :) [00:57:22] :D [00:59:09] a revert revert :p [00:59:39] and path conflict, yay [00:59:44] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: puppet fail [01:01:02] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:02:18] (03PS1) 10BBlack: T86663 4.4: shuffle roles, no repool yet [puppet] - 10https://gerrit.wikimedia.org/r/202972 [01:02:55] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up a generic API base path to be used by action & REST APIs - https://phabricator.wikimedia.org/T95229#1192818 (10GWicke) I think we just observed the fairly dramatic impact of using a separate domain on cold load latencies: {F110247} At ~14:10 P... [01:02:56] (03CR) 10BBlack: [C: 032 V: 032] T86663 4.4: shuffle roles, no repool yet [puppet] - 10https://gerrit.wikimedia.org/r/202972 (owner: 10BBlack) [01:04:35] (03PS2) 10Dzahn: logstash: lint [puppet] - 10https://gerrit.wikimedia.org/r/202651 [01:07:21] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60602 bytes in 0.265 second response time [01:09:02] (03PS1) 10BBlack: T86663 4.4: repool cp301[48] [puppet] - 10https://gerrit.wikimedia.org/r/202973 [01:09:20] ori, bblack, RoanKattouw: https://phabricator.wikimedia.org/T95229#1192818 [01:09:24] (03CR) 10BBlack: [C: 032 V: 032] T86663 4.4: repool cp301[48] [puppet] - 10https://gerrit.wikimedia.org/r/202973 (owner: 10BBlack) [01:11:38] gwicke: why not (a) add a dns-prefetch directive, (b) have a javascript snippet make a small https request if the page is idle and set a flag in sessionStorage? [01:11:51] have everything under the same domain would be a better solution [01:11:53] i dont' dispute that [01:12:05] but you can't have that today, whereas you can have what i suggested today [01:12:32] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:07] ori: either way requires code changes, testing and deployment [01:13:27] actually, easiest thing to do would be [01:13:37] (03PS1) 10BBlack: resync cache node list hieradata [puppet] - 10https://gerrit.wikimedia.org/r/202974 [01:13:38] which one takes longer is hard to tell [01:13:58] a single VCL rule is not *that* complicated either [01:14:28] it is during a varnish migration week [01:14:32] well, need to configure a backend too [01:15:01] instead of dns prefetch, just add an to the page html [01:15:15] from, say, a hook handler in WikimediaEvents [01:15:29] for logged in users on wikis where restbase is enabled (or for en for users that set the preference) [01:15:54] that all sounds more like a hack tbh [01:15:54] the image will be cached, so it won't be requested repeatedly, and it'll take care of both name resolution and tls handshake [01:16:20] 99.9% of those connections would not end up being used, they'd just use battery power [01:16:48] (03PS1) 10Gage: ipsec-global: fix bug in non-verbose mode, exit if not root [puppet] - 10https://gerrit.wikimedia.org/r/202975 [01:16:54] and the connection might already be closed by the time VE is activated, especially with caching [01:17:12] RECOVERY - puppet last run on mw2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:17:14] you can set the cache headers to whatevery ou want on that image [01:18:01] I think I'd prefer to try to do this in varnish if we can get that done ~2 weeks from now [01:18:17] *shrug* suit yourself [01:18:24] we are not in a huge rush [01:19:05] I appreciate your creativity and will to get this done *right now* [01:19:21] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60602 bytes in 0.350 second response time [01:20:15] ori: Roan rolled back the config change, so we are back to PHP API loading, which isn't great but also not totally horrible [01:20:53] (03CR) 10Yuvipanda: "(commenting only on style issues + the python)" [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [01:21:38] i just don't see why you're choosing instead of just doing it the way we can now and the way you want later [01:22:09] all those arguments about the restbase connection already being closed before VE is activated, would apply to using the wiki hostname as well (but I think it would still be open in either case commonly, FWIW) [01:22:33] also, either way you've also gotten a sessionid cached as well, so even if the TCP is closed, the resumption will be faster than a fresh TLS [01:22:41] when a user clicks around the wiki connection would see some activity [01:23:03] well sure but so would the img src="//restbase [01:23:15] if we disallowed client-side caching then yes, that would generate one request for each page load [01:24:23] it's doable, but it would create ~20-30k req/s to speed up 0.7 req/s [01:24:24] I guess figure out the typical/default SPDY connection timeout for inactivity, set the image cache headers just under that? [01:25:11] (03PS2) 10Gage: ipsec-global: fix bug in non-verbose mode, exit if not root [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) [01:25:35] if you are concerned about battery power or whatever you can use an HTML5 prefetch link tag too, . the prefetch directive utilizes browser idle time, and you're not being tricky either, because "The prefetch keyword indicates that preemptively fetching and caching the specified resource is likely to be beneficial, as it is highly likely that the user will require this [01:25:35] resource." [01:26:17] power consumption is the same but you're being polite and simply indicating that there is a likelihood prefetching will be beneficial rather than forcing the request. [01:26:18] bblack: how complex do you expect setting up a path in Varnish to be, and how does it fit with your timeline re Varnish migration? [01:27:04] it's generally nice not to do too many risky varnish changes at once. right now everything's stalled on the migrations, which will probably finish early next week if no hangups. [01:27:26] but "everything that's stalled/backlogged for varnish" has become a rather lengthy list over the past month or two [01:27:44] you are in demand ;) [01:27:48] (03CR) 10Chad: "It's a binary file. Even a few bytes difference and it won't be able to merge :\" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188388 (https://phabricator.wikimedia.org/T75905) (owner: 10Reedy) [01:28:01] but I don't think the change for restbase backend based on path will be very risky, either. [01:28:25] there are security implications to sharing an origin [01:28:34] separation is conceptually nicer imo [01:28:43] well yeah, there is that [01:28:46] better to share an ip but different hostname [01:28:52] right now the content is loaded from the same origin [01:29:05] through the PHP API [01:29:20] is this sharing a domain for multiple services? [01:29:34] we can do "diff hostname, same IP" type stuff in cases like these and pick up SPDY connection reuse, probably. [01:29:53] but, in the future, not now [01:30:00] we need certificate changes for that, at the very least. [01:30:09] csteipp: it's about proxying a path to restbase [01:30:28] csteipp: https://phabricator.wikimedia.org/T95229 [01:30:43] e.g. https://en.wikipedia.org/_/rest/v1/.... -> restbase backend rather than wikiappserver backend, in varnish [01:30:58] csteipp: just noticed that I hadn't cc'ed you yet, now fixed [01:31:39] (the cert changes is so the same TLS session can be valid for *.wikipedia.org + *.wikimedia.org at the same time) [01:32:51] initially we'd drop cookies from the backend request [01:34:09] gwicke: I can appreciate there's a challenge going to another domain name, but you're throwing out a lot of security benefit and adding a lot of risk. Services were supposed to do the opposite. [01:35:19] so far we don't have any of that security benefit [01:35:44] what is the sec issue here anyways? are not restbase APIs + the wiki both our control and not user content? [01:36:20] bblack: They are, it just means that xss there has full access to the projects. [01:36:24] they are user content, but we control the sanitizer [01:36:41] And sanitizing is hard :) [01:37:07] what are we sanitizing? I guess I don't really get the issue here. XSS via wiki edit -> restbase access? [01:37:41] No, restbase itself having a way to convince a browser to run javascript [01:37:41] (03PS2) 10Dzahn: dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/202971 [01:37:44] bblack: we are saniziting the typical HTML XSS issues that people could try to sneak in by writing dangerous wikitext [01:37:49] seems like if we're not sanitizing properly, we have issues on the main wiki access regardless. [01:37:51] (03CR) 10jenkins-bot: [V: 04-1] dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/202971 (owner: 10Dzahn) [01:38:08] yeah, also in VE [01:38:59] csteipp: which issue are you mainly worried about? [01:39:39] so, anyways: if you can convince csteipp on security, you could do rest through the main wiki hostname. if you want separate hostname but same IP for SPDY sharing, that's stalled on future certificate changes on ops end, and no timeline yet. [01:39:50] if you want to use ori's suggested hack, there's nothing stopping you now. [01:39:55] (03PS1) 10Yuvipanda: tools: Add role / class for tools manifest services [puppet] - 10https://gerrit.wikimedia.org/r/202978 (https://phabricator.wikimedia.org/T95210) [01:40:09] bblack: yup, thanks [01:40:54] (well the first option is also + wait for ops sometime next week) [01:41:00] I appreciate your creativity and will to get this done *right now* [01:41:07] if you want to do this the right way [01:41:08] gwicke: I'm not sure what you mean by "which"? I'm worried about restbase presenting something that can be used for xss. [01:41:19] wouldn't same IP but separate hostname + dns prefetch directive be the best solution? [01:41:55] I'm not sure which case we are trying to address [01:42:13] there's no benefit to "same IP" until certificate issues are resolved, so no point doing that at this time. [01:42:18] VE loads HTML as a string, then parses it [01:42:54] I believe that basically renders domain-based XSS protections moot [01:43:20] gwicke: Right, so VE needs to protect against dom-based xss in how it loads the strings it gets. [01:43:22] apart from that, we'll set CSP & other security headers [01:44:43] So if you're setting CSP: default-src 'none', that will be the same, for modern browsers. Older ones / IE aren't going to get protected. [01:44:50] csteipp: currently it gets the HTML wrapped in JSON, from the main wiki domain [01:45:32] for private wikis, we'll need cookies for authentication [01:45:44] similar for a possible save API end point [01:46:12] (03CR) 10Yuvipanda: [C: 032] tools: Add role / class for tools manifest services [puppet] - 10https://gerrit.wikimedia.org/r/202978 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [01:47:11] (03Abandoned) 10Dzahn: dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/202971 (owner: 10Dzahn) [01:47:12] csteipp: ultimately though we are working towards using the same HTML for page views [01:47:29] which we probably don't want to load from a different domain [01:49:14] viewing for readonly would load HTML from restbase? [01:50:10] a front-end service would [01:50:12] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [01:50:51] ^ was me, accidentally did ‘puppet agent -tv’ instead of ‘puppet merge' [01:50:57] and looks like icinga-wm counts the interrupt as failure [01:50:59] which is fair enough [01:51:03] * YuviPanda runs it again for good measure [01:51:52] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:51:54] bblack: the point is more that this HTML will continue to be loaded in the security context of the main wiki domain [01:51:56] gwicke: why not? [01:52:12] (use a different domain) [01:52:32] (03PS1) 10Dzahn: dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/202980 [01:52:48] I believe I have seen some data that supports that most of our user visits see one page only [01:52:57] well, the considerations for restbase performance vs domainname get way bigger if we're using it for readonly views, too. [01:52:58] which means that the load time for that first page is very important [01:53:19] but if it's some "frontend service" that does that part, maybe not the same endpoint as restbase anyways [01:53:27] (03CR) 10jenkins-bot: [V: 04-1] dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/202980 (owner: 10Dzahn) [01:53:40] gwicke: I need to spend some more time looking into how restbase works, but I think I agree with ori, the right answer sounds like having a different domain on the same IP [01:53:52] gwicke: citation needed on that [01:54:24] different domain, period. the same IP part is orthogonal until some random point in the future. [01:54:55] (03PS2) 10Dzahn: dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/202980 [01:55:11] gwicke: So if first page load time is important, I think you want to be combining the html you pull from restbase into the initial call. Not make a separate call at all, right? [01:55:27] yes, that's what we do right now [01:55:36] that's how I would picture it. PHP gets HTML from restbase. [01:55:36] you get one HTML page that has both navigation and content [01:56:11] the single-page-app stuff typically loses on first page load times, but then wins on subsequent browsing [01:56:41] but, you can also combine the two [01:56:44] Right, so running restbase on a different domain isn't a problem. And the PHP code is responsible for making sure the call to restbase and the response are properly formatted. [01:57:15] the point is that the content that you are worried about would be in the wiki domain [01:57:31] if you want to speed up first page load time for desktop users, make a minimal variant that doesn't load 10,000 gadgets + centrallogin, etc. More like the simple mobile form. If the user tries to log in or clicks some "give me advanced wikipedia" button, upgrade. [01:57:51] (03CR) 10Dzahn: [C: 032] logstash: lint [puppet] - 10https://gerrit.wikimedia.org/r/202651 (owner: 10Dzahn) [01:58:24] bblack: that too [01:58:27] (which most users will never ever do) [01:58:46] gwicke: I'm worried about how restbase presents user content yes, but there's a huge attack surface beyond that in restbase iteself. Every error message, etc, has the potential for xss. [01:59:21] it's all JSON [01:59:41] so depends on how the client chooses to process that [02:00:45] (03PS2) 10Dzahn: labsdns: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/202646 (https://phabricator.wikimedia.org/T93645) [02:00:53] of course, the current API response is JSON too [02:01:04] and we have a huge PHP API on the same domain [02:01:24] (03CR) 10Dzahn: [C: 032] labsdns: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/202646 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [02:02:22] the one you argued should be split up and redone as a separate service? [02:02:29] * gwicke is looking for avg # of page views per session [02:02:31] * ori forgets [02:03:30] ori: I think you know that you are conflating a few things there, but I'm happy to discuss that with you if you'd like to ;) [02:03:37] gwicke: It probably is safe how you're doing it now, but from a risk perspective, you're adding lots of liability by doing it this way. [02:03:41] (03PS1) 10Yuvipanda: tools: Make services host a submit host [puppet] - 10https://gerrit.wikimedia.org/r/202981 (https://phabricator.wikimedia.org/T95210) [02:03:55] gwicke, you're wonderful, really, but sometimes i get the feeling that one of your criteria for a good solution is that you must be the one to have come up with it [02:04:06] (03PS2) 10Yuvipanda: tools: Make services host a submit host [puppet] - 10https://gerrit.wikimedia.org/r/202981 (https://phabricator.wikimedia.org/T95210) [02:04:14] but like i said, up to you. good night and god bless, as Fiona says. [02:04:51] csteipp: the options are a) maintain the status quo, and b) trade latency for a separate domain [02:05:25] (03CR) 10Yuvipanda: [C: 032] tools: Make services host a submit host [puppet] - 10https://gerrit.wikimedia.org/r/202981 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [02:06:24] (03CR) 10Dzahn: [C: 032] logging: lint [puppet] - 10https://gerrit.wikimedia.org/r/202652 (owner: 10Dzahn) [02:06:37] and/or additional requests [02:06:49] (03PS2) 10Dzahn: logging: lint [puppet] - 10https://gerrit.wikimedia.org/r/202652 [02:07:45] gwicke: Right now yes, but that seems like something we can work with ops on to get prioritized [02:08:46] csteipp: how will authenticated requests work with a separate domain? [02:09:10] (03PS2) 10Dzahn: gerrit: lint fixes in role class [puppet] - 10https://gerrit.wikimedia.org/r/202647 (https://phabricator.wikimedia.org/T93645) [02:11:05] gwicke: That's something that will have to be worked through. You know the options as well as I do :) [02:11:54] csteipp: I mean, do you see an option that lets us have both authentication & XSS protection from separate domain? [02:12:11] *a separate domain [02:18:01] (03CR) 10Dzahn: [C: 032] gerrit: lint fixes in role class [puppet] - 10https://gerrit.wikimedia.org/r/202647 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [02:18:16] gwicke: There are certainly ways to do it, and different ways have their own security trade off. Let's talk through the options tommorrow. [02:18:51] ok [02:19:22] * csteipp goes to take care of sick family [02:19:35] the best I have found about session lengths are some graphs on commons [02:19:55] will ask analytics tomorrow [02:21:54] (03PS2) 10Dzahn: dataset: lint [puppet] - 10https://gerrit.wikimedia.org/r/202654 (https://phabricator.wikimedia.org/T93645) [02:22:23] csteipp: good luck with the family, and ttyl! [02:23:24] (03CR) 10Dzahn: [C: 032] dataset: lint [puppet] - 10https://gerrit.wikimedia.org/r/202654 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [02:26:35] (03CR) 10Dzahn: "well, but " we really want to move things into modules sooner than later." is a reason to _keep_ the check not remove it. it's there to te" [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [02:32:35] !log l10nupdate Synchronized php-1.25wmf24/cache/l10n: (no message) (duration: 06m 19s) [02:32:43] Logged the message, Master [02:37:24] !log LocalisationUpdate completed (1.25wmf24) at 2015-04-09 02:36:20+00:00 [02:37:28] Logged the message, Master [02:48:28] (03PS1) 10Yuvipanda: tools: Make toollabs::services inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/202982 (https://phabricator.wikimedia.org/T95210) [02:48:39] (03CR) 10jenkins-bot: [V: 04-1] tools: Make toollabs::services inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/202982 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [02:48:41] (03PS2) 10Yuvipanda: tools: Make toollabs::services inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/202982 (https://phabricator.wikimedia.org/T95210) [02:49:52] (03CR) 10Yuvipanda: [C: 032] tools: Make toollabs::services inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/202982 (https://phabricator.wikimedia.org/T95210) (owner: 10Yuvipanda) [02:50:47] (03PS3) 10Andrew Bogott: For cert names, use the fqdn instead of the ec2id if use_dnsmasq is lowered. [puppet] - 10https://gerrit.wikimedia.org/r/202924 [02:55:41] !log l10nupdate Synchronized php-1.26wmf1/cache/l10n: (no message) (duration: 04m 40s) [02:55:50] Logged the message, Master [02:56:54] (03CR) 10Andrew Bogott: "Yuvi, I'd appreciate a close read on this since it's a bit risky." [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [02:59:22] !log LocalisationUpdate completed (1.26wmf1) at 2015-04-09 02:58:19+00:00 [02:59:26] Logged the message, Master [03:15:57] !log changed email address for metawiki:JulieC per request and account verification to allow for merger to global account [03:15:57] anomie: that repo should be operations/software/nova_ldap_sink or something [03:15:59] err [03:16:03] sorry anomie, I meant andrewbogott [03:16:04] Logged the message, Master [03:16:21] Yeah, I’m following a bad pattern established years ago. [03:16:28] I’m not sure I know how to move things :( [03:16:30] jamesofur: wanted to tell you, did you know IP editing is enabled on mobile now? :) [03:16:39] I had heard! [03:16:47] no explosions yet? [03:16:49] andrewbogott: on the gerrit side of things? 1. create new project, 2. enable push, 3. push, 4. tada! [03:17:00] jamesofur: spike in edits, drop in account creations :) [03:17:05] YuviPanda: to rename? [03:17:07] which is expected, I guess [03:17:15] andrewbogott: yeah, I don’t htink you can ‘rename’ as such. just delete old one and create new one [03:17:43] YuviPanda: how big of a spike? Any analysis on how many were reverted (I imagine not yet) [03:18:02] jamesofur: http://mobile-reportcard.wmflabs.org/#edits_daily-graphs-tab [03:18:12] jamesofur: not that big a spike, I guess. 6k average to about 9-10k average [03:19:39] interesting [03:19:44] do we know what the error edits are? [03:19:53] that spike is almost as high as the edit spike [03:20:15] jamesofur: yeah, I bet it’s edit protected + abusefilter [03:20:20] jamesofur: we can find out, yeah - it’s all EventLogged [03:20:30] jamesofur: halfak is running queries about their quality / revertedness, I think. [03:20:36] jamesofur: at least will be once there’s more data [03:20:43] great, will be interesting to see how that turns out [03:21:06] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1192995 (10GWicke) [03:21:14] jamesofur: totally. [03:21:22] jamesofur: I think there was lots of ‘emoticon vandalism’ for instance [03:21:26] that abusefilter didn’t catch [03:21:31] (poop emoticon instead of ‘poop’) [03:21:49] ahhh, interesting, wouldn't have even thought about that. [03:21:56] I wonder if they came from Verizon ......... [03:21:59] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1183654 (10GWicke) [03:22:15] jamesofur: ? :) [03:22:19] I wouldn't have even thought about that, though makes sense. [03:22:21] jamesofur: we can answer most questions with eventlogging, though. [03:22:23] heh yeah [03:23:21] There is a vandal who does mostly poop vandalism, mostly from his phone, with lots of socks. (at one point he was even calling the office multiple times bitching that his sock got blocked because he was the king of the poop vandals). [03:23:26] hence why I wondered ;) [03:23:48] difference between a "long term vandal using a new attack vector" and "new vandal messing around" [03:24:00] I'm praying this person was under 15 [03:24:01] but that's a harder analysis to do, will be interesting to see how it turns out. [03:24:03] hahahaha :D [03:24:06] but I guess probably not [03:24:07] yeah [03:24:21] I think kaldari requested for an abusefilter disallowing emoticon unicode ranges for anons? [03:24:31] That would make sense [03:24:37] yeah [03:29:50] 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1193000 (10Tfinc) Approved [03:30:01] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1193001 (10Andrew) p:5Triage>3High [03:31:19] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1193007 (10yuvipanda) @tfinc thanks! @mholloway can you upload a ssh key either to your officewiki user page (ideally?) or to phabricator.wikimedia.org/paste? [03:31:51] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures [03:45:32] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [03:48:36] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1193023 (10Mholloway) My SSH key is on my Wikitech user page (https://wikitech.wikimedia.org/wiki/User:Mholloway); alternatively, see here: https://phabricator.wikimedia.org/P494 [03:49:02] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail [04:04:30] (03CR) 10Yuvipanda: "(like service.manifest)" [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [04:06:12] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:19:12] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: puppet fail [04:27:18] !log xtrabackup clone db2035 to db2041 [04:27:23] Logged the message, Master [04:31:23] !log dbstore1001 s2 delayed replication paused, T95426 [04:31:26] Logged the message, Master [04:34:41] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [05:43:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Apr 9 05:42:38 UTC 2015 (duration 42m 37s) [05:43:50] Logged the message, Master [06:30:51] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [06:33:12] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:42] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:02] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:31] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:22] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:24] (03PS5) 10Giuseppe Lavagetto: standard: include admin wherever needed [puppet] - 10https://gerrit.wikimedia.org/r/202407 (https://phabricator.wikimedia.org/T86774) [06:40:13] 7Puppet, 10Tool-Labs: Develop and publish a gridengine provider for Puppet - https://phabricator.wikimedia.org/T95525#1193114 (10scfc) 3NEW [06:40:27] <_joe_> mmmh [06:40:36] <_joe_> are you guys sure that's a good idea? [06:40:39] <_joe_> ^^ [06:47:01] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:02] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:12] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:12] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:11] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:54:43] 6operations, 10Wikimedia-Site-requests: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112#1193141 (10Nemo_bis) [06:55:43] 6operations, 10Wikimedia-Site-requests: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112#200641 (10Nemo_bis) >>! In T18112#946037, @JohnLewis wrote: > Seems to be done. But not for en.wiki/s1. [07:00:22] (03CR) 10Giuseppe Lavagetto: "I verified this patch with the catalog compiler, the only hosts with changes are:" [puppet] - 10https://gerrit.wikimedia.org/r/202407 (https://phabricator.wikimedia.org/T86774) (owner: 10Giuseppe Lavagetto) [07:04:20] (03CR) 10Muehlenhoff: [C: 04-1] "The selection of ciphers is fine for trusty and jessie. (It also matches the current upstream defaults compared to OpenSSH 6.7 and 6.8)" [puppet] - 10https://gerrit.wikimedia.org/r/185329 (owner: 10Dzahn) [07:06:45] (03CR) 10Muehlenhoff: "(My comment from 9:04 was for a different change, please disregard it. Here's the status for the MAC change:)" [puppet] - 10https://gerrit.wikimedia.org/r/185329 (owner: 10Dzahn) [07:08:12] (03CR) 10Muehlenhoff: [C: 04-1] "The selection of ciphers is fine for trusty and jessie. (It also matches the current upstream defaults compared to OpenSSH 6.7 and 6.8)" [puppet] - 10https://gerrit.wikimedia.org/r/185325 (owner: 10Dzahn) [07:10:27] (03CR) 10Muehlenhoff: [C: 04-1] "Likewise to the other two changes precise doesn't support curve25519-sha256@libssh.org, which makes sshd fail to start:" [puppet] - 10https://gerrit.wikimedia.org/r/185321 (owner: 10Dzahn) [07:16:45] 6operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown: Image tarball dumps are not being generated - https://phabricator.wikimedia.org/T53001#1193180 (10Nemo_bis) 5Open>3stalled [07:23:13] (03PS3) 10Giuseppe Lavagetto: Add hiera lookup tool [puppet] - 10https://gerrit.wikimedia.org/r/175153 (owner: 10Ori.livneh) [07:33:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The tool would be very useful, a few caveats before it can get through:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/175153 (owner: 10Ori.livneh) [07:34:15] (03Abandoned) 10Giuseppe Lavagetto: proxies: allow filtering by datacenter [tools/scap] - 10https://gerrit.wikimedia.org/r/200130 (owner: 10Giuseppe Lavagetto) [07:39:02] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Puppet has 1 failures [07:51:27] (03CR) 10Nemo bis: "Do we know how big this table is?" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/200313 (owner: 10Kelson) [07:54:32] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:55:32] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 1 failures [08:12:41] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:15:56] 6operations, 10MediaWiki-extensions-Graph, 6Services, 10service-template-node, 7service-runner: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1193371 (10mobrovac) [08:22:59] (03CR) 10Nemo bis: "> Would need geo_tags table to be created everywhere before deploying" [dumps] - 10https://gerrit.wikimedia.org/r/155080 (https://bugzilla.wikimedia.org/51225) (owner: 10MaxSem) [08:32:38] (03CR) 10MaxSem: "Yes, this has already happened." [dumps] - 10https://gerrit.wikimedia.org/r/155080 (https://bugzilla.wikimedia.org/51225) (owner: 10MaxSem) [08:48:54] akosiaris: hello [08:49:22] akosiaris: remember when i asked about deploying new services to sca and you said "not before april" ? [08:49:32] akosiaris: knock knock, it's april :) [08:49:50] mobrovac: lol, quite true [08:50:21] so, how many are in the queue ? [08:50:30] 2 [08:50:47] lemme find the tickets [08:51:22] akosiaris: https://phabricator.wikimedia.org/T92627 and https://phabricator.wikimedia.org/T90487 [08:52:35] let's work out a plan for that [08:56:45] <_joe_> can we start working on unifying the puppet code for different services? [08:57:40] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1193451 (10Tgr) [08:57:50] _joe_: that'd be my idea too [08:57:57] at least for the SCA case [08:58:16] i was thinking of having a sca::service module which would set up most things [08:58:50] and maybe if we're smart enough, we can generalise it beyond sca [08:59:08] (that should be a bold maybe though) [08:59:32] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1080600 (10Tgr) What is the status of this bug? Is any further work planned, and if so, is there a rough schedule... [09:01:36] sounds like a plan to me! [09:02:41] kk, i'll create a ticket for that and hopefully start working on it [09:02:42] so, seems like that service-mobileapp-node (is it me or is it badly named ?) could be first and used as template for the puppet code unification [09:03:16] what's the state of graphoid ? The ticket is kind of vague [09:03:20] akosiaris: it's mw/services/mobileapps now [09:03:51] graphoid is near prod-ready, code && security review passed ok [09:04:08] i'm tweaking some minor prod log issues right now [09:04:41] citoid graphoid and mobileapps could all benefit from the unified puppet module [09:04:46] they all use the service template [09:04:51] yup [09:05:12] caveats: graphoid needs extra pkgs, citoid needs zotero (grrR) [09:05:27] but that's great because we plan for these cases ahead of time [09:06:32] I am a bit unclear on what restbase has to do with mediawiki-services-mobileapps [09:06:46] this will be phase 2 [09:07:17] so what that service does is that it gathers data from various sources and puts together an html suitable for android/ios [09:07:42] thus the idea is that eventually it could sit behind restbase so that rb could store these htmls [09:08:29] why behind ? and not in front ? [09:08:52] restbase would still be storing the htmls, wouldn't it ? [09:09:21] yup [09:09:22] so [09:09:31] client -> restbase -> mobileapps [09:10:01] scenario: the client asks for rev X, restbase proxies that to mobileapps which compiles the html [09:10:11] restbase stores it and gets it back to the client [09:10:25] when client #2 comes asking for the same stuff, restbase serves it from storage [09:11:02] (since data manipulation in mobileapps is deterministic) [09:12:07] so a cache. In general we got varnish for that. I 'd rather we got a client -> mobileapps -> restbase ? fetch/send : compile/store/send scenario [09:12:20] note the tertiary operator used there :P [09:12:52] resembles the memcached model more and decouples mobileapps from restbase [09:13:03] so if restbase suffers, mobileapps can still be functional [09:13:47] and you avoid having to support mobileapps specifically in restbase [09:14:13] right [09:14:53] hm, maybe use varnish in front for now? it's load should be reduced now that most reqs go to restbase which uses uncached parsoid routes [09:15:49] the parsoid varnishes ? hmm I think you are right, the caching part of it should be unused these days [09:16:22] they are already reused for all the other services btw, just not in a caching model [09:17:18] we could enable caching for mediawiki-services-mobileapps (badly named, I am way more certain now :P) [09:17:50] :) [09:18:03] (03CR) 10Krinkle: "Note to future self: This doesn't work yet and needs the package to be installed manually from the .deb file for the puppet-run to continu" [puppet] - 10https://gerrit.wikimedia.org/r/202714 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [09:18:07] there are a couple of users btw of parsoid-lb [09:18:19] cxserver/iegreview [09:18:30] yep, we know, that's why we haven't killed MW update jobs to it yet [09:18:56] I suppose there exists a plan to migrate them to restbase ? [09:19:59] yes, the uncertainty is when it is going to be enacted [09:21:21] no schedule then. OK [09:21:49] let's just keep in mind that it might complicate caching for mobileapps [09:24:09] * akosiaris sigh hates how keystone tokens expire before wikitech cookies [09:27:27] I don't really follow most of the above, and I'm not sticking around to discuss, but: it would be best to avoid new code/APIs using or putting anything behind the parsoid varnish layer. Really, we shouldn't have stuffed those other newer things behind it either. [09:27:41] the sooner parsoid-varnish can die, the better [09:27:51] PROBLEM - puppet last run on mc1002 is CRITICAL: CRITICAL: Puppet has 1 failures [09:28:08] agreed [09:28:12] but what do you want to do instead? [09:28:31] anything remotely sane! [09:28:43] bblack: i think that once there are no more users of cached parsoid routes, we can do that [09:28:44] we need to integrate varnish with a service discovery system this quarter [09:29:04] +1 for SD [09:29:08] ^ I assume that means restbase? [09:29:26] mark: is that a goal ? [09:29:34] that's very much part of _the_ goal yes [09:29:38] as well as several other things [09:30:17] anyways, like I said, I'm not here. I just stopped by to check on graphs and noticed :) [09:30:36] bye ;) [09:31:26] <_joe_> mark: should we start filing phab task for _the_ goal? And if so, does it have a project in phabricator? [09:31:34] yes please [09:31:38] * mobrovac senses a plan is about to form here [09:32:02] s/form/crystallise/ [09:32:07] andrewbogott_afk: ping me when you are around [09:36:19] I 'll just start merging https://gerrit.wikimedia.org/r/#/c/202731/ and the prereqs [09:36:31] catalogcompiler through the fleet said OK [09:36:45] and we even got jessie honoring our AuthorizedKeysFile as a bonus [09:37:42] (03CR) 10Giuseppe Lavagetto: [C: 031] ssh::userkey: Allow a prefix to be specified for a key [puppet] - 10https://gerrit.wikimedia.org/r/202731 (owner: 10Alexandros Kosiaris) [09:37:56] 6operations, 6Services, 10service-template-node: Unify SCA Service Puppet Modules / Roles - https://phabricator.wikimedia.org/T95533#1193549 (10mobrovac) 3NEW a:3mobrovac [09:38:12] there we go ^^ [09:38:49] 6operations, 6Services, 10service-template-node: Unify SCA Service Puppet Modules / Roles - https://phabricator.wikimedia.org/T95533#1193559 (10mobrovac) [09:38:53] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1193558 (10mobrovac) [09:39:57] (03PS3) 10Alexandros Kosiaris: ssh::userkey: Allow a prefix to be specified for a key [puppet] - 10https://gerrit.wikimedia.org/r/202731 [09:39:59] (03PS3) 10Alexandros Kosiaris: Specify ssh userkey policy for ganeti clusters [puppet] - 10https://gerrit.wikimedia.org/r/202730 [09:40:01] (03PS5) 10Alexandros Kosiaris: ssh: remove lucid special casing for authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202394 [09:40:03] (03PS5) 10Alexandros Kosiaris: ssh: allow parameterization of authorized_keys [puppet] - 10https://gerrit.wikimedia.org/r/202392 [09:40:05] (03PS5) 10Alexandros Kosiaris: sodium: specify the position of authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202393 [09:41:16] 6operations, 6Services, 10service-template-node: Unify SCA Service Puppet Modules / Roles - https://phabricator.wikimedia.org/T95533#1193562 (10mobrovac) [09:41:18] 6operations, 10MediaWiki-extensions-Graph, 6Services, 10service-template-node, 7service-runner: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1193561 (10mobrovac) [09:41:35] 6operations, 6Services, 10service-template-node: Unify SCA Service Puppet Modules / Roles - https://phabricator.wikimedia.org/T95533#1193549 (10mobrovac) [09:41:38] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1193563 (10mobrovac) [09:45:11] RECOVERY - puppet last run on mc1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:59:05] 6operations, 6Labs: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1193580 (10Krenair) 3NEW [09:59:39] 6operations, 6Labs: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1193587 (10Krenair) Maybe a cron? [10:05:00] (03PS2) 10Alexandros Kosiaris: Allow hiera role_backend to be debuggable via hiera CLI [puppet] - 10https://gerrit.wikimedia.org/r/202756 [10:06:49] (03PS3) 10Tim Landscheidt: Tools: Make list of proxies for portgrabber configurable [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) [10:08:26] godog: statsd switch is about to happen? [10:10:35] mobrovac: it is, getting ready now [10:10:58] k, don't forget to +2 https://gerrit.wikimedia.org/r/#/c/199952/ [10:11:09] mobrovac: yep, I'll give you an heads up too [10:11:20] cheers [10:12:26] (03CR) 10Hashar: "Self fixing my comments on PS18" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/194471 (owner: 10Alexandros Kosiaris) [10:12:42] (03PS20) 10Hashar: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 (owner: 10Alexandros Kosiaris) [10:13:54] (03CR) 10Hashar: [C: 031] "PS20:" [puppet] - 10https://gerrit.wikimedia.org/r/194471 (owner: 10Alexandros Kosiaris) [10:17:28] (03PS5) 10Filippo Giunchedi: statsite: new module [puppet] - 10https://gerrit.wikimedia.org/r/199599 (https://phabricator.wikimedia.org/T90111) [10:17:30] (03PS3) 10Filippo Giunchedi: statsite: replace ::txstatsd class and role calls [puppet] - 10https://gerrit.wikimedia.org/r/202701 (https://phabricator.wikimedia.org/T90111) [10:17:32] (03PS4) 10Filippo Giunchedi: statsdlb: replace txstatsd with statsite [puppet] - 10https://gerrit.wikimedia.org/r/199600 (https://phabricator.wikimedia.org/T90111) [10:17:47] (03PS21) 10Hashar: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 (owner: 10Alexandros Kosiaris) [10:18:01] (03PS2) 10Hashar: role::package::builder uses package_builder module [puppet] - 10https://gerrit.wikimedia.org/r/200525 (owner: 10Alexandros Kosiaris) [10:18:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: new module [puppet] - 10https://gerrit.wikimedia.org/r/199599 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [10:19:59] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta with one and two proxies." [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [10:21:54] !log begin replacing txstatsd with statsite, stop graphite to rename metrics [10:22:00] Logged the message, Master [10:22:03] (03CR) 10Hashar: [C: 04-1] "Lovely!" [puppet] - 10https://gerrit.wikimedia.org/r/201882 (owner: 10Dzahn) [10:22:07] mobrovac: I'll merge that change later btw [10:22:24] I'm sure there will be alerts coming up [10:23:02] ok, no pb [10:23:27] (03CR) 10Hashar: [C: 031] beta: lint [puppet] - 10https://gerrit.wikimedia.org/r/202655 (owner: 10Dzahn) [10:27:30] (03Abandoned) 10Hashar: contint: keep 180 min of puppet reports [puppet] - 10https://gerrit.wikimedia.org/r/193825 (https://phabricator.wikimedia.org/T87484) (owner: 10Hashar) [10:28:03] (03PS2) 10Hashar: contint: Jessie does not have openjdk-6-jdk [puppet] - 10https://gerrit.wikimedia.org/r/201701 (https://phabricator.wikimedia.org/T94999) [10:28:12] (03PS2) 10Hashar: contint: update browsers package names for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) [10:29:01] PROBLEM - Graphite Carbon on graphite1001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [10:29:32] PROBLEM - txstatsd backend instances on graphite1001 is CRITICAL: CRITICAL: Not all configured txstatsd instances are running. [10:29:42] PROBLEM - statsdlb process on graphite1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [10:39:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsdlb: replace txstatsd with statsite [puppet] - 10https://gerrit.wikimedia.org/r/199600 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [10:41:32] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: puppet fail [10:42:38] (03PS1) 10Filippo Giunchedi: statsdlb: s/decomission/decommission/ [puppet] - 10https://gerrit.wikimedia.org/r/203023 [10:42:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsdlb: s/decomission/decommission/ [puppet] - 10https://gerrit.wikimedia.org/r/203023 (owner: 10Filippo Giunchedi) [10:43:42] RECOVERY - statsdlb process on graphite1001 is OK: PROCS OK: 1 process with command name statsdlb [10:44:42] RECOVERY - Graphite Carbon on graphite1001 is OK: OK: All defined Carbon jobs are runnning. [10:45:01] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:46:35] !log txstatsd replaced on graphite1001, replacing other clients [10:46:40] Logged the message, Master [10:47:06] (03PS3) 10Giuseppe Lavagetto: contint: Jessie does not have openjdk-6-jdk [puppet] - 10https://gerrit.wikimedia.org/r/201701 (https://phabricator.wikimedia.org/T94999) (owner: 10Hashar) [10:49:01] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: Jessie does not have openjdk-6-jdk [puppet] - 10https://gerrit.wikimedia.org/r/201701 (https://phabricator.wikimedia.org/T94999) (owner: 10Hashar) [10:49:22] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 2.50% of data above the critical threshold [1000.0] [10:49:33] ^ expected [10:53:44] (03PS1) 10Filippo Giunchedi: statsite: s/8127/8128/ [puppet] - 10https://gerrit.wikimedia.org/r/203024 [10:54:00] (03PS2) 10Filippo Giunchedi: statsite: s/8127/8128/ [puppet] - 10https://gerrit.wikimedia.org/r/203024 [10:54:02] PROBLEM - statsdlb process on graphite1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [10:54:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: s/8127/8128/ [puppet] - 10https://gerrit.wikimedia.org/r/203024 (owner: 10Filippo Giunchedi) [10:55:42] RECOVERY - statsdlb process on graphite1001 is OK: PROCS OK: 1 process with command name statsdlb [11:00:24] (03CR) 10Hashar: [C: 031] role::package::builder uses package_builder module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200525 (owner: 10Alexandros Kosiaris) [11:00:41] (03PS4) 10Filippo Giunchedi: statsite: replace ::txstatsd class and role calls [puppet] - 10https://gerrit.wikimedia.org/r/202701 (https://phabricator.wikimedia.org/T90111) [11:01:33] (03PS5) 10Filippo Giunchedi: statsite: replace ::txstatsd class and role calls [puppet] - 10https://gerrit.wikimedia.org/r/202701 (https://phabricator.wikimedia.org/T90111) [11:01:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: replace ::txstatsd class and role calls [puppet] - 10https://gerrit.wikimedia.org/r/202701 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [11:02:27] puppet failures likely ahead [11:02:38] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "please see the comment in the source" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) (owner: 10Hashar) [11:03:36] PROBLEM - statsdlb process on graphite1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [11:06:24] (03CR) 10Hashar: [C: 04-1] "Can be made nicer. Will verify whether latest is actually required." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) (owner: 10Hashar) [11:06:46] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [11:07:03] sigh, that's me, checking [11:07:15] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: Puppet has 2 failures [11:07:26] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: Puppet has 1 failures [11:07:35] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Puppet has 2 failures [11:07:35] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Puppet has 2 failures [11:07:46] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: Puppet has 1 failures [11:08:35] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: Puppet has 2 failures [11:08:45] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures [11:08:45] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:05] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:09:15] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: Puppet has 2 failures [11:09:25] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:26] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:26] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:26] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:26] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:48] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:50] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:51] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 2 failures [11:09:51] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:04] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:06] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:07] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:15] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: Puppet has 2 failures [11:10:25] PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:26] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:36] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:36] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:36] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [11:11:15] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [11:11:16] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:11:16] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [11:11:56] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 1 failures [11:11:56] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Puppet has 2 failures [11:11:56] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:06] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:15] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:16] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:17] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:17] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:17] PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:26] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:26] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: Puppet has 2 failures [11:12:27] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:45] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:46] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:55] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:56] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [11:12:56] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [11:13:06] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:13:07] the cache boxes is statsite assuming upstart, sadly [11:13:13] looking now [11:13:26] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [11:13:27] PROBLEM - puppet last run on amssq62 is CRITICAL: CRITICAL: Puppet has 1 failures [11:13:27] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [11:13:37] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 2 failures [11:13:45] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 1 failures [11:13:47] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Puppet has 1 failures [11:13:55] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:13:55] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Puppet has 2 failures [11:13:55] (03PS1) 10KartikMistry: Beta: Add 'simple' in source and target [puppet] - 10https://gerrit.wikimedia.org/r/203027 (https://phabricator.wikimedia.org/T95538) [11:14:06] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 1 failures [11:14:06] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 1 failures [11:14:16] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: Puppet has 2 failures [11:14:17] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 2 failures [11:14:27] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet has 2 failures [11:14:27] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet has 2 failures [11:14:35] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [11:14:57] PROBLEM - puppet last run on amssq41 is CRITICAL: CRITICAL: Puppet has 1 failures [11:15:05] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 2 failures [11:15:16] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 1 failures [11:15:16] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 2 failures [11:15:16] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 2 failures [11:15:37] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 1 failures [11:15:46] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Puppet has 1 failures [11:15:46] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: Puppet has 1 failures [11:16:05] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Puppet has 2 failures [11:16:26] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [11:16:26] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 2 failures [11:17:16] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Puppet has 1 failures [11:17:35] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 1 failures [11:17:35] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures [11:17:46] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 1 failures [11:17:55] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures [11:17:55] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures [11:18:05] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 1 failures [11:18:05] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: Puppet has 2 failures [11:18:06] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: Puppet has 1 failures [11:18:16] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:18:26] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [11:18:36] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Puppet has 2 failures [11:18:56] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:06] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:06] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:06] PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:25] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:26] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:35] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:36] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:45] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:46] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:56] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:56] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Puppet has 2 failures [11:19:56] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:56] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: Puppet has 1 failures [11:19:56] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 2 failures [11:20:05] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Puppet has 1 failures [11:20:48] what's going on? [11:21:52] that's me because of statsite failing on debian/cache boxes, fixing it now [11:22:12] the irony :) [11:22:22] also ms-fe/ms-be though? [11:22:30] (also the irony :) [11:22:46] yeah lots of irony involved, ms-fe/ms-be because of precise [11:26:25] (03PS1) 10Filippo Giunchedi: statsite: support debian systems [puppet] - 10https://gerrit.wikimedia.org/r/203028 [11:26:30] (you can mute a user with +b ~q:icinga-wm , so you don' need to kick out completly :-P ....) [11:26:43] review of shame ^ [11:27:13] are we running statsite on every single box now? [11:27:18] (03PS4) 10Giuseppe Lavagetto: Add hiera lookup tool [puppet] - 10https://gerrit.wikimedia.org/r/175153 (owner: 10Ori.livneh) [11:27:48] paravoid: no we are not, txstatsd was running on cache boxes tho [11:27:55] why? [11:28:35] because at the time if all cache boxes were sending traffic to the central txstatsd it'd die [11:28:56] which traffic? [11:29:03] statsd traffic [11:29:08] do we even use statsd for anything varnish-related? [11:29:16] we do, varnishkafka does [11:29:21] ah [11:29:31] it does? [11:29:41] it didn't use to, we had some log-parsing hacks [11:30:20] yeah it does, happened around december IIRC [11:30:25] (03PS4) 10Alexandros Kosiaris: ssh::userkey: Allow a prefix to be specified for a key [puppet] - 10https://gerrit.wikimedia.org/r/202731 [11:30:27] (03PS4) 10Alexandros Kosiaris: Specify ssh userkey policy for ganeti clusters [puppet] - 10https://gerrit.wikimedia.org/r/202730 [11:30:29] # Test using logster to send varnishkafka stats to statsd -> graphite. [11:30:29] (03PS6) 10Alexandros Kosiaris: ssh: remove lucid special casing for authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202394 [11:30:31] (03PS6) 10Alexandros Kosiaris: ssh: allow parameterization of authorized_keys [puppet] - 10https://gerrit.wikimedia.org/r/202392 [11:30:32] yeah [11:30:33] (03PS6) 10Alexandros Kosiaris: sodium: specify the position of authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202393 [11:31:31] paravoid: does that above look reasonable? an hack now of course, I don't think we'd need to run statsite now on the cache boxes so there will be cleanup [11:31:46] (03CR) 10Giuseppe Lavagetto: "I added a new patchset that does what follows:" [puppet] - 10https://gerrit.wikimedia.org/r/175153 (owner: 10Ori.livneh) [11:31:53] akosiaris: hold that, I have some a comment or two [11:32:14] godog: I'm not very familiar with the rest [11:32:26] no need for provider => systemd, though [11:32:27] that's wrong [11:33:03] it'll autodetect and dtrt ? [11:33:11] yes [11:33:23] also statsite::instance won't be able to be defined multiple times on debian as I see it [11:33:37] File['/etc/statsite.ini'] will clash [11:33:41] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While the patch is formally correct, I prefer the approach ori has taken here:" [puppet] - 10https://gerrit.wikimedia.org/r/202756 (owner: 10Alexandros Kosiaris) [11:33:49] ah that's true [11:34:14] ok not a problem ATM because it is applied only once except for the machine hosting graphite [11:35:06] (03PS2) 10Filippo Giunchedi: statsite: support debian systems [puppet] - 10https://gerrit.wikimedia.org/r/203028 [11:36:12] (03PS6) 10Giuseppe Lavagetto: standard: include admin wherever needed [puppet] - 10https://gerrit.wikimedia.org/r/202407 (https://phabricator.wikimedia.org/T86774) [11:40:06] <_joe_> godog: I'll take a look at the other failures [11:41:54] _joe_: ms- machines in codfw should be fixed now, ms- machines in eqiad are failing because of precise [11:42:18] <_joe_> ms-machines in codfw have a problem with txstatsd decommissioning [11:42:23] <_joe_> I'm fixing that [11:42:51] yeah I give them a kick, should be recovering [11:43:41] we're running blind, so please fix soon [11:43:45] (03PS1) 10Giuseppe Lavagetto: txstatsd: stop service before trying to remove the user [puppet] - 10https://gerrit.wikimedia.org/r/203029 [11:44:20] paravoid: reqstats are working btw [11:44:38] <_joe_> this should fix the failures for ms-* in codfw, I'll look at ms-* in eqiad if needed [11:45:31] _joe_: I have this patch of shame too https://gerrit.wikimedia.org/r/#/c/203028/2 [11:46:01] <_joe_> godog: uhm on precise the problem is with the debian package? [11:46:13] it is yeah [11:46:46] paravoid: you were saying something about comments ? [11:46:53] give me a sec [11:47:04] <_joe_> godog: as long as we have a better fix later, we can go on with this patch [11:47:37] _joe_: yeah we do [11:47:53] <_joe_> godog: uhm, you didn't fix the statsitectl tool though to support systemd [11:47:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: support debian systems [puppet] - 10https://gerrit.wikimedia.org/r/203028 (owner: 10Filippo Giunchedi) [11:47:59] <_joe_> right? [11:48:04] <_joe_> so this won't work either [11:48:20] <_joe_> oh no rigth [11:48:21] <_joe_> it's ok [11:49:56] ok cache boxes should be recovering [11:52:39] (03PS1) 10Filippo Giunchedi: statsite: fix hostname variable [puppet] - 10https://gerrit.wikimedia.org/r/203032 [11:52:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: fix hostname variable [puppet] - 10https://gerrit.wikimedia.org/r/203032 (owner: 10Filippo Giunchedi) [11:56:34] (03PS2) 10KartikMistry: CX: Swedish in target, simple in source and sv-da in MT [puppet] - 10https://gerrit.wikimedia.org/r/202341 (https://phabricator.wikimedia.org/T95108) [12:06:44] (03CR) 10Hashar: "Unattended upgrade does not match either package so firefox would need latest as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) (owner: 10Hashar) [12:09:45] (03PS4) 10Alexandros Kosiaris: Provision the ssh key added in 3c8c524 [puppet] - 10https://gerrit.wikimedia.org/r/201462 [12:09:47] (03PS1) 10Alexandros Kosiaris: ganeti: Reference correctly the ganeti cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/203035 [12:10:41] (03CR) 10jenkins-bot: [V: 04-1] Provision the ssh key added in 3c8c524 [puppet] - 10https://gerrit.wikimedia.org/r/201462 (owner: 10Alexandros Kosiaris) [12:10:53] (03CR) 10jenkins-bot: [V: 04-1] ganeti: Reference correctly the ganeti cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/203035 (owner: 10Alexandros Kosiaris) [12:14:11] (03PS2) 10Alexandros Kosiaris: ganeti: Reference correctly the ganeti cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/203035 [12:14:13] (03PS5) 10Alexandros Kosiaris: Provision the ssh key added in 3c8c524 [puppet] - 10https://gerrit.wikimedia.org/r/201462 [12:14:31] (03PS3) 10Hashar: contint: update browsers package names for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) [12:15:48] (03CR) 10Hashar: contint: update browsers package names for Jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) (owner: 10Hashar) [12:17:02] mhh how do I ask icinga-wm to join again? [12:18:59] (03PS22) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [12:19:53] (03CR) 10Alex Monk: [C: 031] "Let's deploy this after 1.26wmf1 goes to wikipedias?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) (owner: 10Glaisher) [12:20:55] (03CR) 10Alexandros Kosiaris: [C: 032] "Thanks Antoine, you are great. I reworded just a bit one sentence in the README and I am merging now" [puppet] - 10https://gerrit.wikimedia.org/r/194471 (owner: 10Alexandros Kosiaris) [12:21:45] (03CR) 10Glaisher: "Fine by me. I don't have the time to get the i18n swatted." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) (owner: 10Glaisher) [12:22:02] (03PS3) 10Alexandros Kosiaris: role::package::builder uses package_builder module [puppet] - 10https://gerrit.wikimedia.org/r/200525 [12:22:07] godog, try restarting it? [12:22:15] (03CR) 10Alexandros Kosiaris: [C: 032] role::package::builder uses package_builder module [puppet] - 10https://gerrit.wikimedia.org/r/200525 (owner: 10Alexandros Kosiaris) [12:22:32] I don't really know why paravoid decided to kickban rather than quiet it. [12:23:36] !log bounce icinga-wm [12:23:39] (03CR) 10Muehlenhoff: "It doesn't matter in practice since jessie is the first Debian version in use, but for complete correctness this should depend on >= wheez" [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) (owner: 10Hashar) [12:23:43] Logged the message, Master [12:23:59] Krenair: yay it is back, thanks [12:24:05] RECOVERY - puppet last run on ms-be1006 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:24:05] RECOVERY - puppet last run on ms-be1003 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:24:21] godog: I restarted the ircecho service on neon [12:24:56] paravoid: ah whoops, me too [12:25:10] (03PS1) 10Filippo Giunchedi: gdash: update dashboards after statsite migration [puppet] - 10https://gerrit.wikimedia.org/r/203038 [12:25:12] (03PS1) 10Filippo Giunchedi: icinga: update checks after statsite migration [puppet] - 10https://gerrit.wikimedia.org/r/203039 [12:25:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] icinga: update checks after statsite migration [puppet] - 10https://gerrit.wikimedia.org/r/203039 (owner: 10Filippo Giunchedi) [12:26:04] (03PS2) 10Filippo Giunchedi: gdash: update dashboards after statsite migration [puppet] - 10https://gerrit.wikimedia.org/r/203038 [12:26:06] RECOVERY - puppet last run on ms-fe1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:26:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: update dashboards after statsite migration [puppet] - 10https://gerrit.wikimedia.org/r/203038 (owner: 10Filippo Giunchedi) [12:26:29] (03PS2) 10Filippo Giunchedi: icinga: update checks after statsite migration [puppet] - 10https://gerrit.wikimedia.org/r/203039 [12:26:35] (03CR) 10Filippo Giunchedi: [V: 032] icinga: update checks after statsite migration [puppet] - 10https://gerrit.wikimedia.org/r/203039 (owner: 10Filippo Giunchedi) [12:27:29] (03PS1) 10Alexandros Kosiaris: Kill the old unused package-builder manifests [puppet] - 10https://gerrit.wikimedia.org/r/203040 [12:30:05] RECOVERY - puppet last run on ms-be1008 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:30:06] RECOVERY - puppet last run on ms-be1012 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:30:56] RECOVERY - puppet last run on ms-be1007 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [12:31:16] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [12:32:05] RECOVERY - puppet last run on ms-be1009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:35:06] RECOVERY - puppet last run on ms-be1011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:35:16] (03PS5) 10Alexandros Kosiaris: ssh::userkey: Allow a prefix to be specified for a key [puppet] - 10https://gerrit.wikimedia.org/r/202731 [12:35:18] (03PS5) 10Alexandros Kosiaris: Specify ssh userkey policy for ganeti clusters [puppet] - 10https://gerrit.wikimedia.org/r/202730 [12:35:20] (03PS7) 10Alexandros Kosiaris: ssh: remove lucid special casing for authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202394 [12:35:22] (03PS7) 10Alexandros Kosiaris: ssh: allow parameterization of authorized_keys [puppet] - 10https://gerrit.wikimedia.org/r/202392 [12:35:24] (03PS7) 10Alexandros Kosiaris: sodium: specify the position of authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202393 [12:42:34] (03PS1) 10Filippo Giunchedi: statsite: fix stream_cmd invocation [puppet] - 10https://gerrit.wikimedia.org/r/203042 [12:42:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: fix stream_cmd invocation [puppet] - 10https://gerrit.wikimedia.org/r/203042 (owner: 10Filippo Giunchedi) [12:42:52] WARNING: The following packages cannot be authenticated! [12:42:52] syslog-ng-core [12:42:55] that never stops [12:42:59] (on beta) [12:48:04] (03PS1) 10Alexandros Kosiaris: Reimage copper as jessie with role::package::builder [puppet] - 10https://gerrit.wikimedia.org/r/203043 [12:48:42] (03PS2) 10Filippo Giunchedi: Configure OCG to report counters rather than meters [puppet] - 10https://gerrit.wikimedia.org/r/199952 (owner: 10Ori.livneh) [12:48:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Configure OCG to report counters rather than meters [puppet] - 10https://gerrit.wikimedia.org/r/199952 (owner: 10Ori.livneh) [12:50:00] 6operations, 10Beta-Cluster, 6Labs: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1193865 (10hashar) 3NEW [13:01:54] RECOVERY - statsdlb process on graphite1001 is OK: PROCS OK: 1 process with command name statsdlb [13:03:08] (03PS2) 10Hashar: zuul: switch install to a Debian package [puppet] - 10https://gerrit.wikimedia.org/r/202714 (https://phabricator.wikimedia.org/T48552) [13:05:49] (03PS3) 10Hashar: zuul: switch install to a Debian package [puppet] - 10https://gerrit.wikimedia.org/r/202714 (https://phabricator.wikimedia.org/T48552) [13:09:14] (03PS4) 10Hashar: zuul: switch install to a Debian package [puppet] - 10https://gerrit.wikimedia.org/r/202714 (https://phabricator.wikimedia.org/T48552) [13:13:04] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 2.17% of data above the critical threshold [1000.0] [13:20:46] (still me) [13:21:43] RECOVERY - Disk space on graphite1001 is OK: DISK OK [13:22:29] (03PS4) 10Giuseppe Lavagetto: contint: update browsers package names for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) (owner: 10Hashar) [13:23:16] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: update browsers package names for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/201711 (https://phabricator.wikimedia.org/T95000) (owner: 10Hashar) [13:25:34] <_joe_> it would be fantastic if I gave a V+2 on your change because jenkins is too slow [13:25:48] <_joe_> finally [13:26:34] (03CR) 10Giuseppe Lavagetto: [C: 031] Reimage copper as jessie with role::package::builder [puppet] - 10https://gerrit.wikimedia.org/r/203043 (owner: 10Alexandros Kosiaris) [13:27:19] (03CR) 10Giuseppe Lavagetto: "I'll look out for other people wanting to get things from their home directories there, though" [puppet] - 10https://gerrit.wikimedia.org/r/203043 (owner: 10Alexandros Kosiaris) [13:28:24] godog: is it normal that i'm getting retrieval errors for all movingMedian()s in grafana (from graphite) ? [13:28:34] RECOVERY - puppet last run on ms-fe3002 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:28:54] RECOVERY - puppet last run on ms-be3001 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:29:04] RECOVERY - puppet last run on ms-be3002 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:29:38] mobrovac: possible, I didn't get to rename the grafana dashboards yet, which dashboard? [13:29:57] restbase [13:30:11] godog: the weird thing is that non-movingMedian() graphs are being displayed [13:30:53] (03PS1) 10Hashar: labs: enhance output of a couple notifications [puppet] - 10https://gerrit.wikimedia.org/r/203062 [13:31:44] PROBLEM - puppet last run on mw2095 is CRITICAL Puppet has 1 failures [13:33:23] jouncebot, next [13:33:23] In 1 hour(s) and 26 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150409T1500) [13:33:29] (03CR) 10Hashar: "Example of annoying output:" [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [13:33:36] mobrovac: mh perhaps it is the one with percentile that got renamed [13:33:44] RECOVERY - puppet last run on ms-fe3001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:34:12] (03CR) 10jenkins-bot: [V: 04-1] labs: enhance output of a couple notifications [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [13:34:53] godog: yup, that seems to be it, thnx! [13:36:03] (03CR) 10Alex Monk: "Apparently dewiki already set up local i18n for it, let's do this in the next swat deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) (owner: 10Glaisher) [13:36:45] (03PS2) 10Hashar: labs: enhance output of a couple notifications [puppet] - 10https://gerrit.wikimedia.org/r/203062 [13:39:04] RECOVERY - puppet last run on ms-be3004 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:44:14] mobrovac: cool, did renaming fix it? [13:44:50] godog: for restbase, it did, but citoid now seems to be getting only some stats [13:44:51] hm hm [13:47:34] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:48:09] mobrovac: could be related to a similar change to this? https://gerrit.wikimedia.org/r/199952 [13:48:38] (03PS7) 10Giuseppe Lavagetto: standard: include admin wherever needed [puppet] - 10https://gerrit.wikimedia.org/r/202407 (https://phabricator.wikimedia.org/T86774) [13:49:44] godog: nope, i deployed a similar fix for citoid [13:50:13] godog: the weird thing is that some stats are displayed, some are not [13:50:19] and no error from grphite [13:50:44] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [13:52:08] mobrovac: mhh try renaming max to upper in the heap stats [13:52:49] godog: heh, the heap stats are actually being displayed, all the others are not [13:53:45] (03CR) 10Hashar: [C: 031] "I have tested the puppet part on labs and that works fine." [puppet] - 10https://gerrit.wikimedia.org/r/202714 (https://phabricator.wikimedia.org/T48552) (owner: 10Hashar) [13:54:44] godog: ah ok, nevermind, .count && .rate do not exist any more :) [13:55:04] mobrovac: ah yeah, I think because they are sent as statsd meters, but should be counters I'd say [13:57:10] yeah, obviously will need to change some stuff in the code [13:59:15] mobrovac: heh sorry the transition isn't as smooth as I wanted [13:59:29] https://github.com/etsy/statsd/wiki/Protocol with this being the protocol spec, people got creative [13:59:38] eh :) [14:00:47] will get sth to eat [14:02:10] Anybody here is aware of what I should expect from Limn? [14:02:37] hi renoirb_: come to #wikimedia-analytics, I maintain limn (but it's kind of end of life) [14:02:42] We want to get more usage details and it seems that Limn is the current tool [14:02:45] will do! [14:06:33] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1194117 (10Nuria) No further work is planned. The limitation on the length of events is on varnish.Thus far, lengt... [14:08:05] (03PS1) 10BBlack: add_ip6_mapped: middle approach [puppet] - 10https://gerrit.wikimedia.org/r/203069 [14:08:13] (03CR) 10Santhosh: [C: 031] Beta: Add 'simple' in source and target [puppet] - 10https://gerrit.wikimedia.org/r/203027 (https://phabricator.wikimedia.org/T95538) (owner: 10KartikMistry) [14:08:24] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:18] (03CR) 10Se4598: "But meta uses the i18n messages for global group rights management and isn't on 1.26wmf1 yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) (owner: 10Glaisher) [14:20:28] (03CR) 10Alex Monk: "Well, the stewards would still be able to configure it, so..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) (owner: 10Glaisher) [14:21:11] 6operations, 10Continuous-Integration: Build Debian package jenkins-debian-glue for Jessie - https://phabricator.wikimedia.org/T95006#1194170 (10hashar) [14:21:57] (03PS1) 10Hashar: contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) [14:23:08] (03CR) 10Hashar: "I have filled T95545 to get this new package builder module applied on the CI Jessie slaves :)" [puppet] - 10https://gerrit.wikimedia.org/r/194471 (owner: 10Alexandros Kosiaris) [14:23:56] (03CR) 10Giuseppe Lavagetto: [C: 032] standard: include admin wherever needed [puppet] - 10https://gerrit.wikimedia.org/r/202407 (https://phabricator.wikimedia.org/T86774) (owner: 10Giuseppe Lavagetto) [14:24:03] <_joe_> oook [14:29:19] (03CR) 10Hashar: "I have cherry picked it on the integration puppetmaster. Have to recreated the Jessie instance though because the extended disk was too sm" [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [14:30:06] do you guys have a Phabricator tag for Debian packaging related tasks ? [14:30:16] such as rebuilding a package / backporting / uploading to apt.wm.o ? [14:32:31] akosiaris: I’m on a bus but have a few minutes. What’s up? [14:33:29] andrewbogott: I have a VM in the maps project in labs that is SHUTOFF [14:33:36] and in a building state in the same time [14:33:45] akosiaris: an old one or one you just created? [14:33:51] old one [14:33:55] not mine btw [14:34:03] toolserver volunteers [14:34:06] maps-tiles2 [14:34:21] I-00000364.eqiad.wmflabs [14:34:22] ok, I’ll reset it if I can get a login on virt1000 :) [14:35:48] andrewbogott: I can do it... I was not sure what to do though [14:35:52] nova list reset-state ? [14:36:04] I am unsure why the VM is in that state [14:36:39] maybe virt dns server is borked again ? [14:36:42] ave-jessie-1001 login: 2015-04-09T14:29:29.300044+00:00 integration-slave-jessie-1001 puppet-agent[506]: Could not request certificate: getaddrinfo: Name or service not known [14:36:49] that is on an instance I created a second or so ago [14:37:27] 6operations, 7HTTPS, 3HTTPS-by-default, 5Patch-For-Review: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1194249 (10BBlack) Based on the overnight esams traffic graphs up through present, the Step 4 moves last night were probably slightly more aggressive than id... [14:38:57] akosiaris: we were having cpu-usage problems a week ago and I halted some instances to get the servers unstuck. I might have halted that instance, although I would’ve written in the SAL if I did... [14:39:13] jouncebot, net [14:39:15] jouncebot, next [14:39:15] In 0 hour(s) and 20 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150409T1500) [14:39:17] Anyway — you need the nova ID, which is on the instance page. In this case it’s 0e82f3c8-af65-433a-89dc-0f3425e7f585 [14:39:34] 6operations, 10Wikimedia-Site-requests: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112#1194270 (10MarkAHershberger) ----- Original Message ----- > Nemo_bis added a subscriber: JohnLewis. > Nemo_bis added a comment. > In https://phabricator.wikimedia.org... [14:39:34] akosiaris: well… do you want me to tell you how to do this for future reference or shall I just? I have a login now. [14:40:12] andrewbogott: OS_TENANT_NAME="maps" nova reset-state 0e82f3c8-af65-433a-89dc-0f3425e7f585 ? [14:40:24] I assume this, is it correct ? [14:40:34] akosiaris: try just ‘nova start’ first [14:40:38] ok [14:40:59] let's see [14:41:09] ok, it says active [14:41:17] Also, reset-state resets to ‘error’ state which nova then refuses to do anything further with things in error state. So I usually do reset-state —state running (or something to that effect, can’t remember the syntax) [14:41:25] yeah, that’s probably all it takes. [14:41:42] --active [14:41:52] hashar: I’m not sure about your new instance, but something is broken in the labs puppet config as of a few minutes ago. [14:42:02] akosiaris: yeah, sounds right [14:42:09] andrewbogott: yeah I can imagine. Not a big deal for me that can wait [14:42:16] andrewbogott: thanks! [14:45:03] 7Puppet, 6operations, 5Patch-For-Review: restructure site.pp to use roles, hiera. - https://phabricator.wikimedia.org/T86774#1194277 (10Joe) 5Open>3Resolved [14:48:57] Krenair: Hmm, I'm actually here today. Do you still think we should deploy it today? [14:49:11] sure, why not? [14:49:27] ok, I'll add it then [14:50:09] oh, you already did :p [14:50:31] * anomie assumes Krenair will SWAT this morning [14:50:40] ok [14:53:05] (03PS2) 10BBlack: resync cache node list hieradata [puppet] - 10https://gerrit.wikimedia.org/r/202974 [14:53:19] (03CR) 10BBlack: [C: 032 V: 032] resync cache node list hieradata [puppet] - 10https://gerrit.wikimedia.org/r/202974 (owner: 10BBlack) [14:55:19] (03PS1) 10Ottomata: Ensure useful (mostly python) packages are on analytics worker and client nodes [puppet] - 10https://gerrit.wikimedia.org/r/203080 [14:55:43] (03PS2) 10Ottomata: Ensure useful (mostly python) packages are on analytics worker and client nodes [puppet] - 10https://gerrit.wikimedia.org/r/203080 [14:56:04] Glaisher, so... https://gerrit.wikimedia.org/r/#/c/201897/1 [14:56:15] I'm not sure about https://ast.wikipedia.org/w/index.php?title=Uiquipedia:Votaciones§ion=13#Cambiu_de_nome_de_los_proyectos_wiki_n.27asturianu [14:57:22] do you speak Asturian? [14:57:30] (03CR) 10Ottomata: [C: 032] Ensure useful (mostly python) packages are on analytics worker and client nodes [puppet] - 10https://gerrit.wikimedia.org/r/203080 (owner: 10Ottomata) [14:57:33] _joe_: sorry, my connection is off and on so I don’t know if my last question made it through. Wondering if this puppet error is due to your recent patches: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Sudo::Group[ops] is already declared in file /etc/puppet/manifests/role/labs.pp:12; cannot redeclare at /etc/puppet/modules/admin/manifests/group.pp:39 on node i-000005b6.eqi [14:57:39] Krenair: they closed it like that [14:57:42] (happening on lots of labs boxes just now) [14:58:57] Glaisher, I get the impression that Wikipedia tried to rename Wiktionary? [14:59:00] which is absolutely not okay [14:59:30] ah, yeah, ‘admin’ is now included on labs boxes. not so good [14:59:37] Krenair: I asked them to file a separate ticket for Wiktionary [14:59:37] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [14:59:43] and that patch is for Wikipedia only [14:59:56] Glaisher, but the discussion was for both [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150409T1500). [15:00:18] kart_, hey [15:00:35] hey [15:00:38] :) [15:00:45] (03PS2) 10Alex Monk: CX: Enable 'newarticle' campaign by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202021 (https://phabricator.wikimedia.org/T95147) (owner: 10KartikMistry) [15:00:58] Krenair: so we should ask them to start another discussion for Wikipedia only? [15:01:04] I think so Glaisher [15:01:08] (03CR) 10Alex Monk: [C: 032] CX: Enable 'newarticle' campaign by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202021 (https://phabricator.wikimedia.org/T95147) (owner: 10KartikMistry) [15:01:13] (03Merged) 10jenkins-bot: CX: Enable 'newarticle' campaign by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202021 (https://phabricator.wikimedia.org/T95147) (owner: 10KartikMistry) [15:01:20] alright, I'll do that [15:01:45] (03PS1) 10Filippo Giunchedi: gdash: adjust metric name for statsite [puppet] - 10https://gerrit.wikimedia.org/r/203081 [15:01:48] (03PS1) 10Filippo Giunchedi: kafka: adjust graphite checks for statsite names [puppet] - 10https://gerrit.wikimedia.org/r/203082 [15:01:54] ottomata: ^ for your eyes [15:02:03] will look, one sec [15:02:07] PROBLEM - puppet last run on analytics1031 is CRITICAL puppet fail [15:02:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: adjust metric name for statsite [puppet] - 10https://gerrit.wikimedia.org/r/203081 (owner: 10Filippo Giunchedi) [15:02:18] PROBLEM - puppet last run on analytics1029 is CRITICAL puppet fail [15:02:20] i know! [15:02:22] my fault. [15:02:33] (03PS1) 10Ottomata: Install common compute packages on Hadoop cluster nodes in a better way than the last commit [puppet] - 10https://gerrit.wikimedia.org/r/203083 [15:02:38] PROBLEM - puppet last run on analytics1019 is CRITICAL puppet fail [15:02:44] (03CR) 10jenkins-bot: [V: 04-1] Install common compute packages on Hadoop cluster nodes in a better way than the last commit [puppet] - 10https://gerrit.wikimedia.org/r/203083 (owner: 10Ottomata) [15:02:55] (03PS2) 10Ottomata: Install common compute packages on Hadoop cluster nodes in a better way than the last commit [puppet] - 10https://gerrit.wikimedia.org/r/203083 [15:03:23] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/202021/ (duration: 00m 11s) [15:03:25] kart_, ^ [15:03:29] Logged the message, Master [15:03:34] Krenair: testing.. [15:03:58] PROBLEM - puppet last run on analytics1034 is CRITICAL puppet fail [15:04:08] PROBLEM - puppet last run on analytics1036 is CRITICAL puppet fail [15:04:16] (03CR) 10Ottomata: [C: 032] Install common compute packages on Hadoop cluster nodes in a better way than the last commit [puppet] - 10https://gerrit.wikimedia.org/r/203083 (owner: 10Ottomata) [15:04:18] PROBLEM - puppet last run on analytics1039 is CRITICAL puppet fail [15:04:57] Krenair: working. Thanks! [15:04:57] PROBLEM - puppet last run on analytics1015 is CRITICAL puppet fail [15:04:59] (03PS3) 10Alex Monk: Add 'editeditorprotected' protection level on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) (owner: 10Glaisher) [15:05:18] PROBLEM - puppet last run on analytics1033 is CRITICAL puppet fail [15:05:38] (03CR) 10Alex Monk: [C: 032] Add 'editeditorprotected' protection level on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) (owner: 10Glaisher) [15:05:43] (03Merged) 10jenkins-bot: Add 'editeditorprotected' protection level on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201940 (https://phabricator.wikimedia.org/T94368) (owner: 10Glaisher) [15:05:45] (03CR) 10Glaisher: [C: 04-1] "see phab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201897 (https://phabricator.wikimedia.org/T94341) (owner: 10Glaisher) [15:06:23] Glaisher [15:06:31] !log krenair Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/201940/ (duration: 00m 11s) [15:06:35] Logged the message, Master [15:06:38] ookay [15:07:07] https://de.wikipedia.org/wiki/Spezial:Gruppenrechte [15:07:08] PROBLEM - puppet last run on analytics1035 is CRITICAL puppet fail [15:07:11] geschützt (nur Sichter) (editeditorprotected) [15:07:18] PROBLEM - puppet last run on analytics1017 is CRITICAL puppet fail [15:07:18] RECOVERY - puppet last run on analytics1031 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:07:38] PROBLEM - puppet last run on analytics1040 is CRITICAL puppet fail [15:07:39] huh hm. uhhhhhhhh godog, i think you will not know the answer to this, so this is probably rhetorical. [15:07:50] why do I have monitoring::graphite_anomaly { 'kafka-broker-MessagesIn-anomaly': defined in two places? [15:07:53] graphite.pp and kafka.pp???? [15:07:57] "levels":["","autoconfirmed","editeditorprotected","sysop","superprotect" [15:07:59] Krenair: ^ [15:08:01] it shoudl probably just be in graphite.pp.... [15:08:05] yeah looks good [15:08:08] PROBLEM - puppet last run on analytics1041 is CRITICAL puppet fail [15:10:31] ottomata: you guessed right, I don't know the answer :) maybe leftover? [15:10:37] RECOVERY - puppet last run on analytics1035 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:10:37] RECOVERY - puppet last run on analytics1033 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:10:57] RECOVERY - puppet last run on analytics1034 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:10:58] RECOVERY - puppet last run on analytics1029 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:11:07] RECOVERY - puppet last run on analytics1036 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:11:07] RECOVERY - puppet last run on analytics1040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:11:18] RECOVERY - puppet last run on analytics1039 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:11:18] RECOVERY - puppet last run on analytics1019 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:11:38] RECOVERY - puppet last run on analytics1041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:11:57] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:12:12] Glaisher, am looking through the queue to see if there's any other simple config changes to make [15:12:27] (03PS1) 10Giuseppe Lavagetto: admin: fix inclusion in labs [puppet] - 10https://gerrit.wikimedia.org/r/203084 [15:12:28] RECOVERY - puppet last run on analytics1017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:12:57] ah, nice [15:13:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: fix inclusion in labs [puppet] - 10https://gerrit.wikimedia.org/r/203084 (owner: 10Giuseppe Lavagetto) [15:13:20] I don't think I've an open config patch which needs to be deployed [15:13:57] PROBLEM - puppet last run on stat1002 is CRITICAL puppet fail [15:15:24] kart_, what about https://gerrit.wikimedia.org/r/#/c/202689/1 ? [15:15:27] <_joe_> ottomata: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[python-numpy] is already declared in file /etc/puppet/modules/statistics/manifests/compute.pp:79; [15:15:33] <_joe_> I suppose you know it [15:15:58] I think I need to update https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Add_a_wiki :/ [15:17:04] hey gwicke, shouldn't there be something on ^ and https://wikitech.wikimedia.org/wiki/Add_a_wiki about restbase? [15:17:12] I updated the other Add a wiki documentation January [15:17:16] last* [15:17:28] still needs to be checked [15:19:00] (03PS2) 10Alex Monk: New WP and VP namespaces aliases on lv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202755 (https://phabricator.wikimedia.org/T95106) (owner: 10Dereckson) [15:19:14] pssh _joe_ will fix. need to use ensure_packages everywhere [15:19:16] thanks [15:19:53] Krenair: I guess so, although it might make sense to wait until we cover all wikis [15:20:07] (03CR) 10Alex Monk: [C: 032] New WP and VP namespaces aliases on lv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202755 (https://phabricator.wikimedia.org/T95106) (owner: 10Dereckson) [15:20:16] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1194497 (10Glaisher) https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Add_a_wiki too [15:20:28] otherwise we'd have to explain the details [15:20:46] ottomata: I'll go ahead btw with the .value rename, we can find out about the duplication later [15:21:14] gwicke, don't all wikis in beta rely on restbase being there for ve? [15:21:18] Krenair: we should set up some pub/sub thing for wiki creations ;) [15:21:34] we have a mailing list that gets notified when we make a new wiki [15:21:49] ah, should probably subscribe [15:21:51] name? [15:22:08] newprojects: https://lists.wikimedia.org/mailman/listinfo/newprojects [15:22:30] thx, subscribed [15:22:55] I don't think it shows beta wiki creations anymore [15:23:33] gwicke, https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings-labs.php#L396 - yeah [15:23:45] we need restbase for all new beta wikis now [15:24:06] is there a lot of churn on beta? [15:24:23] the RB config doesn't have that many wikis listed for beta [15:24:48] PROBLEM - puppet last run on graphite1001 is CRITICAL Puppet last ran 4 hours ago [15:25:20] churn? [15:25:22] it's got only the domains parsoid has in its interwiki map in localsettings.js for beta [15:25:31] Krenair: we could also consider supporting a regexp rule [15:25:47] godog +1 [15:26:00] (03PS2) 10Filippo Giunchedi: kafka: adjust graphite checks for statsite names [puppet] - 10https://gerrit.wikimedia.org/r/203082 [15:26:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] kafka: adjust graphite checks for statsite names [puppet] - 10https://gerrit.wikimedia.org/r/203082 (owner: 10Filippo Giunchedi) [15:26:28] RECOVERY - puppet last run on graphite1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:27:03] mobrovac, okay, and the instructions at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Add_a_wiki say to change parsoid-localsettings-beta.js [15:27:16] what extra needs to be done for restbase? [15:28:06] with a regexp based rule we could probably eliminate the need to do anything [15:28:08] RECOVERY - RAID on ms-be1005 is OK optimal, 13 logical, 13 physical [15:28:26] Krenair: this is gonna be fun - https://github.com/wikimedia/operations-puppet/blob/production/modules/restbase/templates/config.labs.yaml.erb#L141 [15:28:28] (03Merged) 10jenkins-bot: New WP and VP namespaces aliases on lv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202755 (https://phabricator.wikimedia.org/T95106) (owner: 10Dereckson) [15:29:11] !log bounce uwsgi on graphite1001 [15:29:14] Logged the message, Master [15:31:19] mobrovac, is it as simple as adding an extra line there? [15:31:28] Krenair: yes [15:32:16] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/202755/ (duration: 00m 12s) [15:32:19] Logged the message, Master [15:33:12] gwicke, what's so complicated about that? :/ [15:34:17] Krenair: it's not too complicated, just creates busywork [15:34:30] alternative in https://phabricator.wikimedia.org/T95563 [15:34:31] yeah, the whole page is about the busywork involved [15:35:11] 6operations, 10ops-eqiad: ms-be1005.eqiad.wmnet: slot=5 dev=sdf failed - https://phabricator.wikimedia.org/T95268#1194555 (10Cmjohnson) We have spare disks on-site so I replaced the disk in slot 5 ran megacli -DiscardPreservedCache -L5 -a0 and then megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 [15:38:06] Hi guys, checking Wikimedia page https://wikitech.wikimedia.org/wiki/BGP/old_setup [15:38:23] do you have "current" setup page somewhere ? [15:40:09] (03CR) 10Alex Monk: [C: 032] Added media.padil.gov.au to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202726 (https://phabricator.wikimedia.org/T95328) (owner: 10Dereckson) [15:40:45] (03PS1) 10Ottomata: Use ensure_packages rather than package resource for many analytics and statistics roles [puppet] - 10https://gerrit.wikimedia.org/r/203085 [15:41:56] (03CR) 10Alex Monk: "Prateek? Please respond. This has been sitting in the config queue since November." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (https://phabricator.wikimedia.org/T73477) (owner: 10Glaisher) [15:43:29] (03PS1) 10BBlack: remove pointless (I think!) esams/ulsfo $ganglia_aggregator and cache_upload def for esams [puppet] - 10https://gerrit.wikimedia.org/r/203087 [15:44:09] (03Abandoned) 10Alex Monk: Generalize comment for "templateeditor" since it's on other wikis now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/153306 (owner: 10Legoktm) [15:44:18] (03Merged) 10jenkins-bot: Added media.padil.gov.au to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202726 (https://phabricator.wikimedia.org/T95328) (owner: 10Dereckson) [15:45:33] (03CR) 10Alex Monk: [C: 04-1] Switch some usages of 'wiki' to 'wikipedia' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194086 (https://phabricator.wikimedia.org/T91340) (owner: 10MaxSem) [15:45:50] (03CR) 10Alex Monk: [C: 04-1] Don't set up the job queue for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190406 (owner: 10Ori.livneh) [15:46:02] (03PS2) 10Ottomata: Use ensure_packages rather than package resource for many analytics and statistics roles [puppet] - 10https://gerrit.wikimedia.org/r/203085 [15:46:09] (03CR) 10Ottomata: [C: 032 V: 032] Use ensure_packages rather than package resource for many analytics and statistics roles [puppet] - 10https://gerrit.wikimedia.org/r/203085 (owner: 10Ottomata) [15:46:43] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/202726 (duration: 00m 12s) [15:46:48] Logged the message, Master [15:47:28] 6operations, 10ops-eqiad, 10Analytics-Cluster: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1194617 (10Ottomata) Thanks Chris, how's it looking? [15:47:55] (03CR) 10Alex Monk: [C: 04-1] "Un-addressed comments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190203 (owner: 10Se4598) [15:48:29] (03CR) 10Alex Monk: [C: 04-1] "Unmerged dependency" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197038 (owner: 10Werdna) [15:48:55] Krinkle: I'll migrate labmon1001 to statsite in 15min btw [15:48:56] (03PS2) 10BBlack: remove pointless (I think!) esams $ganglia_aggregator and cache_upload def for esams [puppet] - 10https://gerrit.wikimedia.org/r/203087 [15:49:14] (03PS2) 10RobH: adding wtp systems to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/200734 (https://phabricator.wikimedia.org/T90271) [15:49:16] (03PS2) 10Alex Monk: Make Hovercards default for Chinese, Catalan and Greek WP. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197038 (https://phabricator.wikimedia.org/T88164) (owner: 10Werdna) [15:49:28] (03CR) 10jenkins-bot: [V: 04-1] Make Hovercards default for Chinese, Catalan and Greek WP. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197038 (https://phabricator.wikimedia.org/T88164) (owner: 10Werdna) [15:49:38] 6operations, 10ops-eqiad, 10Analytics-Cluster: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1194633 (10Cmjohnson) The basic things that would typically fix this did not work. I will have to try a firmware upgrade next but the only firmware update is for server2008 and in .exe fi... [15:50:06] ottomata: I have to deal with tech support for a non-windows firmware update [15:50:13] 6operations, 10Analytics-EventLogging: Decommission vanadium - https://phabricator.wikimedia.org/T95566#1194634 (10Ottomata) 3NEW a:3Ottomata [15:50:29] kart_? [15:50:30] (03PS3) 10RobH: adding wtp systems to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/200734 (https://phabricator.wikimedia.org/T90271) [15:50:45] RECOVERY - puppet last run on stat1002 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:51:05] (03Abandoned) 10RobH: adding wtp systems to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/200734 (https://phabricator.wikimedia.org/T90271) (owner: 10RobH) [15:51:19] ottomata: ok [15:51:20] oops [15:51:21] ha [15:51:25] cmjohnson1: ok [15:51:35] godog: I didn't quite get in what way this would impact users [15:51:42] cmjohnson1: how long do you think that will take? [15:51:57] Are the end point metric properties still the same? [15:52:00] And graphite functions? [15:52:02] robh: we don't ahve dsh groups anymore, right? can I remove this from the reclaim or decomission instructions " • Remove server entry from DSH node groups. [15:52:02] • These files are maintained in operations/puppet:/files/dsh/group. [15:52:02] "? [15:52:04] on wikitech? [15:52:08] ottomata: i know it's important ...i have several things spinning ATM [15:52:22] ottomata: dont we still have one or two? are they entirely gone? [15:52:38] i thought the deployment still used it but i may be wrong. [15:52:39] IDK ...cuz it depends on if the tech suppport person I get will know what to do [15:52:41] Krinkle: graphite is the same, some statsd metrics will change name tho [15:52:53] ottomata: if even one dsh group still exists, best to leave it as a step. [15:52:58] ottomata regarding time ^^ [15:53:00] cmjohnson1: ok, it isn't super critical, hadoop keeps on trucking [15:53:00] but if its all gone, sure, remove, but be sure [15:53:10] (03CR) 10Alex Monk: [C: 032] Collection: Remove deprecated $wgCollectionHierarchyDelimiter configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200037 (owner: 10Cscott) [15:53:13] but would be nice to have back soon, s'ok if there other priorities [15:53:16] i will poke about it though :) [15:53:17] thanks! [15:53:19] (03Merged) 10jenkins-bot: Collection: Remove deprecated $wgCollectionHierarchyDelimiter configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200037 (owner: 10Cscott) [15:53:39] godog: is it possible that rate metrics are all multiplied by 1000? [15:53:55] ok, robh, i still see dsh stuff, but it is now in module [15:53:58] will just update doc with proper path [15:54:03] cool, thx dude [15:54:09] !log krenair Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/200037/1 - should be a no-op (duration: 00m 11s) [15:54:16] Logged the message, Master [15:55:02] gwicke: perhaps related to the flush interval? which metrics btw? [15:55:21] (03CR) 10Alex Monk: [C: 04-1] "Please answer my questions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199576 (owner: 10KartikMistry) [15:55:26] godog: all rate metrics in http://grafana.wikimedia.org/#/dashboard/db/restbase and http://grafana.wikimedia.org/#/dashboard/db/visualeditor-load-save [15:56:02] most likely all, those are the two dashboards I checked / fixed so far [15:56:06] PROBLEM - puppet last run on analytics1027 is CRITICAL puppet fail [15:56:19] (03PS1) 10Ottomata: Remove references to vanadium in order to decomission [puppet] - 10https://gerrit.wikimedia.org/r/203091 (https://phabricator.wikimedia.org/T95566) [15:58:01] (03CR) 10Ottomata: [C: 032] Remove references to vanadium in order to decomission [puppet] - 10https://gerrit.wikimedia.org/r/203091 (https://phabricator.wikimedia.org/T95566) (owner: 10Ottomata) [15:59:15] (03CR) 10Alex Monk: [C: 032] Convert all Bugzilla numbers to Phabricator ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201910 (owner: 10Alex Monk) [15:59:19] (03CR) 10jenkins-bot: [V: 04-1] Convert all Bugzilla numbers to Phabricator ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201910 (owner: 10Alex Monk) [16:00:04] kart_: Respected human, time to deploy Content Translation deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150409T1600). Please do the needful. [16:00:44] oh well [16:00:47] gwicke: things that depended on the flush interval yet would have changed, but not for timers afaik, looking, how is that metric being sent to statsd btw? [16:01:06] godog: I think I saw a list of renamed properties at some point. Is there a list? [16:01:08] Yes jouncebot Sir [16:01:16] analytics1027 gah [16:01:17] godog: it's a normal timer [16:01:32] what is yo prob... [16:01:37] (i am talking to servers now) [16:01:43] godog: on the wire, it would use the 'ms' type [16:02:06] the rate is derived from it [16:02:54] Krinkle: yeah we can get one, I'll let you know [16:03:37] gwicke: is there sampling involved too? [16:03:45] (03PS1) 10Ottomata: Use ensure_packages for java on hadoop clients [puppet] - 10https://gerrit.wikimedia.org/r/203093 [16:05:06] !log Updated cxserver to 640bcdf [16:05:09] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1194717 (10Andrew) I've confirmed the change with digicert. Shopify says: "We'll add in store.wikimedia.org, but you'll need to confirm it one more tim... [16:05:11] Logged the message, Master [16:05:19] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [16:05:24] ok. CX deployment time then. [16:05:31] (03PS3) 10Alex Monk: Convert all Bugzilla numbers to Phabricator ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201910 [16:05:37] (03CR) 10Ottomata: [C: 032] Use ensure_packages for java on hadoop clients [puppet] - 10https://gerrit.wikimedia.org/r/203093 (owner: 10Ottomata) [16:06:05] cmjohnson1: another q for you [16:06:20] sure [16:06:23] i am decomissioning vanadium [16:06:29] it has a bad disk, but i don't need the node anymore [16:06:45] https://phabricator.wikimedia.org/T94926 [16:06:47] is there an op who could deploy an apache config patch? [16:06:51] akosiaris: godog can you merge, https://gerrit.wikimedia.org/r/203027 and https://gerrit.wikimedia.org/r/202341 please? [16:06:52] what should I do? [16:06:59] just power off? [16:07:04] kart_: sorry I can't ATM [16:07:20] godog: graphite stuff? :) [16:07:30] RECOVERY - puppet last run on analytics1027 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:07:48] ye [16:08:09] ottomata: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission [16:08:14] yes, i am doing all of that [16:08:25] but i wasn't sure if i should do anything special because of the bad disk [16:08:31] okay..once done send me a ticket [16:08:34] like, if you wanted to make sure you got to it before or something [16:08:44] i will do the rest of the decom and replace the disk [16:08:50] ok, i'll link the bad disk ticket to the decom one and note it [16:08:51] k danke [16:08:58] well add to server spares --make that the subject [16:09:03] ok [16:09:49] !log decomissioning vanadium, powering it off [16:09:55] Logged the message, Master [16:10:47] robh, do we use decomissioning.pp anymore? [16:11:13] no, you manually remove from icinga [16:11:22] its just more reliable [16:11:25] lifecycle wrong again? [16:11:51] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission eww [16:11:56] so yea there are RT references and such [16:12:05] ottomata: by chance is this machine coming back to spares? [16:12:10] !log kartik Started scap: Update ContentTranslation [16:12:12] or is it going into a different use? [16:12:13] Logged the message, Master [16:12:34] (03PS1) 10Ottomata: Remove vanadium entries, leave mgmt ones [dns] - 10https://gerrit.wikimedia.org/r/203097 (https://phabricator.wikimedia.org/T95566) [16:12:35] robh, yes [16:12:49] i did the remove from icinga [16:12:53] i will remove the decmossion.pp part [16:13:34] (03CR) 10Ottomata: [C: 032] Remove vanadium entries, leave mgmt ones [dns] - 10https://gerrit.wikimedia.org/r/203097 (https://phabricator.wikimedia.org/T95566) (owner: 10Ottomata) [16:14:02] ottomata: so this is a reclaim, dont touch the mgmt dns entries, but the rest can go, but also make sure you put in a ticket in network for someone to disable the switch port [16:14:22] ottomata: unless its already shut down, then i'll just do it now and save you the task =] [16:14:23] lemme know [16:14:34] robh: we should work on that spares list and start identifying everything out of warranty [16:14:49] robh, i powered it off [16:14:50] !log migrate labmon1001 to statsite [16:14:56] Logged the message, Master [16:14:57] its pretty easy to tell on spares due to the specs for me ;D [16:14:58] but yea [16:14:59] left mgmt entires in place [16:15:21] ottomata: cool, i'll disable the switch port now, you dont need to add a ticket [16:15:23] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1194744 (10Ottomata) p:5Triage>3Normal a:5Ottomata>3Cmjohnson [16:15:39] just to confirm, this is vanadium [16:15:47] (i see the gerrit patchsets, but no task, so im paranoid) [16:15:58] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1194757 (10Ottomata) I have decomissioned vanadium. It is powered off and read to be reclaimed. Please note that it currently has a failed disk: https://phabricator.wikimed... [16:16:11] 6operations, 10ops-eqiad, 10Analytics-EventLogging: vanadium failed disk /dev/sda - https://phabricator.wikimedia.org/T94926#1194760 (10Ottomata) [16:16:14] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1194759 (10Ottomata) [16:16:16] ottomata: ge-4/0/11 up up vanadium [16:16:19] robh, yes vanadium [16:16:20] you sure its powered odnw? [16:16:23] uh [16:16:25] its port is still up [16:17:00] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:17:12] robh, I did [16:17:17] shutdown -P now [16:17:19] how do you power off usually? [16:17:21] via mgmt? [16:17:25] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1194766 (10RobH) [16:17:32] ottomata: ssh in and shut it down is easier. [16:17:41] but if you didnt, then yea mgmt can do it [16:17:47] i did it in ssh [16:17:48] shutdown -h now [16:17:49] while i was there [16:17:51] -h? [16:17:51] hmmm [16:17:53] i did -P [16:17:54] halt [16:18:29] iunnooo, anyway, i can't ssh in at all, so uhhh [16:18:35] whatevs, it shoudl shut down, so yea we will do mgmt [16:18:40] i added some entries to the decom ticket [16:18:42] k cool [16:18:51] we should make a habit of listing the steps off for decom, just as we do for installs [16:18:56] so nothing is skipped [16:19:12] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1194771 (10Ottomata) [16:19:15] ok [16:19:20] PROBLEM - Graphite Carbon on labmon1001 is CRITICAL Not all configured Carbon instances are running. [16:19:21] i have already removed from production dns [16:19:22] so why is it assigned to chris? (just asking) [16:19:24] not hte mgmt entires though [16:19:33] uh, i assumed he had to do it [16:19:36] no? [16:19:43] the only thing he has to do is a wipe [16:19:46] and thats a sub task [16:19:48] not the main task [16:19:49] adding to spares sounds like it needs to be throne on top of a big pile somewhere [16:19:52] thrown* [16:19:53] robh: i asked for it [16:19:54] by me [16:20:00] i control spares list ;D [16:20:22] yeah but i've had my fair share of entries ...thought it just be easier [16:20:25] ok you two work it out :) [16:20:40] I am dusting my hands off :) [16:20:41] thank you! [16:20:44] I just want us to follow the exact same procedure, I don't care who does it. However, that procedure hasn't been documented for phabricator. [16:20:55] so, i'll take some time to document how it should flow [16:21:10] RECOVERY - Graphite Carbon on labmon1001 is OK All defined Carbon jobs are runnning. [16:21:18] cmjohnson1: the ideal is a flow like installs, with sub-tasks for onsite work versus software, etc... [16:22:31] well, luckily, chris and i both know all the decom/reclai steps [16:22:38] but still doesnt mean i shouldnt document it, heh [16:23:24] 6operations, 10Beta-Cluster, 6Labs: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1194794 (10Dzahn) http://ubuntuforums.org/showthread.php?t=802156 tldr: bad proxies sudo aptitude -o Acquire::http::No-Cache=True -o Acqui... [16:23:38] (03PS2) 10Dr0ptp4kt: Enable CirrusSearch event logging in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202965 (owner: 10Bmansurov) [16:25:08] (03PS1) 10Filippo Giunchedi: labs: update graphite checks with new metric names [puppet] - 10https://gerrit.wikimedia.org/r/203102 [16:25:10] (03PS1) 10Filippo Giunchedi: swift: use keepLastValue where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/203103 [16:25:25] (03CR) 10Dr0ptp4kt: [C: 032] Enable CirrusSearch event logging in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202965 (owner: 10Bmansurov) [16:25:39] (03CR) 10Dr0ptp4kt: [V: 032] "Okay, let's try this on the beta cluster." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202965 (owner: 10Bmansurov) [16:25:43] (03PS2) 10Filippo Giunchedi: labs: update graphite checks with new metric names [puppet] - 10https://gerrit.wikimedia.org/r/203102 [16:26:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] labs: update graphite checks with new metric names [puppet] - 10https://gerrit.wikimedia.org/r/203102 (owner: 10Filippo Giunchedi) [16:26:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: use keepLastValue where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/203103 (owner: 10Filippo Giunchedi) [16:26:44] (03PS2) 10Filippo Giunchedi: swift: use keepLastValue where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/203103 [16:26:51] (03CR) 10Filippo Giunchedi: [V: 032] swift: use keepLastValue where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/203103 (owner: 10Filippo Giunchedi) [16:27:40] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [16:29:15] godog: \o/ [16:29:27] godog: sorry I didn’t end up making that patch, thanks for making it [16:29:34] YuviPanda: np, fixing other things too :( [16:29:52] oh :( [16:29:58] let me know if there’s anything I can do to help [16:30:04] has it moved already? [16:31:04] YuviPanda: should be now yeah, mistakenly redirected to prod for a minute [16:31:12] hah :) [16:31:28] 6operations, 10ops-eqiad: dysprosium memory failure - https://phabricator.wikimedia.org/T95423#1194829 (10Cmjohnson) Moved the DIMM from A4 to A3 and DIMM error followed MEMBIST Memory Test failure DIMM A3 [16:34:18] !log kartik Finished scap: Update ContentTranslation (duration: 22m 07s) [16:34:21] Logged the message, Master [16:39:14] (03PS1) 10Legoktm: Add SandboxLink to extension-list, deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203105 (https://phabricator.wikimedia.org/T72499) [16:42:56] YuviPanda: want free merge? [16:43:09] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [16:43:27] kart_: ? [16:43:41] YuviPanda: https://gerrit.wikimedia.org/r/203027 and https://gerrit.wikimedia.org/r/202341 :) [16:43:56] as akosiaris seems not around today :) [16:45:07] (03PS1) 10Filippo Giunchedi: labs: point statsite at labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/203106 [16:45:35] (03PS2) 10Yuvipanda: Beta: Add 'simple' in source and target [puppet] - 10https://gerrit.wikimedia.org/r/203027 (https://phabricator.wikimedia.org/T95538) (owner: 10KartikMistry) [16:45:45] (03CR) 10Yuvipanda: [C: 032 V: 032] Beta: Add 'simple' in source and target [puppet] - 10https://gerrit.wikimedia.org/r/203027 (https://phabricator.wikimedia.org/T95538) (owner: 10KartikMistry) [16:46:10] YuviPanda: https://gerrit.wikimedia.org/r/203106 [16:46:32] (03PS2) 10Yuvipanda: labs: point statsite at labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/203106 (owner: 10Filippo Giunchedi) [16:46:41] (03CR) 10Yuvipanda: [C: 031] labs: point statsite at labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/203106 (owner: 10Filippo Giunchedi) [16:46:55] (03PS3) 10Yuvipanda: CX: Swedish in target, simple in source and sv-da in MT [puppet] - 10https://gerrit.wikimedia.org/r/202341 (https://phabricator.wikimedia.org/T95108) (owner: 10KartikMistry) [16:47:12] godog: +1, although I’m going to guess that since statsite should also run on the same host, ‘localhost’ is good enough? [16:47:21] (03CR) 10Yuvipanda: [C: 032 V: 032] CX: Swedish in target, simple in source and sv-da in MT [puppet] - 10https://gerrit.wikimedia.org/r/202341 (https://phabricator.wikimedia.org/T95108) (owner: 10KartikMistry) [16:47:37] YuviPanda: yeah but if someone uses statsite in labs it'll push its metrics to prod [16:47:41] aaaah [16:47:42] I see [16:47:43] makes sense [16:47:52] godog: sorry, was in a meeting; there is no sampling involved [16:47:53] to set a default like that yeah [16:47:58] kart_: all done [16:48:09] YuviPanda: cool. Thanks a lot! [16:48:38] is your deployment done now kart_? [16:49:12] Krenair: yes. [16:50:02] (03PS4) 10Alex Monk: Convert all Bugzilla numbers to Phabricator ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201910 [16:50:11] (03CR) 10Alex Monk: [C: 032] Convert all Bugzilla numbers to Phabricator ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201910 (owner: 10Alex Monk) [16:50:23] (03PS3) 10Filippo Giunchedi: labs: point statsite at labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/203106 [16:50:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] labs: point statsite at labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/203106 (owner: 10Filippo Giunchedi) [16:51:53] (03PS1) 10Legoktm: Enable SandboxLink on English projects where it is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203109 (https://phabricator.wikimedia.org/T72499) [16:52:00] PROBLEM - puppet last run on labmon1001 is CRITICAL Puppet has 1 failures [16:53:39] (03Merged) 10jenkins-bot: Convert all Bugzilla numbers to Phabricator ticket numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201910 (owner: 10Alex Monk) [16:54:48] um [16:55:16] dr0ptp4kt_cold, you merged https://gerrit.wikimedia.org/r/#/c/202965/ but didn't sync it? :/ [16:55:21] oh well, labs only [16:55:32] putting it in with a bunch of other no-op changes I'm doing [16:56:10] (03PS1) 10Yuvipanda: tools: Add switchover / hot standby capability to tools services [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) [16:56:22] (03CR) 10jenkins-bot: [V: 04-1] tools: Add switchover / hot standby capability to tools services [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) (owner: 10Yuvipanda) [16:56:36] (03PS2) 10Yuvipanda: tools: Add switchover / hot standby capability to tools services [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) [16:56:44] Krenair: oh, i thought those were auto deployed? [16:56:48] !log krenair Synchronized wmf-config: no-ops: https://gerrit.wikimedia.org/r/#/c/201910/ and https://gerrit.wikimedia.org/r/#/c/202965/ (duration: 00m 13s) [16:56:50] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [16:56:53] Logged the message, Master [16:56:54] dr0ptp4kt_cold, only to beta, lol :) [16:56:56] Krenair: mind sync'ing while you're at it? [16:57:01] yeah I just did [16:57:13] Krenair: thanks! /me slaps self [16:57:48] we sync everything in that repo [16:57:55] even if it's labs-only [16:58:34] Krenair: thank you for the reminder. and the assist. [17:00:04] Legoktm, MatmaRex: Dear anthropoid, the time has come. Please deploy mw:Extension:SandboxLink deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150409T1700). [17:02:33] (03PS3) 10Yuvipanda: tools: Add switchover / hot standby capability to tools services [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) [17:02:44] (03CR) 10jenkins-bot: [V: 04-1] tools: Add switchover / hot standby capability to tools services [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) (owner: 10Yuvipanda) [17:03:03] (03PS4) 10Yuvipanda: tools: Add switchover / hot standby capability to tools services [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) [17:07:41] gwicke: ok so re: timers yeah .rate is bound to the flush interval, so you'd need scaleToSeconds() to bring it back to per-second [17:15:34] godog: could you apply that to all rates? [17:15:48] right now they are basically all useless [17:16:02] 6operations, 7HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#1194999 (10Jdforrester-WMF) 5Resolved>3Open This should stay open until it's actually fixed… [17:16:49] 6operations, 7HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#1195013 (10hoo) [17:16:51] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1195014 (10hoo) [17:17:01] (03CR) 10coren: "This should certainly work, but I wonder if - conceptually - we shouldn't rather make certain that everything -services- does is idempoten" [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) (owner: 10Yuvipanda) [17:17:21] PROBLEM - Host mw2129 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:06] (03CR) 10Yuvipanda: "That would require distributed locking for starting the services and what not, and also for things like emailing users when things are res" [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) (owner: 10Yuvipanda) [17:18:09] Coren: ^ responded [17:23:20] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [17:23:36] (03CR) 10coren: [C: 031] "Fair enough." [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) (owner: 10Yuvipanda) [17:23:54] Coren: ^ is that from your script? [17:24:49] YuviPanda: Yeah, although 'high load' remains as useless a metric as ever. iowait is still reasonably low. [17:25:34] fair enough [17:26:19] Also, load under 10 on a 8 core box imo hardly qualified as 'high load' [17:26:43] Wait, why does it say 24? [17:27:08] The actual load is between 9.5 and 10.5 over all three intervals. [17:27:16] 17:26:48 up 7 days, 13:18, 3 users, load average: 10.19, 9.47, 9.18 [17:27:59] * Coren attempts to divine which orifice that number was pulled out of. [17:28:33] Coren: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1428600495.459&target=servers.labstore1001.loadavg.05 [17:28:39] Coren: certainly not graphite... [17:29:00] not sure what happened there? [17:29:12] loadavg is meaningless without cpu-count context really. even then, it's kinda dubious :) [17:29:58] you kinda know when it's really bad, but maybe as an automated check is hard to ever define a line in the sand that's sane and useful. [17:30:26] I usually think, if it's double cpu count then ask why, if it's triple look for associative failures [17:30:45] but it's really like sideways indicative of no particular thing [17:30:49] outcome wise [17:30:53] really, cpu% would be a better metric for what people want out of loadavg [17:31:04] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1195073 (10RobH) [17:31:09] I find low idle% much more predictive of actual issues. [17:31:16] same thing, inverted [17:31:20] we should / could change it :) [17:31:23] although [17:31:24] in this case [17:31:29] I’m not sure why icinga is complaining [17:31:36] load is always the bridesmaid and never the bride :) [17:31:45] since the graph says it isn’t actually over 24 [17:31:51] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [17:31:59] godog: ^ is this due to the switchover? [17:32:01] and if you have high cpu idle% and also clearly-bad high loadavg/cpu, then you've got some horrible software design problem that you probably can't address anyways. [17:32:02] * YuviPanda isn’t sure [17:32:03] YuviPanda: That may be related to the changes in graphites. [17:32:06] yeah [17:32:07] probably [17:32:12] there’s some missing data there [17:36:35] (03CR) 10Yuvipanda: [C: 032] tools: Add switchover / hot standby capability to tools services [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) (owner: 10Yuvipanda) [17:39:52] (03CR) 10Legoktm: [C: 032 V: 032] Add SandboxLink to extension-list, deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203105 (https://phabricator.wikimedia.org/T72499) (owner: 10Legoktm) [17:40:23] !log legoktm Started scap: SandboxLink deployment [17:40:29] Logged the message, Master [17:41:32] 7Puppet, 10Beta-Cluster, 6Labs: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195176 (10greg) p:5Triage>3Unbreak! [17:43:16] YuviPanda: yeah I think so [17:43:33] godog: I meant the anamolous alert rather than the icinga alerts [17:43:40] (it was supposed to alert >24, was alerting for 10) [17:47:03] gwicke: afaik there's no way to apply a graphite function transparently on retrieval [17:48:06] cmjohnson1: do you know how to enter the raid config on these new HPs? The docs I wrote say ‘ctrl-s’ but that seems to get me something else. [17:48:29] 6operations, 6Labs: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195235 (10Dzahn) The error message comes from mediawiki / extensions/EventLogging includes /EventLoggingHooks.php line 33 " wfDebugLog( 'EventLoggi... [17:49:30] godog: is there a way to configure statsite to preserve req/s semantics in the rates it calculates? [17:49:36] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 5Patch-For-Review, and 2 others: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1195244 (10gpaumier) [17:49:43] 6operations, 6Labs: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195246 (10Dzahn) So i guess either there should be EventLogging and it needs config, or there shouldn't then the extensions should be removed. [17:51:49] 6operations, 6Labs: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195268 (10Krenair) I thought MediaWiki was not supposed to be running there anymore? [17:52:58] cmjohnson1: hm, maybe found it [17:55:27] andrewbogott, am I correct in assuming that virt1000 no longer runs wikitech? [17:55:45] it's entirely served by silver now right? [17:56:12] Krenair: that’s right [17:59:31] andrewbogott, so it should not have mediawiki installed at all? [18:00:25] 6operations, 6Labs: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195332 (10Dzahn) So the receiving end of it it is configured to expect stuff from virt1000 but because mw is gone now these show up on fluorine?? [18:01:30] andrewbogott, I have no idea what mutante is on about there [18:01:31] (03PS2) 10Legoktm: Enable SandboxLink on all projects where it is a default gadget [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203109 (https://phabricator.wikimedia.org/T72499) [18:02:04] i dont know either, all i am saying is the error message you see in logs comes from EventLogging extension [18:02:06] andrewbogott, it appears to still be sending MediaWiki logs (i.e., running MediaWiki) even though it no longer should be as far as I know [18:02:24] mutante, yes, okay, so something on virt1000 is still running MediaWiki every minute [18:02:28] and triggering eventlogging [18:02:40] which sends off these logs to fluorine via our logging system [18:02:42] Krenair: it’s probably still draining the jobqueue, or trying to [18:02:49] I’ll have a look later on — I see the phab ticket [18:02:52] ok [18:04:12] andrewbogott: did you get mac addresses? I have them but didn't finish yet [18:04:33] finish the the dhcp cfg [18:04:36] cmjohnson1: For dhcp you men? [18:04:41] yep [18:04:54] I haven’t done anything with dhcp/dns. The first server labvirt1001 seems to work though [18:05:07] oh, well, maybe, I guess I haven’t installed yet [18:05:25] okay..i will fnish that now and they should install [18:05:30] cool [18:06:10] !log 18:05:48 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', 'mw1010.eqiad.wmnet', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet', 'mw2001.codfw.wmnet', 'mw2041.codfw.wmnet', 'mw2080.codfw.wmnet', 'mw2119.codfw.wmnet', 'mw2187.codfw.wmnet'] on mw2129.codfw.wmnet returned [255]: ssh: connect to host mw2129.codfw.wmnet port 22: Connection [18:06:10] timed out [18:06:16] Logged the message, Master [18:06:41] 6operations, 10ops-codfw: mw2128 not rebooting after network driver crash, blank console - https://phabricator.wikimedia.org/T95264#1195349 (10Papaul) The Dell tech came with a replacement board, replaced the old one and now the server wouldn't power on. the decision they will come with another system board,... [18:06:44] bd808: ^ should I be worried about that? [18:06:51] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [18:06:51] PROBLEM - Debian mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/debian is over 18 hours old. [18:09:42] legoktm: naw. That's just a failure on one of the codfw hosts. !log it and move on [18:10:09] ok [18:10:19] if their names are in the 2xxx range [18:10:21] RECOVERY - Debian mirror in sync with upstream on carbon is OK: /srv/mirrors/debian is over 0 hours old. [18:12:16] Krenair: is that better? [18:14:10] andrewbogott, yeah hasn't been anything for a few minutes now. what'd you do? [18:14:20] I wiped out the apache user’s crontab [18:14:33] it hasn’t been puppetized in ages but was still running latent jobs [18:16:12] 6operations, 6Labs: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195409 (10Andrew) I wipe out apache's crontab on virt1000. Did that fix everything? [18:17:04] 6operations, 7Graphite: Sitestat timer rates not stored as req/s, multiplied by 1000 - https://phabricator.wikimedia.org/T95596#1195418 (10GWicke) [18:17:10] YuviPanda: It's not quite done yet, but it's not looking so bad. Couple more projects, with generally just one outlier. [18:17:29] (03CR) 10Dzahn: "how do i add the "prod vs. labs" part (127.0.0.1 when in labs) the right way using hiera?" [puppet] - 10https://gerrit.wikimedia.org/r/201882 (owner: 10Dzahn) [18:18:14] (03CR) 10Dzahn: "i mean, should that be "contint::zuul_merger_hosts::labs" or how do we do it?" [puppet] - 10https://gerrit.wikimedia.org/r/201882 (owner: 10Dzahn) [18:19:20] Coren: nice :) [18:19:54] (03PS1) 10Cmjohnson: Adding mac address to dhcp file for labvirt1001-6 [puppet] - 10https://gerrit.wikimedia.org/r/203185 [18:19:59] 6operations, 7Graphite, 5Patch-For-Review: replace txstatsd - https://phabricator.wikimedia.org/T90111#1195444 (10GWicke) Seems to generally work well. Thank you, Filippo! One issue I noticed is {T95596}. [18:20:01] mutante: there’s hieradata/labs hierarchy [18:20:13] mutante: so you can add it to hieradata/labs/integration.yaml I think [18:20:16] 6operations, 7Graphite: Statsite timer rates not stored as req/s, multiplied by 1000 - https://phabricator.wikimedia.org/T95596#1195447 (10GWicke) [18:20:19] !log legoktm Finished scap: SandboxLink deployment (duration: 39m 55s) [18:20:23] Logged the message, Master [18:20:38] 6operations, 7Graphite: Statsite timer rates not stored as req/s, multiplied by 1000 - https://phabricator.wikimedia.org/T95596#1195410 (10GWicke) [18:21:25] YuviPanda: thanks! looking [18:21:44] (03CR) 10Legoktm: [C: 032] Enable SandboxLink on all projects where it is a default gadget [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203109 (https://phabricator.wikimedia.org/T72499) (owner: 10Legoktm) [18:21:55] mutante: err, hieradata/labs/integration/common.yaml [18:21:58] I meant [18:22:38] 6operations, 7Graphite: Statsite timer rates not stored as req/s, multiplied by 1000 - https://phabricator.wikimedia.org/T95596#1195466 (10GWicke) [18:22:51] while prod is in hieradata/role/common/contint.yaml ? hmmm [18:23:04] andrewbogott: still want precise? [18:23:18] mutante: yes, the contint project is called integration in labs. Not sure why :) [18:23:43] confusing [18:24:27] and role-based lookup vs. not role-based? [18:25:18] 6operations, 6Labs: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195481 (10Krenair) 5Open>3Resolved a:3Krenair Looks like it. Thanks. [18:25:32] 6operations, 6Labs: Investigate virt1000 messages implying it's still running some old wikitech stuff - https://phabricator.wikimedia.org/T95535#1195484 (10Krenair) a:5Krenair>3Andrew [18:25:47] mutante: labs has no role based lookup at all [18:25:50] because labs has no roles [18:25:52] well [18:25:53] role keywords [18:25:55] (03Merged) 10jenkins-bot: Enable SandboxLink on all projects where it is a default gadget [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203109 (https://phabricator.wikimedia.org/T72499) (owner: 10Legoktm) [18:26:27] and then it's always "but mutante, labs is different anyways":) [18:26:34] which makes me sigh a little bit [18:26:50] well, feel free to implement it [18:26:52] :P [18:27:44] MatmaRex: it's live https://en.wikipedia.org/wiki/Special:Version ! [18:28:03] !log legoktm Synchronized wmf-config/InitialiseSettings.php: Enable SandboxLink on all projects where it is a default gadget https://gerrit.wikimedia.org/r/203109 (duration: 01m 06s) [18:28:06] Logged the message, Master [18:28:22] (03CR) 10Cmjohnson: [C: 032] Adding mac address to dhcp file for labvirt1001-6 [puppet] - 10https://gerrit.wikimedia.org/r/203185 (owner: 10Cmjohnson) [18:28:34] !log mw2129.codfw.wmnet still timing out [18:28:37] Logged the message, Master [18:28:40] 7Puppet, 10Beta-Cluster, 6Labs: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195122 (10hashar) The puppet failure where due to the hostname of the puppetmaster changing. That causes puppetmaster self to no more recognize the master as being the master... [18:29:23] legoktm: yayyyy. running the cleanup [18:34:38] 6operations, 10ops-eqiad: dysprosium memory failure - https://phabricator.wikimedia.org/T95423#1195520 (10Cmjohnson) Order with Dell placed. [18:37:17] (03PS2) 10Aaron Schulz: Add pool counter config for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199263 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [18:39:27] MatmaRex: close the bug whenever you're done? :) [18:40:37] legoktm: done [18:40:57] \o/ [18:41:27] whee [18:41:58] (03CR) 10Aaron Schulz: "Values tweaked" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199263 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [18:43:02] (03PS1) 10Mattflaschen: Enable VE for all Flow boards on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203195 [18:59:43] Nikerabbit: can you +1 https://gerrit.wikimedia.org/r/199263 if ok? [19:00:43] (03PS2) 10Hashar: contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) [19:00:45] (03PS1) 10Hashar: package_builder: use ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/203198 [19:00:59] (03PS1) 10Andrew Bogott: Assign partman config to labvirt100x [puppet] - 10https://gerrit.wikimedia.org/r/203199 [19:01:31] (03PS2) 10Dzahn: WIP - contint, move zuul_merger_hosts to hiera [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) [19:01:51] (03CR) 10Nikerabbit: [C: 031] Add pool counter config for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199263 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [19:01:56] (03CR) 10Hashar: "I guess I am going to do the same on contint / zuul modules as well. Thanks for your patch was just in time to help me fix another issue " [puppet] - 10https://gerrit.wikimedia.org/r/203085 (owner: 10Ottomata) [19:01:58] (03CR) 10Andrew Bogott: [C: 032] Assign partman config to labvirt100x [puppet] - 10https://gerrit.wikimedia.org/r/203199 (owner: 10Andrew Bogott) [19:02:24] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-Android-App, and 4 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1195594 (10DarTar) @dr0ptp4kt, @deskana: are there related plans for appending an oldid parameter and reviewing the caching impl... [19:02:59] (03CR) 10Hashar: "Similar to https://gerrit.wikimedia.org/r/#/c/203085/2" [puppet] - 10https://gerrit.wikimedia.org/r/203198 (owner: 10Hashar) [19:04:02] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-Android-App, and 4 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1195610 (10Deskana) >>! In T90606#1195594, @DarTar wrote: > @dr0ptp4kt, @deskana: are there related plans for appending an oldid... [19:05:54] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-Android-App, and 4 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1195618 (10DarTar) @Deskana got it, I'd like to run past you some ideas about the backlink that I discussed with Moiz and a few... [19:06:26] (03CR) 10Dzahn: "i like this! i wanted to mention though while ensure_packages is puppet stdlib, there is also "require_package" by Ori in use in some plac" [puppet] - 10https://gerrit.wikimedia.org/r/203085 (owner: 10Ottomata) [19:07:00] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-Android-App, and 4 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1195619 (10Deskana) @DarTar That sounds great. Putting it in an email or a Phab task would be ideal, as I'm really overloaded wi... [19:08:11] (03PS3) 10Dzahn: contint, move zuul_merger_hosts to hiera, use in ferm [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) [19:08:24] (03PS4) 10Dzahn: contint: move zuul_merger_hosts to hiera, use in ferm [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) [19:08:27] ottomata: are you on top of https://phabricator.wikimedia.org/T95494 or should I treat it like a normal access request? It seems somehow different. [19:10:50] (03PS1) 10Andrew Bogott: Install Trusty on the new labvirt boxes [puppet] - 10https://gerrit.wikimedia.org/r/203201 [19:11:10] (03CR) 10Aaron Schulz: [C: 032] Add pool counter config for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199263 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [19:11:19] (03Merged) 10jenkins-bot: Add pool counter config for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199263 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [19:12:24] (03CR) 10Andrew Bogott: [C: 032] Install Trusty on the new labvirt boxes [puppet] - 10https://gerrit.wikimedia.org/r/203201 (owner: 10Andrew Bogott) [19:12:39] (03CR) 10Dzahn: "thanks to YuviPanda for pointing me to hieradata/labs/integration/common.yaml for the labs part. it's a bit confusing that it's "contint" " [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) (owner: 10Dzahn) [19:13:07] (03PS1) 10Hashar: contint: 'zip' package via ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/203203 [19:14:11] (03CR) 10Dzahn: "@YuviPanda for it to find hieradata/role/common/contint.yaml we would have to edit gallium in site.pp including the roles, amirite" [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) (owner: 10Dzahn) [19:14:13] !log aaron Synchronized wmf-config/PoolCounterSettings-common.php: Add pool counter config for Translate (duration: 01m 11s) [19:14:16] Logged the message, Master [19:15:20] (03PS3) 10Hashar: contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) [19:15:22] (03PS2) 10Hashar: contint: 'zip' package via ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/203203 [19:17:24] (03PS5) 10Dzahn: contint: move zuul_merger_hosts to hiera, use in ferm [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) [19:18:14] (03CR) 10Alex Monk: "Looks like ilowiki was the one you modified yourself?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201915 (https://phabricator.wikimedia.org/T37337) (owner: 10Mxn) [19:20:06] (03CR) 10Alex Monk: "Also, I ran `mwgrep min-device-pixel-ratio | grep Common.css` and removed the entries already addressed here:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201915 (https://phabricator.wikimedia.org/T37337) (owner: 10Mxn) [19:23:35] (03PS1) 10Ori.livneh: rrd-navtiming: simplify [puppet] - 10https://gerrit.wikimedia.org/r/203206 [19:23:46] (03CR) 10jenkins-bot: [V: 04-1] rrd-navtiming: simplify [puppet] - 10https://gerrit.wikimedia.org/r/203206 (owner: 10Ori.livneh) [19:23:51] (03PS2) 10Ori.livneh: rrd-navtiming: simplify [puppet] - 10https://gerrit.wikimedia.org/r/203206 [19:24:19] (03CR) 10Ori.livneh: [C: 032 V: 032] rrd-navtiming: simplify [puppet] - 10https://gerrit.wikimedia.org/r/203206 (owner: 10Ori.livneh) [19:24:49] 7Puppet, 10Beta-Cluster, 6Labs: Puppet failing with certificate errors on deployment-prep - https://phabricator.wikimedia.org/T95586#1195675 (10hashar) 5Open>3Resolved a:3hashar Ok solved! That was the exact same issue as on integration and staging project. Changing the hostname cause the puppetmaster... [19:26:09] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/203198 (owner: 10Hashar) [19:26:18] (03CR) 10Hashar: "Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/203203 (owner: 10Hashar) [19:28:58] !log aaron Synchronized php-1.25wmf24/extensions/AbuseFilter: 4b03cec4574aaece27879e408d545ce7ea0fa2ce (duration: 01m 06s) [19:29:05] Logged the message, Master [19:32:07] andrewbogott: i'm on top of that [19:32:10] sent an email out [19:32:11] to jody [19:32:35] ottomata: thanks [19:32:46] (03CR) 10Hashar: "Cherry picked on integration puppetmaster. It is too late to verify what happens but stuff is being created on" [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [19:33:53] (03CR) 10Dzahn: "consider this part of the "kill network.pp" task, 10 lines at a time" [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) (owner: 10Dzahn) [19:37:59] (03PS4) 10Hashar: contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) [19:39:04] chasemp, YuviPanda, yet another hiera question. [19:39:12] I just set up labvirt1001 and ran puppet for the first time... [19:39:14] and it installed nova. [19:39:18] Why? It has no entry in site.pp [19:39:25] So, I should think, no roles. [19:40:08] ah, nm, I see why [19:40:11] regexp mistake [19:41:34] (03PS1) 10Andrew Bogott: Better qualify the virt10xx node definitions. [puppet] - 10https://gerrit.wikimedia.org/r/203209 [19:41:46] andrewbogott: :) [19:43:32] (03CR) 10Andrew Bogott: [C: 032] Better qualify the virt10xx node definitions. [puppet] - 10https://gerrit.wikimedia.org/r/203209 (owner: 10Andrew Bogott) [19:45:41] PROBLEM - configured eth on labvirt1001 is CRITICAL: eth1 reporting no carrier. [19:46:02] PROBLEM - nova-compute process on labvirt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [19:46:21] PROBLEM - puppet last run on labvirt1001 is CRITICAL Puppet has 2 failures [19:47:23] ACKNOWLEDGEMENT - configured eth on labvirt1001 is CRITICAL: eth1 reporting no carrier. andrew bogott This server isnt ready for prime time yet [19:47:23] ACKNOWLEDGEMENT - nova-compute process on labvirt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute andrew bogott This server isnt ready for prime time yet [19:47:23] ACKNOWLEDGEMENT - puppet last run on labvirt1001 is CRITICAL Puppet has 2 failures andrew bogott This server isnt ready for prime time yet [19:52:33] (03PS1) 10GWicke: Switch restbase from txstatsd to statsd backend [puppet] - 10https://gerrit.wikimedia.org/r/203210 [19:56:20] ^demon|away, manybubbles: we had a bunch of pool-queuefull errors in hhvm.log a few minutes ago [19:57:45] Krenair: also looks like a poolcounter server timedout [19:58:12] RECOVERY - puppet last run on labvirt1001 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:58:32] ah, no that is old [20:00:46] hhmmm - not sure what to say - we used to see similar things with lsearchd and I was hoping they were gone but no. [20:01:12] should file a bug and look at them but unless its stuck in spew mode all the time we kind of expect a _few_ if we get hammered quickly [20:02:35] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1195808 (10GWicke) [20:03:00] manybubbles, I wonder if icinga should be saying something when such limits are reached [20:10:25] 6operations, 10Beta-Cluster, 6Labs: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195824 (10Dzahn) 5Open>3Resolved a:3Dzahn fixed with method 2: ``` # apt-get clean # cd /var/lib/apt # mv lists lists.old # mkdir -p... [20:12:40] Krenair: dunno - I don't think its a huge issue - its not causing people to complain and its not new. So I've filed T95610 and will let it be [20:12:53] ok, thanks [20:16:08] (03PS5) 10Hashar: contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) [20:16:19] 6operations, 10Beta-Cluster, 6Labs: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195841 (10Dzahn) root@deployment-bastion:~# apt-key list | grep -B1 ftpmaster pub 1024D/437D05B5 2004-09-12 uid Ubuntu Ar... [20:18:38] 6operations, 7Graphite: Statsite timer rates not stored as req/s, multiplied by 1000 - https://phabricator.wikimedia.org/T95596#1195843 (10GWicke) It turns out that statsite [changes the semantics of the `.rate` metric to be about the value and //not// the number of samples per time unit](https://github.com/ar... [20:19:00] ori: https://phabricator.wikimedia.org/T95596#1195843 [20:19:22] (03CR) 10Hashar: "So I have sneaked a before: to ensure that I have the proper symlink:" [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [20:21:12] 6operations, 10Beta-Cluster, 6Labs: GPG error: http://nova.clouds.archive.ubuntu.com precise Release BADSIG 40976EAF437D05B5 - https://phabricator.wikimedia.org/T95541#1195853 (10hashar) Thanks a ton @dzahn for the fix, the reference and the detailed step by step instructions! [20:22:24] 6operations, 7Graphite: Urgent: Statsite timer changes semantics of timer rate metrics, need metric rename - https://phabricator.wikimedia.org/T95596#1195858 (10GWicke) [20:22:43] 6operations, 7Graphite: Urgent: Statsite changes semantics of timer rate metrics, need metric rename - https://phabricator.wikimedia.org/T95596#1195410 (10GWicke) [20:28:23] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [20:29:50] (03PS1) 10Dzahn: remove rbf hosts from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/203216 (https://phabricator.wikimedia.org/T95153) [20:32:22] (03PS1) 10Dzahn: delete rbf hosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/203217 (https://phabricator.wikimedia.org/T95153) [20:34:35] (03PS2) 10Dzahn: delete rbf hosts from DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/203216 (https://phabricator.wikimedia.org/T95153) [20:36:51] (03CR) 10Tim Landscheidt: "AFAICS, this does not /ensure/ that only one service daemon is running which might happen when tools-services-01 is rebooted with $active_" [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) (owner: 10Yuvipanda) [20:39:26] 6operations, 10Continuous-Integration: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1195915 (10hashar) Thanks @faidon for the preliminary investigation. Should I fill subtasks for the 5 points you mentioned? It seems that each will reach out to di... [20:39:36] (03CR) 10Yuvipanda: "Instances are cheap when they aren't actually doing much work, so I think it is ok to have a separate host. We could re image them to be s" [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) (owner: 10Yuvipanda) [20:41:17] 6operations, 10Continuous-Integration: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1195919 (10hashar) [20:41:52] 6operations, 10Continuous-Integration: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177707 (10hashar) [20:47:19] 6operations, 7Graphite: Urgent: Statsite changes semantics of timer rate metrics, need metric rename - https://phabricator.wikimedia.org/T95596#1195944 (10GWicke) https://phabricator.wikimedia.org/T90111#1153830 has the rename script for the original migration. Maybe something like this could work? ``` # tes... [20:58:32] 6operations, 10ops-fundraising: setup/install/deploy beryllium as frack authentication server - https://phabricator.wikimedia.org/T95617#1195980 (10RobH) 3NEW [20:59:14] 6operations, 10ops-fundraising: setup/install/deploy betelgeuse as frack authentication server - https://phabricator.wikimedia.org/T95618#1195990 (10RobH) 3NEW [20:59:27] 6operations, 10ops-fundraising: setup/install/deploy betelgeuse as frack authentication server - https://phabricator.wikimedia.org/T95618#1195990 (10RobH) [20:59:41] 6operations, 10ops-fundraising: setup/install/deploy beryllium as frack authentication server - https://phabricator.wikimedia.org/T95617#1195999 (10RobH) [21:00:25] 6operations, 10ops-fundraising: setup/install/deploy beryllium as frack authentication server - https://phabricator.wikimedia.org/T95617#1195980 (10RobH) [21:00:28] 6operations, 10ops-fundraising: setup/install/deploy betelgeuse as frack authentication server - https://phabricator.wikimedia.org/T95618#1195990 (10RobH) [21:00:53] 6operations, 10ops-fundraising: setup/install/deploy beryllium as frack authentication server - https://phabricator.wikimedia.org/T95617#1195980 (10RobH) [21:00:56] 6operations, 10ops-fundraising: setup/install/deploy betelgeuse as frack authentication server - https://phabricator.wikimedia.org/T95618#1195990 (10RobH) [21:01:08] i hate you bot [21:01:10] i hate you so much. [21:01:28] each of those is a goddamn link, and you have to echo them all one at a time... [21:01:29] your falut for having the same phab nick as irc nick [21:01:38] i just think its spammy to channel [21:01:42] :) [21:01:52] no one cares i just made a giant dependency chain of tasks (well, im 25% done) [21:01:53] heh [21:02:17] each one of those is goign to get another 2 sub-tasks =P [21:09:08] (03PS1) 10Hashar: package_builder: fix dependency order for hooks [puppet] - 10https://gerrit.wikimedia.org/r/203228 [21:10:46] (03CR) 10Hashar: [C: 031 V: 032] "Applied on integration puppet master. That fixed the hook installation on integration-slave-jessie-1001.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/203228 (owner: 10Hashar) [21:12:21] (03CR) 10Tim Landscheidt: [C: 04-1] "This doesn't work for me:" [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [21:13:59] (03CR) 10Andrew Bogott: "The current behavior is used to set puppet status on wikitech. I'm not sure whether or not this change will break things... probably not," [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [21:24:15] greg-g: Sorry for doing this late in the day but, I've added myself to the 4pm slot for MF and Gather. [21:25:55] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1196091 (10RobH) @Mholloway: Please note that you must read/acknowledge/sign the document here: https://phabricator.wikimedia.org/L3 Please advise on this task once you have done so, thanks! [21:27:51] 6operations, 10ops-eqiad, 6Labs: labvirt1004 has a failed disk 1789-Slot 0 Drive Array Disk Drive(s) Not Responding Check cables or replace the following drive(s): Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1196095 (10Andrew) 3NEW a:3Cmjohnson [21:28:17] andrewbogott: ugh! [21:28:46] cmjohnson1: it’s possible I power-cycled at an inopportune time and it’s a false alarm. But I can’t delete the existing partitions... [21:28:59] that could be a bug in the serial console or it could be because of a bad drive [21:29:07] well you can blow out the raid cfg and start over [21:29:19] if it can wait till tomorrow it's easier for me to do on-site [21:30:36] cmjohnson1: I can’t blow out the raid config [21:30:40] rmoen: no worries, many people add themselves right before it :) [21:30:48] When I try to delete a raid, it says to press f3 to confirm but ignores my keypress [21:30:59] cmjohnson1: it can definitely wait until tomorrow :) [21:32:31] cool [21:37:34] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:54] PROBLEM - Host labvirt1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:54] PROBLEM - Host labvirt1006 is DOWN: PING CRITICAL - Packet loss = 100% [21:38:55] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [21:39:03] PROBLEM - Host labvirt1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:39:12] ^ just me rebooting things [21:40:09] …things that take aaaaages to boot [21:41:54] RECOVERY - Host labvirt1002 is UPING OK - Packet loss = 0%, RTA = 2.04 ms [21:41:54] RECOVERY - Host labvirt1003 is UPING OK - Packet loss = 0%, RTA = 1.40 ms [21:42:05] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 2.26 ms [21:42:14] RECOVERY - Host labvirt1006 is UPING OK - Packet loss = 0%, RTA = 1.32 ms [21:42:15] (03CR) 10Tim Landscheidt: "I tested:" [puppet] - 10https://gerrit.wikimedia.org/r/203062 (owner: 10Hashar) [21:43:24] RECOVERY - Host labvirt1001 is UPING OK - Packet loss = 0%, RTA = 4.35 ms [21:46:54] PROBLEM - Disk space on labvirt1001 is CRITICAL: Connection refused by host [21:47:03] (03CR) 10BBlack: "Aside from inlines: this is a good start, but I'd really like to see this monitor also correlate the output of "ip xfrm policy" to confirm" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199787 (owner: 10Gage) [21:47:03] PROBLEM - puppet last run on labvirt1001 is CRITICAL: Connection refused by host [21:47:14] PROBLEM - DPKG on labvirt1001 is CRITICAL: Connection refused by host [21:47:33] PROBLEM - salt-minion processes on labvirt1001 is CRITICAL: Connection refused by host [21:47:43] PROBLEM - RAID on labvirt1001 is CRITICAL: Connection refused by host [21:47:44] PROBLEM - SSH on labvirt1001 is CRITICAL: Connection refused [21:48:04] PROBLEM - dhclient process on labvirt1001 is CRITICAL: Connection refused by host [21:50:53] !log esams cache role migrations starting up soon for the evening... [21:50:57] Logged the message, Master [21:51:04] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1196148 (10RobH) So, the system is powered off (confirmed in drac) but the port shows up. This could be simply the NIC has enough power to register, or as complicated as some... [21:53:39] (03CR) 10Yuvipanda: "I'm re-creating tools-services-02 as a small instance, and will recreate -01 too. The switchover will allow me to test our switchover proc" [puppet] - 10https://gerrit.wikimedia.org/r/203110 (https://phabricator.wikimedia.org/T95521) (owner: 10Yuvipanda) [21:54:29] (03PS1) 10BBlack: T86663 5.1: pool 3036; depool 3003,amssq5[12] [puppet] - 10https://gerrit.wikimedia.org/r/203231 [21:55:04] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.1: pool 3036; depool 3003,amssq5[12] [puppet] - 10https://gerrit.wikimedia.org/r/203231 (owner: 10BBlack) [22:00:40] 6operations, 7Graphite, 5Patch-For-Review: Counts in Cassandra metrics no longer updated - https://phabricator.wikimedia.org/T95627#1196185 (10GWicke) 3NEW a:3fgiunchedi [22:01:15] 6operations, 7Graphite: Counts in Cassandra metrics no longer updated - https://phabricator.wikimedia.org/T95627#1196185 (10GWicke) [22:04:37] (03PS1) 10BBlack: T86663 5.1: switch cp3003 role [puppet] - 10https://gerrit.wikimedia.org/r/203233 [22:04:43] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [22:06:56] (03PS2) 10BBlack: T86663 5.1: switch cp3003 role [puppet] - 10https://gerrit.wikimedia.org/r/203233 [22:07:43] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.1: switch cp3003 role [puppet] - 10https://gerrit.wikimedia.org/r/203233 (owner: 10BBlack) [22:11:44] PROBLEM - puppet last run on db2038 is CRITICAL puppet fail [22:12:15] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:12:44] PROBLEM - Varnish HTTP upload-frontend on cp3003 is CRITICAL: Connection refused [22:13:13] PROBLEM - Varnishkafka log producer on cp3003 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [22:13:37] (03PS1) 10BBlack: T86663 5.1: repool cp3003 [puppet] - 10https://gerrit.wikimedia.org/r/203235 [22:13:52] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.1: repool cp3003 [puppet] - 10https://gerrit.wikimedia.org/r/203235 (owner: 10BBlack) [22:14:33] RECOVERY - Varnish HTTP upload-frontend on cp3003 is OK: HTTP OK: HTTP/1.1 200 OK - 284 bytes in 0.181 second response time [22:14:54] RECOVERY - Varnishkafka log producer on cp3003 is OK: PROCS OK: 1 process with command name varnishkafka [22:18:09] (03PS1) 10BBlack: T86663 5.1: pool 3037; depool 3004,amssq5[34] [puppet] - 10https://gerrit.wikimedia.org/r/203238 [22:18:41] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.1: pool 3037; depool 3004,amssq5[34] [puppet] - 10https://gerrit.wikimedia.org/r/203238 (owner: 10BBlack) [22:20:34] sloppy errors all over today [22:20:37] * bblack needs more coffee [22:23:30] (03Abandoned) 10Se4598: Set $wgTranslationNotificationsAlwaysHttpsInEmail to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190203 (owner: 10Se4598) [22:23:58] (03PS1) 10BBlack: T86663 5.2: switch cp3004 role [puppet] - 10https://gerrit.wikimedia.org/r/203239 [22:24:26] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.2: switch cp3004 role [puppet] - 10https://gerrit.wikimedia.org/r/203239 (owner: 10BBlack) [22:28:06] RECOVERY - puppet last run on db2038 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [22:28:07] (03PS1) 10BBlack: T86663 5.2: repool cp3004 [puppet] - 10https://gerrit.wikimedia.org/r/203240 [22:28:23] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1196290 (10RobH) So, I think that one of the old lsearch servers would work for this: wmf3152 Dell PowerEdge R410, Dual Intel Xeon X5650 (2.66 GHz), 48GB Memory, (2) 150GB Dis... [22:28:34] (03CR) 10BBlack: [C: 032 V: 032] T86663 5.2: repool cp3004 [puppet] - 10https://gerrit.wikimedia.org/r/203240 (owner: 10BBlack) [22:28:41] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1196293 (10RobH) a:5RobH>3Cmjohnson Advise, and assign back to me, thanks! [22:36:17] 6operations, 7Graphite: Urgent: Statsite changes semantics of timer rate metrics, need metric rename - https://phabricator.wikimedia.org/T95596#1196329 (10Gage) Discussed in IRC; ran this at approximately UTC 22:32: ``` sudo stop carbon/relay sudo stop carbon/cache NAME=a sudo stop carbon/cache NAME=b sudo sto... [22:36:36] ACKNOWLEDGEMENT - DPKG on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott re-imaging [22:36:36] ACKNOWLEDGEMENT - Disk space on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott re-imaging [22:36:36] ACKNOWLEDGEMENT - NTP on labvirt1001 is CRITICAL: NTP CRITICAL: No response from NTP server andrew bogott re-imaging [22:36:36] ACKNOWLEDGEMENT - RAID on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott re-imaging [22:36:36] ACKNOWLEDGEMENT - SSH on labvirt1001 is CRITICAL: Connection timed out andrew bogott re-imaging [22:36:37] ACKNOWLEDGEMENT - configured eth on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott re-imaging [22:36:37] ACKNOWLEDGEMENT - dhclient process on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott re-imaging [22:36:38] ACKNOWLEDGEMENT - puppet last run on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott re-imaging [22:36:38] ACKNOWLEDGEMENT - salt-minion processes on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott re-imaging [22:40:58] jouncebot, next [22:40:58] In 0 hour(s) and 19 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150409T2300) [22:41:02] rmoen, umm [22:41:07] you list some commits [22:41:10] but don't actually link to them [22:42:10] Krenair: Just about the cut the branches now. I'm planning on doing the deployment [22:42:15] I can't find anything under "project:mediawiki/core branch:wmf/1.25wmf24 status:open" [22:42:23] the swat deployment? [22:42:28] Krenair: yes [22:42:39] you're... not on the swat team [22:42:44] but if greg-g says it's okay, I guess... [22:43:00] Krenair: not officially but I'm planning on being [22:43:03] Why are you updating those extensions to master rather than backporting the relevant commits? [22:43:41] Krenair: It was a team decision. [22:43:54] go on... [22:44:12] Krenair: due to the amount of cherry picks needed [22:45:26] needed to achieve what? [22:46:07] Krenair: Gather is heavily dependent on MF [22:46:27] There are a lot of dependencies. Some perhaps hard to identify [22:46:29] oh so some gather changes can rely on new MF features? [22:46:32] urgh [22:46:36] yep [22:46:45] that's only temporary. [22:48:21] * jdlrobson checks if we are depending on any MobileFrontend changes [22:50:39] I assumed rmoen already did that? [22:51:23] well we are..but i'm wondering if it's easier to cherry pick them rather than deploy master. [22:52:18] deploying master is certainly not encouraged if you can avoid it [22:52:28] jdlrobson, Krenair: I know of a few, but rather than miss one thought it might be safer to update to master. [22:53:17] Krenair: rmoen yeh so from a quick glance https://gerrit.wikimedia.org/r/#/c/202052/ might be the only dependency in MobileFrontend that if not deployed will break the feature but there may be subtle issues we miss [22:53:21] RECOVERY - SSH on labvirt1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [22:53:31] RECOVERY - Host labvirt1001 is UPING OK - Packet loss = 0%, RTA = 1.50 ms [22:57:31] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:00:05] RoanKattouw, ^d, Krenair, rmoen, superm401: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150409T2300). Please do the needful. [23:00:27] superm401, ping? [23:00:40] Krenair: thats me today [23:00:51] matt's going to introduce our new developer to a few people around the office [23:00:57] superm401 is listed... ok [23:01:06] rmoen is running this today, apparently. I have no idea whether greg has signed off on this or not. [23:01:22] Krenair: greg-g has i believe [23:01:32] Krenair: i could edit the page if it makes you happy :) [23:01:44] we just decided this a minute ago [23:01:45] nah, anyone from the Flow team is fine [23:02:07] Krenair: rmoen can join the SWAT group :) [23:02:21] RECOVERY - Host labvirt1001 is UPING OK - Packet loss = 0%, RTA = 2.18 ms [23:02:22] * Krenair is just mildly irritated by last-minute-oops-not-on-the-page-but-lets-do-it-anyway [23:03:31] 6operations, 7Graphite: Counts with underscore in name no longer updated since move to statsite (cassandra metrics) - https://phabricator.wikimedia.org/T95627#1196380 (10GWicke) [23:04:19] I have a private patch to deploy after rmoen is done. [23:06:21] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:06:51] Connection resets always strike at the worst times :/ [23:09:20] RECOVERY - Host labvirt1001 is UPING OK - Packet loss = 0%, RTA = 1.55 ms [23:13:10] Everything okay, rmoen? [23:13:30] Krenair: Yes walking through this with kaldari [23:13:38] okay :) [23:13:40] creating cherry pick for MF [23:31:11] RECOVERY - Disk space on labvirt1001 is OK: DISK OK [23:31:55] (03PS1) 10Bmansurov: Enable CirrusSearch event logging in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203250 [23:34:10] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:35:08] (03CR) 10Dr0ptp4kt: [C: 031] Enable CirrusSearch event logging in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203250 (owner: 10Bmansurov) [23:36:12] ACKNOWLEDGEMENT - DPKG on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott This server is terrible [23:36:12] ACKNOWLEDGEMENT - Disk space on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott This server is terrible [23:36:12] ACKNOWLEDGEMENT - NTP on labvirt1001 is CRITICAL: NTP CRITICAL: No response from NTP server andrew bogott This server is terrible [23:36:13] ACKNOWLEDGEMENT - RAID on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott This server is terrible [23:36:13] ACKNOWLEDGEMENT - SSH on labvirt1001 is CRITICAL: Connection timed out andrew bogott This server is terrible [23:36:13] ACKNOWLEDGEMENT - configured eth on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott This server is terrible [23:36:13] ACKNOWLEDGEMENT - dhclient process on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott This server is terrible [23:36:14] ACKNOWLEDGEMENT - puppet last run on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott This server is terrible [23:36:14] ACKNOWLEDGEMENT - salt-minion processes on labvirt1001 is CRITICAL: Timeout while attempting connection andrew bogott This server is terrible [23:36:30] RECOVERY - Host labvirt1001 is UPING OK - Packet loss = 0%, RTA = 1.89 ms [23:37:00] RECOVERY - configured eth on labvirt1001 is OK - interfaces up [23:37:11] RECOVERY - RAID on labvirt1001 is OK no RAID installed [23:37:32] RECOVERY - dhclient process on labvirt1001 is OK: PROCS OK: 0 processes with command name dhclient [23:37:50] RECOVERY - puppet last run on labvirt1001 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:37:51] RECOVERY - DPKG on labvirt1001 is OK: All packages OK [23:47:35] You sure everything is okay rmoen? :/ [23:47:43] The Flow patch was listed first... [23:49:28] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1196470 (10Mholloway) @RobH: Read, acknowledged, and signed. Thank you! [23:49:39] Krenair: go ahead. we can let you know before syncing things [23:50:28] (03PS2) 10Alex Monk: Enable VE for all Flow boards on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203195 (owner: 10Mattflaschen) [23:50:35] (03CR) 10Alex Monk: [C: 032] Enable VE for all Flow boards on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203195 (owner: 10Mattflaschen) [23:50:40] (03Merged) 10jenkins-bot: Enable VE for all Flow boards on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203195 (owner: 10Mattflaschen) [23:51:21] RECOVERY - salt-minion processes on labvirt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:52:02] ebernhardson, please check [23:52:11] 1 host is stuck, probably codfw misbehaving again [23:52:14] but otherwise it's synced [23:52:55] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/203195/ (duration: 01m 11s) [23:53:01] Logged the message, Master [23:53:10] !log ssh to mw2129.codfw.wmnet still timing out [23:53:13] Logged the message, Master [23:53:42] Krenair: thanks testing [23:56:03] Krenair: excellent, works great. thanks1