[00:00:05] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160505T0000). Please do the needful. [00:00:06] can I get anyone to merge a puppet patch? https://gerrit.wikimedia.org/r/#/c/287021/ simple phabricator reconfig [00:00:25] aude: Lemme know when you're done so I can sync a no-op config patch [00:02:08] how do i update wikiversions.php ? [00:02:17] and then i do sync-wikiversions ? [00:02:57] or does sync-wikiversions do that? [00:04:05] * aude looks to see what sync-wikiversions does [00:04:42] !log taking phabricator offline for maintenance [00:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:19] twentyafterfour: do you know? [00:05:37] Rebuild and sync wikiversions.php to the cluster [00:06:05] got it [00:06:08] !log aude@tin rebuilt wikiversions.php and synchronized wikiversions files: Put wikidata on wmf/1.27.0-wmf.23 [00:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:07:07] RoanKattouw: done [00:07:12] wikidata looks good now [00:07:54] :) [00:08:31] Cool, I'll do my no-ops now [00:08:36] (03PS4) 10Dzahn: acme-setup: only accept ASCII letters as unique cert ID [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) [00:09:31] (03CR) 10BBlack: [C: 04-1] "Excellent start! But yes, we'll probably need it to accept letters, digits, and some basic separators that are ok inside filenames, like [" [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) (owner: 10Dzahn) [00:09:44] phabricator is down? [00:09:54] (03CR) 10Catrope: [C: 032] Remove useless Echo footer notice overrides in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287033 (owner: 10Catrope) [00:10:07] (03CR) 10Dzahn: "just amended to accept digits too" [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) (owner: 10Dzahn) [00:10:09] (03CR) 10Catrope: [C: 032] Enable cross-wiki notifications by default in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287034 (https://phabricator.wikimedia.org/T130655) (owner: 10Catrope) [00:10:15] mutante: whoops I was still typing that when PS4 hit :) [00:10:17] aude: maintenance by twentyafterfour [00:10:37] but either way, it needs to accept ids like "apt_wikimedia_org" or foocert-3 or whatever, too [00:10:39] mutante: ok [00:10:42] (03CR) 10jenkins-bot: [V: 04-1] Enable cross-wiki notifications by default in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287034 (https://phabricator.wikimedia.org/T130655) (owner: 10Catrope) [00:10:58] (03PS2) 10Catrope: Remove useless Echo footer notice overrides in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287033 [00:11:00] bblack: ah, right! ok [00:11:06] (03CR) 10Catrope: [C: 032] Remove useless Echo footer notice overrides in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287033 (owner: 10Catrope) [00:11:19] (03PS2) 10Catrope: Enable cross-wiki notifications by default in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287034 (https://phabricator.wikimedia.org/T130655) [00:11:27] (03CR) 10Catrope: [C: 032] Enable cross-wiki notifications by default in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287034 (https://phabricator.wikimedia.org/T130655) (owner: 10Catrope) [00:12:56] (03Merged) 10jenkins-bot: Remove useless Echo footer notice overrides in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287033 (owner: 10Catrope) [00:13:03] (03Merged) 10jenkins-bot: Enable cross-wiki notifications by default in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287034 (https://phabricator.wikimedia.org/T130655) (owner: 10Catrope) [00:16:47] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Plumbing for wmgEchoCrossWikiByDefault (unused for now) (duration: 00m 26s) [00:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:11] (03CR) 10CSteipp: [C: 031] Amend imagemagick policy to also include the URL decoder [puppet] - 10https://gerrit.wikimedia.org/r/286790 (owner: 10Muehlenhoff) [00:21:28] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Plumbing for wmgEchoCrossWikiByDefault (unused for now) (duration: 00m 24s) [00:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:25] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [00:23:45] Request from 90.180.83.194 via cp1045 cp1045, Varnish XID 2053957651 [00:23:49] Error: 503, Service Unavailable at Thu, 05 May 2016 00:23:27 GMT [00:25:08] ? [00:25:18] trying to load phabricator [00:25:28] yup [00:25:45] 17:12 < twentyafterfour> Taking phabricator offline for maintenance [00:25:50] from -devtools [00:25:56] Danny_B: [00:26:11] ah [00:26:23] a log in here would be nice too heh [00:26:31] yup [00:26:39] I guess it must be downtimed, no icinga spam [00:26:59] bblack Danny_B !log taking phabricator offline for maintenance [00:27:10] and we're _again_ hitting the necessity of having the "this site is having maintenance now" message instead of error [00:27:23] !log phabricator upgrade complete [00:27:25] it is, for 1 hour [00:27:25] From 01:04:38. [00:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:54] changing /topic would be also handy... [00:27:59] oh up at 00:04 [00:28:01] sorry! [00:28:11] true, but also can't have both, no icinga spam and notifications [00:28:42] twentyafterfour: You can now view https://phabricator.wikimedia.org/diffusion/ENOR/manage/basics/ [00:28:47] can I get someone to merge https://gerrit.wikimedia.org/r/#/c/287021/ so that I can turn puppet and icinga back on? [00:28:51] Users carn't edit. [00:28:55] But we can now view. [00:29:00] Danny_B: what do you mean about the message/error? [00:29:08] The link is still hiden. [00:29:42] (03CR) 10BBlack: [C: 032] librarize phab extensions repo [puppet] - 10https://gerrit.wikimedia.org/r/287021 (https://phabricator.wikimedia.org/T128797) (owner: 1020after4) [00:30:07] twentyafterfour: it's merged [00:30:17] bblack: thank you! [00:30:23] bblack: we have discussed it already at least once. that if some site is taken off intentionally, there should be "maintenance running" message when trying to load such site instead of error 503 or 404 or whatever... [00:30:45] Danny_B: that was patched in recently and used for stat1001 maintenance already [00:31:08] ??? [00:31:12] * Danny_B is confused [00:31:14] just people aren't aware of the mechanism, which is new, and not the best (it's a puppet commit) [00:31:17] phab returned 503 [00:31:24] It shows Include of '/srv/phab/libext/misc/src/__phutil_library_init__.php' failed! [00:31:31] ^^ twentyafterfour [00:31:37] PhutilBootloaderException [00:31:40] although intended downtime [00:32:04] :( [00:32:36] paladox: thanks [00:32:42] Danny_B: https://gerrit.wikimedia.org/r/#/c/285976/ shows the new functionality. ignore the whitespace/arrow-align whitenoise. if you add a 'maintenance' key there with a message, cache_misc will block all access to that service with a 503 and a custom message included at the bottom. [00:32:47] it's a start [00:32:48] Your welcome. [00:32:56] it's new though, it hasn't been documented or widely talked about yet [00:34:11] bblack: if you take down phabricator, why would you link to phabricator? [00:34:45] that message was for taking down stat1001, which backs metrics, stats, and datasets .wikimedia.org [00:34:47] Danny_B: the new feature is that you _can_ set a custom error page [00:34:51] (03PS1) 1020after4: More phabricator config changes refs T128797 [puppet] - 10https://gerrit.wikimedia.org/r/287037 (https://phabricator.wikimedia.org/T128797) [00:34:53] ah! [00:34:55] obviously, your custom message for a phab outage wouldn't have a phab link :) [00:35:06] one more phabricator config change please? https://gerrit.wikimedia.org/r/#/c/287037/ [00:35:12] (03CR) 10jenkins-bot: [V: 04-1] More phabricator config changes refs T128797 [puppet] - 10https://gerrit.wikimedia.org/r/287037 (https://phabricator.wikimedia.org/T128797) (owner: 1020after4) [00:35:38] great! please educate relevant people who can take services down to use it... ;-) [00:35:42] (03PS2) 10BBlack: More phabricator config changes refs T128797 [puppet] - 10https://gerrit.wikimedia.org/r/287037 (https://phabricator.wikimedia.org/T128797) (owner: 1020after4) [00:36:04] it's not ideal yet anyways, the message ends up at the bottom of the page [00:36:14] but it's a start in data terms anyways [00:36:24] more-ideal would be confctl integration too [00:36:30] i remember what we were discussed [00:36:40] (03CR) 1020after4: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/287037 (https://phabricator.wikimedia.org/T128797) (owner: 1020after4) [00:36:56] use some principle like in .htaccess rewriterule .* /maintenance.html [00:37:00] (03CR) 10Dzahn: [C: 032] More phabricator config changes refs T128797 [puppet] - 10https://gerrit.wikimedia.org/r/287037 (https://phabricator.wikimedia.org/T128797) (owner: 1020after4) [00:37:08] or so... [00:37:38] well we want it in data, outside of the service itself [00:37:53] the ideal would be for confctl (etcd) to control backend service pooling/maint [00:38:09] how does such info page look like actually? [00:38:36] e.g.: confctl select appserver=stat1001.eqiad.wmnet set/maint='This service is down, see https://phabricator.wikimedia.org/T128797' [00:39:09] also on the nearly-done todo list for confctl itself is auto-logging confctl actions to IRC [00:39:52] twentyafterfour: will the mail notifications from the service time be delivered or no? [00:40:00] Danny_B: it looks basically like the 503 page you saw for phab earlier, but the custom message is included at the bottom, presently [00:40:24] in place of 'Error: 503, Service Unavailable' at the bottom [00:40:30] maybe you could just put HTML in that ? [00:40:54] we've talked about moving the error text up to the big text at the top, but needs html design input to make sure it still looks pretty and looks like our MediaWiki error pages, etc [00:41:00] would it be possibe to load it from other file? [00:41:03] (and more-complex templating) [00:41:10] Danny_B: not really [00:41:28] we have a consistent error template we load from a file, and we want that to stay consistent regardless of service [00:41:49] the custom bit is just the error message string, and the design question is how to alter the existing template to show it more-prominently while still looking nice [00:41:52] btw - speaking of error message - are we supposed to have the only one atm? the one with logo? [00:42:13] (because couple days ago i've see the old bsod) [00:42:14] there's two, but they look very similar [00:42:16] no, there is some epic ticket about making all the error pages look more alike across the entire WMF [00:42:20] one from MW, one from Varnish [00:42:29] they're already fairly-well aligned, but not perfectly [00:43:32] could you please link me the repo(s) where those messages are stored? ( i mean their php/html template ) [00:43:39] we could play a sound that literally tells you it's maintenance :) /me hides [00:44:16] mutante: embed yt video with construction guy? [00:45:13] Danny_B: i was more thinking an .ogg like the pronunciation examples on wikipedia and wiktionary :p [00:47:15] Danny_B: https://phabricator.wikimedia.org/T113114 [00:47:35] that is the bug and it has all the links you wanted [00:47:44] source column [00:48:35] Danny_B: the current varnish one is https://github.com/wikimedia/operations-puppet/blob/production/files/varnish/errorpage.html + multi-varnish-version templating at https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/errorpage.inc.vcl.erb [00:48:55] also https://phabricator.wikimedia.org/T76560 [00:49:15] some work was done it a while back (probably in those tickets mutante links) to align it with MW/apache error pages somewhat [00:49:22] mutante: if one of us was a girl, i'd send a :-*, but bad luck... :-P [00:50:39] T113114 is another example of why #consistency tag would be handy (cc: andre__ ) [00:50:39] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [00:51:13] nice bot, sit ;-) [00:52:45] bblack: a while ago i've participated on those error messages and already then i was calling for unification... if it is possible now, i'll be happy to work on it... [00:52:45] Make phabricator tags #consistent [00:52:57] ;-) [00:53:53] unicornification, you wanna include labs [00:57:02] Danny_B: no mail should have been lost [00:57:38] 06Operations: status.wikimedia.org should use some Wikimedia icon if possible - https://phabricator.wikimedia.org/T134458#2266248 (10Danny_B) [00:57:50] 06Operations: status.wikimedia.org should use some Wikimedia icon if possible - https://phabricator.wikimedia.org/T134458#2266260 (10Danny_B) p:05Triage>03Low [00:58:24] 06Operations: status.wikimedia.org should use some Wikimedia favicon if possible - https://phabricator.wikimedia.org/T134458#2266248 (10Danny_B) [00:59:14] twentyafterfour: when about they should be sent? [01:05:52] (03PS1) 10BBlack: tlsproxy: no AE:gzip forcing for HTTP/2 [puppet] - 10https://gerrit.wikimedia.org/r/287038 [01:06:17] (03CR) 10BBlack: "Should probably read up more on this before merging, just in case." [puppet] - 10https://gerrit.wikimedia.org/r/287038 (owner: 10BBlack) [01:07:27] status.wm.o has bigger problems heh [01:09:32] the thing is that it's not WMF, the favicon would not be entirely correct [01:10:08] Danny_B: we would like to replace that status page anyways [01:12:28] Danny_B: re: the mail question.. the answer is ..ehm [01:12:30] 186 * * F,2h,15m; G,16h,1h,1.5; F,4d,6h [01:13:55] i know it's not wmf's, but (assuming) amazon. that's why i put the conditions in the description... ;-) [01:14:40] speaking of it not being wmf service, there should be a disclaimer at the bottom, that it is third party service, which does not have to follow wmf privacy policy etc...) [01:15:05] it's formerly-Nimsoft, now CA App Synthetic Monitor or something [01:15:23] Danny_B: yep, just like on policy.wikimedia.org [01:15:31] ironically [01:15:56] so i think the mail retry should be 2h [01:16:03] the full answer is in that line above and http://www.exim.org/exim-html-current/doc/html/spec_html/ch-retry_configuration.html [01:16:17] from templates/exim/exim4.conf.phab.erb [01:17:35] no wait, every 15 minutes for the first 2 hours [01:17:37] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2266282 (10BBlack) In the settings, we can see that http://status.wikimedia.org/ is also available at http://status.asm.ca.com/8777 . There don't appear to be any TLS-related settings :( W... [01:17:39] omg, we have labs infrastructure, why don't we run such services / microsites there? [01:17:54] because the point of the status page is to be up when all else is down [01:18:07] and measure uptime [01:18:12] mutante: i was rather pointing to policy atm [01:18:16] until we had catchpoint at least [01:18:30] Danny_B: oh, yea, that you should ask legal [01:18:34] i know that service monitors have to be elsewhere [01:18:58] but also labs aint for production services [01:19:11] which still does not imply we can't have our own monitoring hw / vm somewhere [01:19:37] mutante: yup, i meant they are different cluster than production [01:19:47] yea, that is true about monitoring and why we have both, internal and external [01:19:54] = monitoring is elsewhere than production [01:21:40] Danny_B: "Why did we move it away from our cluster" _> https://phabricator.wikimedia.org/T110203 "Why dont we move it back" -> https://phabricator.wikimedia.org/T132104 "Why does it load 3rd party analytics" -> https://phabricator.wikimedia.org/T132103 [01:21:50] have fun [01:21:54] hey, mutante , i see you on the policy site banner photo! ;-) [01:22:40] :p yea [01:23:00] that one i'm actually not wearing the "dont take photos" lanyard _in the photo_ [01:23:20] 132103 is restricted :-/ [01:23:53] has a custom policy, i dont know why [01:24:14] on Monday April 18 i asked: [01:24:16] How about the policy of this bug itself? Can we open it and make it public? [01:24:39] ah, i see why, security asked to wait [01:25:26] T132103 [01:25:52] oh stashbot, where art thou? [01:25:56] ;-) [01:26:22] i think .. because it cant read the ticket [01:27:57] btw, Watchmouse/CA App Synthetic Monitor/Status.wm.org ... [01:28:17] it told us exactly how Phab went down and came back.. by emial [01:28:18] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2188533 (10Danny_B) Can't we simply for the very beginning create the static dump of the current site and present it until something new/better will be used? How oft... [01:28:47] gotta run, cu later [01:28:54] cu [01:52:21] 06Operations, 10Traffic, 06WMF-Design, 10Wikimedia-General-or-Unknown, 07Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#2266306 (10Danny_B) Neither "Requirements" nor "Things we can't do" mentions JavaScript... Any statement about it? [02:07:33] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [02:38:13] PROBLEM - DPKG on tungsten is CRITICAL: DPKG CRITICAL dpkg reports broken packages [02:38:14] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:23] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.113, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(104, Connection reset by peer))) [02:38:33] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:53] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:53] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:03] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:05] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.178, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [02:39:13] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures [02:39:14] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:14] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:14] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:15] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:44] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:43:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [02:47:23] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [02:47:32] hm? [02:49:34] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 18m 23s) [02:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:50:23] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:37] hmmm that doesn't look good [02:53:04] loadavg is skyrocketing, not sure about the nrpe thing [02:53:34] RECOVERY - DPKG on tungsten is OK: All packages OK [02:53:37] so is network [02:54:06] hmmm only on 789 [02:54:41] 789 higher load and bytes output on network, others higher bytes in [02:54:53] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:04] PROBLEM - Check size of conntrack table on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:22] around 02:33 for a start time, 5 mins before the icinga spam [02:55:26] VE isn't loading, restbase times out [02:55:43] I'm mostly flying blind here, I don't have a lot of experience with RB hosts [02:55:43] PROBLEM - dhclient process on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:44] PROBLEM - puppet last run on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:44] PROBLEM - RAID on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:44] PROBLEM - DPKG on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:44] PROBLEM - Check size of conntrack table on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:52] is wtp related? [02:55:53] PROBLEM - DPKG on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:53] PROBLEM - SSH on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:55] PROBLEM - salt-minion processes on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:09] yeah, restbase hits parsoid for revisions it does not already have in storage [02:56:13] PROBLEM - Disk space on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:14] PROBLEM - configured eth on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:29] possibly just fallout, but PURGE volume on caches dropped off notably around the same time [02:56:36] could be changeprop-related? [02:56:55] PROBLEM - SSH on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:04] PROBLEM - configured eth on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:04] PROBLEM - puppet last run on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:04] PROBLEM - RAID on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:13] yeah didn't something similar happen in beta earlier? [02:57:14] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [02:57:28] PROBLEM - parsoid disk space on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:28] PROBLEM - Parsoid on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:33] PROBLEM - DPKG on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:36] our overall req volume isn't spiking, I don't think this is DoS [02:58:13] RECOVERY - configured eth on wtp1012 is OK: OK - interfaces up [02:58:15] the parsoid / restbase errors in logstash all reference a particular revision on frwiki, frwiki/?oldid=106801025 [02:58:24] PROBLEM - SSH on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:33] PROBLEM - RAID on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:34] PROBLEM - Check size of conntrack table on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:37] wtp1006 disk space just paged [02:58:39] PROBLEM - parsoid disk space on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:43] PROBLEM - puppet last run on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:43] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:44] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [02:58:49] which is https://fr.wikipedia.org/w/index.php?oldid=106801025 [02:58:54] * andrewbogott is here and looking at those [02:58:55] PROBLEM - salt-minion processes on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:55] PROBLEM - RAID on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:55] PROBLEM - Parsoid on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:56] from 2014, nothing unusual about it [02:59:05] PROBLEM - salt-minion processes on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:05] PROBLEM - configured eth on wtp1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:05] PROBLEM - dhclient process on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:05] PROBLEM - SSH on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:05] PROBLEM - salt-minion processes on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:05] PROBLEM - Disk space on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:05] PROBLEM - configured eth on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:06] PROBLEM - DPKG on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:06] PROBLEM - salt-minion processes on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:07] PROBLEM - salt-minion processes on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:07] PROBLEM - configured eth on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:08] PROBLEM - dhclient process on wtp1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:18] PROBLEM - parsoid disk space on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:18] PROBLEM - dhclient process on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:18] PROBLEM - DPKG on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:18] PROBLEM - Check size of conntrack table on wtp1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:18] PROBLEM - puppet last run on wtp1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:18] PROBLEM - RAID on wtp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:18] PROBLEM - RAID on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:19] PROBLEM - DPKG on wtp1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:19] PROBLEM - configured eth on wtp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:20] PROBLEM - RAID on wtp1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:20] PROBLEM - puppet last run on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:21] PROBLEM - puppet last run on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:24] andrewbogott: could you page subbu? [02:59:28] Although I would love it if someone said "I know what those are…" [02:59:29] sure [02:59:33] thanks [02:59:34] PROBLEM - dhclient process on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:34] PROBLEM - DPKG on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:34] PROBLEM - Disk space on wtp1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:34] PROBLEM - RAID on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:34] PROBLEM - RAID on wtp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:39] PROBLEM - parsoid disk space on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:39] PROBLEM - puppet last run on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:39] PROBLEM - configured eth on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:43] PROBLEM - DPKG on wtp1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:44] PROBLEM - SSH on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:56] PROBLEM - Disk space on wtp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:56] PROBLEM - Disk space on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:56] PROBLEM - dhclient process on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:56] PROBLEM - dhclient process on wtp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:01] PROBLEM - parsoid disk space on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:03] PROBLEM - configured eth on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:03] PROBLEM - Check size of conntrack table on wtp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:03] PROBLEM - Check size of conntrack table on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:03] PROBLEM - configured eth on wtp1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:03] PROBLEM - RAID on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:03] PROBLEM - puppet last run on wtp1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:03] PROBLEM - configured eth on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:04] PROBLEM - DPKG on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:04] PROBLEM - salt-minion processes on wtp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:05] PROBLEM - salt-minion processes on wtp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:05] PROBLEM - Disk space on wtp1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:06] PROBLEM - dhclient process on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:15] https://grafana-admin.wikimedia.org/dashboard/db/restbase has lots of interesting graph anomalies... [03:00:23] PROBLEM - dhclient process on wtp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:23] PROBLEM - salt-minion processes on wtp1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:23] PROBLEM - RAID on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:24] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [03:00:33] PROBLEM - SSH on wtp1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:00:36] PROBLEM - parsoid disk space on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:37] PROBLEM - SSH on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:00:37] PROBLEM - Check size of conntrack table on wtp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:37] PROBLEM - salt-minion processes on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:37] PROBLEM - salt-minion processes on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:37] PROBLEM - salt-minion processes on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:37] PROBLEM - DPKG on wtp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:38] multi-minute request latencies, RB reqrate dropoff, etc [03:00:38] PROBLEM - Check size of conntrack table on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:38] PROBLEM - Check size of conntrack table on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:40] PROBLEM - Check size of conntrack table on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:40] PROBLEM - SSH on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:00:45] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:45] PROBLEM - RAID on wtp1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:45] PROBLEM - Disk space on wtp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:45] PROBLEM - Check size of conntrack table on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:45] PROBLEM - Check size of conntrack table on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:45] PROBLEM - configured eth on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:45] PROBLEM - DPKG on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:54] PROBLEM - SSH on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:00:55] PROBLEM - Disk space on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:55] PROBLEM - puppet last run on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:03] PROBLEM - RAID on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:03] PROBLEM - puppet last run on wtp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:04] PROBLEM - DPKG on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:04] PROBLEM - DPKG on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:04] PROBLEM - RAID on wtp1024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:04] PROBLEM - configured eth on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:04] PROBLEM - DPKG on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:04] PROBLEM - SSH on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:04] PROBLEM - puppet last run on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:05] PROBLEM - dhclient process on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:05] PROBLEM - Disk space on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:06] PROBLEM - Check size of conntrack table on wtp1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:06] someone tried to edit a revision from 2014, restbase didn't have it in storage, so it had to pass on the request to parsoid, which is having a hard time parsing it [03:01:18] PROBLEM - parsoid disk space on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:26] PROBLEM - parsoid disk space on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:27] RECOVERY - salt-minion processes on wtp1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:01:27] PROBLEM - SSH on wtp1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:27] RECOVERY - SSH on wtp1022 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:01:28] PROBLEM - Disk space on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:28] so is RB spamming that one thing to all the wtp servers and locking them all up? [03:01:33] RECOVERY - SSH on wtp1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:01:33] PROBLEM - SSH on wtp1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:33] PROBLEM - SSH on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:34] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:37] can we kill this one bad request/parse? [03:01:37] RECOVERY - parsoid disk space on wtp1022 is OK: DISK OK [03:01:38] RECOVERY - RAID on wtp1022 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:01:38] RECOVERY - configured eth on wtp1006 is OK: OK - interfaces up [03:01:38] yes, it looks like it [03:01:38] PROBLEM - DPKG on wtp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:42] PROBLEM - parsoid disk space on wtp1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:43] RECOVERY - RAID on wtp1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:01:43] PROBLEM - dhclient process on wtp1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:43] PROBLEM - salt-minion processes on wtp1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:43] PROBLEM - dhclient process on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:43] PROBLEM - Check size of conntrack table on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:43] PROBLEM - configured eth on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:44] PROBLEM - salt-minion processes on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:44] PROBLEM - dhclient process on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:45] PROBLEM - Check size of conntrack table on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:45] PROBLEM - dhclient process on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:46] dunno how [03:01:46] PROBLEM - RAID on wtp1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:46] PROBLEM - RAID on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:01:52] me either! [03:01:54] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [03:02:17] RECOVERY - parsoid disk space on wtp1006 is OK: DISK OK [03:02:17] PROBLEM - SSH on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:21] PROBLEM - parsoid disk space on wtp1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:35] RECOVERY - dhclient process on wtp1022 is OK: PROCS OK: 0 processes with command name dhclient [03:02:35] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 1.898 second response time [03:02:36] RECOVERY - salt-minion processes on wtp1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:43] RECOVERY - Disk space on wtp1004 is OK: DISK OK [03:02:43] RECOVERY - dhclient process on wtp1004 is OK: PROCS OK: 0 processes with command name dhclient [03:02:43] RECOVERY - salt-minion processes on wtp1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:02:43] RECOVERY - configured eth on wtp1005 is OK: OK - interfaces up [03:02:45] PROBLEM - salt-minion processes on wtp1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:45] RECOVERY - Check size of conntrack table on wtp1014 is OK: OK: nf_conntrack is 3 % full [03:02:53] RECOVERY - dhclient process on wtp1005 is OK: PROCS OK: 0 processes with command name dhclient [03:02:53] RECOVERY - RAID on wtp1014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:02:53] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 35 minutes ago with 0 failures [03:02:54] PROBLEM - configured eth on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:54] RECOVERY - SSH on wtp1014 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:02:55] PROBLEM - Check size of conntrack table on wtp1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:55] PROBLEM - DPKG on wtp1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:03:07] RECOVERY - parsoid disk space on wtp1017 is OK: DISK OK [03:03:07] PROBLEM - DPKG on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:03:07] PROBLEM - RAID on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:03:08] RECOVERY - SSH on wtp1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:03:13] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [03:03:14] RECOVERY - configured eth on wtp1004 is OK: OK - interfaces up [03:03:14] RECOVERY - DPKG on wtp1004 is OK: All packages OK [03:03:23] RECOVERY - dhclient process on wtp1009 is OK: PROCS OK: 0 processes with command name dhclient [03:03:28] RECOVERY - parsoid disk space on wtp1023 is OK: DISK OK [03:03:28] RECOVERY - Disk space on wtp1023 is OK: DISK OK [03:03:32] RECOVERY - parsoid disk space on wtp1004 is OK: DISK OK [03:03:35] RECOVERY - parsoid disk space on wtp1009 is OK: DISK OK [03:03:35] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 6.949 second response time [03:03:36] PROBLEM - Disk space on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:03:40] RECOVERY - parsoid disk space on wtp1016 is OK: DISK OK [03:03:40] PROBLEM - SSH on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:03:40] RECOVERY - salt-minion processes on wtp1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:40] PROBLEM - puppet last run on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:00] RECOVERY - dhclient process on wtp1019 is OK: PROCS OK: 0 processes with command name dhclient [03:04:01] RECOVERY - salt-minion processes on wtp1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:14] PROBLEM - parsoid disk space on wtp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:14] RECOVERY - SSH on wtp1017 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:04:20] PROBLEM - Check size of conntrack table on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:20] PROBLEM - RAID on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:20] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.808 second response time [03:04:24] RECOVERY - parsoid disk space on wtp1013 is OK: DISK OK [03:04:24] PROBLEM - puppet last run on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:24] PROBLEM - salt-minion processes on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:24] PROBLEM - configured eth on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:30] ... [03:04:30] RECOVERY - dhclient process on wtp1017 is OK: PROCS OK: 0 processes with command name dhclient [03:04:30] RECOVERY - dhclient process on wtp1013 is OK: PROCS OK: 0 processes with command name dhclient [03:04:30] RECOVERY - salt-minion processes on wtp1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:04:30] RECOVERY - configured eth on wtp1024 is OK: OK - interfaces up [03:04:30] PROBLEM - salt-minion processes on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:40] RECOVERY - dhclient process on wtp1010 is OK: PROCS OK: 0 processes with command name dhclient [03:04:40] RECOVERY - Check size of conntrack table on wtp1017 is OK: OK: nf_conntrack is 4 % full [03:04:40] RECOVERY - DPKG on wtp1013 is OK: All packages OK [03:04:41] RECOVERY - RAID on wtp1017 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:04:41] RECOVERY - configured eth on wtp1017 is OK: OK - interfaces up [03:04:41] RECOVERY - RAID on wtp1024 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:04:41] PROBLEM - salt-minion processes on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:41] RECOVERY - RAID on wtp1013 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:04:41] PROBLEM - DPKG on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:50] RECOVERY - dhclient process on wtp1023 is OK: PROCS OK: 0 processes with command name dhclient [03:04:50] RECOVERY - SSH on wtp1024 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:04:51] PROBLEM - SSH on wtp1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:51] PROBLEM - Check size of conntrack table on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:51] PROBLEM - Check size of conntrack table on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:51] RECOVERY - DPKG on wtp1023 is OK: All packages OK [03:04:51] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - parsoid_8000 - Could not depool server wtp1019.eqiad.wmnet because of too many down! [03:04:51] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [03:05:00] PROBLEM - DPKG on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:00] PROBLEM - configured eth on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:00] RECOVERY - SSH on wtp1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:05:10] PROBLEM - RAID on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:10] RECOVERY - DPKG on wtp1017 is OK: All packages OK [03:05:10] RECOVERY - configured eth on wtp1013 is OK: OK - interfaces up [03:05:11] RECOVERY - Disk space on wtp1020 is OK: DISK OK [03:05:11] RECOVERY - puppet last run on wtp1023 is OK: OK: Puppet is currently enabled, last run 34 minutes ago with 0 failures [03:05:11] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:05:11] RECOVERY - puppet last run on wtp1017 is OK: OK: Puppet is currently enabled, last run 33 minutes ago with 0 failures [03:05:11] RECOVERY - dhclient process on wtp1024 is OK: PROCS OK: 0 processes with command name dhclient [03:05:25] PROBLEM - parsoid disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:26] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 wow .. what is going on there. [03:05:29] RECOVERY - parsoid disk space on wtp1024 is OK: DISK OK [03:05:29] RECOVERY - Check size of conntrack table on wtp1013 is OK: OK: nf_conntrack is 2 % full [03:05:32] PROBLEM - parsoid disk space on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:33] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [03:05:42] RECOVERY - Disk space on wtp1017 is OK: DISK OK [03:05:42] RECOVERY - salt-minion processes on wtp1023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:05:42] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [03:05:52] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [03:05:52] PROBLEM - configured eth on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:52] PROBLEM - SSH on wtp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:05:52] PROBLEM - Parsoid on wtp1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:05:52] RECOVERY - Check size of conntrack table on wtp1024 is OK: OK: nf_conntrack is 2 % full [03:05:53] RECOVERY - DPKG on wtp1024 is OK: All packages OK [03:05:57] subbu, someone tried to edit a revision from 2014, restbase didn't have it in storage, so it had to pass on the request to parsoid, which is having a hard time parsing it [03:06:06] PROBLEM - parsoid disk space on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:09] PROBLEM - parsoid disk space on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:10] RECOVERY - Disk space on wtp1013 is OK: DISK OK [03:06:10] RECOVERY - RAID on wtp1023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:06:10] PROBLEM - puppet last run on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:14] PROBLEM - parsoid disk space on wtp1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:20] disk space is critical? [03:06:25] RECOVERY - SSH on wtp1023 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:06:27] RECOVERY - Disk space on wtp1009 is OK: DISK OK [03:06:35] PROBLEM - Check size of conntrack table on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:36] RECOVERY - configured eth on wtp1023 is OK: OK - interfaces up [03:06:36] PROBLEM - SSH on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:36] PROBLEM - salt-minion processes on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:37] subbu: it's not a disk space problem. monitoring is failing because the hosts are in general trouble [03:06:48] PROBLEM - parsoid disk space on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:52] PROBLEM - parsoid disk space on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:53] PROBLEM - Disk space on wtp1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:53] PROBLEM - dhclient process on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:54] RECOVERY - Check size of conntrack table on wtp1023 is OK: OK: nf_conntrack is 1 % full [03:06:55] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [03:06:55] RECOVERY - Check size of conntrack table on wtp1016 is OK: OK: nf_conntrack is 2 % full [03:06:55] PROBLEM - configured eth on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:55] PROBLEM - RAID on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:55] PROBLEM - Disk space on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:06] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [03:07:06] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [03:07:07] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [03:07:09] subbu: repeating from a less-spammed channel: [03:07:11] 03:03 < bblack> someone ask restbase about an old frwiki article from 2014, it's not in cache, it asked parsoid to parse it. parsoid is having trouble with it (hangs on trying to parse it?), and so RB eventually times out and asks again or something, and is spamming that through all the parsoid servers killing them [03:07:15] PROBLEM - configured eth on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:16] 03:04 < ori> https://fr.wikipedia.org/api/rest_v1/page/html/%EA%9E%80/106801025?redirect=false is the restbase request for that revision [03:07:18] PROBLEM - parsoid disk space on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:19] 03:05 < bblack> and some wtp are recovering intermittently, probably because when they go crazy LVS/pybal depools them and they get to eventually recover and re-enter the pool [03:07:19] PROBLEM - dhclient process on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:19] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [03:07:22] i cannot get onto the servers. [03:07:23] 03:05 < andrewbogott> why would that throw puppet alerts? [03:07:25] PROBLEM - puppet last run on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:25] PROBLEM - puppet last run on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:25] PROBLEM - dhclient process on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:25] 03:05 < bblack> and then get hit again [03:07:28] 03:06 < bblack> because the RB and parsoid nodes involved are hitting very high load and failing NRPE checks with socket timeouts, etc [03:07:31] 03:06 < ori> OK, I'm going to try and find some crude way to reject requests for that particular revision in parsoid [03:07:34] 03:06 < bblack> wtp is runni [03:07:36] RECOVERY - dhclient process on wtp1016 is OK: PROCS OK: 0 processes with command name dhclient [03:07:37] RECOVERY - Disk space on wtp1001 is OK: DISK OK [03:07:43] yeah logging into wtp is going to be hard, it's conntrack tables are already overflowed [03:07:45] PROBLEM - salt-minion processes on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:46] PROBLEM - RAID on wtp1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:49] PROBLEM - parsoid disk space on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:49] PROBLEM - DPKG on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:50] subbu: can you move to #wikimedia_security where it's quieter? [03:07:50] PROBLEM - puppet last run on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:54] PROBLEM - parsoid disk space on wtp1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:55] PROBLEM - salt-minion processes on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:55] PROBLEM - Disk space on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:06] moved there. [03:08:06] andrewbogott, that channel doesn't exist [03:08:17] PROBLEM - Disk space on wtp1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:17] PROBLEM - dhclient process on wtp1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:17] um… mediawiki_security [03:08:19] dammit [03:08:26] PROBLEM - salt-minion processes on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:33] i cannot join .. it is invite only [03:08:36] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [03:08:38] oh, ok [03:08:39] sorry [03:08:45] RECOVERY - SSH on wtp1019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:08:45] PROBLEM - RAID on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:45] RECOVERY - DPKG on wtp1019 is OK: All packages OK [03:08:49] RECOVERY - parsoid disk space on wtp1019 is OK: DISK OK [03:08:50] PROBLEM - salt-minion processes on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:50] PROBLEM - dhclient process on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:50] RECOVERY - salt-minion processes on wtp1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:54] can use -traffic, it's public/logged and unspammed [03:08:55] RECOVERY - salt-minion processes on wtp1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:08:55] PROBLEM - SSH on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:08:55] PROBLEM - dhclient process on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:55] PROBLEM - salt-minion processes on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:55] PROBLEM - configured eth on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:55] PROBLEM - salt-minion processes on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:55] PROBLEM - SSH on wtp1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:09:19] PROBLEM - parsoid disk space on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:19] PROBLEM - RAID on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:19] PROBLEM - Check size of conntrack table on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:19] PROBLEM - dhclient process on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:19] PROBLEM - RAID on wtp1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:20] PROBLEM - puppet last run on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:20] PROBLEM - Disk space on wtp1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:25] RECOVERY - configured eth on wtp1009 is OK: OK - interfaces up [03:09:25] PROBLEM - SSH on wtp1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:09:25] PROBLEM - SSH on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:09:36] PROBLEM - puppet last run on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:45] PROBLEM - Check size of conntrack table on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:45] PROBLEM - Check size of conntrack table on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:56] PROBLEM - configured eth on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:56] RECOVERY - Disk space on wtp1011 is OK: DISK OK [03:09:57] PROBLEM - DPKG on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:10:09] PROBLEM - parsoid disk space on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:10:10] PROBLEM - Parsoid on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:15] PROBLEM - Disk space on wtp1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:10:15] PROBLEM - Disk space on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:10:16] RECOVERY - Check size of conntrack table on wtp1009 is OK: OK: nf_conntrack is 1 % full [03:10:25] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.019 second response time [03:10:25] RECOVERY - DPKG on wtp1009 is OK: All packages OK [03:10:26] RECOVERY - SSH on wtp1009 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:10:27] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [03:10:36] RECOVERY - salt-minion processes on wtp1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:10:46] RECOVERY - Disk space on wtp1010 is OK: DISK OK [03:10:56] RECOVERY - dhclient process on wtp1001 is OK: PROCS OK: 0 processes with command name dhclient [03:11:06] RECOVERY - dhclient process on wtp1020 is OK: PROCS OK: 0 processes with command name dhclient [03:11:07] RECOVERY - salt-minion processes on wtp1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:07] RECOVERY - configured eth on wtp1020 is OK: OK - interfaces up [03:11:07] RECOVERY - RAID on wtp1009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:11:07] RECOVERY - DPKG on wtp1001 is OK: All packages OK [03:11:15] PROBLEM - Disk space on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:11:15] RECOVERY - RAID on wtp1020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:11:19] RECOVERY - parsoid disk space on wtp1001 is OK: DISK OK [03:11:19] RECOVERY - Check size of conntrack table on wtp1020 is OK: OK: nf_conntrack is 1 % full [03:11:19] RECOVERY - DPKG on wtp1010 is OK: All packages OK [03:11:23] RECOVERY - parsoid disk space on wtp1011 is OK: DISK OK [03:11:23] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:11:23] RECOVERY - Check size of conntrack table on wtp1001 is OK: OK: nf_conntrack is 0 % full [03:11:24] PROBLEM - RAID on wtp1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:11:36] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures [03:11:36] RECOVERY - RAID on wtp1010 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:11:36] RECOVERY - DPKG on wtp1020 is OK: All packages OK [03:11:46] PROBLEM - SSH on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:47] RECOVERY - SSH on wtp1010 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:11:47] RECOVERY - SSH on wtp1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:12:08] RECOVERY - RAID on wtp1001 is OK: OK: no disks configured for RAID [03:12:09] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.208 second response time [03:12:09] RECOVERY - salt-minion processes on wtp1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:15] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 5.054 second response time [03:12:15] RECOVERY - configured eth on wtp1010 is OK: OK - interfaces up [03:12:15] PROBLEM - puppet last run on wtp1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:12:20] RECOVERY - parsoid disk space on wtp1020 is OK: DISK OK [03:12:24] RECOVERY - parsoid disk space on wtp1010 is OK: DISK OK [03:12:28] PROBLEM - parsoid disk space on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:12:39] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [03:12:39] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [03:13:09] RECOVERY - Check size of conntrack table on wtp1010 is OK: OK: nf_conntrack is 3 % full [03:13:29] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:29] PROBLEM - Check size of conntrack table on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:33] what's going on? [03:13:38] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:13:39] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:39] PROBLEM - Disk space on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:39] PROBLEM - Parsoid on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:13:39] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:49] PROBLEM - puppet last run on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:14:03] RECOVERY - parsoid disk space on wtp1003 is OK: DISK OK [03:14:03] RECOVERY - SSH on wtp1021 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:14:04] gwicke: come to #wikimedia-traffic [03:14:08] PROBLEM - dhclient process on wtp1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:14:13] it's spammy in here [03:14:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [03:14:48] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [03:14:48] RECOVERY - dhclient process on wtp1021 is OK: PROCS OK: 0 processes with command name dhclient [03:14:50] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [03:14:53] RECOVERY - parsoid disk space on wtp1021 is OK: DISK OK [03:14:58] PROBLEM - dhclient process on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:14:59] !log injected an early return to v1Wt2html in routes.js when oldid = 106801025 [03:14:59] PROBLEM - puppet last run on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:09] RECOVERY - Check size of conntrack table on wtp1021 is OK: OK: nf_conntrack is 2 % full [03:15:09] RECOVERY - DPKG on wtp1021 is OK: All packages OK [03:15:10] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures [03:15:11] !log and restarted parsoid on wtp1* [03:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:15:18] PROBLEM - SSH on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:18] PROBLEM - DPKG on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:23] PROBLEM - parsoid disk space on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:23] PROBLEM - Disk space on wtp1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:23] PROBLEM - salt-minion processes on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:15:28] RECOVERY - configured eth on wtp1021 is OK: OK - interfaces up [03:15:28] RECOVERY - RAID on wtp1021 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:15:28] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [03:15:28] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [03:15:29] RECOVERY - Disk space on wtp1021 is OK: DISK OK [03:15:29] RECOVERY - RAID on wtp1008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:15:39] RECOVERY - salt-minion processes on wtp1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:40] PROBLEM - dhclient process on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:15:48] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [03:15:48] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [03:15:49] RECOVERY - SSH on wtp1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:16:18] RECOVERY - salt-minion processes on wtp1011 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:18] RECOVERY - SSH on wtp1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:16:19] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [03:16:20] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [03:16:28] RECOVERY - Disk space on wtp1003 is OK: DISK OK [03:16:28] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:16:50] PROBLEM - Parsoid on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:13] RECOVERY - parsoid disk space on wtp1019 is OK: DISK OK [03:17:18] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [03:17:18] RECOVERY - salt-minion processes on wtp1003 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:17:18] RECOVERY - SSH on wtp1003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:17:18] PROBLEM - Parsoid on wtp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:18] PROBLEM - Parsoid on wtp1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:18] RECOVERY - dhclient process on wtp1011 is OK: PROCS OK: 0 processes with command name dhclient [03:17:19] PROBLEM - dhclient process on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:33] RECOVERY - parsoid disk space on wtp1022 is OK: DISK OK [03:17:38] PROBLEM - DPKG on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:38] PROBLEM - salt-minion processes on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:38] PROBLEM - dhclient process on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:38] PROBLEM - configured eth on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:38] PROBLEM - RAID on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:42] PROBLEM - parsoid disk space on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:42] PROBLEM - Check size of conntrack table on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:46] PROBLEM - parsoid disk space on wtp1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:46] PROBLEM - DPKG on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:48] PROBLEM - Check size of conntrack table on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:51] PROBLEM - parsoid disk space on wtp1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:18:10] PROBLEM - DPKG on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:18:10] PROBLEM - puppet last run on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:18:10] PROBLEM - RAID on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:18:19] RECOVERY - RAID on wtp1015 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:18:28] PROBLEM - SSH on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:29] PROBLEM - Disk space on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:18:29] PROBLEM - SSH on wtp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:29] PROBLEM - Disk space on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:18:38] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 5.314 second response time [03:18:39] RECOVERY - SSH on wtp1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:18:39] PROBLEM - RAID on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:18:40] RECOVERY - SSH on wtp1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:18:48] RECOVERY - dhclient process on wtp1015 is OK: PROCS OK: 0 processes with command name dhclient [03:18:49] PROBLEM - Parsoid on wtp1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:49] PROBLEM - salt-minion processes on wtp1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:18:49] PROBLEM - configured eth on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:02] RECOVERY - parsoid disk space on wtp1015 is OK: DISK OK [03:19:06] PROBLEM - parsoid disk space on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:09] PROBLEM - parsoid disk space on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:10] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 4.966 second response time [03:19:11] PROBLEM - puppet last run on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:11] PROBLEM - puppet last run on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:30] RECOVERY - configured eth on wtp1015 is OK: OK - interfaces up [03:19:30] PROBLEM - Disk space on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:31] RECOVERY - salt-minion processes on wtp1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:19:32] RECOVERY - salt-minion processes on wtp1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:19:32] RECOVERY - dhclient process on wtp1022 is OK: PROCS OK: 0 processes with command name dhclient [03:19:40] RECOVERY - Disk space on wtp1015 is OK: DISK OK [03:19:40] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 6.978 second response time [03:19:40] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:19:41] PROBLEM - salt-minion processes on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:50] PROBLEM - Check size of conntrack table on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:50] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:19:50] RECOVERY - Check size of conntrack table on wtp1015 is OK: OK: nf_conntrack is 3 % full [03:19:50] RECOVERY - DPKG on wtp1015 is OK: All packages OK [03:19:51] PROBLEM - dhclient process on wtp1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:51] RECOVERY - Disk space on wtp1019 is OK: DISK OK [03:20:00] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.016 second response time [03:20:50] RECOVERY - salt-minion processes on wtp1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:20:52] RECOVERY - Disk space on wtp1007 is OK: DISK OK [03:20:52] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 9.600 second response time [03:20:53] !log ran: sudo salt 'scb*' cmd.run 'puppet agent --disable ; service changeprop stop' [03:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:21:11] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:22] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.672 second response time [03:21:40] RECOVERY - salt-minion processes on wtp1019 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:21:41] RECOVERY - configured eth on wtp1011 is OK: OK - interfaces up [03:21:50] RECOVERY - Check size of conntrack table on wtp1011 is OK: OK: nf_conntrack is 1 % full [03:21:50] RECOVERY - RAID on wtp1011 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:21:54] RECOVERY - parsoid disk space on wtp1011 is OK: DISK OK [03:21:55] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:55] RECOVERY - DPKG on wtp1011 is OK: All packages OK [03:22:00] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:22:01] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:22:11] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [03:22:35] PROBLEM - parsoid disk space on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:22:51] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:12] PROBLEM - Parsoid on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:30] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:44] RECOVERY - parsoid disk space on wtp1007 is OK: DISK OK [03:23:44] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:45] PROBLEM - salt-minion processes on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:45] PROBLEM - SSH on wtp1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:50] RECOVERY - SSH on wtp1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:24:01] RECOVERY - dhclient process on wtp1007 is OK: PROCS OK: 0 processes with command name dhclient [03:24:05] PROBLEM - parsoid disk space on wtp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:24:06] RECOVERY - Disk space on wtp1005 is OK: DISK OK [03:24:13] RECOVERY - Check size of conntrack table on wtp1007 is OK: OK: nf_conntrack is 1 % full [03:24:32] RECOVERY - DPKG on wtp1007 is OK: All packages OK [03:24:40] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 41 minutes ago with 0 failures [03:24:40] RECOVERY - configured eth on wtp1014 is OK: OK - interfaces up [03:24:45] RECOVERY - parsoid disk space on wtp1006 is OK: DISK OK [03:24:50] RECOVERY - dhclient process on wtp1014 is OK: PROCS OK: 0 processes with command name dhclient [03:25:01] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [03:25:01] PROBLEM - Disk space on wtp1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:01] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [03:25:10] RECOVERY - DPKG on wtp1014 is OK: All packages OK [03:25:10] PROBLEM - SSH on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:25:14] RECOVERY - parsoid disk space on wtp1010 is OK: DISK OK [03:25:14] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [03:25:14] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.260 second response time [03:25:15] RECOVERY - Disk space on wtp1014 is OK: DISK OK [03:25:25] RECOVERY - parsoid disk space on wtp1014 is OK: DISK OK [03:25:40] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [03:25:41] PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [03:25:50] RECOVERY - Check size of conntrack table on wtp1005 is OK: OK: nf_conntrack is 1 % full [03:25:50] RECOVERY - RAID on wtp1007 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:25:50] RECOVERY - salt-minion processes on wtp1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:25:50] RECOVERY - configured eth on wtp1007 is OK: OK - interfaces up [03:25:51] RECOVERY - salt-minion processes on wtp1003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:25:51] RECOVERY - configured eth on wtp1005 is OK: OK - interfaces up [03:26:01] RECOVERY - Disk space on wtp1002 is OK: DISK OK [03:26:01] RECOVERY - dhclient process on wtp1005 is OK: PROCS OK: 0 processes with command name dhclient [03:26:02] RECOVERY - Check size of conntrack table on wtp1014 is OK: OK: nf_conntrack is 2 % full [03:26:02] PROBLEM - changeprop endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [03:26:10] RECOVERY - RAID on wtp1014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:26:11] RECOVERY - DPKG on wtp1003 is OK: All packages OK [03:26:11] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 58 minutes ago with 0 failures [03:26:11] RECOVERY - SSH on wtp1014 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:26:12] PROBLEM - Disk space on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:26:20] RECOVERY - configured eth on wtp1003 is OK: OK - interfaces up [03:26:20] RECOVERY - Disk space on wtp1016 is OK: DISK OK [03:26:21] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [03:26:22] RECOVERY - dhclient process on wtp1003 is OK: PROCS OK: 0 processes with command name dhclient [03:26:31] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:26:45] RECOVERY - parsoid disk space on wtp1003 is OK: DISK OK [03:26:54] RECOVERY - parsoid disk space on wtp1005 is OK: DISK OK [03:27:00] RECOVERY - Disk space on wtp1003 is OK: DISK OK [03:27:11] RECOVERY - SSH on wtp1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:27:11] PROBLEM - Parsoid on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:27:20] RECOVERY - Disk space on wtp1022 is OK: DISK OK [03:27:30] RECOVERY - salt-minion processes on wtp1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:27:31] RECOVERY - RAID on wtp1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:27:31] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [03:27:41] RECOVERY - RAID on wtp1003 is OK: OK: no disks configured for RAID [03:27:51] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [03:27:52] RECOVERY - Check size of conntrack table on wtp1003 is OK: OK: nf_conntrack is 1 % full [03:27:53] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 55 minutes ago with 0 failures [03:27:53] RECOVERY - SSH on wtp1003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:27:53] RECOVERY - DPKG on wtp1005 is OK: All packages OK [03:27:54] !log restarted parsoid on all wtp* hosts to kill any requests from changepropagation service that may have been in flight [03:28:00] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [03:28:05] PROBLEM - parsoid disk space on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:05] PROBLEM - salt-minion processes on wtp1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:28:11] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [03:28:20] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [03:28:25] RECOVERY - parsoid disk space on wtp1002 is OK: DISK OK [03:28:25] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [03:28:25] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [03:28:25] RECOVERY - dhclient process on wtp1002 is OK: PROCS OK: 0 processes with command name dhclient [03:28:41] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [03:29:00] RECOVERY - DPKG on wtp1022 is OK: All packages OK [03:29:31] RECOVERY - RAID on wtp1002 is OK: OK: optimal, 1 logical, 2 physical [03:29:40] RECOVERY - configured eth on wtp1022 is OK: OK - interfaces up [03:29:50] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [03:30:11] RECOVERY - SSH on wtp1022 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:30:11] RECOVERY - configured eth on wtp1002 is OK: OK - interfaces up [03:30:11] RECOVERY - DPKG on wtp1002 is OK: All packages OK [03:30:11] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:30:12] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 4.881 second response time [03:30:20] RECOVERY - salt-minion processes on wtp1016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:30:21] RECOVERY - salt-minion processes on wtp1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:30:25] RECOVERY - parsoid disk space on wtp1022 is OK: DISK OK [03:30:25] RECOVERY - RAID on wtp1022 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:30:44] RECOVERY - parsoid disk space on wtp1001 is OK: DISK OK [03:30:52] RECOVERY - Check size of conntrack table on wtp1022 is OK: OK: nf_conntrack is 1 % full [03:30:52] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [03:31:25] RECOVERY - Check size of conntrack table on wtp1016 is OK: OK: nf_conntrack is 0 % full [03:31:42] RECOVERY - DPKG on wtp1016 is OK: All packages OK [03:31:43] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:31:54] RECOVERY - RAID on wtp1016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:32:03] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [03:32:03] RECOVERY - SSH on wtp1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:32:22] RECOVERY - SSH on wtp1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:32:22] RECOVERY - configured eth on wtp1016 is OK: OK - interfaces up [03:32:24] RECOVERY - dhclient process on wtp1016 is OK: PROCS OK: 0 processes with command name dhclient [03:32:32] RECOVERY - salt-minion processes on wtp1010 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:32:32] RECOVERY - dhclient process on wtp1010 is OK: PROCS OK: 0 processes with command name dhclient [03:32:46] RECOVERY - parsoid disk space on wtp1016 is OK: DISK OK [03:32:46] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:32:46] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.169 second response time [03:33:02] RECOVERY - configured eth on wtp1010 is OK: OK - interfaces up [03:33:03] RECOVERY - DPKG on wtp1010 is OK: All packages OK [03:33:33] RECOVERY - RAID on wtp1010 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:33:34] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures [03:33:43] RECOVERY - Check size of conntrack table on wtp1010 is OK: OK: nf_conntrack is 0 % full [03:34:03] RECOVERY - Disk space on wtp1010 is OK: DISK OK [03:34:12] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [03:34:34] RECOVERY - Check size of conntrack table on wtp1002 is OK: OK: nf_conntrack is 1 % full [03:34:42] RECOVERY - dhclient process on wtp1012 is OK: PROCS OK: 0 processes with command name dhclient [03:34:43] RECOVERY - SSH on wtp1010 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:34:54] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:34:54] RECOVERY - SSH on wtp1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:35:02] RECOVERY - salt-minion processes on wtp1006 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:35:02] RECOVERY - DPKG on wtp1006 is OK: All packages OK [03:35:03] RECOVERY - Disk space on wtp1006 is OK: DISK OK [03:35:03] RECOVERY - configured eth on wtp1012 is OK: OK - interfaces up [03:35:12] RECOVERY - dhclient process on wtp1006 is OK: PROCS OK: 0 processes with command name dhclient [03:35:17] RECOVERY - parsoid disk space on wtp1012 is OK: DISK OK [03:35:33] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [03:35:43] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [03:35:54] RECOVERY - Check size of conntrack table on wtp1006 is OK: OK: nf_conntrack is 1 % full [03:36:02] RECOVERY - DPKG on wtp1012 is OK: All packages OK [03:36:12] RECOVERY - Disk space on wtp1001 is OK: DISK OK [03:36:22] RECOVERY - Disk space on wtp1012 is OK: DISK OK [03:36:22] RECOVERY - salt-minion processes on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:36:23] RECOVERY - configured eth on wtp1006 is OK: OK - interfaces up [03:36:23] RECOVERY - Check size of conntrack table on wtp1012 is OK: OK: nf_conntrack is 1 % full [03:36:44] RECOVERY - RAID on wtp1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:37:12] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:37:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:41:32] RECOVERY - SSH on wtp1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:41:43] RECOVERY - configured eth on wtp1001 is OK: OK - interfaces up [03:42:02] RECOVERY - dhclient process on wtp1001 is OK: PROCS OK: 0 processes with command name dhclient [03:42:13] RECOVERY - salt-minion processes on wtp1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:42:22] RECOVERY - DPKG on wtp1001 is OK: All packages OK [03:42:53] RECOVERY - Check size of conntrack table on wtp1001 is OK: OK: nf_conntrack is 0 % full [03:43:02] RECOVERY - RAID on wtp1001 is OK: OK: no disks configured for RAID [03:43:22] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 58 minutes ago with 0 failures [03:44:13] RECOVERY - salt-minion processes on wtp1020 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:44:14] RECOVERY - configured eth on wtp1020 is OK: OK - interfaces up [03:44:23] RECOVERY - dhclient process on wtp1020 is OK: PROCS OK: 0 processes with command name dhclient [03:44:43] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:44:43] RECOVERY - RAID on wtp1020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:44:52] RECOVERY - Check size of conntrack table on wtp1020 is OK: OK: nf_conntrack is 0 % full [03:45:06] RECOVERY - parsoid disk space on wtp1020 is OK: DISK OK [03:45:15] RECOVERY - parsoid disk space on wtp1019 is OK: DISK OK [03:45:22] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 51 minutes ago with 0 failures [03:45:32] RECOVERY - DPKG on wtp1020 is OK: All packages OK [03:45:34] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [03:45:42] RECOVERY - DPKG on wtp1019 is OK: All packages OK [03:45:53] RECOVERY - salt-minion processes on wtp1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:46:03] RECOVERY - Disk space on wtp1020 is OK: DISK OK [03:46:14] RECOVERY - RAID on wtp1019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:46:33] RECOVERY - SSH on wtp1019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:46:43] RECOVERY - configured eth on wtp1019 is OK: OK - interfaces up [03:46:44] RECOVERY - Disk space on wtp1019 is OK: DISK OK [03:46:52] RECOVERY - dhclient process on wtp1019 is OK: PROCS OK: 0 processes with command name dhclient [03:47:22] RECOVERY - Check size of conntrack table on wtp1019 is OK: OK: nf_conntrack is 1 % full [03:57:32] RECOVERY - Disk space on wtp1004 is OK: DISK OK [03:58:14] RECOVERY - configured eth on wtp1004 is OK: OK - interfaces up [03:58:33] RECOVERY - dhclient process on wtp1004 is OK: PROCS OK: 0 processes with command name dhclient [03:58:33] RECOVERY - RAID on wtp1004 is OK: OK: no disks configured for RAID [03:58:33] RECOVERY - Check size of conntrack table on wtp1004 is OK: OK: nf_conntrack is 0 % full [03:58:43] RECOVERY - SSH on wtp1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:58:55] RECOVERY - parsoid disk space on wtp1004 is OK: DISK OK [03:58:57] RECOVERY - DPKG on wtp1004 is OK: All packages OK [03:59:12] RECOVERY - salt-minion processes on wtp1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:59:12] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [03:59:53] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.006 second response time [03:59:55] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2266425 (10Papaul) [04:01:12] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [04:11:18] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2266427 (10BBlack) [04:26:23] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:39:10] (03PS1) 10Tim Landscheidt: Tools: Install xml2 on execution nodes [puppet] - 10https://gerrit.wikimedia.org/r/287045 (https://phabricator.wikimedia.org/T134146) [04:54:17] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2266486 (10Dzahn) Yea, agree. That is an option to solve this and then T132103, and then see again for a longer term solution. [04:55:26] !log restarting elasticsearch server elastic1007.eqiad.wmnet (T110236) [04:55:27] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [04:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:56:51] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2266489 (10Slaporte) It won't work to move from WordPress unless there is a system that will allow us to create/edit pages without editing HTML. I'm not necessarily a... [05:08:52] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [05:11:11] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Transit: NTT (service ID 253066) {#11376} [10Gbps]BR [05:14:57] 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2266493 (10Dzahn) ping, we need reviews/merge for https://gerrit.wikimedia.org/r/#/c/280644/ [05:28:22] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [05:31:16] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2259160 (10Dzahn) Yes, familiar with this issue. Next time you see it and Icinga reports a bunch of services as down, and they have in common they are all on ganeti VMs and one of them is alsafi, just do "ssh alsafi"... [05:32:57] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2266522 (10Dzahn) 0
  • 09:10 moritzm: powercycled alsafi (stuck in KVM)
  • 02:21 mutante: ssh alsafi
  • 23:56 mutante: ssh alsafi
  • 17:57 mutante: ssh alsafi fixes ganeti VM timeouts once again
  • ... [05:36:39] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [05:42:39] gehel: ^^^ [05:43:59] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2266526 (10elukey) [05:44:35] (03PS1) 10Ori.livneh: Add a parameter to service::node for enabling or disabling the service unit [puppet] - 10https://gerrit.wikimedia.org/r/287050 [05:46:22] paravoid: thanks, I'll have a look [05:50:38] (03PS1) 10Ori.livneh: Disable changeprop service [puppet] - 10https://gerrit.wikimedia.org/r/287051 [05:52:30] ACKNOWLEDGEMENT - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] Gehel seems to be a transient issue, already starting to recover [05:52:49] !log restarting elasticsearch server elastic1008.eqiad.wmnet (T110236) [05:52:49] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [05:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:53:08] (03CR) 10Faidon Liambotis: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/287050 (owner: 10Ori.livneh) [05:53:26] (03CR) 10Faidon Liambotis: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/287051 (owner: 10Ori.livneh) [05:53:59] I think it will still ensure=>running ^ [05:55:22] argh. enabled determines whether init should start it at boot or not, not whether puppet should ensure it is running or not, right? [05:55:35] right [05:55:39] correct, but why would it ensure => running? [05:55:49] we want ensure => stopped at the Service level [05:56:00] right [05:56:01] ah, service_unit idiosyncracy [05:56:18] service_params will override the one in service_unit [05:56:24] "in other words 'present' will ensure => running and conversely 'absent' will ensure => stopped." [05:56:41] yeah [05:56:50] but you can override in service_params with ensure at that level [05:57:21] should i overload service::node's enabled param so that it controls both the ensure and the enabled params for the underlying service_unit? [05:57:25] or would that be confusing? [05:57:41] it sounds fine to me, not confusing at all [05:57:48] it's actually what I'd expect it to do [05:57:57] if you override ensure at the service_unit level, it will actually remove the unit file right? [05:58:16] bblack: different layer of abstraction! :) [05:59:02] (I think) ori means thatservice::node { enable => false } should pass both ensure => stopped and enable => false to service_unit's service_params [05:59:07] I'd make the new 'enable' of service_node just also control ensure=>running|stopped directly in service_params [05:59:19] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2266532 (10jayvdb) How many pages are on the site, and how frequently are they updated? [05:59:20] yeah that [05:59:30] I think that's what he said -- or at least I interpreted it that way [05:59:43] ok, I didn't, but we're obviously on the same page :) [06:00:31] yep [06:00:34] patch in a sec [06:00:40] can just do a followup and puppet-merge it all together [06:01:04] hey while at it [06:01:12] can we also remove the monitoring::check? [06:01:29] I can do a separate commit for that too [06:02:41] monitoring::service/nrpe::monitor_service both have an "ensure" [06:03:39] not sure what you're talking about exactly (haven't looked) so I'll leave it to you [06:06:09] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2266539 (10jcrespo) @Krenair You should ping labs admins. [06:06:29] my connection is so slow [06:07:21] (03PS1) 10Faidon Liambotis: service::node: remove monitoring on $enable=false [puppet] - 10https://gerrit.wikimedia.org/r/287053 [06:07:24] that's the monitoring part [06:07:42] want me to do the ensure running part too? [06:08:42] (03PS2) 10Faidon Liambotis: service::node: remove monitoring on $enable=false [puppet] - 10https://gerrit.wikimedia.org/r/287053 [06:08:44] (03PS1) 10Faidon Liambotis: service::node: ensure => stopped on $enable=false [puppet] - 10https://gerrit.wikimedia.org/r/287054 [06:08:48] there [06:10:02] (03PS1) 10Ori.livneh: Make service::node's "enable" param control the service_unit's ensure value [puppet] - 10https://gerrit.wikimedia.org/r/287055 [06:10:21] yours is better, I forgot service_ensure() handles boolean values [06:10:25] I'll abandon [06:10:32] that' [06:10:36] that's not the only difference [06:11:05] mhm [06:11:17] (03CR) 10jenkins-bot: [V: 04-1] Make service::node's "enable" param control the service_unit's ensure value [puppet] - 10https://gerrit.wikimedia.org/r/287055 (owner: 10Ori.livneh) [06:11:20] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [06:11:21] I guess both approaches are valid, mine keeps the service unit in place [06:11:43] I think it's better that way, more in line with the advertised description of the new enable param [06:12:01] I guess so [06:12:13] (03Abandoned) 10Ori.livneh: Make service::node's "enable" param control the service_unit's ensure value [puppet] - 10https://gerrit.wikimedia.org/r/287055 (owner: 10Ori.livneh) [06:12:19] (03CR) 10BBlack: [C: 031] service::node: ensure => stopped on $enable=false [puppet] - 10https://gerrit.wikimedia.org/r/287054 (owner: 10Faidon Liambotis) [06:12:31] PROBLEM - RAID on dbstore1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [06:12:39] (03CR) 10Ori.livneh: [C: 031] service::node: ensure => stopped on $enable=false [puppet] - 10https://gerrit.wikimedia.org/r/287054 (owner: 10Faidon Liambotis) [06:12:42] (03CR) 10BBlack: [C: 031] service::node: remove monitoring on $enable=false [puppet] - 10https://gerrit.wikimedia.org/r/287053 (owner: 10Faidon Liambotis) [06:12:48] (03CR) 10Faidon Liambotis: [C: 032] service::node: ensure => stopped on $enable=false [puppet] - 10https://gerrit.wikimedia.org/r/287054 (owner: 10Faidon Liambotis) [06:12:55] <_joe_> ouch [06:13:01] <_joe_> paravoid: I'm not sure that will work [06:13:06] what won't? [06:13:09] it seems like it should [06:13:11] <_joe_> see docs for base::service_unit [06:13:33] $params = merge($base_params, $service_params) [06:13:33] yeah but this is in service_params, which is merged()'d on after your implicit ensure=>running [06:13:58] (03CR) 10Faidon Liambotis: [C: 032] service::node: remove monitoring on $enable=false [puppet] - 10https://gerrit.wikimedia.org/r/287053 (owner: 10Faidon Liambotis) [06:14:09] <_joe_> paravoid: puppet merge() does left-merge or right-merge? [06:14:11] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 4 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [06:14:14] right wins [06:14:21] what bblack said [06:14:26] <_joe_> ok :) [06:14:31] <_joe_> just woke up :P [06:14:33] only way to find out :) [06:14:42] <_joe_> paravoid: the compiler would :P [06:14:46] except when the future parser is enabled, in which case it depends on the day of the week [06:14:47] <_joe_> (show you) [06:14:53] <_joe_> ori: ahahahah [06:14:59] haha [06:15:03] better than phase of the moon I guess [06:15:04] ori: I can reenable puppet on scb* [06:15:08] <_joe_> ori: the future parser is actually way better [06:15:12] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [06:15:18] paravoid: thanks, please do [06:15:30] <_joe_> what happened btw? [06:15:31] doing so [06:15:45] _joe_: https://etherpad.wikimedia.org/p/ve-2016-05-05 are the running notes [06:15:48] <_joe_> also, why are the 3 of you up at this time? [06:15:56] see the bottom of that pad, under ROOT CAUSE [06:16:03] postmortem will follow when the people involved get a night's sleep :) [06:16:11] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [06:16:15] yeah I'm off [06:16:21] o/ [06:16:21] me too shortly [06:16:38] I'm up because I slept early, hadn't put my phone in silent and haven't fixed my gmail filters to move icinga alerts out of inbox [06:16:39] <_joe_> oh wow [06:16:48] and I have configured my phone to beep on every email to inbox [06:17:12] from the observational point of view, this started with some restbase icinga alerts, noticing load + crazy graphs on RB.... and then suddenly a deluge of overloaded parsoid nodes failing all icinga checks with socket timeouts [06:17:15] "change propagation" lived up to its name [06:17:30] bblack: yeah I added a section on alerting/monitoring [06:17:38] this was a big fail from that regard [06:17:44] the only pages were for parsoid "disk space" [06:17:49] yeah [06:18:01] no useful pages, tons of useless alerts (and some of them pages) [06:18:02] but more to the point, all the alerts were about nodes [06:18:10] where's the alerts or pages about the actual service endpoints? [06:18:16] yup [06:19:38] by that I mean (I'm sure paravoid follows, but to be clear for all): [06:19:41] 02:39 < icinga-wm> PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:19:50] ^ that's a service check on a specific node [06:19:55] I added that as another bullet to that section [06:20:42] but we need an alert (which pages!) when restbase.svc.(eqiad|codfw).wmnet dies, in spite of LVS failing out whatever individual backend nodes [06:22:55] <_joe_> bblack: out abstraction about monitoring LVS endpoints is totally simplistic [06:23:14] <_joe_> and it is one of the myriad of things bugging me in the background [06:23:39] 06Operations, 10Analytics-Cluster, 10EventBus, 06Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#2266566 (10elukey) [06:23:43] 06Operations, 10EventBus, 10MediaWiki-Cache, 06Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#2266567 (10elukey) [06:23:47] 06Operations, 10ops-codfw, 06Analytics-Kanban, 06DC-Ops, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2266565 (10elukey) 05Open>03Resolved [06:24:29] ok I'm out, good morning EU :) [06:24:39] <_joe_> bblack: good night :) [06:24:40] bye bblack :) [06:25:18] (03CR) 10Faidon Liambotis: [C: 04-1] ircserver: move ircd.conf to public repo (WIP) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [06:26:02] (03CR) 10Faidon Liambotis: [C: 031] "Yup!" [puppet] - 10https://gerrit.wikimedia.org/r/286785 (https://bugzilla.wikimedia.org/134271) (owner: 10Dzahn) [06:31:30] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] (03CR) 10Faidon Liambotis: [C: 032] "LGTM. More ideas for this:" [puppet] - 10https://gerrit.wikimedia.org/r/286683 (owner: 10Alex Monk) [06:31:58] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] (03PS3) 10Faidon Liambotis: Make udpmxircecho conform to pep8 [puppet] - 10https://gerrit.wikimedia.org/r/286683 (owner: 10Alex Monk) [06:32:38] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: puppet fail [06:33:48] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:19] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:38] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:39] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:35:11] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2266575 (10Peachey88) What about [[ http://docs.getpelican.com/en/3.6.3/ | Pelican ]]? You can a variety of formats, Markdown being one example. I wonder if we could... [06:35:24] (03PS1) 10Elukey: Configure mc1009 with the latest memcached version as performance test. [puppet] - 10https://gerrit.wikimedia.org/r/287058 (https://phabricator.wikimedia.org/T129963) [06:36:59] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:18] PROBLEM - BGP status on cr1-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active [06:41:19] (this is planned maintenance fwiw) [06:42:38] I was about to ask :) [06:43:50] it's on the calendar [06:43:54] :) [06:44:54] (03PS2) 10Elukey: Configure mc1009 with the latest memcached version as performance test. [puppet] - 10https://gerrit.wikimedia.org/r/287058 (https://phabricator.wikimedia.org/T129963) [06:45:12] elukey: what's with kafka2001 puppet disabled? [06:46:43] paravoid: shouldn't be disabled, I think that it was a leftover from yesterday issue with eventbus [06:46:46] checking [06:47:40] ah yeah disabled by ottomata, re-enabling it [06:48:08] RECOVERY - BGP status on cr1-eqord is OK: BGP OK - up: 43, down: 0, shutdown: 0 [06:48:18] he was also doing load tests (it is the new event bus stack in codfw) [06:48:29] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet last ran 9 hours ago [06:48:59] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1621.70 Read Requests/Sec=1427.30 Write Requests/Sec=0.50 KBytes Read/Sec=33598.00 KBytes_Written/Sec=19.20 [06:49:13] that fermium thing is bacula backing it up, I checked [06:50:30] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:55:29] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:55:51] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:48] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:57:39] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:39] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:49] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:58:18] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:18] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:58] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=14.70 Read Requests/Sec=0.00 Write Requests/Sec=2.90 KBytes Read/Sec=0.00 KBytes_Written/Sec=71.20 [06:59:28] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#2266608 (10Amire80) [07:15:49] 06Operations, 10ops-eqiad: dbstore1001 degraded RAID - https://phabricator.wikimedia.org/T134471#2266613 (10jcrespo) [07:16:06] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/2674/" [puppet] - 10https://gerrit.wikimedia.org/r/287058 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [07:16:36] ACKNOWLEDGEMENT - RAID on dbstore1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo T134471 [07:18:44] jynus: :) [07:18:58] jynus: there's an unacknowledged db1038 disk space alert too [07:19:10] db1038 ? [07:19:16] yes [07:19:35] yes, to be docomissioned [07:19:52] but holds still master data [07:20:30] problem is I cannot ack it because it flops due to binary logs [07:22:16] let me see if I can solve it with dark magic [07:23:11] I did, black magic worked [07:28:17] (03PS2) 10Giuseppe Lavagetto: Amend imagemagick policy to also include the URL decoder [puppet] - 10https://gerrit.wikimedia.org/r/286790 (owner: 10Muehlenhoff) [07:28:24] black magic being? [07:29:59] deleted unused files [07:30:28] haha [07:30:32] ok that works I guess :) [07:30:43] well, you must now which files to delete [07:30:50] (do you know the joke?) [07:31:10] no [07:31:27] <_joe_> jynus: actually, one interview question I ask is based on someone not knowing which files to delete :P [07:33:13] http://www.i18nguy.com/engineers.html #4 [07:37:15] you know some colleague used to create a 1GB rubish file in order to delete it if he run out of db space? [07:37:17] (03PS4) 10Alexandros Kosiaris: hhvm: allow passing service parameters [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [07:37:25] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] hhvm: allow passing service parameters [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [07:38:32] (03PS4) 10Alexandros Kosiaris: contint: disable HHVM background service [puppet] - 10https://gerrit.wikimedia.org/r/269947 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [07:38:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] contint: disable HHVM background service [puppet] - 10https://gerrit.wikimedia.org/r/269947 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [07:40:04] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [07:44:19] <_joe_> jynus: I would use reserved blocks [07:44:31] <_joe_> and use tune2fs to change it when needed :P [07:44:34] sorry, what? [07:44:57] <_joe_> if you are using an ext4 filesystem, you can set a % of block to be reserved to root [07:45:09] well, we kind of used to do the same with LVM [07:45:30] <_joe_> if you're using lvm, of course, you're ok with it [07:45:33] although technically we used that for snapshots, not for resizes [07:45:39] not anymore [07:46:07] no snapshots- xtrabackups is usually a better option for backups [07:51:29] 06Operations, 07HHVM, 07user-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#961386 (10Joe) [07:58:23] gehel: so, what's up with wdqs? [07:59:40] paravoid: Actually, I don't really know. We're investigating with Stas... We have some random freezes... There is a minor file descriptor leak, but it does not really explain the issue [08:03:11] paravoid: my feeling at this point is that the issue in inside of the JVM (blazegraph / jetty / ...) but we don't have much in term of metrics. I need to add some visibility there [08:05:50] (03PS1) 10Giuseppe Lavagetto: memcached: install a specific version on mc2009 [puppet] - 10https://gerrit.wikimedia.org/r/287060 [08:08:18] <_joe_> elukey: ^^ merge after you've uploaded the package to reprepro [08:09:37] _joe_ ack thanks [08:12:52] !log Restarted blazegraph on wdqs1002 (unresponsive) [08:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:13:58] 06Operations, 10DBA, 07Performance, 07RfC, 05codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523#2266721 (10jcrespo) I would not stick to Cassandra (reliability issues), but yes, outside of MySQL. [08:13:59] <_joe_> gehel, hoo: it seems to me that instabilities arose when we've depooled one of the two wdqs machines [08:19:45] (03CR) 10Merlijn van Deen: [C: 031] Tools: Install xml2 on execution nodes [puppet] - 10https://gerrit.wikimedia.org/r/287045 (https://phabricator.wikimedia.org/T134146) (owner: 10Tim Landscheidt) [08:47:20] _joe_: On localhost the service seems responsive [08:47:21] hm [08:54:23] _joe_: gehel: Here? [08:56:37] Would be nice if anyone could kick nginx there [08:57:17] !log Restarted blazegraph on wdqs1002 (unresponsive) [08:58:48] <_joe_> hoo: what are the symptoms? [08:58:51] <_joe_> let me see [08:59:18] <_joe_> hoo: do you think nginx is in a bad state? [08:59:25] <_joe_> and why? [09:00:55] <_joe_> nginx is perfectly responsive AFAICS [09:02:39] _joe_: Yeah, nginx itself is [09:02:44] it's responsive from localhost [09:02:50] but apparently not from the varnishes? [09:03:01] Kicking the service itself helps [09:03:07] but that's not nice [09:03:33] I would have looked into nginx, but I don't have the access on that box [09:04:42] <_joe_> hoo: I know nothing specific about the varnish setup for wdqs [09:04:44] 06Operations, 10DBA: Decomission old coredb machines (>=db1050) - https://phabricator.wikimedia.org/T134476#2266762 (10jcrespo) [09:04:58] <_joe_> but if this happens again, tell me before kicking the service [09:05:21] <_joe_> gehel should probably take a look too [09:05:23] 06Operations, 10DBA: Decomission old coredb machines (>=db1050) - https://phabricator.wikimedia.org/T134476#2266774 (10jcrespo) [09:05:27] _joe_: It will happen again [09:05:28] The symptoms are that it becomes unresponsive when talked to via varnish [09:05:36] 06Operations, 10DBA: Decomission old coredb machines (>=db1050) - https://phabricator.wikimedia.org/T134476#2266762 (10jcrespo) [09:05:49] <_joe_> hoo: can you provide me an url that becomes unresponsive? [09:06:03] (03PS1) 10Dereckson: Add tasnimnews.com & khamenei.ir to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) [09:06:23] <_joe_> hoo: I think this might have to do with the backend becoming unresponsive shortly for a specific url [09:06:40] <_joe_> did you try different urls when you find it to be unresponsive? [09:07:49] _joe_: Maybe it reaches some connection limit for varnish [09:07:50] or rather a busy worker limit [09:07:51] thus becomes unresponsive for that [09:08:03] I tried [09:08:11] I'm on a high latency connection right now [09:08:19] so expect out of order replies [09:08:42] curl 'localhost/bigdata/namespace/wdq/sparql?query=%23Catss%20-%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%0A%7B%0A%09%3Fitem%20wdt%3AP31%20wd%3AQ146%20.%20%0A%09SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D' -H 'Host: query.wikidata.org' [09:08:43] that works [09:09:07] <_joe_> hoo: ok just ping me when this happens again [09:09:08] but not via Varnihs [09:09:19] <_joe_> I'm writing perf reviews again this morning :( [09:09:20] I will [09:09:32] Well, I'll try [09:09:38] I have a talk about the query service in a bit [09:09:43] and then a workshop later on [09:09:50] but that seems increasingly unlikely :/ [09:10:17] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2445 [09:10:59] <_joe_> hoo: :/ [09:11:17] <_joe_> gehel: you here? [09:13:07] <_joe_> hoo: I think something is wrong at the varnish level indeed [09:13:17] <_joe_> I'm digging [09:13:28] _joe_: Thanks [09:13:32] TV wants to talk to me [09:13:34] so away for now [09:13:38] <_joe_> eheh ok [09:13:45] <_joe_> go be a superstar :P [09:17:14] <_joe_> interesting, cp1058 is the only machine where this happens in eqiad [09:19:33] <_joe_> x-cache:cp1058 hit(1), heh [09:21:50] (03PS1) 10Jcrespo: Depool db1023 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287066 (https://phabricator.wikimedia.org/T125028) [09:22:12] !log restbase deploy start of 2a3972a on canary restbase1008 [09:22:17] (03CR) 10Jcrespo: [C: 032] Depool db1023 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287066 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [09:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:49] (03Merged) 10jenkins-bot: Depool db1023 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287066 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [09:22:56] <_joe_> !log reloaded the backend varnsih config on cp1058 [09:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:09] <_joe_> hoo: seems to work now AFAICS [09:25:17] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 74989 Threads: 2 Questions: 2300300 Slow queries: 384 Opens: 448 Flush tables: 2 Open tables: 456 Queries per second avg: 30.675 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:26:22] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1023 for maintenance (duration: 02m 44s) [09:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:28:17] (03PS2) 10Yuvipanda: Tools: Install xml2 on execution nodes [puppet] - 10https://gerrit.wikimedia.org/r/287045 (https://phabricator.wikimedia.org/T134146) (owner: 10Tim Landscheidt) [09:28:36] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/287045 (https://phabricator.wikimedia.org/T134146) (owner: 10Tim Landscheidt) [09:30:09] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Really depool db1023 for maintenance (duration: 02m 27s) [09:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:32:39] (03CR) 10Mobrovac: [C: 031] cxserver: scap3 migration [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [09:33:12] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2266797 (10elukey) ``` elukey@copper:~/memcached-1.4.25$ debdiff ~/memcached_1.4.25-2.dsc /var/cache/pbuilder/result/jessie-amd64/memcached_1.4.25-2~wmf1.dsc dp... [09:33:36] _joe_: Thansk [09:35:03] <_joe_> hoo: still not sure what happened there, might happen again [09:38:05] maybe we exhausted some open connection limit? [09:38:07] checking mw2027 [09:38:16] either varnish side or nginx side [09:38:29] thus making it wait for stuck ones? [09:40:25] console says: ������� [09:42:26] !log powercycling mw2027 [09:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:51] (03PS2) 10Elukey: memcached: install a specific version on mc2009 [puppet] - 10https://gerrit.wikimedia.org/r/287060 (owner: 10Giuseppe Lavagetto) [09:45:04] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 36.33 ms [09:49:36] !log restbase deploy end of 2a3972a [09:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:55] (03CR) 10Elukey: [C: 032] memcached: install a specific version on mc2009 [puppet] - 10https://gerrit.wikimedia.org/r/287060 (owner: 10Giuseppe Lavagetto) [09:54:30] !log installed memcached 1.4.25-2~wmf1 manually on mc2009 as part of T129963 [09:54:30] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [09:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:47] !log SET GLOBAL thread_pool_max_threads = 2000; on db1057 [09:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:59:12] !log SET GLOBAL thread_pool_stall_limit = 10; on db1057 [09:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:00:19] _joe_, hoo: sorry, I'm mostly not here... public holiday in Switzerland... [10:00:59] _joe_: what makes you think something is wrong at Varnish level? [10:03:53] <_joe_> gehel: oh sorry, didn't know [10:04:25] <_joe_> gehel: what hoo said: same query hanging on on varnish, and not when done directly to wdqs1002 [10:04:35] <_joe_> I think this might be related to caching [10:04:45] <_joe_> are we caching those queries on varnish, right? [10:05:03] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2266824 (10elukey) Installed the newly built version of memcached on mc2009, everything looks good. Puppet has been patched accordingly https://gerrit.wikimedia... [10:08:27] (03CR) 10Alexandros Kosiaris: "I am not in love with the confluent:: module name as from what I understand it's a vendor/platform and not a functional piece of software " (0314 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [10:08:42] <_joe_> gehel: so my working theory is that when a specific url gets requested and is cacheable, all subsequent requests for that url get stuck in a queue, and if our backend timeout for wdqs is large, this might explain it [10:08:50] 06Operations, 10DBA: Decomission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266826 (10jcrespo) [10:09:00] <_joe_> but I'll do some digging when I'm done with other chores [10:10:13] _joe_: yes, we are caching on Varnish [10:10:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've commented on PS8 for some reason. I did a diff between PS8 and PS11 however and my comments in PS8 still stand." [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [10:11:37] _joe_: we have a set of test queries (https://github.com/wikimedia/wikidata-query-rdf/tree/master/queries), some of them slow, some of them fast. When service is getting stuck, even the fast ones timeout [10:16:21] (03PS1) 10Alexandros Kosiaris: chromium: Remove url_downloader role [puppet] - 10https://gerrit.wikimedia.org/r/287070 [10:17:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] chromium: Remove url_downloader role [puppet] - 10https://gerrit.wikimedia.org/r/287070 (owner: 10Alexandros Kosiaris) [10:20:55] (03PS1) 10Alexandros Kosiaris: chromium: Remove url_downloader hiera and role [puppet] - 10https://gerrit.wikimedia.org/r/287071 [10:21:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] chromium: Remove url_downloader hiera and role [puppet] - 10https://gerrit.wikimedia.org/r/287071 (owner: 10Alexandros Kosiaris) [10:23:11] !log stopping mysql & backuping db1023 in preparation for reimage [10:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:50] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1023.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1023.eqiad.wmnet (111 Connection refused) [10:27:12] jynus: I've set downtime ^^^ [10:28:10] PROBLEM - url_downloader on hydrogen is CRITICAL: Connection refused [10:28:22] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [10:28:30] PROBLEM - url_downloader on chromium is CRITICAL: Connection refused [10:28:31] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:28:31] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:28:32] PROBLEM - url_downloader on alsafi is CRITICAL: Connection refused [10:28:51] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:28:51] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:28:55] taking a look at url_downloader [10:29:10] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:11] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:11] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:11] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:20] the citoid/restbase failures are related [10:29:21] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:40] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [10:29:41] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:52] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:52] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:00] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:02] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:31:04] (03PS1) 10Alexandros Kosiaris: Revert "chromium: Remove url_downloader hiera and role" [puppet] - 10https://gerrit.wikimedia.org/r/287072 [10:31:22] it's me [10:31:32] godog: I am the problem here [10:31:32] fixing [10:31:48] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "chromium: Remove url_downloader hiera and role" [puppet] - 10https://gerrit.wikimedia.org/r/287072 (owner: 10Alexandros Kosiaris) [10:32:03] akosiaris: ah! thanks, I missed your merge shortly before [10:32:30] big PEBKAC on my part [10:32:34] sorry :-( [10:33:17] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [10:33:17] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [10:33:18] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [10:33:19] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [10:33:19] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [10:33:25] of course restbase dying because urldownloader misbehaves is funny ... [10:33:28] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [10:33:38] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [10:33:48] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [10:33:51] or at least the checks firing, even if restbase does not have problems [10:34:18] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [10:34:19] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [10:34:28] RECOVERY - url_downloader on alsafi is OK: TCP OK - 0.021 second response time on port 8080 [10:34:28] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [10:34:28] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [10:34:29] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [10:34:38] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [10:34:57] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [10:35:28] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [10:36:15] indeed, looks like it makes nrpe time out, though the check timeout is 5s and nrpe timeout is 10s [10:40:32] !log fix problems with url_downloader created by PEBKAC [10:40:39] !log rebooted labsdb1004 [10:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:00:21] (03CR) 10Filippo Giunchedi: [C: 04-1] "note that current $skey usage also needs to be explicitly added to the ssh server configuration, e.g. ganeti" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [11:10:39] (03PS1) 10Elukey: Update memcached version on mc1009 as part of a performance test. [puppet] - 10https://gerrit.wikimedia.org/r/287073 (https://phabricator.wikimedia.org/T129963) [11:10:55] (03PS6) 10Volans: MariaDB: Set additional salt grains for core DBs [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) [11:13:04] (03CR) 10Jcrespo: [C: 031] MariaDB: Set additional salt grains for core DBs [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [11:13:08] (03CR) 10Volans: [C: 032] MariaDB: Set additional salt grains for core DBs [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [11:15:13] akosiaris: what happened to labsdb1004? [11:18:56] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:18:56] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:45] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [11:20:45] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [11:22:19] --^ pageview api, cassadra timeouts [11:24:30] (03CR) 10Filippo Giunchedi: [C: 04-1] Automate the generation deployment keys (keyholder-managed ssh keys) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [11:26:15] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:57] gehel: _joe_: SMalyshev: Anyone looking at WDQS? [11:28:05] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [11:30:27] hoo: not looking closely at the moment, thanks for your ping! [11:30:51] gehel: Would really appreciate it to work for the next few hours [11:30:56] having a workshop about it right now [11:31:04] _joe_: you wanted to check a few things before I restart the service? [11:32:13] !log restarting blazegraph on wdqs1001 [11:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:33:29] Works again, thanks gehel [11:33:31] _joe_: it seems that blazegraph was stuck since 7:51am on wdqs1001. I restarted to restore service (that's the only thing I know how to do at the moment) [11:33:52] wdqs1002 is still stuck [11:33:53] gehel: Is 1001 pooled right now? [11:34:27] yes, both are pooled. We were suspecting that the instability was due to a single server not being able to handle the load, but it does not seem to help much [11:34:59] It still works blazingly fast locally [11:35:15] but talking ot it via varnish is slow/ times out/ connection reseet [11:37:22] strange, strange, strange... [11:37:49] Indeed :/ [11:38:07] I already posted my suggestions, but can't look into that [11:38:18] don't have enough access on the box to see what nginx is doing [11:40:26] (03PS4) 10ArielGlenn: base connection limit for dumps server on ip + user agent [puppet] - 10https://gerrit.wikimedia.org/r/285682 (https://phabricator.wikimedia.org/T133790) [11:40:48] hoo: did you post your suggestions on T134238 ? I can'tfind them... [11:40:49] T134238: Query service fails with "Too many open files" - https://phabricator.wikimedia.org/T134238 [11:40:53] gehel: No [11:40:55] Can do [11:41:13] hoo: That would be great! I can have a quick look right now... [11:41:46] (03PS5) 10ArielGlenn: base connection limit for dumps server on ip + user agent [puppet] - 10https://gerrit.wikimedia.org/r/285682 (https://phabricator.wikimedia.org/T133790) [11:42:55] (03CR) 10ArielGlenn: [C: 032] base connection limit for dumps server on ip + user agent [puppet] - 10https://gerrit.wikimedia.org/r/285682 (https://phabricator.wikimedia.org/T133790) (owner: 10ArielGlenn) [11:53:02] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2266901 (10faidon) >>! In T111654#2249183, @Volans wrote: > @faidon, given that: > - MySQL is (and should remain IMHO) an internal service only > - hence we are in physical control of all t... [11:53:33] YuviPanda: kernel upgrade. and mysql replication did not recover :-( [11:54:13] akosiaris: ouch :| [11:54:15] akosiaris: ok [11:56:41] gehel: Posted very quickly [11:56:58] Thanks for your help, keep me/ us updated [11:58:35] hoo: thanks for the comments. I'm going to have a look. Saddly I'm not going to be able to spend too much time on that today (officially on holiday9 [11:58:46] gehel: Oh, didn't know that [11:58:52] We have a public holiday today as well [11:58:58] but am at a conference [11:59:10] conference are almost holidays... [11:59:49] Depends on how many people want to talk to you [12:02:05] I doubt this is an issue of overflowing the number of nginx workers, at the moment, only 2 established TCP connections for nginx... [12:12:51] hoo, _joe_, SMalyshev: I don't have any great idea for wdqs. I added a few notes to T134238, sorry, but I have to go... [12:12:51] T134238: Query service fails with "Too many open files" - https://phabricator.wikimedia.org/T134238 [12:23:25] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=211.60 Read Requests/Sec=1954.10 Write Requests/Sec=110.10 KBytes Read/Sec=19330.40 KBytes_Written/Sec=10604.00 [12:28:58] (03CR) 10Alexandros Kosiaris: [C: 04-2] "skey (that is supplemental key) was never meant to be used that way. Changing the semantics of it changes the way we do ssh public key aut" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [12:33:35] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=12.40 Read Requests/Sec=0.10 Write Requests/Sec=1.10 KBytes Read/Sec=0.40 KBytes_Written/Sec=32.00 [12:45:57] (03PS3) 10Alexandros Kosiaris: uwsgi: Add python3 support [puppet] - 10https://gerrit.wikimedia.org/r/283492 [12:49:03] (03CR) 10Alexandros Kosiaris: "@Yuvipanda. done. Thanks for catching that. Only one mention of uwsgi-plugin-python3 in the entire repo now and that's toollabs which seem" [puppet] - 10https://gerrit.wikimedia.org/r/283492 (owner: 10Alexandros Kosiaris) [12:49:09] (03PS4) 10Alexandros Kosiaris: uwsgi: Add python3 support [puppet] - 10https://gerrit.wikimedia.org/r/283492 [12:49:18] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] uwsgi: Add python3 support [puppet] - 10https://gerrit.wikimedia.org/r/283492 (owner: 10Alexandros Kosiaris) [12:49:26] akosiaris: cool :) [12:54:44] (03PS1) 10Alexandros Kosiaris: planet: Use the per site url-downloader instance [puppet] - 10https://gerrit.wikimedia.org/r/287077 [13:00:54] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: puppet fail [13:01:14] (03PS2) 10Alexandros Kosiaris: planet: Use the per site url-downloader instance [puppet] - 10https://gerrit.wikimedia.org/r/287077 [13:01:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "puppet compiler shows up the expected outcome" [puppet] - 10https://gerrit.wikimedia.org/r/287077 (owner: 10Alexandros Kosiaris) [13:05:53] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2267001 (10akosiaris) [13:13:38] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2267009 (10akosiaris) Unfortunately disk_aio=native did not solve the problem. It is not however possible yet to reproduce it reliably. Perhaps a qemu upgrade to a newer version (jessie-backports has 2.5, we are on 2... [13:14:56] !log set ganeti2006 as drained (excluded from allocation operations) [13:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:23] (03PS1) 10Mobrovac: Change Prop: Tell RESTBase not to respond with redirects [puppet] - 10https://gerrit.wikimedia.org/r/287080 (https://phabricator.wikimedia.org/T134483) [13:16:49] (03PS3) 10Mobrovac: Change prop: Add the rule for MobileApps re-renders [puppet] - 10https://gerrit.wikimedia.org/r/286847 [13:18:07] 06Operations, 06Performance-Team, 10Thumbor: Backport Thumbor dependencies - https://phabricator.wikimedia.org/T134485#2267027 (10Gilles) [13:20:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 664 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5763063 keys - replication_delay is 664 [13:20:06] 06Operations, 06Performance-Team, 10Thumbor: Backport Thumbor dependencies - https://phabricator.wikimedia.org/T134485#2267058 (10Gilles) [13:25:18] 06Operations, 06Performance-Team, 10Thumbor: Backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2267084 (10Gilles) [13:25:38] (03PS1) 10Alexandros Kosiaris: phab::vcs: Add a proxy parameter [puppet] - 10https://gerrit.wikimedia.org/r/287081 [13:29:25] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:33:08] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5753760 keys - replication_delay is 0 [13:46:37] hey akosiaris morning, yt? [13:46:46] good (my) morning* :) [13:46:53] I was about to comment on that [13:46:57] yeah I am around [13:47:29] most comments I can just do, got 2/3 qs it'd be easier to talk to you real quick about [13:47:30] so [13:47:43] 2 are about preference that I keep going back and forth on [13:48:00] so, for the server.properties file, re. the if vs ternary blocks [13:48:12] they actually do different things, but i'm not sure which is better [13:48:20] the if blocks don't render anything out of the var is not set [13:48:22] (03PS8) 10Andrew Bogott: Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) [13:48:23] which makes the file shorter [13:48:48] the ternary leaves the property in the file, but just commented out [13:49:09] which might be better in some cases, because it is easier for a cursory reader to see possible config settings [13:49:32] yup. I am fine with both. Both have their pros and cons. Apart from being consistent, I have no special preference on that one [13:50:06] yeah, the reason i wasn't was because it made those sections funky with no values...e.g. there's this Log Retention Policy section that if I did the if block, would have nothing in it [13:50:18] but, maybe I can just wrap the whole section in a conditional :) [13:50:51] hm, yeah maybe i'll just do that :) [13:51:00] I'm about to start a labs maintenance task which will disable Jenkins for a while. So get your merges in now if you need 'em. [13:51:03] ok, typing out loud seems to help me, i'mma keep going! [13:51:05] ok other one [13:51:13] was re Class -> style dependencies vs require class [13:51:24] i did the Class -> Class on purpose [13:51:51] because I wanted puppet to fail if e.g. the ::broker class was not included elsewhere [13:52:13] and, you def can't do require if the class had parameters without default values (this one does not, so it would work) [13:52:36] i don't mind doing the require ::client class, because the default values there will almost always be fine [13:52:43] or if the values are not looked up via hiera [13:52:48] aye true. [13:52:49] hm [13:53:06] well, in labs it gets funky, but still the same [13:53:09] so, lemme get this straight. The class should fail if the ::broker class is not included, right ? [13:53:16] well, that was my intention [13:53:19] but i suppose it doesn't have to be that way [13:53:37] hmm [13:53:39] yeah, it probably should [13:53:40] because [13:53:40] so, why not avoid that case in the first place and just include the class ? [13:53:48] I am trying to understand what the use case is [13:53:49] for these classes (::jmxtrans, ::alerts) [13:54:18] its possible that someone might see them and think: Oh, i'll just include those over on this node over here on some non kafka broker node [13:54:26] in that case, it would be better to get a failure [13:54:40] rather than have puppet install kafka and for the person to have to clean that up [13:54:48] PROBLEM - NTP on pybal-test2001 is CRITICAL: NTP CRITICAL: Offset -25.96152508 secs [13:54:56] i dunno, maybe that's a contrived problem, not sure [13:55:13] I am thinking you are solving a problem it does not really exist tbh [13:55:19] yeah maybe so [13:55:22] trying to protect someone from himself [13:55:26] hehe [13:55:45] !log downtimed labservices1001, holmium, labcontrol1001 for one hour, disabled puppet as per https://phabricator.wikimedia.org/T128737 [13:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:02] many of the cdh classes work this way, and its def better there because there are many without default values for some parameters. but i think in this case it doesn't matter so much [13:56:10] (03CR) 10Andrew Bogott: [C: 032] Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) (owner: 10Andrew Bogott) [13:56:11] so, if you prefer a require, I am fine with it [13:56:13] will do... [13:56:13] :) [13:56:16] ok [13:56:16] ok the last thing [13:56:18] confluent name [13:56:38] i named it this way because: the configs here are fairly confluent package specific, and, we will probably use more confluent packages in the future [13:57:15] it has been very useful to have all of the cdh packaged classes in a single module, because then it is safer to refer to variables from one class in another in the same module. [13:58:00] yeah I don't feel really strong on this one. Not like I have a better alternative anyway. it's just that confluent seems like an umbrella for software that is written by others [13:58:35] they package it and that is admirable but still.. it feels this way to me. [13:58:46] I dunno, it's probably just me. I 'd say ignore me on this one [13:58:50] oook [13:58:57] thanks akosiaris, will submit another patch in a few [13:59:01] ok [13:59:04] thanks as well [13:59:10] chasemp, jynus, ready for our dns db changes? [13:59:36] yes, I have a meeting at 10 so if I drop off don't be worried I just want to keep pace and help out so I'm in the loop [14:00:04] andrewbogott: Dear anthropoid, the time has come. Please deploy Labs DNS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160505T1400). [14:00:25] 10 my time is in an hour :) [14:00:25] ok, I'm downtiming and disabling puppet everywhere... [14:01:08] I am ready, technically, I do not have to do anything, only support if anything goes wrong [14:01:20] let me check he latest backups just in case [14:02:00] jynus: there's a step down the road which is 'disable replication and make things r/w/' — I'm hoping you'll do that bit. But it'll be a few minutes. [14:02:40] !log stopping nova-api on labnet1002 [14:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:00] true, andrewbogott [14:07:46] chasemp: ok, I'm at this step: "dig tests @ labs-ns1.wikimedia.org and labs-recursor1.wikimedia.org" [14:07:52] want to check my work? Looks fine to me [14:07:54] yep [14:08:16] (03PS2) 10Elukey: Update memcached version on mc1009 as part of a performance test. [puppet] - 10https://gerrit.wikimedia.org/r/287073 (https://phabricator.wikimedia.org/T129963) [14:09:04] andrewbogott: seems to be functioning [14:09:14] icinga checks are ok too [14:09:29] and the config says gmysql-host=localhost [14:09:44] * andrewbogott restarts pdns just to be sure we know what we're getting... [14:10:03] yep, seems fine [14:10:21] so, now labservices [14:10:41] do I disconnect holmium? [14:10:47] jynus: not yet [14:10:50] ok [14:11:02] checklist is here: https://phabricator.wikimedia.org/T128737 [14:11:10] the backups are fine, BTW [14:11:14] hey akosiaris, for ensure_service on base::service_unit i got [14:11:19] oh, yeah, good :) [14:11:20] $ensure must be "present" or "absent" (got: "running"). at /etc/puppet/modules/base/manifests/service_unit.pp:79 [14:11:23] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2267138 (10Krenair) @andrew, @yuvipanda, @chasemp: ^ [14:11:41] i think that must just work for the service resource [14:11:43] not base::service_unit [14:12:30] ottomata: hmm lemme check it once more [14:12:41] (03CR) 10Elukey: "Verified that changes will be applied only to mc1009:" [puppet] - 10https://gerrit.wikimedia.org/r/287073 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [14:12:44] k [14:13:59] chasemp: ok, labs-ns0 and labs-ns1 are both backed by local r/o databases. Looks fine, right? [14:14:06] looking [14:14:44] databases are in reality rw, but that doesn't matter much [14:15:11] ok — yeah, I think it doesn't matter until I restart designate services [14:15:12] ottomata: ah indeed. The stanza I mentioned work in the service_params hash which is a parameter of base::service_unit [14:15:28] RECOVERY - NTP on pybal-test2001 is OK: NTP OK: Offset -0.00665140152 secs [14:15:43] the idea is they are still replicationg now from m5 [14:15:47] so I see a connection between mysql instances 'tcp ESTAB 0 0 208.80.154.12:45867 10.64.0.13:mysql users:(("mysqld",15595,46))' which is ok I think as we severe that next? [14:15:48] sigh.. too many ensures [14:15:49] ah yes :) [14:16:27] andrewbogott: all seems well [14:16:36] akosiaris: haha, yeah [14:16:48] ok, jynus, next step is "disable db replication, make pdns dbs read/write on both holmium and labservices1001" — so, over to you :) [14:16:50] i guess i'll just leave it the way I had it then? [14:16:50] akosiaris: around? [14:16:59] andrewbogott, doing now [14:17:00] (03CR) 10Alexandros Kosiaris: Add new confluent module and puppetization for using confluent Kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [14:17:05] ottomata: yup [14:17:10] kart_: yup [14:17:27] (03PS12) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [14:17:29] ok, tested in labs, new patch ^ [14:17:46] akosiaris: re: Apertium->Jessie. You mentioned some script for mass-building? (or did I misunderstood). [14:18:09] akosiaris: I suggested, we rebuild all Apertium packages, fix issues and migrate. [14:18:11] andrewbogott, done for labservices1001.wikimedia.org and holmium [14:18:20] jynus: ok! [14:18:21] kart_: no, not really. there is not such thing [14:18:24] I can continue work on official backport. [14:18:35] I'll restart designate services and we'll see how they do [14:18:35] (ive stopped the replication, although not completelly reseted just in case we need to restart it) [14:18:37] akosiaris: so, manual upload? [14:19:07] kart_: more or less. If you got a set of gerrit repos we can probably automate the building a bit though [14:19:12] there is package_builder [14:19:35] kart_: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/package_builder/README.md [14:19:59] which makes our lives easier by automating a lot of stuff, but it is not automating mass building per se [14:20:28] but wrapping around calls to says pdebuild if you got a list of gerrit repos should be doable [14:20:48] s/says/,say/ [14:21:29] kart_: using .dsc directly may work too. But that's just one step less than pdebuild, not much else. [14:21:44] and assumes we got the packages already built for at least another distro [14:21:50] which we do, right ? [14:22:02] so it might make sense to do that [14:22:42] chasemp: I just created newdbtest.testlabs.wmflabs.org [14:22:51] it looks to me like it's working properly on ns0 but not on ns1 [14:22:51] via horizon? [14:22:55] yeah [14:22:57] kk [14:23:28] agreed labs-ns1 can't resolve it [14:23:44] * andrewbogott sighs [14:23:45] interestingly returns [14:23:47] ;; AUTHORITY SECTION: [14:23:47] testlabs.wmflabs.org. 120 IN SOA labs-ns0.wikimedia.org. root.wmflabs.org. 1459362189 3600 600 86400 3600 [14:23:55] so something about our designate config isn't what we want [14:24:04] I guess if anything was going to go wrong it would be this [14:24:37] do you know much about how axfr works? Does it use a different port that we need to open? [14:25:37] (03CR) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [14:26:38] andrewbogott: I think there is another port but I'm looking [14:26:49] holmium (aka ns1) is getting the notification [14:26:51] but not updating [14:26:52] but I also thought it was already setup from before [14:26:53] ah [14:27:03] how did you verify? [14:27:08] I can see it in the logs [14:27:15] holmium says: [14:27:18] https://www.irccloud.com/pastebin/ir5Icldo/ [14:27:23] whereas labservices1001 says: [14:27:35] https://www.irccloud.com/pastebin/GKjoDmQy/ [14:27:54] well, wait, '0 queued' is weird [14:27:58] ottomata: https://gerrit.wikimedia.org/r/#/c/284349/8..12/modules/confluent/templates/kafka/server.properties.erb,unified line 1 [14:28:13] that hm there at the beginning of line one is probably not what we want, right ? :P [14:28:38] andrewbogott: I'm not convinced this part was working before https://phabricator.wikimedia.org/T124680#2252726 [14:28:44] which isn't good still but is interesting [14:28:48] !log memcached 1.4.25-2~wmf1 manually installed on mc1009 as part of a performance test (T129963) [14:28:49] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [14:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:29:05] tldr holmium had issues w/ axfr already I think and was maybe surviving on labservices1001 table scrap updates [14:29:26] chasemp: I can't tell what you're linking to, but, yes — I suspect that holmium wasn't doing xfr properly and was just working due to labservices updating the DB. If that's what you're saying :) [14:29:55] in taht comment "Does Holmium still have ongoing issues connectivity wise?" and below [14:29:59] (I know its' a lot of stuff) [14:30:04] (03PS1) 10Volans: MariaDB: set mysql_role to standalone for es1 [puppet] - 10https://gerrit.wikimedia.org/r/287088 (https://phabricator.wikimedia.org/T133337) [14:30:12] but either way just an indicator it might not be designate config, it could be fw not sure yet [14:30:26] it would at least make sense as carryover from previous setup [14:30:31] yeah [14:30:35] andrewbogott: do we have a way to trigger an update? [14:30:39] looks like everything is on port 53 though... [14:30:47] chasemp: they're goingin in periodically as far as I can tell [14:30:53] but certainly creating a record is an easy way [14:31:30] pm'd an idea that may be sensitive [14:32:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add new confluent module and puppetization for using confluent Kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [14:33:55] PROBLEM - Host mw1062 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:55] PROBLEM - Host mw1063 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:12] ? [14:34:21] akosiaris: it works [14:34:41] it does look weird but it works [14:34:42] :) [14:34:46] ottomata: ok then. that leaves the leading "hm" then [14:34:54] op haha [14:34:56] wha? [14:34:57] weird [14:35:02] (03CR) 10Elukey: [C: 032] Update memcached version on mc1009 as part of a performance test. [puppet] - 10https://gerrit.wikimedia.org/r/287073 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [14:35:18] (03PS13) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [14:35:20] sorry bout that ^ [14:36:06] PROBLEM - Host mw1061 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:22] (03CR) 10Alexandros Kosiaris: [C: 031] Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [14:36:39] thanks akosiaris we are going to schedule upgrade for analytics kafka cluster next week [14:37:21] jynus: we have a fix, I'm writing a patch [14:38:12] so, should I reset the slaves (drop replication completelly) or should I let it be like this for now? [14:38:23] ottomata: you 're welcome [14:38:28] let's say replication is now "paused" [14:38:51] it would restart on restart [14:39:00] !log memcached on mc1009 is running now with new parameters only available for 1.4.25-2~wmf1 - part of a performance test (T129963) [14:39:01] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [14:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:17] (03PS2) 10BBlack: tlsproxy: no AE:gzip forcing for HTTP/2 [puppet] - 10https://gerrit.wikimedia.org/r/287038 [14:39:28] also replication on labtestservices2001 is still running [14:39:39] (03CR) 10BBlack: [C: 032 V: 032] "Confirmed this is the correct behavior." [puppet] - 10https://gerrit.wikimedia.org/r/287038 (owner: 10BBlack) [14:41:12] (03PS2) 10Cmjohnson: dhcp: remove entries for decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285605 (https://phabricator.wikimedia.org/T126242) (owner: 10Giuseppe Lavagetto) [14:42:10] (03CR) 10Cmjohnson: [C: 032] dhcp: remove entries for decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285605 (https://phabricator.wikimedia.org/T126242) (owner: 10Giuseppe Lavagetto) [14:42:37] !log restarted memcached on mc1008 as part of a performance test (T129963) [14:42:37] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [14:42:43] (03PS1) 10Luke081515: Grant groups from flaggedrevs the patrol rights at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287090 (https://phabricator.wikimedia.org/T134491) [14:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:05] 06Operations, 10ops-eqiad: mw1070-89 and mw1121-30 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T133770#2267199 (10Cmjohnson) [14:44:48] 06Operations, 10ops-eqiad: ship single dell 500GB sata to ulsfo - https://phabricator.wikimedia.org/T133699#2267200 (10Cmjohnson) a:05Cmjohnson>03RobH Disk has been shipped to the office. The tracking number is 1ZA19A020393157877 Assigning to @robh to receive [14:48:20] ori: memcached upgraded on mc1009 and mc1008 restarted [14:48:53] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2267216 (10Cmjohnson) [14:48:55] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2267214 (10Cmjohnson) 05Open>03Resolved These servers have all been wiped and removed from the racks. [14:49:22] (03CR) 10Luke081515: [C: 031] Commons: Restrict changetags userright [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286522 (https://phabricator.wikimedia.org/T134196) (owner: 10Rillke) [14:49:45] PROBLEM - Host 208.80.155.118 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:06] labs-recursor1.wikimedia.org. [14:50:15] andrewbogott, ^ [14:50:31] yeah, that's me [14:50:58] why does that show up as an IP instead of hostname? [14:51:13] (03CR) 10Volans: "Compiler results here, all looks good to me:" [puppet] - 10https://gerrit.wikimedia.org/r/287088 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [14:54:59] anomie: Ok for you if we start with my patchs at SWAT? I have to small config changes for the master, so it won't take much time ;) [14:55:16] Luke081515: sure [14:55:25] ok, thx :) [14:55:52] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2267245 (10Cmjohnson) [14:58:44] (03CR) 10Jcrespo: [C: 031] MariaDB: set mysql_role to standalone for es1 [puppet] - 10https://gerrit.wikimedia.org/r/287088 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [14:58:46] (03PS1) 10Andrew Bogott: Designate: Open up firewall to udp on 5354 for axfr syncing [puppet] - 10https://gerrit.wikimedia.org/r/287091 [14:59:34] (03CR) 10Andrew Bogott: [C: 032 V: 032] Designate: Open up firewall to udp on 5354 for axfr syncing [puppet] - 10https://gerrit.wikimedia.org/r/287091 (owner: 10Andrew Bogott) [15:00:02] (03PS14) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [15:00:05] anomie ostriches thcipriani marktraceur aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160505T1500). [15:00:05] anomie Luke081515: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:06] 06Operations, 10ops-eqiad: db1058 does not come up after restart - https://phabricator.wikimedia.org/T134360#2267246 (10Cmjohnson) db1058 is most likely cooked. The server was almost too hot to touch. One of the power supplies has failed. I attempted to drain flea power but the server will not power on. I a... [15:00:10] (03PS1) 10Jcrespo: Prepare db1023 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/287092 (https://phabricator.wikimedia.org/T125028) [15:00:17] * Luke081515 is here [15:00:35] I suppose I'll do the SWATting today. [15:00:48] ok :) [15:01:25] (03CR) 10jenkins-bot: [V: 04-1] Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [15:01:34] (03CR) 10jenkins-bot: [V: 04-1] Prepare db1023 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/287092 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [15:01:38] (03CR) 10Anomie: [C: 032] Grant groups from flaggedrevs the patrol rights at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287090 (https://phabricator.wikimedia.org/T134491) (owner: 10Luke081515) [15:02:15] (03Merged) 10jenkins-bot: Grant groups from flaggedrevs the patrol rights at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287090 (https://phabricator.wikimedia.org/T134491) (owner: 10Luke081515) [15:02:27] (03PS1) 10Andrew Bogott: Allow axfr on udp, take two [puppet] - 10https://gerrit.wikimedia.org/r/287093 [15:02:29] !log stopping db1065 for hardware maintenance [15:02:29] ?? what's up with that openstack thing? [15:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:39] puppet failling on patches due to [15:02:39] ./modules/role/manifests/labs/openstack/designate.pp:44 ERROR duplicate parameter found in resource (duplicate_params) [15:02:44] jenkins failing* [15:03:42] (03CR) 10Andrew Bogott: [C: 032 V: 032] Allow axfr on udp, take two [puppet] - 10https://gerrit.wikimedia.org/r/287093 (owner: 10Andrew Bogott) [15:03:44] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Grant groups from flaggedrevs the patrol rights at test2wiki [[gerrit:287090]] (duration: 00m 47s) [15:03:45] Luke081515: ^ Test please [15:04:06] anomie: checked, works :) [15:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:22] 06Operations, 10ops-eqiad: dbstore1001 degraded RAID - https://phabricator.wikimedia.org/T134471#2266613 (10Cmjohnson) New disk has been ordered and should arrive 5/5 Congratulations: Work Order SR929246049 was successfully submitted. [15:04:25] (03PS2) 10Anomie: Commons: Restrict changetags userright [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286522 (https://phabricator.wikimedia.org/T134196) (owner: 10Rillke) [15:04:34] (03CR) 10Anomie: [C: 032] Commons: Restrict changetags userright [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286522 (https://phabricator.wikimedia.org/T134196) (owner: 10Rillke) [15:04:51] (03CR) 10Ottomata: [C: 032 V: 032] Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [15:04:59] (03PS15) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [15:05:03] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2267254 (10JAllemandou) @ottomata: 6 instances with 4 disks each in RAID 0 works for me. As you said, 1 lost over 6 is acceptable, and having 6.5Tb per instance seems fine about empty spac... [15:05:05] (03Merged) 10jenkins-bot: Commons: Restrict changetags userright [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286522 (https://phabricator.wikimedia.org/T134196) (owner: 10Rillke) [15:05:09] (03CR) 10Ottomata: [V: 032] Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [15:05:13] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: puppet fail [15:05:47] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Commons: Restrict changetags userright [[gerrit:286522]] (duration: 00m 29s) [15:05:50] Luke081515: ^ Test please [15:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:10] anomie: Checked, works. Thanks for SWAT :) [15:06:52] RECOVERY - Host 208.80.155.118 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [15:07:22] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:09:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 704 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5766494 keys - replication_delay is 704 [15:11:17] 06Operations, 10RESTBase-Cassandra, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2267279 (10faidon) >>! In T132771#2210653, @Eevans wrote: >> We have around ~2 million (2.017.651) Cassandra-related metrics on Graphite. This accounts for 62% of all metrics that we... [15:11:18] !log anomie@tin Synchronized php-1.27.0-wmf.23/extensions/CentralAuth/: SWAT: Use master CentralAuthUser instances when writing [[gerrit:287079]] (duration: 00m 32s) [15:11:18] anomie: ^ Test please [15:11:25] (03PS1) 10Alexandros Kosiaris: wgCopyUploadProxy: Vary per datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287095 [15:11:28] anomie: Works! [15:11:33] anomie: No backport to wmf.22? [15:11:39] lol [15:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:50] anomie: No, since group 2 goes to wmf.23 in a few hours, it doesn't seem worth the bother. [15:11:52] anomie: ok. [15:12:05] * anomie is done with SWAT [15:12:24] akosiaris: Thanks! [15:12:49] 06Operations, 10ops-eqiad, 13Patch-For-Review: eqiad: Failed DIMM db1065 - https://phabricator.wikimedia.org/T133250#2267299 (10Cmjohnson) DIMM swapped RMA Return Tracking #'s USPS 9202 3946 5301 2431 8184 70 FEDEX 9611918 2393026 53310155 [15:13:29] kart_: you 're welcome. when you have assembled the list of packages (or even better the packages themselves) ping me [15:14:35] 06Operations, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#2267314 (10akosiaris) >>! In T122134#2248679, @faidon wrote: > I still see a "url-downloader.wikimedia.org" reference in mediawiki-config, this should prob... [15:17:48] 06Operations, 10ops-eqiad, 13Patch-For-Review: eqiad: Failed DIMM db1065 - https://phabricator.wikimedia.org/T133250#2267329 (10Cmjohnson) 05Open>03Resolved Replaced the DIIMM. Cleared log and resolving [15:18:15] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2267331 (10Ottomata) Ok then, unless there are objections, let's go with that. Since they are mounting cassandra stuff under `/srv` elsewhere, let's do that here too. |mount|disks|raid l... [15:19:11] 06Operations, 10vm-requests: EQIAD: (1) VM request for url-downloader - https://phabricator.wikimedia.org/T134496#2267333 (10akosiaris) [15:19:21] 06Operations, 10vm-requests: EQIAD: (1) VM request for url-downloader - https://phabricator.wikimedia.org/T134496#2267345 (10akosiaris) p:05Triage>03Normal [15:19:26] 06Operations, 10vm-requests: EQIAD: (1) VM request for url-downloader - https://phabricator.wikimedia.org/T134496#2267333 (10akosiaris) a:03akosiaris [15:20:58] (03PS1) 10Cmjohnson: Decommission test server being used for raid card testing. [puppet] - 10https://gerrit.wikimedia.org/r/287097 [15:21:55] (03CR) 10Faidon Liambotis: [C: 04-2] "From a quick look, this looks like it runs ssh-keygen on the puppetmaster. This is definitely not something that we should do (and it woul" [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [15:22:13] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:22:28] (03PS4) 10Ottomata: Alter role::kafka::analytics::broker to be able to use confluent module during upgrade [puppet] - 10https://gerrit.wikimedia.org/r/286660 (https://phabricator.wikimedia.org/T121562) [15:22:43] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2267353 (10elukey) Something worth to notice: {F3969854} {F3969858} mc1009 is way faster in allocating memory and recovering from a restart (at least this se... [15:23:03] ottomata: ^^^ strontium [15:23:30] volans: sorry thanks [15:23:38] noop at the moment, forgot to puppet merge [15:23:39] done [15:24:12] I just saw the strontium one, I though was the usual missed sync [15:24:13] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:24:36] (03CR) 10Ottomata: [C: 032] Alter role::kafka::analytics::broker to be able to use confluent module during upgrade [puppet] - 10https://gerrit.wikimedia.org/r/286660 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [15:24:48] (03PS2) 10Cmjohnson: Decommission test server being used for raid card testing. [puppet] - 10https://gerrit.wikimedia.org/r/287097 [15:25:48] (03PS1) 10Cmjohnson: removing wmf4727-test from dns -- raid card test server [dns] - 10https://gerrit.wikimedia.org/r/287099 [15:26:06] (03CR) 10Cmjohnson: [C: 032] Decommission test server being used for raid card testing. [puppet] - 10https://gerrit.wikimedia.org/r/287097 (owner: 10Cmjohnson) [15:26:15] (03PS2) 10Rush: phab::vcs: Add a proxy parameter [puppet] - 10https://gerrit.wikimedia.org/r/287081 (owner: 10Alexandros Kosiaris) [15:26:47] (03CR) 10Rush: [C: 031] "seems reasonable to me, this was all done by the releng folks I believe but this does seem better" [puppet] - 10https://gerrit.wikimedia.org/r/287081 (owner: 10Alexandros Kosiaris) [15:26:49] (03PS2) 10Cmjohnson: Adding mgmt and prodcution dns for druid1001-1003 [dns] - 10https://gerrit.wikimedia.org/r/286694 [15:27:06] (03CR) 10Cmjohnson: [C: 032] removing wmf4727-test from dns -- raid card test server [dns] - 10https://gerrit.wikimedia.org/r/287099 (owner: 10Cmjohnson) [15:28:00] (03PS3) 10Cmjohnson: Adding mgmt and prodcution dns for druid1001-1003 [dns] - 10https://gerrit.wikimedia.org/r/286694 [15:28:33] (03CR) 10Cmjohnson: [C: 032] Adding mgmt and prodcution dns for druid1001-1003 [dns] - 10https://gerrit.wikimedia.org/r/286694 (owner: 10Cmjohnson) [15:30:54] 06Operations, 10ops-eqiad: db1058 does not come up after restart - https://phabricator.wikimedia.org/T134360#2267397 (10jcrespo) Thank you. This should be ones of the replaced ones from the new batch. Feel free to unrack it if you need the space. I will keep this ticket open for decommission purposes. [15:31:58] 06Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2267401 (10Cmjohnson) [15:32:01] 06Operations, 10ops-eqiad, 06DC-Ops: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2267398 (10Cmjohnson) 05Open>03Resolved Removed all production dns entires, Removed dhcpd and netboot.cfg Releases in Google doc as a spares...resolving task [15:32:32] (03PS2) 10Jcrespo: Prepare db1023 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/287092 (https://phabricator.wikimedia.org/T125028) [15:33:39] (03CR) 10Jcrespo: [C: 032] Prepare db1023 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/287092 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [15:35:25] 06Operations, 10ops-eqiad, 06DC-Ops: dbstore1001 management interface has saturated the number of available ssh connections - https://phabricator.wikimedia.org/T126227#2267405 (10Cmjohnson) @jcrespo we still need to do this but let's wait until after the new disk arrives [15:35:54] +1 [15:37:42] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5760252 keys - replication_delay is 0 [15:42:06] (03PS2) 10ArielGlenn: add dumps for flow pages for those wikis which have Flow enabled [dumps] - 10https://gerrit.wikimedia.org/r/282883 (https://phabricator.wikimedia.org/T89398) [15:44:11] jynus: It vaguely looks like pdns is unable to write to the 'domains' table. Do you see any evidence of that? [15:44:13] (03CR) 10ArielGlenn: [C: 032] add dumps for flow pages for those wikis which have Flow enabled [dumps] - 10https://gerrit.wikimedia.org/r/282883 (https://phabricator.wikimedia.org/T89398) (owner: 10ArielGlenn) [15:44:33] 06Operations, 10ops-eqiad: ms-be1002 has a faulty disk - https://phabricator.wikimedia.org/T134234#2267415 (10Cmjohnson) Added disk back. megacli -GetPreservedCacheList -a0 Adapter #0 Virtual Drive(Target ID 05): Missing. megacli -DiscardPreservedCache -L5 -a0 Adapter #0 Virtual Drive(Target ID 05): Preserv... [15:44:52] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [15:45:27] 06Operations, 10ops-eqiad: ms-be1002 has a faulty disk - https://phabricator.wikimedia.org/T134234#2267417 (10Cmjohnson) 05Open>03Resolved [15:46:14] jynus: disregard, I think I see the issue [15:47:14] ? [15:47:23] 06Operations, 10ops-eqiad, 10Analytics-Cluster: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2267422 (10Cmjohnson) @elukey: re-applying thermal paste is needed. There has been several servers that have required lately and it appears to have fixed the issue. [15:48:27] 07Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 03Collab-Team-2016-Apr-Jun-Q4: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2267424 (10ArielGlenn) [15:49:09] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#2267425 (10ArielGlenn) [15:49:42] jynus: it's not a permissions issue, it's a mistake in the designate config [15:49:53] ? [15:49:58] something I can do about it? [15:50:32] or it should just be fixed easily [15:50:34] nope, I can fix it [15:51:20] 06Operations, 10ops-eqiad: audit/remove two cross-connection patch cables - https://phabricator.wikimedia.org/T132945#2267442 (10Cmjohnson) a:05Cmjohnson>03RobH @robh: Dmarc ports 15/16 is empty. disconnected fiber #2013 from cr2-eqiad 5/3/1 Resolve as necessary [15:51:40] (03CR) 10BearND: "@Mobrovac I398ac38e091e865023e532af4469f51dd666b985 is merged now" [puppet] - 10https://gerrit.wikimedia.org/r/286695 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [15:52:31] (03PS1) 10BBlack: Text VCL: RB ?redirect=false optimization [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) [15:52:52] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:53:32] (03CR) 10Mobrovac: [C: 031] "Yup. Remember to get it on tin before issuing the deploy command for the first time :)" [puppet] - 10https://gerrit.wikimedia.org/r/286695 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [15:56:34] (03PS1) 10Andrew Bogott: Designate: Specify IP for pool target master rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/287105 [15:56:48] (03PS1) 10Ottomata: Set analytics kafka broker info for labs deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287106 (https://phabricator.wikimedia.org/T121562) [15:57:11] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: eqiad: Rack and setup new labstore - https://phabricator.wikimedia.org/T133397#2267451 (10Cmjohnson) a:05Cmjohnson>03chasemp All my work is finished.....all you with the install and RAID. Thanks!!! [15:57:13] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:51] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2267453 (10Cmjohnson) [15:58:21] (03PS2) 10Andrew Bogott: Designate: Specify IP for pool target master rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/287105 [15:58:23] (03PS2) 10BBlack: Text VCL: RB ?redirect=false optimization [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) [15:58:55] (03CR) 10DCausse: [C: 031] Set analytics kafka broker info for labs deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287106 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [16:00:04] godog moritzm _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160505T1600). Please do the needful. [16:00:04] Krenair urandom bearND: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:19] * urandom is available [16:00:20] hi [16:00:33] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2267470 (10DStrine) thanks @Dzahn here it is: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3N4d9IDAb//Aq4NG4I2L5EXPpJDHnTg/+O1tOsuNWJUV9NvaNjzB5aZZ3hnZOS5OoTU6Lf0d8FWV5z... [16:00:38] o/ (for scap3 moral support) [16:01:41] (03CR) 10Andrew Bogott: [C: 032] Designate: Specify IP for pool target master rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/287105 (owner: 10Andrew Bogott) [16:01:49] !log restarting db1023 for reimage to jessie [16:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:06] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2267481 (10Cmjohnson) resolving this task. Please re-open if the heat issues return. [16:03:14] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team: iridium (Phabricator host) went down, Possible cpu heat issue - https://phabricator.wikimedia.org/T131742#2267483 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [16:04:23] 06Operations, 10ops-eqiad, 06DC-Ops: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2267485 (10Cmjohnson) @jcrespo we need to fix this as well...both dbstore1001 and 1002 will required a hard reset...power down remove po... [16:06:07] are there any opsen around for puppetswat...? [16:09:01] yeah I can help with puppet swat [16:09:49] I'm going in reverse order, bearND first [16:10:32] hey, bearND stepped away but i'm here and ready to test it [16:11:00] (all 3 of us are on a couch in Colorado currently :)) [16:11:19] haha ok, merging [16:11:35] i'm back [16:12:14] (03PS3) 10Filippo Giunchedi: Deploy mobileapps using scap3 [puppet] - 10https://gerrit.wikimedia.org/r/286695 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [16:12:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Deploy mobileapps using scap3 [puppet] - 10https://gerrit.wikimedia.org/r/286695 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [16:12:40] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 654 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5763861 keys - replication_delay is 654 [16:13:15] so we'll need to run puppet on the targets then on tin. the targets will fail initially, but then we'll run a deploy and it should work [16:13:29] fails on tin [16:13:42] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: can't convert nil into Hash at /etc/puppet/modules/scap/manifests/server.pp:68 on node tin.eqiad.wmnet [16:13:43] the targets are scb[12]00[12] [16:14:30] mobrovac: ^ have you seen this with the others? [16:14:40] (03PS1) 10Andrew Bogott: Open the firewall so that designate can talk to mysql on dns hosts [puppet] - 10https://gerrit.wikimedia.org/r/287108 [16:14:42] (03PS1) 10BBlack: VCL: cap all TTLs at 14d (or less in existing cases) [puppet] - 10https://gerrit.wikimedia.org/r/287109 (https://phabricator.wikimedia.org/T124954) [16:15:21] (03CR) 10Rush: [C: 031] Open the firewall so that designate can talk to mysql on dns hosts [puppet] - 10https://gerrit.wikimedia.org/r/287108 (owner: 10Andrew Bogott) [16:15:46] 06Operations, 10ops-eqiad, 06DC-Ops: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2267509 (10jcrespo) I know, however, dbstore1002 cannot be easily restarted (unlike dbstore1001) without downtime scheduling with #analy... [16:16:08] bearND: godog: hmmmm, that's a new one [16:16:15] (03CR) 10Andrew Bogott: [C: 032] Open the firewall so that designate can talk to mysql on dns hosts [puppet] - 10https://gerrit.wikimedia.org/r/287108 (owner: 10Andrew Bogott) [16:16:30] yeah I'm trying to understand where it comes from [16:16:31] oh! I bet I know what happened: the '~' in the hiera file OK, we'll have to revert for now :( [16:16:40] ok, reverting [16:16:51] damn tilde! [16:17:16] did it go through the compiler? [16:17:29] 06Operations, 10Traffic, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2267511 (10BBlack) We're overdue to circle back to this, but there's also a lot of investigating and thinking left to do, and IMHO the varnish4 transition as well as the Surrogate-Con... [16:17:42] godog: sorry, forgot to do that [16:18:00] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [16:18:09] (03PS1) 10Filippo Giunchedi: Revert "Deploy mobileapps using scap3" [puppet] - 10https://gerrit.wikimedia.org/r/287110 [16:18:16] np, it just became a new admission rule of puppet swat! [16:18:19] godog: mainly because i was compiling a similar change for cxserver and realised the output for tin cannot even load in my browser :/ [16:18:31] godog: yup, good rule! [16:18:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Deploy mobileapps using scap3" [puppet] - 10https://gerrit.wikimedia.org/r/287110 (owner: 10Filippo Giunchedi) [16:19:32] heh, so yeah just the harder to debug failures will be left! [16:21:12] bearND mdholloway I ran puppet again on scb1001 and tin after the rollback, can you verify that deploying mobileapps there still works? [16:21:23] (03PS2) 10BBlack: VCL: cap all TTLs at 14d (or less in existing cases) [puppet] - 10https://gerrit.wikimedia.org/r/287109 (https://phabricator.wikimedia.org/T124954) [16:21:44] (03CR) 10BBlack: [C: 032 V: 032] VCL: cap all TTLs at 14d (or less in existing cases) [puppet] - 10https://gerrit.wikimedia.org/r/287109 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack) [16:21:45] godog: sure, will check [16:22:10] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:22:39] !log mobileapps starting no-op deploy on scb1001 [16:22:41] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail [16:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:06] sanitarium:s3 stuck again [16:23:31] urandom: I'll just merge https://gerrit.wikimedia.org/r/#/c/286865/ since it is a noop [16:23:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Update collector version (both branches) [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/286865 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:23:58] godog: yup! [16:24:13] godog: if you wanted, you could have a look at https://gerrit.wikimedia.org/r/#/c/286784/ too [16:24:36] it should also be a no-op in the sense that without the jars, it'll ignore the extra values in the yaml file [16:24:53] (03PS1) 10Cmjohnson: Removing dns entries for decom'd app servers mw1070-1089, mw1121-1130 and cleaning up missed 1060-69. [dns] - 10https://gerrit.wikimedia.org/r/287111 [16:25:06] i probably should have added it to the swat on the reasoning [16:25:15] fixed [16:25:34] thx [16:25:57] urandom: oh ok, yeah that was going to be my next question if it is going to barf on whitelist being there [16:26:21] nope, the old code will just ignore it [16:26:21] godog: looking fine [16:26:44] godog: deploying on scb1001, that is [16:26:55] mdholloway: sweet, thanks! [16:26:58] godog: there'll be one more puppet change required (after a deploy of the jars), that would make cmcd use the whitelist [16:27:12] (03PS2) 10Cmjohnson: Removing dns entries for decom'd app servers mw1070-1089, mw1121-1130 and cleaning up missed 1060-69. [dns] - 10https://gerrit.wikimedia.org/r/287111 [16:27:15] to rejigger the link to the jar [16:27:26] (03PS2) 10Filippo Giunchedi: Blacklist some meta-table Cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/286784 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:27:39] jynus: I'm now done with all the dns things that I was planning to do. In theory the pdns db on m5-master should be very quiet now, does that look right to you? [16:27:47] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "compiler https://puppet-compiler.wmflabs.org/2685/" [puppet] - 10https://gerrit.wikimedia.org/r/286784 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:27:54] godog: thank you sir! [16:28:45] (03PS3) 10Cmjohnson: Removing dns entries for decom'd app servers mw1070-1089, mw1121-1130 and cleaning up missed 1060-69. [dns] - 10https://gerrit.wikimedia.org/r/287111 [16:29:45] urandom: np, what's the timeline to switch to the new version btw? [16:29:46] (03CR) 10Mobrovac: "One question and a nit in-lined." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [16:30:02] (03PS1) 10BearND: Deploy mobileapps using scap3, 2nd try [puppet] - 10https://gerrit.wikimedia.org/r/287112 (https://phabricator.wikimedia.org/T129147) [16:30:05] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom'd app servers mw1070-1089, mw1121-1130 and cleaning up missed 1060-69. [dns] - 10https://gerrit.wikimedia.org/r/287111 (owner: 10Cmjohnson) [16:31:02] (03CR) 10Mobrovac: [C: 031] Deploy mobileapps using scap3, 2nd try [puppet] - 10https://gerrit.wikimedia.org/r/287112 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [16:31:32] !log mobileapps finished no-op test deploy [16:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:59] godog: it could be done as soon as a trebuchet deploy is complete [16:33:28] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5762530 keys - replication_delay is 0 [16:33:45] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835#516941 (10BBlack) Note also that both CC and SC have grace-mode information as well. In CC: our max-age is `s-maxage` with fallback to `maxage`, and our grac... [16:34:47] (03CR) 10Filippo Giunchedi: [C: 04-1] "the collector seems a bit specific, want to tackle a generic 'nagios check collector' ?" [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [16:35:25] (03CR) 10Alex Monk: "I have no idea how that would work" [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [16:35:25] urandom: ack, let me know if you want to do that now [16:38:06] (03PS2) 10Filippo Giunchedi: restbase: Add beta nlwiki [puppet] - 10https://gerrit.wikimedia.org/r/286476 (https://phabricator.wikimedia.org/T118005) (owner: 10Alex Monk) [16:38:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: Add beta nlwiki [puppet] - 10https://gerrit.wikimedia.org/r/286476 (https://phabricator.wikimedia.org/T118005) (owner: 10Alex Monk) [16:41:09] (03CR) 10Filippo Giunchedi: "essentially a diamond "meta-collector" for nagios checks to be configured with:" [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [16:43:33] (03PS1) 10Eevans: Update cassandra-metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/287113 (https://phabricator.wikimedia.org/T134016) [16:43:40] godog: ^^^ [16:43:48] godog: cassandra-metrics-collector-2.1.0-20160504.150640-1-jar-with-dependencies.jar is deployed everywhere [16:44:17] 2.1.0 will enable whitelisting, and so those whitelist entries will take effect [16:45:41] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2267585 (10Dzahn) @Dstrine thanks, that looks good. btw, the "your_email@" part at the end is just like a label for it, we can change that to your actual email addre... [16:46:14] (03CR) 10Alex Monk: "I think I understand what you mean but I don't know how you'd get the configuration for that generated in puppet." [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [16:48:13] urandom: sounds good, I'll merge that [16:48:23] (03PS2) 10Filippo Giunchedi: Update cassandra-metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/287113 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:48:24] godog, you'd need all parts of puppet code to be able to do something like this: diamond::nagioscollector { 'keyholder_status': command => [ '/usr/bin/sudo', '/usr/lib/nagios/plugins/check_keyholder' ] } [16:48:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Update cassandra-metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/287113 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:48:33] godog: thanks! [16:49:16] and then after all puppet code using diamond::nagioscollector has been run, spit out a config file to be used by the meta collector [16:50:11] or maybe we could have it set up as a template, so each nagioscollector would create an actual new collector? [16:50:20] with the appropriate command etc. filled in [16:50:28] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:52:13] Krenair: yeah diamond isn't great for that use case, I started doing sth like that with diamond::collector::servicestats but never finished it heh [16:52:31] godog, something like... which one? [16:52:49] _joe_: are you still in working hours? are you game to try to deploy the ocg1003 decommission patch? [16:53:44] (03CR) 10BBlack: [C: 04-1] "Should no longer be necessary if you're past the transition and upgraded. do_spdy:false is now the effective default and the option is re" [puppet] - 10https://gerrit.wikimedia.org/r/286821 (https://phabricator.wikimedia.org/T134362) (owner: 10Hashar) [16:54:19] <_joe_> cscott: let's! [16:54:45] <_joe_> (no, not still working hours but not as tired as yesterday) [16:54:54] godog: puppet isn't going to restart cmcd for us, is it? [16:55:02] Krenair: a diamond collector that would support multiple "things" where these "things" are actually config snippets dropped by puppet [16:55:43] urandom: no it isn't going to [16:55:48] godog: kk [16:55:50] (03PS5) 10Giuseppe Lavagetto: Decommission ocg1003. [puppet] - 10https://gerrit.wikimedia.org/r/286070 (https://phabricator.wikimedia.org/T84723) (owner: 10Cscott) [16:55:56] !log restart cassandra-metrics-collector on restbase2001 to test white/black listing [16:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:17] godog: beat me to it! [16:56:19] godog, what about a single collector reading configs out of a directory, and each puppet diamond::nagioscollector just generating an extra file in that directory? [16:56:37] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2267594 (10Dzahn) @Dstrine thanks, i found it. you already have one. I asked that because we like to use the same UID as in labs to prevent duplicates. uidNumber... [16:57:02] Krenair: precisely, sth like I did in diamond::collector::servicestats [16:57:20] <_joe_> cscott: wait for me to verify it a second [16:58:55] godog, so that just leaves how we're going to get the single collector on all hosts that use it but not the hosts that don't [16:59:38] (03CR) 10Giuseppe Lavagetto: [C: 032] Decommission ocg1003. [puppet] - 10https://gerrit.wikimedia.org/r/286070 (https://phabricator.wikimedia.org/T84723) (owner: 10Cscott) [17:00:04] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160505T1700). Please do the needful. [17:00:43] Krenair: possibly, but that was already the case with the collector you proposed (?) [17:00:52] no parsoid deploy [17:01:50] _joe_: hey look, it's the start of our OCG deploy window. ;) [17:02:01] <_joe_> applying puppet :P [17:02:06] <_joe_> on ocg1003 [17:02:11] <_joe_> then on the other ones [17:02:12] fyi, we're deploying a config change to OCG during this window. [17:02:36] according to puppet compiler, the mw-ocg-service.js config file shouldn't change on ocg1001 and ocg1002. [17:03:31] _joe_: after applying puppet, you (or i) will need to do a `service ocg restart` to have ocg pick up the change. [17:03:46] (03PS1) 10Dzahn: admin: add user account for David Strine [puppet] - 10https://gerrit.wikimedia.org/r/287114 (https://phabricator.wikimedia.org/T133953) [17:03:48] <_joe_> AFAIK, puppet does that automatically [17:04:01] <_joe_> it did, in fact [17:05:11] <_joe_> cscott: let me depool this server from the balancer too [17:05:30] <_joe_> I confirm a noop on ocg1002 as expected [17:06:03] _joe_: i thought it was already depooled from the balancer? but in any case, after decommissioning it should still answer front end requests, so you can depool from the balancer in any order. [17:06:12] !log oblivian@palladium conftool action : set/pooled=no; selector: name=ocg1003.* [17:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:06:26] <_joe_> just removed ^^ [17:06:36] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2267604 (10DStrine) @Dzahn I have a wikitech account but I have never used Gerrit. Will I need to? As you can tell I'm not an engineer. This... [17:07:29] <_joe_> cscott: let's run the cache wipe script there? [17:08:25] godog: i think it's working [17:08:31] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2267606 (10Dzahn) @Dstrine It's great that you have a wikitech account, i would have asked you to create one otherwise. But that was only to f... [17:08:33] godog: https://graphite.wikimedia.org/render?target=cassandra.restbase2001-a.org.apache.cassandra.metrics.ColumnFamily.local_group_wikipedia_T_parsoid_html.meta.CompressionRatio.value&format=json [17:08:56] <_joe_> cscott: I am running the script now [17:09:06] godog: if you scroll to the bottom... all of the nulls [17:09:09] <_joe_> !log clearing ocg cache entries for ocg1003 [17:09:12] _joe_: can we verify that it's not accepting front end requests or getting backend requests? [17:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:24] _joe_: don't forget to clear the cache entries for ocg1003.eqiad.wmnet as well. [17:09:40] <_joe_> cscott: that's what I am going to do now, see my log [17:10:19] <_joe_> and no, we were only getting direct hits from the appservers [17:11:08] _joe_: cool [17:11:10] urandom: yup I was looking at sth similar, nice! [17:11:41] godog: https://graphite.wikimedia.org/render?target=cassandra.restbase2001-a.org.apache.cassandra.metrics.ColumnFamily.local_group_wikipedia_T_parsoid_html.meta.ReadLatency.count&format=json [17:11:50] that is ReadLatency, which is whitelisted [17:11:54] those are still there [17:12:13] <_joe_> cscott: https://ganglia.wikimedia.org/latest/?c=PDF%20servers%20eqiad&h=ocg1003.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 seems promising [17:12:40] network in still seems high? [17:12:42] <_joe_> cscott: [2016-05-05T17:12:22.371Z] ERROR: mw-ocg-service/46865 on ocg1003: (node) warning: Recursive process.nextTick detected. This will break in the next version of node. Please use setImmediate for recursive deferral. [17:12:50] <_joe_> cscott: that's the script talking to redis [17:13:04] are you using nodejs-ocg ? [17:13:06] godog: i'll wait for puppet to sync that last change everywhere and do some rolling restarts [17:13:12] <_joe_> cscott: yes [17:13:50] _joe_: that's actually a buglet in that particular version of node. we should probably (eventually) migrate to node 4.x, like the rest of services are doing. [17:14:02] <_joe_> we should indeed [17:14:11] it's not an actual problem, the "recursive nexttick" was never actually deprecated or broken. [17:14:25] 06Operations, 10ops-eqiad, 13Patch-For-Review: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2267627 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson server has been wiped and decommissioned. [17:14:48] (if i recall correctly, the "recursive nexttick" was never actually present, it was just a bug in how they were trying to detect it) [17:15:22] <_joe_> cscott: it seems like the script is not doing anything anymore [17:15:30] 06Operations, 10ops-eqiad: mw1070-89 and mw1121-30 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T133770#2267645 (10Cmjohnson) [17:15:53] 06Operations, 10ops-eqiad: mw1070-89 and mw1121-30 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T133770#2242410 (10Cmjohnson) DNS removed, wiping now, once completed the servers will be removed from the rack. [17:16:04] <_joe_> but I'm still getting requests for the cached objects [17:16:22] _joe_: it should say, "removed %d pending jobs" at the end? [17:16:34] <_joe_> why pending jobs? [17:16:36] (03PS2) 10Dzahn: admin: add user account for David Strine [puppet] - 10https://gerrit.wikimedia.org/r/287114 (https://phabricator.wikimedia.org/T133953) [17:16:38] <_joe_> I am clearling the cache [17:16:41] <_joe_> not the queue [17:16:56] _joe_: sorry, you're right, i'm reading the wrong line in the script. 'Cleared %d (of %d total) entries from cache in %s seconds', [17:17:10] <_joe_> yeah that didn't happen [17:17:24] <_joe_> let me strace it [17:17:36] (03CR) 10Dzahn: [C: 032] "just creating the user, no group memberships, they need to be confirmed and separately jenkins will not like you if you try to put a new u" [puppet] - 10https://gerrit.wikimedia.org/r/287114 (https://phabricator.wikimedia.org/T133953) (owner: 10Dzahn) [17:18:26] <_joe_> cscott: it's stuck at 0% cpu use [17:18:31] <_joe_> and strace shows nothing [17:18:55] could be waiting for an hscan response from redis? [17:19:02] does lsof say anything? [17:19:07] !log Rolling restart of cassandra-metrics-collector in RESTBase cluster : T134016 [17:19:07] T134016: RESTBase Cassandra cluster: Increase instance count from 2 to 3 - https://phabricator.wikimedia.org/T134016 [17:19:09] <_joe_> I am still looking [17:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:41] urandom: thanks! sounds good, that'd be more urgently restbase production I'd say, the rest don't have as many CFs anyway [17:19:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2267655 (10Dzahn) @Muehlenhoff i created the user, could you confirm and add him to the right groups for the actual access [17:20:47] <_joe_> cscott: yes it's connected to redis but I have no idea what the hell it's doing [17:20:55] godog: yup; though i assume you'll want to do some cleanup, and it might make it easier to do it all at once [17:21:10] _joe_: i could add some debugging so we can see progress (or lack thereof) [17:21:21] godog: though, the hieradata change was only for RESTBase, could have been applied to AQS but wasn't [17:21:44] <_joe_> cscott: that would be welcome, but let me debug this a bit more [17:22:25] if it's "doing stuff" but mostly talking to redis and waiting for redis to respond, the cpu could well stay near 0% [17:23:09] !log Rolling restart of cassandra-metrics-collector in RESTBase test cluster : T134016 [17:23:10] T134016: RESTBase Cassandra cluster: Increase instance count from 2 to 3 - https://phabricator.wikimedia.org/T134016 [17:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:24] <_joe_> cscott: it is safe to stop and restart, I understand [17:23:30] _joe_: yes. [17:23:40] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2267669 (10Dzahn) If we could actually use MediaWiki that would be lovely too. As you say, usually the skinning is the issue. Maybe we could pay a consultant to conve... [17:23:52] urandom: oh ok, I missed it was only for restbase, that works too [17:24:49] <_joe_> cscott: it's talking to redis [17:24:57] godog: ...and everything is restarted, so we should be good [17:25:05] paravoid: ping me when you have a minute to discuss the keyholder key generation stuff. _joe_, your input would also be helpful I think [17:25:22] _joe_: i just remembered that I don't have root on ocg1003, but i can su as `ocg`, which is sufficient to run the scripts myself. [17:25:28] <_joe_> cscott: it's doing hlen on ocg_job_status [17:25:31] godog: by good i mean, whenever you want to cleanup the filtered metrics, you can [17:25:45] <_joe_> I am looking at tcpdump of communications [17:25:45] _joe_: and redis is taking its time responding about that? [17:25:48] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5769146 keys - replication_delay is 610 [17:26:29] <_joe_> cscott: a bit, not so much [17:26:37] <_joe_> but that's requested like 10 times in a row [17:26:55] let me see if/where we do an hlen in the code [17:27:17] _joe_: oh, that's the health check i think. [17:27:21] <_joe_> then it fetches one object every now and then [17:27:30] <_joe_> the health check, right [17:27:31] ocinga is triggering it [17:27:34] <_joe_> sigh [17:27:51] <_joe_> so probably none of this is due to the script [17:27:56] <_joe_> since we're debugging this [17:28:23] <_joe_> let me get a better filter [17:28:27] what version of redis is it talking to? we upgraded from redis 2.6 (which didn't support hscan), right? [17:29:07] <_joe_> 2.8 IIRC [17:29:09] <_joe_> let me check [17:29:33] urandom: ok thanks! I'm about to go, will look tomorrow too at what has stopped updating [17:29:34] <_joe_> 2.8.17 [17:29:46] godog: have a good one! [17:30:05] _joe_: i'm expecting to see hscans with hdel commands interspersed. [17:30:10] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review: mw2212 had several downtimes recently - test before repool - https://phabricator.wikimedia.org/T129196#2267688 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi Everything looks find with me. checked all led on the server no problem the last problem... [17:30:43] <_joe_> cscott: I am looking at that specific connection and nothing flows either way [17:31:22] <_joe_> cscott: let me restart that script [17:31:23] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2267690 (10BBlack) Did some further testing on an isolated test machine, using our current varnish3 package. * Got 2833-byte test file from uncorrupted (--compressed) o... [17:32:44] <_joe_> ok now I see tons of traffic indeed [17:32:54] (03CR) 10Dzahn: "aha:) cool! and that i use url-downloader at all and not "webproxy" is also right, right?" [puppet] - 10https://gerrit.wikimedia.org/r/287077 (owner: 10Alexandros Kosiaris) [17:33:13] <_joe_> cscott: and it's using hscan [17:33:16] _joe_: maybe there's an O(N^2) slowdown in redis' implementation of hscan (which would be quite unfortunate if true) [17:33:27] <_joe_> cscott: nope [17:33:31] as the hscan goes further through the queue, does redis respond slower and slower? [17:33:32] <_joe_> hscans are very fast [17:34:05] well, it seems to be getting stuck at some point, let's see what happens just around that time. [17:34:32] <_joe_> ok, what I see is that after a short while those error messages are spawned and all talking with redis stops [17:34:42] ORLY [17:35:33] <_joe_> let me do something more [17:36:22] _joe_: what's the last cursor returned by HSCAN before talking with redis stops? [17:36:40] <_joe_> !log temporarily stopping ocg on ocg1003 to better debug cleanup cache [17:36:44] <_joe_> cscott: let's see [17:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:28] <_joe_> cscott: I am doing a full packet capture [17:37:36] <_joe_> but I'm not going to analyze it now [17:37:40] (03PS13) 10Alex Monk: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [17:37:42] (03PS1) 10Alex Monk: [WIP] Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) [17:38:06] 06Operations, 10Wikimedia-Planet: install planet2001 - https://phabricator.wikimedia.org/T134507#2267732 (10Dzahn) [17:38:07] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5763052 keys - replication_delay is 0 [17:38:19] 06Operations, 10Wikimedia-Planet: install planet2001 - https://phabricator.wikimedia.org/T134507#2267745 (10Dzahn) p:05Triage>03Low [17:38:25] disappointed that icinga has not mentioned anything about ocg1003 being down yet. [17:39:03] <_joe_> cscott: it probably takes a couple of minutes [17:39:10] <_joe_> and that's ok, honestly :) [17:39:11] (03CR) 10jenkins-bot: [V: 04-1] deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [17:39:14] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [17:40:18] <_joe_> !log restarted ocg after packet capture [17:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:14] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2267752 (10elukey) mc_evictions seems a bit higher at the moment on mc1009: https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&c=Memcached+eqiad&h=&tab=m&vn=&... [17:43:34] (03PS2) 10Alex Monk: [WIP] Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) [17:44:31] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [17:45:13] <_joe_> cscott: so, a brief peek at that packet capture; last thing it does is hscan ocg_job_status 3276 [17:45:33] <_joe_> which returns instantly when executed on the redis server [17:45:40] <_joe_> so it's definitely not waiting for redis [17:46:14] _joe_: I added my best guesses about places to make service_checker alert in https://etherpad.wikimedia.org/p/ve-2016-05-05; your input on that would be very welcome [17:46:29] <_joe_> gwicke: ack, but maybe tomorrow :) [17:46:47] <_joe_> cscott: not a single HDEL was done in this run of the script [17:46:49] (03PS3) 10Alex Monk: [WIP] Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) [17:46:50] yupyup, thanks! [17:47:15] <_joe_> and now, I am going to dinner, sorry [17:47:25] _joe_: there are about 400k status objects, so it's not getting far through. makes sense that it's cleaned out all job entries in the first 3276 status objects, though, so it's not doing any more hdels. [17:47:49] _joe_: no worries. i did some digging on the nexttick thing, i might be able to work around it for node 0.10 by upgrading some dependencies. [17:47:52] <_joe_> cscott: I guess that script needs to be worked on; nonetheless I think writing a simple cache wiping script should be doable [17:48:08] <_joe_> even on my side, in python, if needed [17:48:10] _joe_: so i'll do that, and add some better feedback to the script so we can tell on console when it's working vs stuck. [17:48:16] <_joe_> ok cool [17:48:17] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:48:28] <_joe_> drop me an email or update the ticket on the cache cleanup script [17:48:37] <_joe_> I have no time for it now, sorry [17:48:39] (03PS4) 10Alex Monk: [WIP] Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) [17:48:41] (03PS14) 10Alex Monk: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [17:48:41] _joe_: probably deploy on monday? although by that time most of the entries should have expired from the cache naturally. [17:48:52] <_joe_> cscott: yeah let's see then [17:48:53] _joe_: so early next week we should be able to actually take down ocg1003 regardless. [17:49:00] <_joe_> ok :) [17:49:18] <_joe_> ttyl [17:49:35] it seems like there's actually a hardcoded limit in node 0.10, so the warning isn't actually harmless, it's dropping some tasks on the floor once we hit the limit. [17:49:40] that's why the job is getting stuck. [17:49:56] solution: upgrade node (hard) or upgrade some dependencies (probably easier) [17:50:16] the node 0.10.x series was not well-loved. [17:51:08] (looking at https://graphite.wikimedia.org/ for the ocg.pdf.status_objects key, you can see that the cache script did successfully knock out a few of the status entries from redis, just not nearly enough of them) [17:55:37] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:19] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:27] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [17:58:17] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [17:58:48] PROBLEM - Disk space on elastic1008 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80341 MB (15% inode=99%) [17:58:54] 06Operations, 10OCG-General, 06Scrum-of-Scrums, 06Services, 07Technical-Debt: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2267842 (10cscott) @joe and @cscott ran the script today, after deploying the T120077 change to decommission ocg1003. The result wa... [18:00:52] 06Operations: Increase size of root partition on ocg* servers - https://phabricator.wikimedia.org/T130591#2267849 (10cscott) [18:00:54] 06Operations, 06Services, 13Patch-For-Review: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2267850 (10cscott) [18:00:56] 06Operations, 10OCG-General, 06Services, 13Patch-For-Review: Implement flag to tell an OCG machine not to take new tasks from the redis task queue - https://phabricator.wikimedia.org/T120077#2267844 (10cscott) 05Open>03Resolved a:03cscott Ok, we deployed the config change and tested everything and co... [18:01:10] 06Operations, 10ops-ulsfo: power loss in ulsfo cabinet 1.23 - https://phabricator.wikimedia.org/T134330#2267851 (10RobH) > This is Jon Waters UL electrician, > It looks like 2 of your machines have a failed power supplies they are > Bastion 4001 and CB 4016. > When I arrived on Tuesday the breaker in the ele... [18:02:32] ^ "the other power supply failed because of the failure itself." [18:02:35] how poetic [18:05:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:06:38] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:08:29] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:08:53] (03PS1) 10Papaul: DNS:Adding mgmt DNS entries for maps200[1-4] Bug:T134406 [dns] - 10https://gerrit.wikimedia.org/r/287127 (https://phabricator.wikimedia.org/T134406) [18:12:34] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2267879 (10Papaul) [18:13:28] (03PS2) 10Dzahn: DNS:Adding mgmt DNS entries for maps200[1-4] Bug:T134406 [dns] - 10https://gerrit.wikimedia.org/r/287127 (https://phabricator.wikimedia.org/T134406) (owner: 10Papaul) [18:13:39] (03CR) 10Dzahn: [C: 032] DNS:Adding mgmt DNS entries for maps200[1-4] Bug:T134406 [dns] - 10https://gerrit.wikimedia.org/r/287127 (https://phabricator.wikimedia.org/T134406) (owner: 10Papaul) [18:14:45] (03PS2) 10Volans: MariaDB: set mysql_role to standalone for es1 [puppet] - 10https://gerrit.wikimedia.org/r/287088 (https://phabricator.wikimedia.org/T133337) [18:17:18] (03CR) 10Volans: [C: 032] MariaDB: set mysql_role to standalone for es1 [puppet] - 10https://gerrit.wikimedia.org/r/287088 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [18:17:54] !log deployed revert/new patches for core & extension for T129506 [18:17:55] papaul: done on ns0 [18:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:36] mutante,: thanks [18:18:56] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2267929 (10mmodell) From @faidon's code review on Gerrit > From a quick look, this looks like it... [18:18:58] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2264720 (10Dzahn) merged DNS change that created mgmt entries, ran authdns-update on ns0 [18:19:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:20:11] 06Operations, 05codfw-rollout: test2wiki has no recent changes before the 20 april - https://phabricator.wikimedia.org/T133225#2267932 (10Luke081515) Still no recent changes before the 20th april: https://test2.wikipedia.org/w/index.php?title=Special:RecentChanges&hidepatrolled=0&days=999&limit=500 [18:21:48] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 708 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5767763 keys - replication_delay is 708 [18:23:18] !log deployed patch for T130947 [18:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:29:26] (03PS5) 10Alex Monk: Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) [18:29:33] (03PS6) 10Alex Monk: Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) [18:29:45] (03PS15) 10Alex Monk: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) [18:33:50] hi mutante, do you think you could take a look at https://gerrit.wikimedia.org/r/#/c/285932/ ? it's the last step for UrlShortener :) [18:34:11] 06Operations, 10ops-ulsfo: power loss in ulsfo cabinet 1.23 - https://phabricator.wikimedia.org/T134330#2267985 (10RobH) So I am on-site, and after some testing have discovered the following: cr1-ulsfo, asw1-ulsfo, and bast4001 power supplies are fine. The port ports they are plugged into are not providing p... [18:34:22] RECOVERY - Disk space on elastic1008 is OK: DISK OK [18:36:32] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5764407 keys - replication_delay is 0 [18:38:46] legoktm: looking..and putting on the list yea, just not deploying (right now), added myself to reviewers [18:39:02] thanks :) [18:41:21] RECOVERY - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [18:43:01] !log mr1-ulsfo going to flap as i reroute the power cable [18:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:02] !log mr1-ulsfo power cable reroute complete [18:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:52] PROBLEM - Host mr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.194) [18:45:05] 06Operations, 10ops-ulsfo: power loss in ulsfo cabinet 1.23 - https://phabricator.wikimedia.org/T134330#2268069 (10RobH) Ok, Dan from UnitedLayer Facilities came over, and he knew exactly what it was. When the other UL tech flipped over all the breakers, hemissed a single one, which kept that bank of ports of... [18:45:55] 06Operations, 10Traffic, 05codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#2268070 (10BBlack) So, this the current state of affairs and the complex bits, using text as an example cluster: * hieradata `cache::route_table` for `role::cache:text` de... [18:46:12] PROBLEM - Host asw-ulsfo.mgmt.ulsfo.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:46:32] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ffff::6 [18:47:44] !log deployed patch for T133507 [18:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:48:23] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [18:48:48] 06Operations, 10ops-ulsfo: power loss in ulsfo cabinet 1.23 - https://phabricator.wikimedia.org/T134330#2268095 (10RobH) [18:50:22] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.68 ms [18:51:12] RECOVERY - Host asw-ulsfo.mgmt.ulsfo.wmnet is UP: PING OK - Packet loss = 0%, RTA = 75.01 ms [18:52:33] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.28 ms [18:54:31] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 73.50 ms [18:55:32] (03PS3) 10Dzahn: ircserver: move ircd.conf to public repo (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [18:55:55] 06Operations, 10ops-ulsfo: cp4016: bad power supply - https://phabricator.wikimedia.org/T134526#2268135 (10RobH) [18:56:26] 06Operations, 10ops-ulsfo: cp4016: bad power supply - https://phabricator.wikimedia.org/T134526#2268148 (10RobH) a:03Cmjohnson Since @cmjohnson can self-dispatch the power supply for this to ULSFO directly, I'm assigning it to him. [18:56:49] (03CR) 10Dzahn: ircserver: move ircd.conf to public repo (WIP) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [18:57:06] (03PS1) 10Smalyshev: Don't start Jolokia each time, let Updater start it [puppet] - 10https://gerrit.wikimedia.org/r/287131 (https://phabricator.wikimedia.org/T134523) [19:00:03] (03CR) 10BBlack: Text VCL: RB ?redirect=false optimization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [19:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160505T1900). [19:02:28] (03PS1) 10Chad: Moving remaining wikis to 1.27.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287132 [19:03:09] 06Operations, 10ops-ulsfo: cp4016: bad power supply - https://phabricator.wikimedia.org/T134526#2268175 (10Cmjohnson) Work Order submitted. Once approved it will ship with next day delivery. Congratulations: Work Order SR929257617 was successfully submitted. [19:06:15] (03CR) 10Chad: [C: 032] Moving remaining wikis to 1.27.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287132 (owner: 10Chad) [19:06:41] (03Merged) 10jenkins-bot: Moving remaining wikis to 1.27.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287132 (owner: 10Chad) [19:07:21] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: wikipedias to wmf.23 [19:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:34] (03PS4) 10Dzahn: ircserver: move ircd.conf to public repo (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [19:09:43] (03PS5) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [19:11:02] PROBLEM - HHVM rendering on mw2086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:41] 07Blocked-on-Operations, 06Operations, 06Services, 06WMDE-Analytics-Engineering, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#2268199 (10Danny_B) [19:12:52] RECOVERY - HHVM rendering on mw2086 is OK: HTTP OK: HTTP/1.1 200 OK - 67788 bytes in 0.280 second response time [19:21:35] 06Operations, 06Labs, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2268224 (10Danny_B) [19:21:38] 06Operations, 10Traffic, 07HTTPS, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2268225 (10Danny_B) [19:24:53] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:25:15] 06Operations, 07Tracking: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#2268231 (10Danny_B) [19:29:34] (03PS1) 10Dzahn: mw_rc_irc: add "secret" files without real secret data [labs/private] - 10https://gerrit.wikimedia.org/r/287136 [19:30:40] !log deployed patch for T132874 [19:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:01] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:34:11] (03PS2) 10Dzahn: mw_rc_irc: add "secret" files without real secret data [labs/private] - 10https://gerrit.wikimedia.org/r/287136 [19:35:11] (03CR) 10Dzahn: [C: 032 V: 032] mw_rc_irc: add "secret" files without real secret data [labs/private] - 10https://gerrit.wikimedia.org/r/287136 (owner: 10Dzahn) [19:36:20] (03PS6) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [19:38:14] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2268264 (10Danny_B) [19:39:50] (03PS7) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [19:43:36] (03PS8) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [19:44:22] 07Puppet, 06Labs, 10Tool-Labs, 07Tracking: Fully puppetize Grid Engine (Tracking) - https://phabricator.wikimedia.org/T88711#2268273 (10Danny_B) [19:47:10] (03PS9) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [19:48:33] 06Operations, 07Tracking: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063#2268291 (10Danny_B) [19:52:48] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 678 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5773554 keys - replication_delay is 678 [20:08:10] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 85 failures [20:12:13] forgot to set that this morning since today is a one-off [20:15:23] (03PS10) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [20:16:59] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Puppet has 1 failures [20:17:11] last time I swear [20:18:29] 06Operations, 07Tracking: Elasticsearch rollout - tracking ticket - https://phabricator.wikimedia.org/T83497#2268370 (10Danny_B) [20:19:22] (03CR) 10Dzahn: "should be ready to go now. just the question if a service restart can be avoided, or .. i would test it first on the old server argon. i a" [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [20:23:26] (03PS3) 10Dzahn: ircserver: don't use TS6 protocol, no other servers [puppet] - 10https://gerrit.wikimedia.org/r/286785 (https://bugzilla.wikimedia.org/134271) [20:23:27] (03PS11) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [20:23:46] dang, what..dependency foo [20:24:38] wants to go back to PS10 [20:32:07] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2268389 (10mobrovac) >>! In T133211#2267929, @mmodell wrote: > 1. Just let all of the service de... [20:32:39] 06Operations, 10Wikimedia-General-or-Unknown, 07Tracking: Backup systems (tracking) - https://phabricator.wikimedia.org/T20255#2268401 (10Danny_B) [20:32:54] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 06Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#2268404 (10GWicke) @akosiaris, I added the tag to reflect that several aspects of these requirements (especially config managemen... [20:33:28] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5767809 keys - replication_delay is 0 [20:34:09] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: puppet fail [20:34:19] PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:34:59] PROBLEM - Apache HTTP on mw1148 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50407 bytes in 0.124 second response time [20:35:25] (03CR) 10Dzahn: [C: 04-1] "PS10 was the good one.. PS11 messed up by accident" [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [20:41:52] (03PS12) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [20:43:38] 06Operations, 10Librarization, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848#2268416 (10chasemp) p:05Triage>03Normal [20:43:49] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:43:57] 06Operations, 06Discovery, 10Maps, 10Tilerator, and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2268417 (10chasemp) p:05Triage>03High [20:44:19] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Performance-Team, 10Traffic: Spike: CentralNotice: Verify that our Special:HideBanners cookie storm works as efficiently as possible - https://phabricator.wikimedia.org/T117435#2268418 (10chasemp) p:05Triage>03Normal [20:44:32] 06Operations, 10MediaWiki-API, 10Traffic: Evaluate the feasibility of cache invalidation for the action API - https://phabricator.wikimedia.org/T122867#2268419 (10chasemp) p:05Triage>03Normal [20:44:59] 06Operations, 10MediaWiki-Vagrant, 10Traffic: Make Varnish port configurable using hiera - https://phabricator.wikimedia.org/T124378#2268421 (10chasemp) p:05Triage>03Low [20:45:12] 06Operations, 06Performance-Team, 10Traffic: Understand and improve streaming behaviour from Varnish - https://phabricator.wikimedia.org/T126015#2268422 (10chasemp) p:05Triage>03Normal [20:45:33] 06Operations, 10DBA, 07Performance, 07RfC, 05codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523#2268423 (10chasemp) p:05Triage>03Normal [20:45:47] 06Operations, 10DBA: Decomission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2268424 (10chasemp) p:05Triage>03Normal [20:46:29] 06Operations, 10ops-eqiad: dbstore1001 degraded RAID - https://phabricator.wikimedia.org/T134471#2268426 (10chasemp) p:05Triage>03High a:03Cmjohnson [20:46:46] 06Operations, 10Traffic, 05codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#2268428 (10chasemp) p:05Triage>03Normal [20:47:10] 06Operations, 10MediaWiki-ResourceLoader, 10Traffic: commons.wikimedia.org home page has 404s loaded from JS (RL?) - https://phabricator.wikimedia.org/T134368#2268429 (10chasemp) p:05Triage>03High [20:47:20] 06Operations, 10Education-Program-Dashboard, 10Traffic, 03Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509#2268430 (10chasemp) p:05Triage>03Normal [20:47:26] 06Operations, 10ops-eqiad: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2268431 (10Cmjohnson) [20:47:35] 06Operations, 10Analytics, 10Traffic, 07Privacy: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817#2268432 (10chasemp) p:05Triage>03Normal [20:51:51] 06Operations, 10ops-eqiad: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2268445 (10Cmjohnson) [] Confirm out of cluster/service group [] Remove from puppet stored configuration files. [] Remove from site.pp (puppet:///manifests/site.pp) [] Remove from netboot.cfg [] Remove from DHCPD... [21:01:09] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [21:04:28] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#1414241 (10Yurik) Please add me as I manage several projects (graphs, maps, tabular data, ...). Thanks! [21:05:05] (03PS4) 10Dzahn: ircserver: don't use TS6 protocol, no other servers [puppet] - 10https://gerrit.wikimedia.org/r/286785 (https://bugzilla.wikimedia.org/134271) [21:05:07] (03PS13) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [21:05:09] (03PS1) 10Volans: MariaDB: set $master true for codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) [21:05:15] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:12:33] (03PS2) 10Volans: MariaDB: set $master true for codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) [21:13:10] (03PS1) 10Southparkfan: Remove DNS entries of db1058 [dns] - 10https://gerrit.wikimedia.org/r/287145 (https://phabricator.wikimedia.org/T134360) [21:14:52] (03PS3) 10Volans: MariaDB: set $master true for codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) [21:15:57] (03PS2) 10Southparkfan: Remove DNS entries of db1058 [dns] - 10https://gerrit.wikimedia.org/r/287145 (https://phabricator.wikimedia.org/T134360) [21:16:14] (03PS14) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [21:17:03] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2268523 (10Southparkfan) Just to learn how the process works, I've submitted a patch for the DNS adjustments. I noticed db1058 is referenced in the dhcpd and manifests/role/coredb.pp files in... [21:19:14] (03PS15) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [21:21:12] (03PS16) 10Dzahn: ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) [21:30:12] 06Operations, 10MediaWiki-Vagrant, 10Traffic: Make Varnish port configurable using hiera - https://phabricator.wikimedia.org/T124378#1954287 (10BBlack) We don't actually use port 6081 in production puppetization anyways, so I'm not sure what this is about? Related is T119396 which is a bit stale and dated. [21:31:30] 06Operations, 10Traffic, 13Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1825233 (10BBlack) [21:31:54] 06Operations, 10Traffic, 13Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1825233 (10BBlack) [21:33:21] 06Operations, 10Education-Program-Dashboard, 10Traffic, 03Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509#1855172 (10BBlack) Can you give me some idea what URLs we're talking about for "education dashboard pages"? I'm lost. [21:37:18] !log restarting elasticsearch server elastic1009.eqiad.wmnet (T110236) [21:37:19] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [21:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:43:24] gehel: go on vacation! [21:43:36] :P [21:43:56] Almost there... just a quick check before bed time... [21:50:14] lol [21:52:36] 06Operations, 10Analytics, 10Traffic, 07Privacy: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817#2268606 (10BBlack) IMHO, we should define better what we need. There's a lot of grey area and disagreement in the discussion so far. Genera... [22:19:37] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2247220 (10Deskana) @Gehel How do you recommend prioritising this? [22:23:33] 06Operations, 10Education-Program-Dashboard, 10Traffic, 03Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509#2268675 (10awight) @BBlack Oops--these are currently hosted on labs, at http://outreachdashboard.wmflabs.org/ and https://wikiedu-das... [22:24:01] (03PS4) 10Volans: MariaDB: set $master true for codfw masters [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) [22:37:32] (03PS1) 10GWicke: Set conservative retry limits & delays [puppet] - 10https://gerrit.wikimedia.org/r/287148 (https://phabricator.wikimedia.org/T134456) [22:56:31] (03PS14) 10Eevans: [WIP]: Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [22:57:07] (03CR) 10Volans: "Puppet compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) (owner: 10Volans) [23:00:04] RoanKattouw ostriches Krenair awight Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160505T2300). [23:00:46] Hi. [23:01:30] It seems there is nothing to SWAT this evening. [23:01:40] Yeah it's empty [23:06:49] (03PS1) 10BryanDavis: New role: role::labs::redirector [puppet] - 10https://gerrit.wikimedia.org/r/287149 (https://phabricator.wikimedia.org/T134508) [23:14:49] (03CR) 10BryanDavis: [C: 031] "Tested via cherry-pick on redirects-nginx01.redirects.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/287149 (https://phabricator.wikimedia.org/T134508) (owner: 10BryanDavis) [23:22:10] 06Operations, 06Discovery, 10Maps, 10Tilerator, and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2268732 (10Yurik) Some stats to consider: the db update script reports 5-6 million z16 affected tiles per day. That's about 7million (5m/4 + 5m/16 + 5m/64...) with... [23:34:07] (03Abandoned) 10Smalyshev: [WIP] Add configs for kafka-watcher tool [puppet] - 10https://gerrit.wikimedia.org/r/286588 (https://phabricator.wikimedia.org/T97562) (owner: 10Smalyshev) [23:54:47] (03CR) 10Dzahn: [C: 032] New role: role::labs::redirector [puppet] - 10https://gerrit.wikimedia.org/r/287149 (https://phabricator.wikimedia.org/T134508) (owner: 10BryanDavis) [23:55:05] thanks mutante [23:55:49] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:56:40] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:56:40] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2268779 (10GWicke) a:03GWicke [23:57:49] PROBLEM - nutcracker port on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:59] PROBLEM - HHVM processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:29] PROBLEM - configured eth on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:30] PROBLEM - RAID on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:49] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:09] PROBLEM - Check size of conntrack table on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:20] PROBLEM - puppet last run on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:20] PROBLEM - DPKG on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:34] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2268853 (10ssastry) Worthwhile replacing all github urls with a version tied to a commit. This prevents the link pointers from pointing to the wr... [23:59:38] PROBLEM - SSH on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:59:38] PROBLEM - salt-minion processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.