[00:00:04] ori: with which account? [00:00:04] RoanKattouw, ^d, marktraceur, MaxSem, kaldari: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141125T0000). Please do the needful. [00:00:05] MaxSem: it could use TemplateData and be generalized for all wikis [00:00:34] MaxSem: basically like VE is doing [00:00:58] paravoid: personal, linked to my github [00:00:59] mutante: do you remember who would have that account? [00:01:25] I'll SWAT [00:01:30] ori: can you make it just fetch a URL from the web and give me, say, the HTML output? [00:01:42] the URL would be https://www.ssllabs.com/ssltest/viewMyClient.html :) [00:01:43] paravoid: https://office.wikimedia.org/wiki/Browser_testing_and_design_tools#BrowserStack [00:01:50] oh slapped with the docs [00:01:52] thanks mutante [00:01:57] " (contact Sarah Rodlund for a personalised login/account)" [00:02:02] yw paravoid [00:02:31] it doesnt actually say the password on wiki thoug [00:02:57] there's that other thing below it http://browsershots.org/ [00:03:15] MaxSem: I just put another patch in for SWAT -- https://gerrit.wikimedia.org/r/#/c/175604/ -- beta only logging config change that I tried earlier today. Now with the right class names! [00:03:29] mutante: doesn't have IE :( [00:03:53] bd808, are the deps already live? [00:04:07] paravoid: i see, then browserstack.com it is [00:04:17] MaxSem: Yes. [00:05:15] legoktm: awesome, re that patch [00:05:23] (03CR) 10MaxSem: [C: 032] Use Monolog provider for beta logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175604 (owner: 10BryanDavis) [00:05:32] (03Merged) 10jenkins-bot: Use Monolog provider for beta logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175604 (owner: 10BryanDavis) [00:06:12] bd808, also, my eyes! +715, -0 :P [00:06:25] MaxSem: heh. yeah [00:07:53] trying crossbrowsertesting.com now [00:07:56] !log maxsem Synchronized wmf-config/logging-labs.php: https://gerrit.wikimedia.org/r/#/c/175604/ labs only (duration: 00m 05s) [00:08:00] Logged the message, Master [00:08:05] (03PS3) 10BBlack: r::c::ssl::misc: switch to r::c::localssl like prod SNI [puppet] - 10https://gerrit.wikimedia.org/r/175455 [00:08:06] bd808, ^^^ [00:08:26] (03CR) 10BBlack: [C: 032 V: 032] r::c::ssl::misc: switch to r::c::localssl like prod SNI [puppet] - 10https://gerrit.wikimedia.org/r/175455 (owner: 10BBlack) [00:08:53] MaxSem: thx. Waiting on jenkins to deploy in beta. It ended up in line behind a scap with l10n updates. :/ [00:14:53] crossbrowsertesting.com is awesome [00:15:06] it's just awesome [00:16:43] +1! [00:17:04] even has a live test, where you get a live screen in your browser [00:17:23] how cool is that! [00:17:41] in their screenshot app, you get a list of screenshots for all the OS/browser combination you picked [00:17:44] either the window or the full page [00:17:56] then each of these has a "live test" button [00:18:06] besides the "retake" button and enlarging it [00:18:25] also apparently it can compare the layout of the page between different browsers, but I didn't test that [00:18:26] cool! [00:18:41] bblack: ^ [00:18:44] useful for SSL stuff [00:19:58] !log puppet disabled on prod text/mobile/bits/upload varnishes for careful SSL changes [00:20:01] Logged the message, Master [00:20:27] (03PS3) 10BBlack: Turn on r::c::ssl::sni locally for varnishes [puppet] - 10https://gerrit.wikimedia.org/r/175464 [00:21:42] apparently rfc 5077 tickets are IE >= 10 [00:21:57] and not supported by safari or opera :/ [00:22:17] !log maxsem Synchronized php-1.25wmf9/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/175613 (duration: 00m 05s) [00:22:18] doh that sucks [00:22:20] Logged the message, Master [00:22:30] yes [00:22:31] kaldari|2, ^^^ [00:22:35] we can still work around by rotating the local keys here right? [00:23:04] to workaround we need to have server-kept sessions [00:23:17] this means either a distributed cache, or source ip hashing load balancing [00:23:25] source ip hashing is what we have now [00:23:29] yes [00:23:41] I was wondering what's the status of rfc5077 post-SSLv3-death [00:23:41] distributed cache is not too hard. I've never done it at our scale though [00:23:51] apparently not good [00:23:56] bd808: distributed ssl/tls session cache [00:23:57] (fwiw I'm still a fan of keeping it vs any distributed thing. seems simpler and less failure-prone and syncy and whatnot) [00:24:19] bblack: ipvs "sh" is pretty buggy though [00:24:26] paravoid: Yeah. Let me see if I can find the server we used for it at Kount [00:24:33] well we can unbugify that if need be [00:24:37] depooling is not instantaneous; weight is ignored [00:24:46] it's still the right approach, imho [00:25:12] (03CR) 10BBlack: [C: 032] Turn on r::c::ssl::sni locally for varnishes [puppet] - 10https://gerrit.wikimedia.org/r/175464 (owner: 10BBlack) [00:25:13] !log maxsem Synchronized php-1.25wmf8/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/175611 (duration: 00m 05s) [00:25:17] Logged the message, Master [00:25:27] kaldari|2, ^ [00:25:32] the failure mode of not having a cache synced is just renegotiating with an extra RTT penalty [00:25:56] MaxSem: thanks, will test... [00:25:59] and the session is ~120 bytes [00:26:08] the failure mode of a depool/missing server should be the same, anything else is just a bug that can be fixed [00:26:15] (in sh-land) [00:26:24] true :) [00:26:35] ori: isn't importScript assyncronous? [00:27:25] apache has a memcache backend, which is appealing [00:27:46] this is similar in nature to my dislike of shared storage. in general I don't like solutions that require network data synchronization to work right. there's usually a more-scalable way to think about the problem that leavings the scaling bits independent of each other. [00:27:58] s/leavings/leaves/ [00:28:28] MaxSem: yay, editing works again [00:30:08] paravoid: http://distcache.sourceforge.net/ is what we used at Kount. There was a memory leak in it I fixed too at some point. [00:30:24] with apache I assume? [00:30:31] apache has moved off it [00:30:37] they have a native memcached backend nowadays [00:30:46] which works better than distcache AIUI [00:31:04] (03PS1) 10Dzahn: bugzilla: remove SSL Apache config for old-bz [puppet] - 10https://gerrit.wikimedia.org/r/175615 [00:31:11] That might be nicer. We wanted to use memcached but at the time that backend was alpha [00:31:36] discache was basically abandonware when we started using it [00:33:07] I assume in that scenario, we'd place a memcached on each ssl terminator and distribute the keys between them, etc? so we don't lose too many. [00:33:15] still seems like a PITA compared to "do sh right" [00:33:34] or just use our memcache cluster :) [00:33:59] you need to invalidate all of them (switch the master key) every ~24h or so anyway [00:34:36] I thought master key switches were smooth, though? there's an overlap window on time for existing sessions to drain, etc? [00:34:57] with which software? :) [00:35:10] nginx has something like this, yes, and it's unique across servers [00:36:55] The default SSL session id creation for mod_ssl is horrible as I reacall. It rolls up a random number and the traverses a linked list to see if it is used. If it is GOTO 1. [00:37:58] what a mess :) [00:38:10] I patched Kount's mod_ssl to use time+host based GUIDs instead and got much better performance (and user tracking :/) [00:38:36] nice [00:39:00] ssl session ids are a really sneaky way to cookie someone [00:39:56] MaxSem, ori: the current [[MediaWiki:Gadget-refToolbar.js]] is an improved version compared to what we had before this: https://en.wikipedia.org/w/index.php?title=MediaWiki_talk:Common.js#Edit_request [00:39:58] oh and on top of this (and bblack will love this), openssl's callback for fetching a session can't be non-blocking [00:40:10] lol [00:40:25] so e.g. nginx gave up on having a cross-server shared cache [00:40:40] maybe libressl will fix it :) [00:40:51] and is why stud went for their own replication thing instead of a memcached backend [00:41:05] (which someone tried and they threw out) [00:41:41] doing a distributed cache right, that's not prone to all kinds of failures and splits and scaling issues, is hard [00:42:23] well [00:42:28] I can't disagree with that :) [00:42:38] but the way stud did it (again, AIUI) is [00:42:44] since the session state is tiny [00:42:52] they're just replicating everything across servers [00:43:05] so basically the server just emits the session data via udp multicast [00:43:12] and all other servers listen to that and inject it into their cache [00:43:34] so it's not "distributed" in the sense that you have to find the right server to request the session from [00:43:38] it's just replicated [00:43:58] and no handshakes essentially [00:44:04] so it sounds significantly easier [00:48:20] (03PS1) 10BryanDavis: beta: Fix logging redis address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175619 [00:49:36] paravoid: will HTTP 2 make TLS session caching less important? [00:49:41] no [00:50:07] it's one connection only, so we could keep that open for longer [00:50:16] doesn't matter [00:50:45] because we couldn't keep it open long enough? [00:50:49] plus, by the time HTTP 2 will have significant adoption, RFC 5077 tickets will do too [00:50:57] and then we won't need server-side session caching :) [00:50:59] (03PS3) 10BBlack: Switch LVS to use localssl at all sites [puppet] - 10https://gerrit.wikimedia.org/r/175465 [00:51:56] {{caveats, but:}} http://caniuse.com/#feat=spdy [00:52:06] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 326 seconds [00:52:15] (03CR) 10Dzahn: [C: 032] bugzilla: remove SSL Apache config for old-bz [puppet] - 10https://gerrit.wikimedia.org/r/175615 (owner: 10Dzahn) [00:52:45] looks a lot like the set of rfc 5077 supporting browsers [00:52:58] I didn't check with safari 8 nor opera 25+ though [00:53:06] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [00:53:15] paravoid: if you want to workaround DNS geoip stuff to peek, all the sites should be using SNI certs + unified fallback at ulsfo (but not at eqiad/esams yet) [00:53:33] paravoid: kk [00:53:37] (03CR) 10Dzahn: [C: 032] old-bugzilla: switch over to misc-web [dns] - 10https://gerrit.wikimedia.org/r/175601 (owner: 10Dzahn) [00:53:56] so yeah, I guess our glorious http 2 future won't have this problem :) [00:54:02] but... legacy :/ [00:54:23] I'm gonna toy with it a little while waiting for the other caches to finish running normal puppet intervals, before turning it on in LVS at the other sites [00:54:25] i see a pending DNS change on ns0 [00:54:31] should i merge it too? [00:54:40] +scs-c1-codfw 1H IN A 10.193.0.20 [00:54:59] mutante: yes [00:55:01] ok [00:55:02] paravoid: a small perf degradation for legacy clients is less dramatic IMHO [00:55:19] it'll still work [00:55:50] where legacy is IE <= 10 right now :) [00:55:50] fwiw: this new: warning: The global option 'udp_threads' was reduced from the configured value of 4 to 1 for lack of SO_REUSEPORT support [00:56:15] don't worry about it :) [00:56:18] fortunately, that's a small market share [00:56:19] 'k:) [00:56:29] that message will go away when those boxes are updated to trusty for a better kernel [00:56:37] * mutante nods [00:56:37] and shrinking [00:56:46] I just didn't feel like conditionalizing it in the template since it's just a warning [00:58:23] (03CR) 10BryanDavis: [C: 032] "beta only change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175619 (owner: 10BryanDavis) [00:58:38] (03Merged) 10jenkins-bot: beta: Fix logging redis address [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175619 (owner: 10BryanDavis) [00:58:40] gwicke: > 6% by a quick count [00:58:59] paravoid: https://gerrit.wikimedia.org/r/#/c/172780/ will enable the cassandra role on the test boxes [00:59:17] !log bd808 Synchronized wmf-config/logging-labs.php: Update labs logging config (duration: 00m 05s) [00:59:21] Logged the message, Master [01:00:04] awight, AndyRussG, ejegg: Respected human, time to deploy CentralNotice (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141125T0100). Please do the needful. [01:01:20] (03CR) 10Faidon Liambotis: [C: 04-1] Give parsoid-admins access to ruthenium; split cassandra test hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [01:03:09] !log puppet back to normal on caches [01:03:17] Logged the message, Master [01:03:19] PROBLEM - HTTPS_wikimediafoundation.org on cp1064 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:19] PROBLEM - HTTPS_zero.wikipedia.org on cp1067 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:20] PROBLEM - HTTPS_wikiquote.org on cp1052 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:20] PROBLEM - HTTPS_wikipedia.org on cp1054 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:20] PROBLEM - HTTPS_zero.wikipedia.org on amssq45 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:20] PROBLEM - HTTPS_m.wikipedia.org on cp1052 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:20] PROBLEM - HTTPS_m.wikimediafoundation.org on cp1053 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:21] PROBLEM - HTTPS_m.wikimedia.org on cp1064 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:21] PROBLEM - HTTPS_wikinews.org on cp1053 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:22] PROBLEM - HTTPS_wikimediafoundation.org on cp1040 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:22] PROBLEM - HTTPS_wikibooks.org on cp1066 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:23] PROBLEM - HTTPS_m.mediawiki.org on cp1066 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:23] PROBLEM - HTTPS_mediawiki.org on cp1037 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:24] PROBLEM - HTTPS_m.wikinews.org on cp1054 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:24] PROBLEM - HTTPS_m.wiktionary.org on cp1067 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:25] PROBLEM - HTTPS_m.wikivoyage.org on cp1068 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:25] PROBLEM - HTTPS_wikibooks.org on cp3017 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:26] PROBLEM - HTTPS_m.wikimedia.org on cp1040 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:03:26] PROBLEM - HTTPS_wiktionary.org on cp1068 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:06:13] bblack: everything ok? [01:08:33] paravoid: I think so, yet [01:08:39] *yes [01:08:45] ok [01:08:59] aww man. I checked out mediawiki-core origin/master, into /srv/mediawiki-staging. Trying to mop up my footprints, now. [01:09:01] the change only has real effect so far at ulsfo, and those seem to be functionally fine [01:09:02] I'm toast, so I think I'll hit my bed [01:09:08] nite :) [01:09:20] (03CR) 10GWicke: Give parsoid-admins access to ruthenium; split cassandra test hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [01:09:23] page me if you need more eyes etc. :) [01:09:28] ok [01:09:48] paravoid: thanks for the review, and enjoy your rest! [01:10:12] don't even ask me what "SSL_CERT CRITICAL: Error: verify depth is 6 " means yet heh [01:10:18] gwicke: yes, split it into a different commit pleaase :) [01:10:39] but it's that x24 certs x80 machines or whatever for all the non-ulsfo caches, but LVS isn't using them for HTTPS yet regardless, and it's probably just a monitoring problem. [01:10:57] at least icinga-wm had the decency to flood itself off the channel! [01:11:14] I need a hand, here. How can I checkout /srv/mediawiki-staging/php-1.25wmf8 to look like it's supposed to? [01:11:39] paravoid: will do [01:11:41] I imagine it's something like "git checkout -f wmf/1.25wmf8", then maybe apply the security patches [01:12:21] Reedy, ^^^ [01:12:32] awight: You should be able to fetch and rebase [01:12:45] gwicke: let's do it now [01:12:56] oh, you know, it could be that I didn't bother actually starting nginx on all those hosts yet! [01:12:57] bd808: I damaged the directory, by "git rebase origin/master" like a fool [01:13:00] so you can have something to play with while I'm asleep :) [01:13:06] will you or should I? [01:13:11] bd808: then -C'd that [01:13:13] bblack: I was about to ask... [01:13:22] but I thought it was too much of a stupid question [01:13:51] awight: reflog looks like you could checkout 752538e [01:14:18] RECOVERY - HTTPS_wikipedia.org on cp1037 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:06:02 2015 GMT (expires in 363 days) [01:14:18] RECOVERY - HTTPS_mediawiki.org on cp1054 is OK: SSL_CERT OK - X.509 certificate for *.mediawiki.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:51:02 2015 GMT (expires in 363 days) [01:14:18] RECOVERY - HTTPS_m.wiktionary.org on cp1053 is OK: SSL_CERT OK - X.509 certificate for *.m.wiktionary.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:46:07 2015 GMT (expires in 363 days) [01:14:18] RECOVERY - HTTPS_wiktionary.org on cp1040 is OK: SSL_CERT OK - X.509 certificate for *.wiktionary.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:46:05 2015 GMT (expires in 363 days) [01:14:18] RECOVERY - HTTPS_m.wikivoyage.org on cp1064 is OK: SSL_CERT OK - X.509 certificate for *.m.wikivoyage.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:46:02 2015 GMT (expires in 363 days) [01:14:23] heh [01:14:27] yay, killed icinga-wm again! [01:14:32] score! [01:15:02] ^demon|headache, Reedy: either of you about to help awight with a bit of a git problem on tin? [01:15:14] (03PS6) 10GWicke: Give parsoid-admins access to ruthenium; split cassandra test hosts [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [01:15:48] note to future self: "Error: verify depth is 6" == "Connection refused" [01:15:49] gwicke: I'll fix the cassandr commi [01:15:59] it's wrong anyway [01:17:10] icinga-wm: welcome back! [01:17:45] awight: It looks like you've got it fixed? [01:19:20] (03PS1) 10Faidon Liambotis: Switch Cassandra test hosts to the new role class [puppet] - 10https://gerrit.wikimedia.org/r/175629 [01:19:29] bd808: Reedy: MaxSem is helping me, but we're wondering about autoload.php. It was created during one of my stupid maneuvers. How can I reset it? [01:19:52] (03PS4) 10BBlack: Switch LVS to use localssl at all sites [puppet] - 10https://gerrit.wikimedia.org/r/175465 [01:20:01] (03PS2) 10Faidon Liambotis: Switch Cassandra test hosts to the new role class [puppet] - 10https://gerrit.wikimedia.org/r/175629 [01:20:08] i'm getting a redirect loop on bz-old but i dunno where from .. hrmm [01:20:27] awight: Just rm it? It's not tracked [01:20:36] bd808: does it get regenerated? [01:20:38] which I think means it shouldn't exist [01:21:28] includes/AutoLoader.php is the real autoloader script [01:21:41] !log disabling puppet on lvs[13]00[1-6] for SSL-related changes [01:21:46] Logged the message, Master [01:22:16] (03CR) 10BBlack: [C: 032] Switch LVS to use localssl at all sites [puppet] - 10https://gerrit.wikimedia.org/r/175465 (owner: 10BBlack) [01:23:14] (03PS7) 10GWicke: Give parsoid-admins access to ruthenium; split cassandra test hosts [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [01:23:16] (03PS1) 10GWicke: Enable cassandra & restbase test hosts [puppet] - 10https://gerrit.wikimedia.org/r/175630 [01:23:19] bd808: /srv/mediawiki/php-1.25wmf9/autoload.php [01:23:34] gwicke: too late dude :) [01:24:19] (03CR) 10GWicke: Give parsoid-admins access to ruthenium; split cassandra test hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [01:25:09] awight: yeah. That looks like complete junk to me. There is one in 1.25wmf9 but it it a tracked file there. [01:25:14] paravoid: hey, go to bed ;) [01:25:26] oh wait you are asking about wmf8 or wmf9? [01:25:29] PROBLEM - HTTPS_m.wikisource.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:25:29] PROBLEM - HTTPS_wikivoyage.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:25:30] gwicke: I pinged you like 3-4 times here :) [01:25:48] yeah, sorry [01:25:51] PROBLEM - HTTPS_wiktionary.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:25:51] PROBLEM - HTTPS_m.wikiversity.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:25:59] I don't have pings & was busy doing my patches [01:26:10] PROBLEM - HTTPS_m.wikivoyage.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:26:10] PROBLEM - HTTPS_zero.wikipedia.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:26:23] (03PS3) 10Faidon Liambotis: Switch Cassandra test hosts to the new role class [puppet] - 10https://gerrit.wikimedia.org/r/175629 [01:26:25] (03PS8) 10Faidon Liambotis: Give parsoid-admins access to ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [01:26:33] gwicke: is there an RT for "ruthenium for RTT testing"? [01:26:43] PROBLEM - HTTPS_m.wiktionary.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:27:10] PROBLEM - HTTPS_mediawiki.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:27:23] paravoid: yeah, let me look in my mail [01:27:27] #6980 ? [01:27:39] PROBLEM - HTTPS_unified on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:27:51] PROBLEM - HTTPS_wikibooks.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:27:59] PROBLEM - HTTPS_wikidata.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:27:59] PROBLEM - HTTPS_m.mediawiki.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:28:00] possibly, thunderbird is impossibly slow at searching recently [01:28:06] https://rt.wikimedia.org/Ticket/Display.html?id=6980 [01:28:12] that mentions *two* systems though [01:28:15] PROBLEM - HTTPS_wikimedia.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:28:18] PROBLEM - HTTPS_m.wikibooks.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:28:25] paravoid: 7192 possibly [01:28:33] PROBLEM - HTTPS_m.wikidata.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:28:35] PROBLEM - HTTPS_wikimediafoundation.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:28:42] and 6980, yeah [01:28:48] PROBLEM - HTTPS_wikinews.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:28:51] PROBLEM - HTTPS_m.wikimedia.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:28:53] you that procurement request for 2 systems was never fulfilled [01:28:59] that's just one system as far as I can see? [01:29:01] PROBLEM - HTTPS_m.wikimediafoundation.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:29:01] PROBLEM - HTTPS_wikipedia.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:29:05] paravoid: we determined that we can do with one test box for now [01:29:22] PROBLEM - HTTPS_m.wikinews.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:29:22] PROBLEM - HTTPS_wikiquote.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:29:28] the bottleneck is the database with the results these days [01:29:39] PROBLEM - HTTPS_m.wikipedia.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:29:39] PROBLEM - HTTPS_wikisource.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:29:42] until that's fixed more workers won't help [01:29:46] (03PS9) 10Faidon Liambotis: Give parsoid-admins access to ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [01:30:09] PROBLEM - HTTPS_m.wikiquote.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:30:09] PROBLEM - HTTPS_wikiversity.org on amssq42 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [01:30:18] (03CR) 10Faidon Liambotis: [C: 032] Switch Cassandra test hosts to the new role class [puppet] - 10https://gerrit.wikimedia.org/r/175629 (owner: 10Faidon Liambotis) [01:30:36] (03CR) 10Faidon Liambotis: [C: 032] Give parsoid-admins access to ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [01:30:44] sorry, mine are better :) [01:30:53] okay ;) [01:30:57] (system::role doesn't actually include a role, it just sets up motd) [01:30:57] thank you! [01:31:32] I'll split out the hiera data update then? [01:31:42] that's a restbase thing isn't it? [01:31:49] both restbase & cassandra [01:32:00] https://gerrit.wikimedia.org/r/#/c/175630/1/hieradata/eqiad.yaml [01:32:07] oh damn [01:32:19] I got fooled by the RESTBase comment [01:32:37] it's no problem [01:32:43] we can add that in a follow-up [01:32:44] role::restbase is not included anyhow [01:32:51] *nod* [01:32:53] oh geez, I forgot that amssq42 was already upgraded to trusty as a test heh (and doesn't have our custom nginx with udplog stuff) [01:33:12] do you also need restbase right now? [01:33:18] bblack: sshhhhhh [01:33:19] paravoid: lets do that tomorrow [01:33:24] bblack: kill it silently :P [01:33:44] yeah I'm gonna downtime it and take it out of frontends and deal with that later :) [01:33:54] no, kill ssl udp2log [01:33:58] paravoid: I would like to test the combo, but it's fine if we finish this tomorrow [01:34:17] we don't use it? [01:34:41] or don't care? [01:34:47] bblack: we might? :) I think there's been a discussion with analytics to kill it for about... two and a half years now [01:35:05] ok [01:35:16] gwicke: I don't really know much about hiera anyway though [01:35:35] I really hate this flat eqiad.yaml, and I think there is an alternative, but I'm not sure if that's in prod yet [01:35:53] (I need that hate puppet flag by my monitor too) [01:35:54] paravoid: don't worry, I'll add a follow-up & then let giuseppe figure it out [01:36:31] let's split everything from manifests into neat modules under a directory hierarchy! [01:36:34] let's move to hiera! [01:36:40] RECOVERY - HTTPS_m.wikimedia.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikimedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:26:07 2015 GMT (expires in 363 days) [01:36:41] RECOVERY - HTTPS_wikinews.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikinews.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:31:09 2015 GMT (expires in 363 days) [01:36:41] go to step 0 [01:36:49] I would like to avoid the duplication in the cluster configuration too [01:36:50] RECOVERY - HTTPS_m.wikimediafoundation.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikimediafoundation.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:31:07 2015 GMT (expires in 363 days) [01:36:50] RECOVERY - HTTPS_wikipedia.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:06:02 2015 GMT (expires in 363 days) [01:37:00] RECOVERY - HTTPS_m.wiktionary.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wiktionary.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:46:07 2015 GMT (expires in 363 days) [01:37:07] but not super high priority [01:37:09] RECOVERY - HTTPS_wikiquote.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikiquote.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:36:05 2015 GMT (expires in 363 days) [01:37:09] RECOVERY - HTTPS_m.wikinews.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikinews.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:36:02 2015 GMT (expires in 363 days) [01:37:18] RECOVERY - HTTPS_m.wikipedia.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikipedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:16:02 2015 GMT (expires in 363 days) [01:37:19] RECOVERY - HTTPS_wikisource.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikisource.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:36:09 2015 GMT (expires in 363 days) [01:37:28] RECOVERY - HTTPS_mediawiki.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.mediawiki.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:51:02 2015 GMT (expires in 363 days) [01:37:39] I /hope/ there's a better way than a single file that has "openstack::version" then "cassandra::seeds" right underneath it [01:37:49] I hope so too ;) [01:37:50] RECOVERY - HTTPS_wikiversity.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikiversity.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:41:04 2015 GMT (expires in 363 days) [01:37:50] RECOVERY - HTTPS_m.wikiquote.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikiquote.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:36:07 2015 GMT (expires in 363 days) [01:37:58] or else I don't really see the point, role classes were fine for that [01:38:03] RECOVERY - HTTPS_unified on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:06:02 2015 GMT (expires in 363 days) [01:38:09] RECOVERY - HTTPS_wikibooks.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikibooks.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:21:03 2015 GMT (expires in 363 days) [01:38:18] RECOVERY - HTTPS_wikidata.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikidata.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:26:02 2015 GMT (expires in 363 days) [01:38:18] RECOVERY - HTTPS_m.mediawiki.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.mediawiki.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:51:04 2015 GMT (expires in 363 days) [01:38:18] but anyway, let's not block this on that discussion :) [01:38:22] RECOVERY - HTTPS_wikivoyage.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikivoyage.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:41:09 2015 GMT (expires in 363 days) [01:38:22] RECOVERY - HTTPS_m.wikisource.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikisource.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:41:02 2015 GMT (expires in 363 days) [01:38:38] RECOVERY - HTTPS_m.wikibooks.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikibooks.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:21:05 2015 GMT (expires in 363 days) [01:38:38] RECOVERY - HTTPS_wikimedia.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikimedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 4 21:06:06 2015 GMT (expires in 345 days) [01:38:51] RECOVERY - HTTPS_wikimediafoundation.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wikimediafoundation.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:31:02 2015 GMT (expires in 363 days) [01:38:51] RECOVERY - HTTPS_wiktionary.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.wiktionary.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:46:05 2015 GMT (expires in 363 days) [01:38:51] RECOVERY - HTTPS_m.wikidata.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikidata.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:26:04 2015 GMT (expires in 363 days) [01:38:51] RECOVERY - HTTPS_m.wikiversity.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikiversity.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:41:07 2015 GMT (expires in 363 days) [01:39:04] ok, I'm gonna go... [01:39:13] paravoid: thanks for your help! [01:39:18] & good night. [01:39:26] RECOVERY - HTTPS_zero.wikipedia.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.zero.wikipedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:16:05 2015 GMT (expires in 363 days) [01:39:26] RECOVERY - HTTPS_m.wikivoyage.org on amssq42 is OK: SSL_CERT OK - X.509 certificate for *.m.wikivoyage.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Nov 22 18:46:02 2015 GMT (expires in 363 days) [01:39:48] others can merge a simple "include role::restbase" + hieradata too if that's all that's needed and you feel like finishing this up today :) [01:40:03] *nod* [01:40:39] opsens, feel free to +2 on my behalf if that's all that's needed :) [01:40:42] bbye [01:40:50] bye! [01:40:57] nite paravoid! [01:42:29] !log awight Synchronized php-1.25wmf8/extensions/CentralNotice: push CentralNotice updates (duration: 00m 06s) [01:42:41] Logged the message, Master [01:43:21] !log awight Synchronized php-1.25wmf9/extensions/CentralNotice: push CentralNotice updates (duration: 00m 05s) [01:43:23] Logged the message, Master [01:45:21] What's the mailing list for reporting stupid things you did during deployment? [01:46:01] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet has 1 failures [01:46:49] (03PS1) 10GWicke: Include and configure the restbase role on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/175633 [01:47:55] omg, now old-bz works [01:48:09] there was a ssl_redirect setting in BZ itself [01:48:19] had to find and disable that too, or redirect loop [01:48:31] I like. [01:48:35] (03Abandoned) 10GWicke: Enable cassandra & restbase test hosts [puppet] - 10https://gerrit.wikimedia.org/r/175630 (owner: 10GWicke) [01:48:56] andre__: :) also see how http becomes https .. as it should [01:49:07] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Puppet has 1 failures [01:49:59] awight: engineering@lists.wikimedia.org is appropriate [01:50:06] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Puppet has 1 failures [01:50:40] awight: And it's only a stupid mistake if you don't learn anything from having made it. [01:50:42] :) [01:51:20] !log esams+eqiad backup LVS converted to new ssl config (lvs100[45] + lvs300[34]) [01:51:26] Logged the message, Master [01:52:13] bd808: thanks :) I'm not sure if it counts as learning if I have to learn multiple times, though... [01:52:43] well, that's not fair. What I mean to say is, I'd like to stop learning this lesson ;) [01:52:57] heh [01:54:39] !log switching off pybal on primary LVS in eqiad for HTTPS check [01:54:41] Logged the message, Master [01:55:59] !log switching off pybal on primary LVS in esams for HTTPS check [01:56:03] Logged the message, Master [01:56:48] if anyone is idly around here and cares to triple-check work: you could make sure that HTTPS works fine for our sites from wherever you're at with whatever device :) [01:58:52] (03PS1) 10BryanDavis: beta: Use password to connect to logstash redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175640 [01:59:16] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:59:49] bblack: seems fine from office with iceweasel [01:59:53] (03CR) 10BryanDavis: beta: Use password to connect to logstash redis server (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175640 (owner: 10BryanDavis) [02:03:35] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:04:53] (03CR) 10Manybubbles: [C: 031] beta: Use password to connect to logstash redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175640 (owner: 10BryanDavis) [02:05:00] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:07:41] !log all LVS back to normal runtime state w/ new SSL config [02:07:45] Logged the message, Master [02:10:15] (03PS3) 10BBlack: remove legacy protoproxy config [puppet] - 10https://gerrit.wikimedia.org/r/175466 [02:10:39] ^ leaving that one for tomorrow. if something does happen to go amiss with all of this, it will be easier to revert if that's not done yet. [02:15:21] !log old-bugzilla now behind varnish too, cert issue should be gone [02:15:24] Logged the message, Master [02:16:59] (03PS1) 10Dzahn: bugzilla: old-bz, keep enforcing https [puppet] - 10https://gerrit.wikimedia.org/r/175646 [02:18:03] (03CR) 10Dzahn: [C: 032] bugzilla: old-bz, keep enforcing https [puppet] - 10https://gerrit.wikimedia.org/r/175646 (owner: 10Dzahn) [02:18:49] !log l10nupdate Synchronized php-1.25wmf8/cache/l10n: (no message) (duration: 00m 01s) [02:18:53] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-25 02:18:52+00:00 [02:18:53] Logged the message, Master [02:18:55] Logged the message, Master [02:31:44] !log l10nupdate Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 00m 01s) [02:31:48] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-25 02:31:48+00:00 [02:31:49] Logged the message, Master [02:31:51] Logged the message, Master [02:59:21] greg-g: can I haz a deploy... now? [03:07:31] (03PS1) 10Springle: MariaDB config tweakes for m1,m2,m3 [puppet] - 10https://gerrit.wikimedia.org/r/175657 [03:07:40] greg-g: ok I'm self-serving :) [03:08:17] PROBLEM - mysqld processes on db1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:08:37] oh.. me^ [03:09:22] (03PS1) 10Awight: Disabling CentralNotice client-side banner choice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175659 [03:09:30] (03CR) 10Springle: [C: 032] MariaDB config tweakes for m1,m2,m3 [puppet] - 10https://gerrit.wikimedia.org/r/175657 (owner: 10Springle) [03:10:01] (03CR) 10Awight: [C: 032] Disabling CentralNotice client-side banner choice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175659 (owner: 10Awight) [03:10:13] (03Merged) 10jenkins-bot: Disabling CentralNotice client-side banner choice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175659 (owner: 10Awight) [03:11:52] !log awight Synchronized wmf-config: Disabling CentralNotice client banner choice due to T75812 (duration: 00m 05s) [03:12:14] Logged the message, Master [03:13:57] RECOVERY - mysqld processes on db1020 is OK: PROCS OK: 1 process with command name mysqld [03:16:25] !log m2 db1020 rebuilt, but blocked from dbproxy1002 until replag=0 [03:16:27] Logged the message, Master [03:53:45] !log Jenkins is unable to create new user sessions. Suspect LDAP is having issues. [03:53:51] Logged the message, Master [03:54:11] Can't verify debug log or change config since that's behind the same GUI [03:59:39] akosiaris: mutante: greg-g: paravoid: Reedy: ^ - if anyone could check that out (see also # _security), that'd be cool. Try at https://integration.wikimedia.org/ci/login with ldap login. [03:59:42] o/ zzz [04:09:45] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [04:20:44] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [04:22:50] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Nov 25 04:22:50 UTC 2014 (duration 22m 49s) [04:22:53] Logged the message, Master [04:31:35] !log restarted jenkins [04:31:37] Logged the message, Master [04:33:01] Please wait while Jenkins is getting ready to work. [04:33:33] either it takes forever to start, or it's got a problem [04:36:46] according to email from timo on oct 28 it does indeed take a very long time [04:41:01] it's back but i still can't auth [04:41:10] neither of these seem relevant because the fix requires you to be auth'd: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues [04:41:33] on the other hand, that uptime problem is now solved ;) [04:41:47] nah i didn't reboot the box just restarted the service [04:41:51] didn't want to change more variables [04:42:06] becuse if i'm gonna reboot i'd want to upgrade [04:43:18] i reviewed antoine "leaving irc" thread but didn't find the answer to this problem [04:43:55] andrew merged a change earlier removing the ldap role from virt1000 and updating keystone's configuration to point to ldap-eqiad instead [04:44:15] is jenkins's configuration referencing virt1000, perhaps? [04:44:22] hmm let's see.. [04:44:33] i have other work i really need to be doing if anyone else is lurking :( [04:44:52] i'm not, really [04:47:07] the only jenkins config i see is /etc/default/jenkins which doesn't say anything about auth [04:48:01] whee looks like it's configged via gui [04:53:57] blah opendj logs to /var/opendj/instance/logs/ on neptunium [04:54:04] thank you lsof [05:05:03] i'm not making any progress, and i need to afk. ori's theory that jenkins has virt1000 hardcoded when it should say ldap-eqiad.wikimedia.org seems likely but i'm not sure how to resolve it. seems like there should be some non-ldap admin login to allow us to update jenkins' ldap config but i don't see it in our passwords repository. [05:54:17] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [06:00:06] yes, jenkins uses virt1000 [06:00:36] the config is in /var/lib/jenkins/config.xml [06:01:07] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:06:17] !log restarted gitblit [06:06:18] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 58823 bytes in 0.089 second response time [06:06:23] Logged the message, Master [06:08:00] heh, again [06:08:02] * YuviPanda waves at mutante [06:08:06] no sleep? [06:08:10] !log in respose to jenkins login issue reported by krinkle: /var/lib/jenkins/xml.config on gallium had "virt1000" value for LDAP, earlier Andrew made a switch from there to ldap-eqiad. fixed config, restarted jenkins [06:08:14] Logged the message, Master [06:08:43] YuviPanda: hey [06:08:51] i.. changed the type of error [06:09:40] now it's fixed:) [06:09:48] Krinkle|detached: ^ fixed [06:10:27] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:10:47] mutante: can you add me to security team in phab? [06:10:52] already have access to security@, etc [06:11:05] YuviPanda: i don't know [06:11:08] oh [06:11:10] ok [06:11:19] nbd, I'll just ask around when chris is around later [06:12:25] yes, please, i really dont know if i can [06:12:39] yeah, thanks :) [06:21:41] yay thanks mutante. perfect place for a config file :P [06:26:07] jgage: true, yea and i wonder if a change in web ui writes that XML, but i guess it has to, it's not puppet [06:26:18] laters though /me waves [06:31:56] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [06:32:50] mutante: I can investigate [06:32:53] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:55] mutante: if you want to go get some sleep :) [06:33:34] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:40] YuviPanda: it's odd that it was gallium, because that's the jenkins box, but it must be coincidence [06:33:41] mutante: puppetmaster failure [06:33:44] yeah [06:33:44] ooh [06:33:44] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:47] co-incidence [06:33:53] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:54] oh, can you restart apache [06:34:01] yeah, on the way now [06:34:01] it will be mod_passenger issue [06:34:04] cool [06:34:20] (fwiw, puppet runs fine on gallium it claims) [06:34:45] mutante: yeah, transient failure [06:34:46] looks already fixed. good [06:34:49] thx [06:34:59] been happening more frequently of late [06:35:06] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:07] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:35:32] we should have a graph or something [06:35:36] uptime of master [06:35:50] heh [06:36:19] ok:) cya later then [06:36:31] cya [06:36:43] (log it) [06:37:01] !log restarted apache on strontium, was seeing transient puppetmaster fails [06:37:01] so we see if it really is more often .. /away [06:37:03] Logged the message, Master [06:41:32] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:31] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:54] seems like we've had a lot of transient puppet failures lately [06:43:54] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: puppet fail [06:45:03] jgage: yeah. [06:45:13] (03PS1) 10Yuvipanda: puppetmaster: Add script to track upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/175663 [06:46:14] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:14] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:28] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:47:06] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:16] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:41] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:28] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:58] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:50:36] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:48] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:52:48] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:30] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:11] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:47] couple of ruby processes at 100% cpu on strontium [06:56:29] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:59] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:58:07] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures [06:58:16] PROBLEM - puppet last run on virt1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:58:48] (03PS2) 10Yuvipanda: puppetmaster: Add script to track upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/175663 [06:58:54] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:58:54] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:00:43] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures [07:04:00] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:36] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [07:05:37] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:15] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:09:10] RECOVERY - puppet last run on virt1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:11:29] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:02] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:16:26] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:37:03] (03PS3) 10Yuvipanda: puppetmaster: Add script to track upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/175663 [09:10:15] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: puppet fail [09:27:00] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:53:37] PROBLEM - Host mw1255 is DOWN: PING CRITICAL - Packet loss = 100% [09:53:59] <_joe_> mh, strange [09:54:34] <_joe_> ouch that was me [09:54:35] <_joe_> :/ [09:55:12] RECOVERY - Host mw1255 is UP: PING WARNING - Packet loss = 73%, RTA = 3.91 ms [10:25:59] PROBLEM - Disk space on mw1224 is CRITICAL: Connection refused by host [10:25:59] PROBLEM - HHVM processes on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:27:39] PROBLEM - RAID on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:28:05] PROBLEM - mediawiki-installation DSH group on mw1221 is CRITICAL: Host mw1221 is not in mediawiki-installation dsh group [10:28:21] PROBLEM - check configured eth on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:28:26] PROBLEM - check if dhclient is running on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:28:26] PROBLEM - mediawiki-installation DSH group on mw1222 is CRITICAL: Host mw1222 is not in mediawiki-installation dsh group [10:28:45] PROBLEM - check if salt-minion is running on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:28:46] RECOVERY - Disk space on mw1224 is OK: DISK OK [10:29:05] PROBLEM - mediawiki-installation DSH group on mw1223 is CRITICAL: Host mw1223 is not in mediawiki-installation dsh group [10:29:16] PROBLEM - nutcracker port on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:29:16] PROBLEM - DPKG on mw1222 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:29:40] PROBLEM - mediawiki-installation DSH group on mw1224 is CRITICAL: Host mw1224 is not in mediawiki-installation dsh group [10:29:41] PROBLEM - nutcracker process on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:29:46] PROBLEM - puppet last run on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:29:46] PROBLEM - DPKG on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:29:55] PROBLEM - Disk space on mw1223 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:31:17] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 112 failures [10:34:48] RECOVERY - DPKG on mw1222 is OK: All packages OK [10:34:48] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 8 failures [10:35:35] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 112 failures [10:36:19] RECOVERY - check if dhclient is running on mw1223 is OK: PROCS OK: 0 processes with command name dhclient [10:36:49] RECOVERY - check if salt-minion is running on mw1223 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:36:50] RECOVERY - HHVM processes on mw1223 is OK: PROCS OK: 1 process with command name hhvm [10:37:19] RECOVERY - nutcracker port on mw1223 is OK: TCP OK - 0.000 second response time on port 11212 [10:37:41] RECOVERY - nutcracker process on mw1223 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:38:09] RECOVERY - Disk space on mw1223 is OK: DISK OK [10:38:30] RECOVERY - RAID on mw1223 is OK: OK: no RAID installed [10:38:59] RECOVERY - check configured eth on mw1223 is OK: NRPE: Unable to read output [10:41:12] PROBLEM - HHVM rendering on mw1221 is CRITICAL: Connection timed out [10:42:41] PROBLEM - HHVM rendering on mw1222 is CRITICAL: Connection timed out [10:45:15] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:45:16] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 69879 bytes in 9.632 second response time [10:45:45] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:46:08] PROBLEM - HHVM rendering on mw1223 is CRITICAL: Connection refused [10:46:59] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 69878 bytes in 0.250 second response time [10:48:55] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 69879 bytes in 3.357 second response time [10:49:06] RECOVERY - DPKG on mw1223 is OK: All packages OK [10:49:06] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [10:49:40] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:51:50] <_joe_> sorry for the spam [11:01:54] (03PS1) 10Giuseppe Lavagetto: dsh: add new mediawiki appservers [puppet] - 10https://gerrit.wikimedia.org/r/175672 [11:05:00] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Application%2520servers%2520eqiad&tab=m&vn=&hide-hf=false [11:05:04] There's quite a few idle apaches... [11:05:17] new ones probably :) [11:05:45] Yeah, I just saw joes commit [11:05:52] <_joe_> Reedy: what paravoid said [11:06:02] <_joe_> Reedy: they'll be live this evening [11:06:02] yay :) [11:06:06] PROBLEM - DPKG on mw1225 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:06:16] PROBLEM - Host mw1226 is DOWN: PING CRITICAL - Packet loss = 100% [11:06:56] RECOVERY - Host mw1226 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [11:07:07] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 2 failures [11:07:07] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 112 failures [11:08:05] _joe_: thoughts on proposed per-hostish solution in last comment of https://phabricator.wikimedia.org/T1356? if you're ok with that I'll whip up the necessary wikitech patches [11:08:26] <_joe_> YuviPanda: I'll take a look in a few [11:08:30] _joe_: thanks [11:08:41] <_joe_> I'm in the middle of reimaging right now [11:09:10] Reedy: can you explain to me some maintenance job issue? [11:09:21] matanya: Dunno... What is it? [11:09:22] <_joe_> I just realized that, given I reimage 10 servers a day, it will still take me 4 weeks to be done with reimaging :/ [11:09:39] 14.04-ing? [11:09:54] <_joe_> Reedy: and HHVM-ing [11:09:54] Reedy: https://he.wikipedia.org/wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:%D7%93%D7%A4%D7%99%D7%9D_%D7%A9%D7%9C%D7%90_%D7%9E%D7%A7%D7%95%D7%A9%D7%A8%D7%99%D7%9D_%D7%9C%D7%A4%D7%A8%D7%99%D7%98%D7%99%D7%9D?page=&submit=%D7%94%D7%A8%D7%A6%D7%94&iwdata=only show pages without a link to wikidata [11:09:57] <_joe_> yes [11:10:02] Awesome [11:10:07] PROBLEM - mediawiki-installation DSH group on mw1225 is CRITICAL: Host mw1225 is not in mediawiki-installation dsh group [11:10:11] https://he.wikipedia.org/wiki/%D7%90%D7%95%D7%A4%D7%A8%D7%A0%D7%95%D7%A8 is on the list [11:10:26] but it is linked to wikidat : https://www.wikidata.org/wiki/Q949770 [11:10:38] so why is it on the list anyway ? [11:10:44] PROBLEM - mediawiki-installation DSH group on mw1243 is CRITICAL: Host mw1243 is not in mediawiki-installation dsh group [11:10:44] PROBLEM - mediawiki-installation DSH group on mw1226 is CRITICAL: Host mw1226 is not in mediawiki-installation dsh group [11:10:49] a delayed cron? a bug ? [11:11:26] RECOVERY - DPKG on mw1225 is OK: All packages OK [11:11:26] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: Puppet has 111 failures [11:11:31] _joe_: I could offer to help with the re-imaging if you'd like, I just haven't done it before / hope there are docs :) [11:11:37] matanya: That's not really an op problem [11:11:40] but a Wikidata thing [11:11:53] I was about to say that :) [11:11:55] ty [11:12:25] oh, ok. thanks, I thought it is related to the maint crons [11:12:38] no, that's a job [11:12:45] so, hoo open a bug on wikidata ? [11:13:08] PROBLEM - Host mw1243 is DOWN: PING CRITICAL - Packet loss = 100% [11:13:19] matanya: I'll try to find the cause [11:13:25] we had this in h [11:13:26] thanks [11:13:27] * the past [11:13:37] was mostly related to hhvm hiccups [11:13:40] RECOVERY - Host mw1243 is UP: PING OK - Packet loss = 0%, RTA = 1.89 ms [11:13:57] PROBLEM - HHVM rendering on mw1226 is CRITICAL: Connection refused [11:14:37] PROBLEM - Apache HTTP on mw1226 is CRITICAL: Connection refused [11:15:18] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:16:46] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 69879 bytes in 4.499 second response time [11:16:58] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:17:20] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [11:17:56] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:18:45] (03PS1) 10Matanya: wikitech: lint [puppet] - 10https://gerrit.wikimedia.org/r/175673 [11:20:59] Reedy: What we need to do for https://phabricator.wikimedia.org/T1254 in mediawiki-config? [11:21:14] (for Beta) [11:24:34] kart__: You need to confirm your code would work in that way [11:24:55] (03CR) 10Giuseppe Lavagetto: [C: 032] dsh: add new mediawiki appservers [puppet] - 10https://gerrit.wikimedia.org/r/175672 (owner: 10Giuseppe Lavagetto) [11:25:11] matanya: That's probably a left over from these problems [11:25:20] or did that page only appear on that list very recently [11:25:29] "this" = hhvm ? [11:25:44] these [11:26:51] Reedy: ie using shared DB? [11:27:11] matanya: Well, early hhvm job runners were troublesome [11:27:20] but I don't see any recent problems in the error logs [11:27:40] can you recreate this report ? [11:27:40] We never cleaned up after that mess, as it's not usually a big problem [11:27:55] (03CR) 10Alexandros Kosiaris: [C: 032] wikitech: lint [puppet] - 10https://gerrit.wikimedia.org/r/175673 (owner: 10Matanya) [11:31:43] (03PS1) 10Matanya: vm: lint [puppet] - 10https://gerrit.wikimedia.org/r/175676 [11:32:35] PROBLEM - Host ms-be2014 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:23] (03CR) 10Alexandros Kosiaris: [C: 032] vm: lint [puppet] - 10https://gerrit.wikimedia.org/r/175676 (owner: 10Matanya) [11:39:10] (03PS1) 10Yuvipanda: icinga: Remove checks for betalabs and toollabs [puppet] - 10https://gerrit.wikimedia.org/r/175678 [11:39:22] (03PS2) 10Yuvipanda: icinga: Remove checks for betalabs and toollabs [puppet] - 10https://gerrit.wikimedia.org/r/175678 [11:40:25] (03CR) 10Yuvipanda: [C: 032] icinga: Remove checks for betalabs and toollabs [puppet] - 10https://gerrit.wikimedia.org/r/175678 (owner: 10Yuvipanda) [11:40:56] akosiaris: ok with puppet merge on the patch you just merged? [11:42:58] icinga is going to complain soon [11:43:27] ms-be2014 is me [11:43:43] akosiaris: alright, am going to merge. linting change only, looks ok. will force on virt to make sure as well. [11:46:16] (03PS1) 10Yuvipanda: icinga: Remove contactgroups for toollabs & betalabs [puppet] - 10https://gerrit.wikimedia.org/r/175679 [11:46:46] (03PS2) 10Yuvipanda: icinga: Remove contactgroups for toollabs & betalabs [puppet] - 10https://gerrit.wikimedia.org/r/175679 [11:47:03] RECOVERY - Host ms-be2014 is UP: PING WARNING - Packet loss = 86%, RTA = 42.91 ms [11:48:25] (03CR) 10Yuvipanda: [C: 032] icinga: Remove contactgroups for toollabs & betalabs [puppet] - 10https://gerrit.wikimedia.org/r/175679 (owner: 10Yuvipanda) [11:49:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One error to fix for sure" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/174694 (owner: 10Giuseppe Lavagetto) [11:52:44] PROBLEM - swift-account-replicator on ms-be2014 is CRITICAL: Connection refused by host [11:53:04] PROBLEM - DPKG on ms-be2014 is CRITICAL: Connection refused by host [11:53:04] PROBLEM - swift-object-replicator on ms-be2014 is CRITICAL: Connection refused by host [11:53:12] PROBLEM - Disk space on ms-be2014 is CRITICAL: Connection refused by host [11:53:23] PROBLEM - swift-account-server on ms-be2014 is CRITICAL: Connection refused by host [11:53:33] PROBLEM - swift-object-updater on ms-be2014 is CRITICAL: Connection refused by host [11:53:53] PROBLEM - RAID on ms-be2014 is CRITICAL: Connection refused by host [11:53:54] PROBLEM - very high load average likely xfs on ms-be2014 is CRITICAL: Connection refused by host [11:53:54] PROBLEM - swift-container-auditor on ms-be2014 is CRITICAL: Connection refused by host [11:54:14] PROBLEM - check configured eth on ms-be2014 is CRITICAL: Connection refused by host [11:54:23] PROBLEM - check if dhclient is running on ms-be2014 is CRITICAL: Connection refused by host [11:54:45] PROBLEM - check if salt-minion is running on ms-be2014 is CRITICAL: Connection refused by host [11:55:06] (03PS1) 10Faidon Liambotis: ocg: simplify module hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/175681 [11:55:08] PROBLEM - puppet last run on ms-be2014 is CRITICAL: Connection refused by host [11:55:22] PROBLEM - swift-object-server on ms-be2014 is CRITICAL: Connection refused by host [11:55:24] PROBLEM - swift-account-auditor on ms-be2014 is CRITICAL: Connection refused by host [11:55:32] PROBLEM - swift-account-reaper on ms-be2014 is CRITICAL: Connection refused by host [11:56:06] PROBLEM - swift-container-server on ms-be2014 is CRITICAL: Connection refused by host [11:56:24] PROBLEM - swift-container-updater on ms-be2014 is CRITICAL: Connection refused by host [11:56:34] PROBLEM - swift-object-auditor on ms-be2014 is CRITICAL: Connection refused by host [11:57:01] PROBLEM - SSH on ms-be2014 is CRITICAL: Connection refused [11:57:53] (03CR) 10Giuseppe Lavagetto: [C: 031] ocg: simplify module hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/175681 (owner: 10Faidon Liambotis) [11:58:36] (03CR) 10Faidon Liambotis: [C: 032] ocg: simplify module hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/175681 (owner: 10Faidon Liambotis) [11:58:46] PROBLEM - swift-container-replicator on ms-be2014 is CRITICAL: Timeout while attempting connection [12:01:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A couple of comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/175633 (owner: 10GWicke) [12:02:41] (03PS2) 10Giuseppe Lavagetto: mediawiki: adjust hhvm max threads to number of cpus as well [puppet] - 10https://gerrit.wikimedia.org/r/175424 [12:03:03] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: adjust hhvm max threads to number of cpus as well [puppet] - 10https://gerrit.wikimedia.org/r/175424 (owner: 10Giuseppe Lavagetto) [12:03:12] (03PS1) 10Yuvipanda: contint: Move monitoring into shinken rather than icinga [puppet] - 10https://gerrit.wikimedia.org/r/175682 [12:03:22] (03PS2) 10Yuvipanda: contint: Move monitoring into shinken rather than icinga [puppet] - 10https://gerrit.wikimedia.org/r/175682 [12:04:16] (03PS1) 10Matanya: toollabs: lint [puppet] - 10https://gerrit.wikimedia.org/r/175683 [12:06:07] (03PS2) 10Yuvipanda: toollabs: lint [puppet] - 10https://gerrit.wikimedia.org/r/175683 (owner: 10Matanya) [12:06:13] matanya: :) [12:06:24] be my guest [12:07:17] (03CR) 10Yuvipanda: [C: 032] toollabs: lint [puppet] - 10https://gerrit.wikimedia.org/r/175683 (owner: 10Matanya) [12:07:40] _joe_: ok if I puppet merge hhvm max threads patch? [12:07:46] <_joe_> yes [12:07:51] <_joe_> sorry about to ask the same [12:07:56] done [12:07:59] heh [12:08:36] should perhaps have a lock for 'puppet merge is running already, yo' [12:11:51] RECOVERY - mediawiki-installation DSH group on mw1225 is OK: OK [12:12:29] RECOVERY - mediawiki-installation DSH group on mw1243 is OK: OK [12:12:30] RECOVERY - mediawiki-installation DSH group on mw1226 is OK: OK [12:13:57] YuviPanda: should probably fix the auto-load issue for toollabs [12:14:04] but don't have time now :) [12:14:06] :) [12:14:08] should. [12:14:14] I haven't looked at the toollabs code in a while either [12:14:44] it is basically moving to classes lower in hirarchey [12:15:08] <_joe_> !log pooling mw1221-mw1226 in the API pool [12:15:10] *two [12:15:15] Logged the message, Master [12:15:32] <_joe_> now more than 25% of api calls are answered by hhvm [12:16:47] whohoo [12:17:20] nice! [12:17:23] http://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [12:17:40] mw1191, mw1192, mw1195, mw1196 are weird [12:17:55] <_joe_> paravoid: they're probably overloaded, I'll take a look [12:18:32] what weight are these hhvm servers? double? [12:18:40] they're just loaded, but I wonder why they have a more stable pattern than the others [12:18:40] <_joe_> no [12:19:08] <_joe_> in the api pool they have the same weight as larger appservers [12:19:11] ok [12:19:46] (03PS3) 10Yuvipanda: contint: Move monitoring into shinken rather than icinga [puppet] - 10https://gerrit.wikimedia.org/r/175682 [12:20:20] <_joe_> so hhvm has precisely 27.6% of the total weight right now [12:20:36] (03CR) 10Yuvipanda: [C: 032] contint: Move monitoring into shinken rather than icinga [puppet] - 10https://gerrit.wikimedia.org/r/175682 (owner: 10Yuvipanda) [12:20:39] YuviPanda: \o/ [12:20:54] <_joe_> :)) [12:21:27] paravoid: there's two more left, waiting for the integration folks (Krinkle) to show up before I remove them. CPU checks done in a weird way, don't think we need them atm. [12:23:37] now I need to patch wikitech before being able to add per-host things. [12:24:56] <_joe_> and guess what? I can now take a look [12:25:25] _joe_: :) this is different though - hostgroup definitions based on roles applied via LDAP. [12:25:30] but yes, do take a look there too [12:25:57] <_joe_> oh that's a common problem we need to solve [12:26:02] <_joe_> in prod and in labs [12:26:21] <_joe_> that is - inferring globals from the roles we've declared in a node [12:26:32] so 15 new servers in the api pool? [12:26:38] <_joe_> mark: yes [12:27:14] <_joe_> because the API pool was quite in need for some steampower [12:27:28] it was [12:28:00] are these servers still configured identically? [12:28:01] PROBLEM - NTP on ms-be2014 is CRITICAL: NTP CRITICAL: No response from NTP server [12:28:03] other than their label [12:28:10] so can we move between the pools easily in emergencies? [12:28:22] <_joe_> yes, we need one puppet commit though [12:28:26] <_joe_> for the lvs ip [12:28:41] <_joe_> I was thinking of actually configuring both IPs on all servers [12:28:43] why not just add it [12:28:44] yeah [12:28:52] it doesn't harm at all [12:29:15] assuming they don't talk to the lvs ips i guess [12:29:17] <_joe_> I agree, I don't think we have any other difference besides the system::role [12:29:21] otherwise they'll talk to themselves [12:29:38] <_joe_> right [12:29:43] <_joe_> and I guess they do [12:29:48] yes that's kind of dangerous [12:29:53] (03PS1) 10Yuvipanda: shinken: Move alert notification to ml from icinga [puppet] - 10https://gerrit.wikimedia.org/r/175685 [12:29:57] <_joe_> meh [12:30:01] (03CR) 10jenkins-bot: [V: 04-1] shinken: Move alert notification to ml from icinga [puppet] - 10https://gerrit.wikimedia.org/r/175685 (owner: 10Yuvipanda) [12:30:07] appservers requesting something from the API perhaps [12:30:15] (03PS2) 10Yuvipanda: shinken: Move alert notification to ml from icinga [puppet] - 10https://gerrit.wikimedia.org/r/175685 [12:30:20] <_joe_> that was actually desirable with the hhvm pool [12:30:42] we could do policy routing [12:30:44] * paravoid ducks [12:31:05] hmm, theoretically they should just use the 'fake' API bits instead of hitting the HTTP endpoints (internal code calling API) [12:31:19] (03CR) 10Yuvipanda: [C: 032] shinken: Move alert notification to ml from icinga [puppet] - 10https://gerrit.wikimedia.org/r/175685 (owner: 10Yuvipanda) [12:31:27] RECOVERY - mediawiki-installation DSH group on mw1221 is OK: OK [12:31:54] RECOVERY - mediawiki-installation DSH group on mw1222 is OK: OK [12:31:55] <_joe_> YuviPanda: I'm sure our dba is really happy about that [12:32:05] heh [12:32:25] RECOVERY - mediawiki-installation DSH group on mw1223 is OK: OK [12:33:04] RECOVERY - mediawiki-installation DSH group on mw1224 is OK: OK [12:36:56] _joe_: it's not used *that* much, though. Only usage I remember was for doing something edit related (EditPage.php was... not well factored, at that time, at least) [12:42:46] (03PS4) 10Giuseppe Lavagetto: deployment: make scap proxies configured in one place [puppet] - 10https://gerrit.wikimedia.org/r/174664 [13:05:43] when is the hhvm merge _joe_? tonight? [13:17:39] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [13:23:30] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.017 second response time [13:39:15] <_joe_> mark: this evening, when or.i wakes up [13:39:21] ok [13:45:18] hoo: aude Just noticed todays deploy is still in the late window... [13:45:52] which should be fine [13:45:58] I'm going to be in the office then [13:46:09] also it should always be there [13:46:11] IMO [13:46:11] (03PS1) 10Reedy: Non Wikipedias to 1.25wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175692 [13:46:17] always? [13:46:21] * hoo in a meeting atm [13:46:22] We moved it for you guys [13:46:24] yep, always [13:46:24] lol [13:46:32] well, that was stupid [13:46:41] see my email and greg's response [13:54:22] PROBLEM - Host ms-be2014 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:33] RECOVERY - Host ms-be2014 is UP: PING OK - Packet loss = 0%, RTA = 44.07 ms [14:02:28] I don't understand why that went out, the host is in scheduled downtime, https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ms-be1014 [14:02:44] ... 1014 [14:03:13] http://i.imgur.com/eezCO.gif [14:06:32] (03CR) 10Hashar: "+1 thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/175682 (owner: 10Yuvipanda) [14:08:15] (03CR) 10Hashar: [C: 031] puppetmaster: Add script to track upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/175663 (owner: 10Yuvipanda) [14:18:44] RECOVERY - SSH on ms-be2014 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [14:44:08] <_joe_> godog: what was that manifest where using require_packages we had to remove one dependency? [14:44:41] <_joe_> I *may* have thought of a way to circumvent that [14:46:15] _joe_: txstatsd is missing a dependency on graphite-carbon IIRC, however that's also included in graphite manifest so the former works only together with the latter [14:46:33] <_joe_> ok [14:46:43] <_joe_> but well I will give it a shot [14:56:12] RECOVERY - DPKG on ms-be2014 is OK: All packages OK [14:56:18] RECOVERY - swift-object-auditor on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:56:19] RECOVERY - swift-container-updater on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:56:19] RECOVERY - swift-object-updater on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:56:21] RECOVERY - swift-object-replicator on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:56:21] RECOVERY - swift-container-auditor on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:56:53] RECOVERY - very high load average likely xfs on ms-be2014 is OK: OK - load average: 0.59, 0.98, 0.99 [14:57:00] RECOVERY - swift-account-auditor on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:57:00] RECOVERY - Disk space on ms-be2014 is OK: DISK OK [14:57:12] RECOVERY - swift-account-reaper on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:57:32] RECOVERY - swift-account-server on ms-be2014 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:57:38] RECOVERY - swift-container-replicator on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:57:52] RECOVERY - RAID on ms-be2014 is OK: OK: optimal, 14 logical, 14 physical [14:58:10] RECOVERY - swift-object-server on ms-be2014 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:58:12] RECOVERY - swift-account-replicator on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:58:21] RECOVERY - check configured eth on ms-be2014 is OK: NRPE: Unable to read output [14:58:33] RECOVERY - check if dhclient is running on ms-be2014 is OK: PROCS OK: 0 processes with command name dhclient [14:58:33] RECOVERY - swift-container-server on ms-be2014 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:59:02] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:00:34] RECOVERY - NTP on ms-be2014 is OK: NTP OK: Offset -0.01997005939 secs [15:03:03] !log upload bcache-tools 1.0.7-1 to carbon [15:03:06] Logged the message, Master [15:07:27] (03PS2) 10Giuseppe Lavagetto: hiera: a few tweaks [puppet] - 10https://gerrit.wikimedia.org/r/174694 [15:22:27] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: a few tweaks [puppet] - 10https://gerrit.wikimedia.org/r/174694 (owner: 10Giuseppe Lavagetto) [15:23:20] <_joe_> I am going to merge this, a few puppet failures may happen while puppetmasters get puppet applied themselves [15:32:57] (03PS5) 10Giuseppe Lavagetto: deployment: make scap proxies configured in one place [puppet] - 10https://gerrit.wikimedia.org/r/174664 [15:35:12] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=325 [critical =325] [15:35:13] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=325 [critical =325] [15:37:30] anyone from ops wanna +1 / CR https://gerrit.wikimedia.org/r/175663? I'm slightly curious about the naming / placement [15:37:50] <_joe_> what is this db alert? [15:37:53] (03Abandoned) 10Yuvipanda: icinga: Scream in -operations too when betacluster has issues [puppet] - 10https://gerrit.wikimedia.org/r/174430 (owner: 10Yuvipanda) [15:37:58] (03CR) 10Giuseppe Lavagetto: [C: 032] "http://puppet-compiler.wmflabs.org/532/change/174664/html/" [puppet] - 10https://gerrit.wikimedia.org/r/174664 (owner: 10Giuseppe Lavagetto) [15:38:43] <_joe_> YuviPanda: what does this script do? [15:39:07] _joe_: updates local puppetmasters, rebasing when required and stashing/popping when required [15:39:16] is currently in beta, just generalizing [15:39:26] <_joe_> ok seems like a good idea [15:39:39] <_joe_> I will take a look soon-ish [15:39:43] ty [15:39:52] <_joe_> but first, let me merge my change and brew coffee! [15:40:13] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=327 [critical =325] [15:40:14] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=327 [critical =325] [15:40:45] heh [15:41:25] <_joe_> anyone knows what those ^^ are about? [15:42:06] (03CR) 10BryanDavis: "A few comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/175663 (owner: 10Yuvipanda) [15:44:05] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [15:45:06] bd808: hmm, I'm wondering how to stagger it from puppet without writing a wraparound script. [15:45:16] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=329 [critical =325] [15:45:18] I could put the stagger in the script itself, but that seems a bit odd/wrong [15:45:20] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=329 [critical =325] [15:45:46] YuviPanda: There is a puppet function that computes a number based on a hash of the host name. [15:45:53] I don't remember what it's called though [15:46:31] YuviPanda: https://docs.puppetlabs.com/references/latest/function.html#fqdnrand [15:46:41] oh, nice! [15:48:33] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: puppet fail [15:48:56] (03PS4) 10Yuvipanda: puppetmaster: Add script to track upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/175663 [15:49:23] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [15:49:33] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [15:49:54] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [15:50:04] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [15:50:13] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=330 [critical =325] [15:50:14] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=330 [critical =325] [15:50:19] bd808: ^^ [15:50:37] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [15:50:52] _joe_: ^ related? [15:51:07] YuviPanda: You moved it into the client block rather than the server block :) [15:51:15] ... [15:51:22] I should eat food and walk outside for a minute. [15:51:29] manybubbles: want to handle itamar's mail ? [15:51:33] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [15:52:09] matanya: I can reply, yeah [15:52:13] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [15:52:17] (03PS5) 10Yuvipanda: puppetmaster: Add script to track upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/175663 [15:52:53] bd808: but updated ^ :) [15:53:09] bd808: should also remove the role from deployment-salt and enable the variable, I guess. I'll take care of that after merging [15:53:41] YuviPanda: test it in beta with a cherry-pick! [15:54:03] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: puppet fail [15:54:14] bd808: already tested in shinken with a cherry-pick :) but there the client and server were the same so didn't catch that. [15:54:39] (03CR) 10BryanDavis: [C: 031] puppetmaster: Add script to track upstream changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/175663 (owner: 10Yuvipanda) [15:55:05] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: puppet fail [15:55:13] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=332 [critical =325] [15:55:14] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=332 [critical =325] [15:55:23] (03PS6) 10Yuvipanda: puppetmaster: Add script to track upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/175663 [15:56:04] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: puppet fail [15:56:49] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: puppet fail [15:57:33] (03CR) 10Yuvipanda: [C: 032] puppetmaster: Add script to track upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/175663 (owner: 10Yuvipanda) [15:58:33] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [15:58:53] <_joe_> godog: I'll take a look [15:59:37] <_joe_> but yeah I screwed up codfw :/ [15:59:40] <_joe_> damn [15:59:45] <_joe_> ok, repairing that [16:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141125T1600). Please do the needful. [16:00:15] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=333 [critical =325] [16:00:16] <_joe_> godog: it's failing at least, not anything worse [16:00:18] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=333 [critical =325] [16:00:38] bd808: hmm, was there a role::beta::puppetmaster? I can't find that in puppet, but is selected in ldap [16:02:41] YuviPanda: I'm sure there was at one time. I'm not seeing it now either. [16:02:56] bd808: hmm, git log -S doesn't find one either [16:03:02] anyway, unpicking. [16:05:13] _joe_: no worries, I wasn't concerned :) [16:05:14] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=335 [critical =325] [16:05:24] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=335 [critical =325] [16:05:46] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [16:09:14] (03PS1) 10Giuseppe Lavagetto: swift_new: reorganize hieradata [puppet] - 10https://gerrit.wikimedia.org/r/175714 [16:09:47] <_joe_> godog: ^^ this should fix it [16:09:58] (03PS2) 10Giuseppe Lavagetto: swift_new: reorganize hieradata [puppet] - 10https://gerrit.wikimedia.org/r/175714 [16:10:08] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=336 [critical =325] [16:10:08] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=336 [critical =325] [16:10:13] (03CR) 10Giuseppe Lavagetto: [C: 032] swift_new: reorganize hieradata [puppet] - 10https://gerrit.wikimedia.org/r/175714 (owner: 10Giuseppe Lavagetto) [16:10:39] (03CR) 10Giuseppe Lavagetto: [V: 032] swift_new: reorganize hieradata [puppet] - 10https://gerrit.wikimedia.org/r/175714 (owner: 10Giuseppe Lavagetto) [16:13:03] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:13:35] <_joe_> they call me wolf, I solve problems I've caused [16:14:01] <_joe_> ok, so. Alea iacta est. [16:14:09] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:14:10] homo homini lupus [16:14:23] <_joe_> !log pooling mw1237-1258 in the appserver pool [16:14:30] Logged the message, Master [16:15:09] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=338 [critical =325] [16:15:16] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=338 [critical =325] [16:15:38] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:16:10] (03PS1) 10Yuvipanda: shinken: Cleanup to be better compatible with autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/175717 [16:16:39] (03PS2) 10Yuvipanda: shinken: Cleanup to be better compatible with autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/175717 [16:17:11] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: puppet fail [16:17:36] (03PS3) 10Yuvipanda: shinken: Cleanup to be better compatible with autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/175717 [16:17:38] <_joe_> ori-: that means something like "the dice is to be rolled", that's something like "now we need to play it along". It was supposedly said by Julius Caesar before crossing the Rubicon [16:18:09] i know :) [16:18:23] <_joe_> oh ok [16:19:09] <_joe_> I kinda expected it :) [16:19:48] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:20:08] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=340 [critical =325] [16:20:09] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=340 [critical =325] [16:20:10] <_joe_> ori-: ok all new servers are pooled. Let's remove the beta feature, or merge the pools first? [16:20:31] beta feature, i think [16:20:47] (03CR) 10Faidon Liambotis: [C: 032] setting scs-c8-codfw [dns] - 10https://gerrit.wikimedia.org/r/175549 (owner: 10RobH) [16:20:51] <_joe_> +1 [16:20:54] is there any way we will be able to tell whether the thing hit a hhvm or zend backend? [16:20:59] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:21:00] <_joe_> I'll leave that to you [16:21:01] Or just be hostname [16:21:04] * by [16:21:08] Anybody doing swat this morning? [16:21:13] <_joe_> hoo: yes, X-analytics, and X-Powered-By headers [16:21:22] _joe_: Ok, that sounds ok [16:21:31] just need that sometimes :) [16:21:36] <_joe_> hoo: X-analytics gets php=zend [16:21:41] <_joe_> or php=hhvm [16:21:44] <_joe_> yes of course [16:21:57] <_joe_> hoo: but in a couple of weeks (say 3) [16:22:03] <_joe_> that would probably become redundant [16:22:14] will we be all hhvm by then? [16:22:17] (03PS4) 10Yuvipanda: shinken: Cleanup to be better compatible with autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/175717 [16:22:21] * hoo has lost track of hte timeline [16:22:37] (03PS1) 10Mark Bergsma: Allocate frack management subnets [dns] - 10https://gerrit.wikimedia.org/r/175718 [16:22:39] <_joe_> hoo: the timeline from now on is "convert as fast as reasonable" [16:22:51] awesome [16:22:53] <_joe_> hoo: we are removing the separate pools today [16:23:09] yeah, saw that breifly in gerrit [16:23:16] Krinkle|detached: for what it's worth, I can log in to jenkins now (only a "few" hours later :) ) [16:23:34] (03PS1) 10Ori.livneh: Remove HHVM beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175719 [16:23:48] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:24:00] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove HHVM beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175719 (owner: 10Ori.livneh) [16:24:08] (03CR) 10Ori.livneh: [C: 032] Remove HHVM beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175719 (owner: 10Ori.livneh) [16:24:15] (03Merged) 10jenkins-bot: Remove HHVM beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175719 (owner: 10Ori.livneh) [16:24:20] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:24:38] (03PS2) 10Giuseppe Lavagetto: varnish: remove redirection to the hhvm pool [puppet] - 10https://gerrit.wikimedia.org/r/175432 [16:24:57] !log ori Synchronized wmf-config/InitialiseSettings.php: Ibd888465: Remove HHVM beta feature (duration: 00m 05s) [16:24:59] Logged the message, Master [16:25:09] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=341 [critical =325] [16:25:10] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=341 [critical =325] [16:25:13] <_joe_> ok, going on with merging the traffic [16:25:28] +1 [16:25:41] +2 [16:26:06] <_joe_> I am going the slow and secure path as always with varnish changes [16:26:16] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: remove redirection to the hhvm pool [puppet] - 10https://gerrit.wikimedia.org/r/175432 (owner: 10Giuseppe Lavagetto) [16:26:19] how unexciting [16:26:25] <_joe_> eheh [16:26:37] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:55] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:00] :) [16:27:10] (03PS1) 10Cmjohnson: AAdding user west1 and add to analytics-privatedata-users group per RT8896 [puppet] - 10https://gerrit.wikimedia.org/r/175721 [16:27:17] (03CR) 10Yuvipanda: [C: 032] shinken: Cleanup to be better compatible with autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/175717 (owner: 10Yuvipanda) [16:27:41] _joe_: ok to merge yours? [16:27:54] (03CR) 10jenkins-bot: [V: 04-1] AAdding user west1 and add to analytics-privatedata-users group per RT8896 [puppet] - 10https://gerrit.wikimedia.org/r/175721 (owner: 10Cmjohnson) [16:27:54] <_joe_> YuviPanda: no man wait 1 minute [16:28:00] ok [16:28:08] _joe_: feel free to just merge mine as well whenever you merge [16:28:31] (03PS2) 10Cmjohnson: AAdding user west1 and add to analytics-privatedata-users group per RT8896 [puppet] - 10https://gerrit.wikimedia.org/r/175721 [16:28:41] <_joe_> YuviPanda: in 1 min [16:28:42] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:29:21] (03CR) 10jenkins-bot: [V: 04-1] AAdding user west1 and add to analytics-privatedata-users group per RT8896 [puppet] - 10https://gerrit.wikimedia.org/r/175721 (owner: 10Cmjohnson) [16:29:21] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:40] <_joe_> YuviPanda: done [16:29:47] cool, thanks _joe_ [16:30:10] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=343 [critical =325] [16:30:15] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=343 [critical =325] [16:30:20] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:31:02] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:17] <_joe_> ori-: we may want to remove or set to zero the hhvm percent thing as well [16:31:28] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:52] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:32:01] _joe_: yes, afterwards, though [16:32:28] <_joe_> bits is ok, now moving to mobile [16:33:18] Will I step on anyone's toes if I sync a no-op config change for beta? [16:33:39] marktraceur, ^d: Did anyone SWAT? [16:33:49] anomie: nope [16:33:51] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:33:54] * anomie grumbles [16:34:01] bd808: Go ahead. Then I'll SWAT what's left. [16:34:06] Uhh...no [16:34:21] Well, if csteipp shows up. [16:34:26] <^d> #distracted [16:34:30] anomie: cool beans [16:34:36] (03PS2) 10BryanDavis: beta: Use password to connect to logstash redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175640 [16:34:45] (03CR) 10BryanDavis: [C: 032] beta: Use password to connect to logstash redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175640 (owner: 10BryanDavis) [16:35:15] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=344 [critical =325] [16:35:16] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=344 [critical =325] [16:35:39] (03Merged) 10jenkins-bot: beta: Use password to connect to logstash redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175640 (owner: 10BryanDavis) [16:36:03] (03Abandoned) 10Cmjohnson: AAdding user west1 and add to analytics-privatedata-users group per RT8896 [puppet] - 10https://gerrit.wikimedia.org/r/175721 (owner: 10Cmjohnson) [16:36:36] !log bd808 Synchronized wmf-config/logging-labs.php: Update labs logging config (Iaab0047) (duration: 00m 06s) [16:36:40] Logged the message, Master [16:36:55] anomie: {{done}}. All yours [16:37:04] bd808: Ok [16:38:46] (03PS1) 1001tonythomas: Fix incorrect beta MX hostname [puppet] - 10https://gerrit.wikimedia.org/r/175728 [16:39:34] <_joe_> ori-: in ~ 20 minutes we should see all traffic drained away from the hhvm pool [16:39:51] * ori- nods [16:39:51] <_joe_> I'd start move servers afterwards [16:40:10] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=346 [critical =325] [16:40:11] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=346 [critical =325] [16:40:23] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago [16:40:32] <_joe_> mh [16:40:35] (03CR) 10Jgreen: [C: 032 V: 031] Fix incorrect beta MX hostname [puppet] - 10https://gerrit.wikimedia.org/r/175728 (owner: 1001tonythomas) [16:41:24] (03PS1) 10BryanDavis: beta: Fix typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175729 [16:43:02] anomie: Can I have a do over? [16:43:09] bd808: Go ahead [16:43:10] (03PS1) 10Filippo Giunchedi: codfw-prod: add ms-be2014 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/175730 [16:43:12] (03PS1) 10Filippo Giunchedi: first generate the rings, then the dumps [software/swift-ring] - 10https://gerrit.wikimedia.org/r/175731 [16:43:35] (03CR) 10BryanDavis: [C: 032] beta: Fix typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175729 (owner: 10BryanDavis) [16:43:42] (03Merged) 10jenkins-bot: beta: Fix typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175729 (owner: 10BryanDavis) [16:44:18] !log bd808 Synchronized wmf-config/logging-labs.php: Update labs logging config (Ib8d8f8e) (duration: 00m 06s) [16:44:21] Logged the message, Master [16:44:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw-prod: add ms-be2014 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/175730 (owner: 10Filippo Giunchedi) [16:44:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] first generate the rings, then the dumps [software/swift-ring] - 10https://gerrit.wikimedia.org/r/175731 (owner: 10Filippo Giunchedi) [16:45:10] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=348 [critical =325] [16:45:14] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=348 [critical =325] [16:45:22] anomie: {{done}} again. Hopefully actually {{done}] this time [16:45:50] Well, csteipp still isn't here. But I can check his patches, so let's do them anyway. [16:50:15] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=349 [critical =325] [16:50:23] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=349 [critical =325] [16:51:44] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:52:00] (03PS1) 10BryanDavis: beta: Pass global into closure for redis connection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175733 [16:52:49] anomie: Let me know when you're done. I have yet another stupid bug in my monolog config for beta. [16:53:04] * bd808 does not like this kind of test cycle [16:53:11] !log anomie Synchronized php-1.25wmf9/includes: SWAT: Make calling wfMangleFlashPolicy configurable [[gerrit:175598]] (duration: 00m 09s) [16:53:13] anomie: ^ Test please [16:53:14] Logged the message, Master [16:53:20] anomie: Looks good [16:54:51] bd808: Should be about 4-5 minutes [16:55:08] (03PS1) 10Ori.livneh: wgPercentHHVM => 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175735 [16:55:08] *nod* [16:55:10] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=351 [critical =325] [16:55:11] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=351 [critical =325] [16:55:12] hmm, beta::puppetmaster::sync was applied to deployment-mx?! [16:55:13] whyyy [16:55:15] * YuviPanda goes to remove [16:55:29] (03PS2) 10Ori.livneh: wgPercentHHVM => 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175735 [16:55:38] YuviPanda: Because it has (has?) it's own puppet server [16:55:44] oh? [16:55:48] <_joe_> ori-: it's not drained completely btw [16:55:55] YuviPanda: It was for tonythomas while he was hacking stuff [16:55:59] aaah [16:56:06] yeah, I see it is now. [16:56:26] I think I have other hosts in other project with the role applied too [16:56:39] well, nothing else has errored so far :) [16:56:47] * YuviPanda got shinken alerts [16:57:05] _joe_: when you have a moment, I'd like your input on https://gerrit.wikimedia.org/r/#/c/175633/ [16:57:47] <_joe_> gwicke: you have it already :) [16:57:51] YuviPanda: bd808-vagrant and logstash-deploy are both using it. [16:58:07] bd808-vagrant in deployment-prep? [16:58:09] tch tch :) [16:58:16] I'll fix both [16:58:30] YuviPanda: Nope in wikimania-support and logstash respectively [16:58:39] oh? [16:58:56] hmm, I could add myself and fix it. [16:58:59] I can fix them later today. no worries [16:59:04] cool [16:59:15] ty bd808 [16:59:27] YuviPanda: ty for making things better ;) [16:59:32] bd808: :) [16:59:53] csteipp: You're late for SWAT. But I decided to take over testing your patches for you. [17:00:16] anomie: Thank you! I just realized what time it was... really sorry about htat [17:00:18] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=352 [critical =325] [17:00:21] !log anomie Synchronized php-1.25wmf9/includes/api: SWAT: API: Work around wfMangleFlashPolicy() [[gerrit:175596]] (duration: 00m 06s) [17:00:21] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=352 [critical =325] [17:00:22] anomie: ^ Test please [17:00:23] Logged the message, Master [17:00:42] anomie: Looks good [17:00:47] bd808: You're good to go [17:00:57] anomie: cool. thanks [17:00:59] (03CR) 10Giuseppe Lavagetto: [C: 031] wgPercentHHVM => 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175735 (owner: 10Ori.livneh) [17:01:42] anomie: Before I sync yet again to find out I did yet another dumb thing, would you look at -- https://gerrit.wikimedia.org/r/#/c/175733/1/wmf-config/logging-labs.php,unified [17:01:50] * anomie looks [17:01:51] _joe_: sorry, hadn't looked carefully [17:02:32] <_joe_> gwicke: np [17:03:02] bd808: I might just do "global $wmgLogstashPassword" inside the function, but that looks sane. [17:03:54] anomie: Cool. I think I like passing rather than reaching out for the global, but it probably really doesn't matter. [17:03:55] (03CR) 10Ori.livneh: [C: 032] wgPercentHHVM => 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175735 (owner: 10Ori.livneh) [17:03:57] (03CR) 10GWicke: Include and configure the restbase role on the test cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/175633 (owner: 10GWicke) [17:04:03] (03Merged) 10jenkins-bot: wgPercentHHVM => 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175735 (owner: 10Ori.livneh) [17:04:14] _joe_: replied [17:04:16] (03PS1) 10Yuvipanda: icinga: Remove CPU alerts for contint from icinga [puppet] - 10https://gerrit.wikimedia.org/r/175737 [17:04:17] bd808: It only matters if something includes that file inside of a function while the global is actually global [17:04:27] 0? [17:04:32] what's going on? [17:04:39] logstash password? for kibana i guess? [17:04:48] !log ori Synchronized wmf-config/CommonSettings.php: $wgPercentHHVM = 0 (duration: 00m 05s) [17:04:50] Logged the message, Master [17:04:54] anomie: True. Our crazy config is crazy [17:05:13] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=354 [critical =325] [17:05:14] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=354 [critical =325] [17:05:17] <_joe_> ok so, last stage of today's migration. [17:05:23] (03CR) 10Yuvipanda: [C: 032] icinga: Remove CPU alerts for contint from icinga [puppet] - 10https://gerrit.wikimedia.org/r/175737 (owner: 10Yuvipanda) [17:05:34] jgage: Yeah. It's the password for the redis instances that are attached to the logstash cluster [17:05:38] paravoid: app server pools merging [17:06:02] jgage: I'm setting up the beta servers to log to logstash via a redis list [17:06:03] <_joe_> oh my old patch has a problem though [17:06:03] bd808 ah, redis. gotcha. [17:06:09] cool [17:06:34] (03PS2) 10BryanDavis: beta: Pass global into closure for redis connection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175733 [17:06:41] (03CR) 10BryanDavis: [C: 032] beta: Pass global into closure for redis connection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175733 (owner: 10BryanDavis) [17:06:50] (03Merged) 10jenkins-bot: beta: Pass global into closure for redis connection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175733 (owner: 10BryanDavis) [17:07:27] !log bd808 Synchronized wmf-config/logging-labs.php: Update labs logging config (I1843dfd) (duration: 00m 06s) [17:07:30] Logged the message, Master [17:09:00] (03CR) 10GWicke: "@Giuseppe: Do you want me to change anything about this patch?" [puppet] - 10https://gerrit.wikimedia.org/r/175633 (owner: 10GWicke) [17:09:05] (03PS1) 10Yuvipanda: shinken: I heard you like monitoring [puppet] - 10https://gerrit.wikimedia.org/r/175739 [17:10:14] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=355 [critical =325] [17:10:15] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=355 [critical =325] [17:10:32] _joe_: ? [17:11:01] <_joe_> ori-: I meant "the next step as I coded it in puppet" [17:11:26] ah [17:12:22] (03PS1) 10Giuseppe Lavagetto: mediawiki: move most servers from the hhvm to the standard pool [puppet] - 10https://gerrit.wikimedia.org/r/175745 [17:13:12] (03CR) 10Ori.livneh: [C: 031] mediawiki: move most servers from the hhvm to the standard pool [puppet] - 10https://gerrit.wikimedia.org/r/175745 (owner: 10Giuseppe Lavagetto) [17:13:14] <_joe_> !log depooled mw1019-1032 from the hhvm pool [17:13:18] Logged the message, Master [17:14:07] (03CR) 10Yuvipanda: [C: 032] shinken: I heard you like monitoring [puppet] - 10https://gerrit.wikimedia.org/r/175739 (owner: 10Yuvipanda) [17:14:29] (03PS2) 10Giuseppe Lavagetto: mediawiki: move most servers from the hhvm to the standard pool [puppet] - 10https://gerrit.wikimedia.org/r/175745 [17:15:11] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=357 [critical =325] [17:15:12] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=357 [critical =325] [17:16:08] (03PS3) 10Giuseppe Lavagetto: mediawiki: move most servers from the hhvm to the standard pool [puppet] - 10https://gerrit.wikimedia.org/r/175745 [17:16:27] (03PS4) 10Giuseppe Lavagetto: mediawiki: move most servers from the hhvm to the standard pool [puppet] - 10https://gerrit.wikimedia.org/r/175745 [17:16:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: move most servers from the hhvm to the standard pool [puppet] - 10https://gerrit.wikimedia.org/r/175745 (owner: 10Giuseppe Lavagetto) [17:20:14] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=358 [critical =325] [17:20:16] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=358 [critical =325] [17:25:14] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=358 [critical =325] [17:25:15] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=358 [critical =325] [17:30:00] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Puppet has 105 failures [17:30:09] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=358 [critical =325] [17:30:10] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=358 [critical =325] [17:30:48] PROBLEM - swift-account-replicator on ms-be2013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:35:13] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=359 [critical =325] [17:35:15] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=359 [critical =325] [17:36:48] (03PS1) 10Reedy: Add interwiki-labs.cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175755 [17:38:04] <^d> Reedy: I hate that we check cdb files into git :\ [17:38:18] Mmm :/ [17:39:29] ^d: Open to ideas [17:39:42] <^d> Not checking cdb files into git? [17:40:15] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=361 [critical =325] [17:40:16] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=361 [critical =325] [17:40:51] <^d> Reedy: We could just build it on tin & sync the file like we do with the wikiversions.cdb? [17:40:51] ^d: Leave them just on disk? .gitignore'd or something? [17:40:55] <^d> Yeah [17:41:39] Guess there's no reason we probably couldn't [17:42:20] <_joe_> !log repooled mw1019-1032,mw1053 in the appservers pool [17:42:24] Logged the message, Master [17:43:53] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:44:32] RECOVERY - swift-account-replicator on ms-be2013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:45:16] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=363 [critical =325] [17:45:19] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=363 [critical =325] [17:45:30] ^d: https://phabricator.wikimedia.org/T75905 [17:46:16] <_joe_> 32% of all appserver traffic is now on HHVM [17:46:48] <^d> Reedy: awesome thx [17:46:50] _joe_: :) :) [17:46:55] <_joe_> ori: :)) [17:47:06] _joe_: you rock! [17:47:13] have dinner! :P [17:47:50] <_joe_> you too, I honestly didn't think we could get to this point in some moments [17:48:10] <_joe_> and yes, I am going to take a break now [17:48:23] <_joe_> the lvs cleaning change needs some more love and will be merged tomorrow [17:48:36] <_joe_> no harm in having 3 servers in the empty pool for tonight [17:48:42] definitely not [17:49:19] good job guys :-) [17:49:41] not only is 32% of all appserver traffic now on HHVM [17:49:49] it also looks like all current HHVM servers /could/ run the site ;) [17:49:57] <_joe_> ori: http://bit.ly/1kMcqsQ [17:49:59] \o/ [17:50:09] _joe_: heh [17:50:13] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=364 [critical =325] [17:50:23] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=364 [critical =325] [17:50:54] <_joe_> mark: well, yes it looks like that could be possible [17:55:14] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=366 [critical =325] [17:55:21] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=366 [critical =325] [18:00:04] maxsem, kaldari: Respected human, time to deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141125T1800). Please do the needful. [18:00:14] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [18:00:33] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [18:00:55] nothing to deploy that I know of [18:05:04] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=369 [critical =325] [18:05:05] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=369 [critical =325] [18:05:14] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet has 3 failures [18:10:13] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=370 [critical =325] [18:10:14] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=370 [critical =325] [18:10:16] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet has 3 failures [18:10:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet has 3 failures [18:10:22] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: Puppet has 3 failures [18:11:02] Hi Reedy, greg-g :) I have a quick, and possibly silly, question about today's 1.25wmf9 train. So I see that version is already deployed to group 0. If we add a new patch to 1.25wmf9 before the train leaves, what happens? do we still get on and get deployed to group 1? [18:11:09] thanks in advance! :) [18:11:50] AndyRussG: If you don't deploy it, nothing happens [18:12:10] If you ask nicely, I can make sure I deploy said patch for you [18:12:39] Reedy: Ahhhh hmmm :) OK got it, and thank you, and please (in reverse order) [18:12:50] Which patch? [18:13:27] Reedy: it's actually still in CR, but has a +1 and we'd love to get it out to non-WP sites to test it a bit... https://gerrit.wikimedia.org/r/#/c/175732/ [18:14:23] Just cherry picked it to your wmf_deploy branch... [18:14:24] https://gerrit.wikimedia.org/r/#/c/175768/ [18:14:36] Do you want it updating in wmf8 too? [18:15:21] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=372 [critical =325] [18:15:24] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=372 [critical =325] [18:15:24] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:15:25] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet has 3 failures [18:15:26] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: Puppet has 3 failures [18:16:43] Reedy: maaaybe :) lemme consult w/ Elliot... [18:18:34] (03PS1) 10Legoktm: Enable REL1_24 for ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175771 [18:19:58] PROBLEM - check if dhclient is running on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:20:13] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=374 [critical =325] [18:20:21] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=374 [critical =325] [18:20:22] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet has 3 failures [18:20:23] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: Puppet has 3 failures [18:20:24] PROBLEM - check if salt-minion is running on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:22:30] (03CR) 10Giuseppe Lavagetto: [C: 031] "Fair enough, fine by me for now. Mine was mostly one question and a suggestion." [puppet] - 10https://gerrit.wikimedia.org/r/175633 (owner: 10GWicke) [18:25:11] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [18:25:25] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=375 [critical =325] [18:25:25] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=375 [critical =325] [18:25:26] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet has 3 failures [18:25:28] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: Puppet has 3 failures [18:28:41] !log deployed patches for T74222 and T72901 [18:28:44] Logged the message, Master [18:28:59] RECOVERY - check if dhclient is running on rhenium is OK: PROCS OK: 0 processes with command name dhclient [18:29:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One merge today broke this slightly" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/175633 (owner: 10GWicke) [18:29:14] RECOVERY - check if salt-minion is running on rhenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:30:12] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [18:30:13] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=377 [critical =325] [18:30:19] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=377 [critical =325] [18:30:20] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet has 3 failures [18:30:21] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: Puppet has 3 failures [18:35:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [18:35:17] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=377 [critical =325] [18:35:18] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=377 [critical =325] [18:35:20] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:35:24] RECOVERY - check_puppetrun on payments1002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [18:39:39] What is going on with Gerrit? It's super slow [18:39:55] I can load websites in Europe much faster than I can load anything on gerrit.wm.o [18:40:18] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [18:40:20] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [18:40:22] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [18:45:16] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 114 seconds ago with 0 failures [18:45:26] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [18:45:27] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [18:50:10] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [18:50:14] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [18:55:12] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [18:55:13] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [19:00:05] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141125T1900). Please do the needful. [19:00:19] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [19:00:20] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [19:02:26] (03PS1) 10Cmjohnson: Adding bob west to data.yaml and to west1 to analytics-privatedata-users: [puppet] - 10https://gerrit.wikimedia.org/r/175780 [19:05:16] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [19:05:17] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [19:05:58] Reedy: When you do the train, can oyu also bump wmf8 Wikidata? [19:06:29] hoo: is there a commit for it? [19:06:44] no submodule update, yet [19:09:13] robh would you review for me and +1 it's an access thing https://gerrit.wikimedia.org/r/#/c/175780/1 [19:09:48] (03CR) 10GWicke: "I think I'd like to get this merged for now. My reading of https://gerrit.wikimedia.org/r/#/c/174694/ suggests that it'll still work, so l" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/175633 (owner: 10GWicke) [19:10:07] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [19:10:08] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [19:10:58] cmjohnson: uhh, does that group include bastion access? [19:12:22] cmjohnson: you have to also add him to bastion group [19:12:43] (i think) [19:12:56] i'm not sure [19:13:07] I'd say yes [19:13:12] Reedy: https://gerrit.wikimedia.org/r/175784 there you go [19:13:26] (03PS2) 10GWicke: Include and configure the restbase role on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/175633 [19:13:50] i think otherwise folks are in bastion in that group, or in another group that somehow gives bastion [19:13:51] (03CR) 10GWicke: "Rebased on top of current production." [puppet] - 10https://gerrit.wikimedia.org/r/175633 (owner: 10GWicke) [19:14:22] hoo: thanks [19:14:43] (03CR) 10RobH: [C: 04-1] "if i recall correctly, you also need to include the user in the bastion group. Most of the users in the group you've added him to are eit" [puppet] - 10https://gerrit.wikimedia.org/r/175780 (owner: 10Cmjohnson) [19:15:17] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [19:15:19] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=367 [critical =325] [19:16:21] ori, robh, paravoid, mutante: https://gerrit.wikimedia.org/r/#/c/175633/ [19:17:10] you ping all us and not the ops contact person for the week ;] [19:17:16] !log reedy Synchronized php-1.25wmf8/extensions/CentralNotice: Ib4d23f2a588f58ef3abcbd8b0b500ad8534723cd (duration: 00m 07s) [19:17:24] Logged the message, Master [19:18:04] robh: fair point, /cc cmjohnson [19:18:08] !log reedy Synchronized php-1.25wmf9/extensions/CentralNotice: Ib4d23f2a588f58ef3abcbd8b0b500ad8534723cd (duration: 00m 06s) [19:18:11] Logged the message, Master [19:18:38] (03PS2) 10Cmjohnson: Adding bob west to data.yaml and to west1 to restricted bastion (terbium) and analytics-privatedata-users RT8896 [puppet] - 10https://gerrit.wikimedia.org/r/175780 [19:18:49] !log reedy Synchronized php-1.25wmf8/extensions/Wikidata: I08946aac3 (duration: 00m 12s) [19:18:52] Logged the message, Master [19:20:01] (03CR) 10Cmjohnson: [C: 032] Include and configure the restbase role on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/175633 (owner: 10GWicke) [19:20:30] gwicke: hope this works for you [19:20:55] cmjohnson: thank you! [19:21:00] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=353 [critical =325] [19:21:01] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=353 [critical =325] [19:21:06] merged [19:21:17] (03PS2) 10Reedy: Non Wikipedias to 1.25wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175692 [19:21:22] (03CR) 10Reedy: [C: 032] Non Wikipedias to 1.25wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175692 (owner: 10Reedy) [19:21:50] (03Merged) 10jenkins-bot: Non Wikipedias to 1.25wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175692 (owner: 10Reedy) [19:25:16] PROBLEM - check_recurring_gc_failures_missed on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=335 [critical =325] [19:25:18] PROBLEM - check_recurring_gc_jobs_required on db1025 is CRITICAL: CRITICAL recurring_gc_failures_missed=335 [critical =325] [19:26:58] (03PS3) 10Cmjohnson: Adding bob west to data.yaml and to west1 to bastion only and analytics-privatedata-users RT8896 [puppet] - 10https://gerrit.wikimedia.org/r/175780 [19:27:46] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non Wikipedias to 1.25wmf9 [19:27:50] Logged the message, Master [19:28:07] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non Wikipedias to 1.25wmf9 [19:28:15] 19:27:43 sudo -u mwdeploy -n -- /usr/bin/rsync -l tin.eqiad.wmnet::common/wikiversions*.{json,cdb} /srv/mediawiki on mw1062 returned [255]: Error reading response length from authentication socket. [19:28:17] worked second time [19:28:20] stupid thing [19:29:07] (03PS2) 10Reedy: Enable REL1_24 for ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175771 (owner: 10Legoktm) [19:29:13] (03CR) 10Reedy: [C: 032] Enable REL1_24 for ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175771 (owner: 10Legoktm) [19:29:24] (03Merged) 10jenkins-bot: Enable REL1_24 for ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175771 (owner: 10Legoktm) [19:30:32] ori: Does https://gerrit.wikimedia.org/r/#/c/174885/ want merging? [19:32:21] (03PS2) 10Reedy: Adding *.commonists.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175426 (owner: 10Steinsplitter) [19:32:23] (03CR) 10Reedy: [C: 032] Adding *.commonists.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175426 (owner: 10Steinsplitter) [19:32:24] (03Merged) 10jenkins-bot: Adding *.commonists.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175426 (owner: 10Steinsplitter) [19:32:34] !log Created wikilove tables on zhwikivoyage [19:32:48] (03PS2) 10Reedy: Enable WikiLove extension on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175407 (owner: 10Glaisher) [19:32:53] (03CR) 10Reedy: [C: 032] Enable WikiLove extension on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175407 (owner: 10Glaisher) [19:32:55] Logged the message, Master [19:33:02] (03Merged) 10jenkins-bot: Enable WikiLove extension on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175407 (owner: 10Glaisher) [19:36:09] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [19:36:11] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [19:38:53] !log reedy Synchronized wmf-config/: Config updates (duration: 00m 06s) [19:38:58] Logged the message, Master [19:39:22] Reedy: oops, I didn't want to deploy that yet. [19:39:49] but it's fine, people will be confused for a day [19:40:10] lol [19:41:19] Reedy: https://gerrit.wikimedia.org/r/#/c/175769/ should go out tomorrow as well...if it gets merged now, will it be picked up by l10nupdate? [19:41:56] I don't think new messages are [19:42:05] they need to exist in the deployment branch [19:42:13] (03CR) 10Cmjohnson: "I'm not sure but I believe this wold affect the cameras in eqiad. Robh may be able to offer better input." [puppet] - 10https://gerrit.wikimedia.org/r/173996 (owner: 10Dzahn) [19:42:23] I'm updating the 1.24 message [19:42:35] it already exists in the deploy branch... [19:42:41] yeah [19:42:44] of course logstash decided to have a problem while i was demoing it in a meeting :\ [19:42:45] I saw 2- and 3+ [19:42:58] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 13 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 10, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 110, uinitializing_shards: 3, unumber_of_data_nodes: 3} [19:43:10] manybubbles: do you remember that dates we have been testing cirrus? i need to pull some logs [19:43:31] matanya: no [19:43:35] hmmm - why yellow [19:43:46] hmm what's with the 'u' prefix in that icinga message [19:44:05] manybubbles, maybe i upset it with too many queries [19:44:06] weird [19:44:15] jgage: its seems happy [19:44:17] though i avoid sorting on large fields [19:44:18] jgage: It's python unicode string mess u'unicode here' [19:44:25] ah [19:46:53] jgage: logstash1002 OOM'd at 2014-11-25 19:30:06 [19:47:56] nice [19:48:22] i was doing a kibana query showing Severity:INFO, Severity:WARN, Severity:ERROR [19:48:34] It seems still a bit sick. I'll restart elastic there [19:48:42] thanks [19:49:33] !log restarted elasticsearch on logstah1002 after OOM [19:49:36] Logged the message, Master [19:50:07] RECOVERY - check_recurring_gc_failures_missed on db1025 is OK: OK recurring_gc_failures_missed=289 [19:50:09] RECOVERY - check_recurring_gc_jobs_required on db1025 is OK: OK recurring_gc_failures_missed=289 [20:02:33] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 10, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 2, number_of_data_nodes: 3 [20:02:36] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 10, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 2, number_of_data_nodes: 3 [20:03:34] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 10, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 2, number_of_data_nodes: 3 [20:04:01] Reedy: did you push the submodule update, yet? [20:04:20] yeah [20:04:50] ah crap [20:07:40] (03PS1) 10Legoktm: Bring in cdb library via composer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175796 [20:08:14] Reedy: You sure you pushed it? [20:08:30] yup [20:08:35] I did wonder if you meant to do wmf9, not wmf8 [20:08:43] Reedy: both actually [20:08:50] but I forgot wmf9 [20:09:55] !log reedy Synchronized php-1.25wmf8/extensions/Wikidata: Ensure my sanity (duration: 00m 13s) [20:09:57] Logged the message, Master [20:10:16] (03PS1) 10Dr0ptp4kt: Vary mdot webroot on Accept-Language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175797 [20:11:11] ^ bblack, yurikR your review requested, please. yurikR, as change notes, see also my latest update on https://gerrit.wikimedia.org/r/#/c/169210/ (thanks for updates to that patch). [20:11:22] bblack: yurikR gotta get some food and call my dad. ttyl [20:12:19] Reedy: Trying to push the other submodule update atm [20:12:25] but gerrit is not accepting it# [20:12:30] well, ssh is waitng [20:12:58] ESTABLISHED [20:13:00] weird# [20:13:24] Reedy: https://gerrit.wikimedia.org/r/#/c/175798/ [20:15:44] Also I don't like this keyboard much# [20:15:47] :P [20:16:01] !log reedy Synchronized php-1.25wmf9/extensions/Wikidata: Ic070ce0beb142e100490940fddaa0bd36b8a50be (duration: 00m 14s) [20:16:06] Logged the message, Master [20:48:21] (03CR) 10BryanDavis: [C: 031] "Not the worlds most elegant solution but better than copypasta to duplicate the library." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175796 (owner: 10Legoktm) [20:53:38] (03PS1) 10AndyRussG: Enable CentralNotice client banner choice a few places [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175849 [20:58:30] (03PS2) 10AndyRussG: Enable CentralNotice client banner choice a few places [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175849 [21:00:05] AndyRussG, ejegg: Dear anthropoid, the time has come. Please deploy CentralNotice (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141125T2100). [21:08:43] (03PS1) 10Jgreen: add mgmt.frack.codfw.wmnet hosts [dns] - 10https://gerrit.wikimedia.org/r/175856 [21:15:08] !log ejegg Synchronized php-1.25wmf9/extensions/CentralNotice/: One more CentralNotice fix to get out ahead of the winter rush (duration: 00m 05s) [21:15:13] Logged the message, Master [21:15:16] (03CR) 10Dzahn: [C: 032] ganglia: remove pmtpa varnish stanza [puppet] - 10https://gerrit.wikimedia.org/r/174205 (owner: 10Dzahn) [21:17:57] (03CR) 10Ejegg: [C: 032] Enable CentralNotice client banner choice a few places [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175849 (owner: 10AndyRussG) [21:18:10] (03Merged) 10jenkins-bot: Enable CentralNotice client banner choice a few places [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175849 (owner: 10AndyRussG) [21:20:51] (03PS4) 10Dzahn: realm.pp - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/173476 [21:22:06] (03CR) 10Dzahn: realm.pp - remove pmtpa (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/173476 (owner: 10Dzahn) [21:22:33] !log ejegg Synchronized wmf-config/CommonSettings.php: Turn CN client-side banner choice back on for selected wmf9 wikis (duration: 00m 05s) [21:22:37] Logged the message, Master [21:24:16] (03CR) 10Jgreen: [C: 032 V: 031] add mgmt.frack.codfw.wmnet hosts [dns] - 10https://gerrit.wikimedia.org/r/175856 (owner: 10Jgreen) [21:27:50] (03CR) 10Dzahn: "looks to me like the class is not applied on a node currently" [puppet] - 10https://gerrit.wikimedia.org/r/173996 (owner: 10Dzahn) [21:28:53] (03CR) 10Dzahn: "so are we ok with installing the key like this? the related ticket had some comments/discussion but does it mean this should wait?" [puppet] - 10https://gerrit.wikimedia.org/r/174161 (owner: 10Dzahn) [21:29:59] YuviPanda: re: your comment on moving the PDU monitoring classes [21:30:09] " the code itself should be in a module " [21:30:25] are you suggesting it be a separate module just for PDU monitoring? [21:30:34] even though we alreayd have so many monitoring modules? [21:31:09] https://gerrit.wikimedia.org/r/#/c/173999/1/modules/nagios_common/manifests/pdu_monitoring.pp [21:32:32] (03Abandoned) 10Dzahn: gerrit role: add ssh::server listening on other IP [puppet] - 10https://gerrit.wikimedia.org/r/174015 (https://bugzilla.wikimedia.org/35611) (owner: 10Dzahn) [21:33:18] (03CR) 10Dzahn: "coren, andrew, so killing autofs class should be fine, right?" [puppet] - 10https://gerrit.wikimedia.org/r/173991 (owner: 10Dzahn) [21:35:16] (03CR) 10Dzahn: "mha::node is included from role::coredb::common. how does this look, springle?" [puppet] - 10https://gerrit.wikimedia.org/r/173464 (owner: 10Dzahn) [21:36:11] (03PS4) 10Dzahn: LDAP: rm pmtpa, +codfw, gluster/NFS server undef [puppet] - 10https://gerrit.wikimedia.org/r/173470 [21:37:22] (03CR) 10Dzahn: "Coren, looks good?" [puppet] - 10https://gerrit.wikimedia.org/r/173470 (owner: 10Dzahn) [21:39:05] !log ejegg Synchronized php-1.25wmf8/extensions/CentralNotice/: One more CentralNotice fix to get out ahead of the winter rush - wmf8 (duration: 00m 07s) [21:39:09] Logged the message, Master [21:39:32] (03CR) 10coren: [C: 031] "Should be okay, though I expect some more can be axed (see inline)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/173470 (owner: 10Dzahn) [21:40:40] (03CR) 10coren: [C: 031] "Kill it with fire and hope it never comes back." [puppet] - 10https://gerrit.wikimedia.org/r/173991 (owner: 10Dzahn) [21:44:06] cmjohnson: ping [21:44:16] (03CR) 10Dzahn: "i talked about the "enabling MobileFrontend per wiki" to Reedy before making this and he had pointed how it is:" [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn) [21:44:18] (03CR) 10RobH: [C: 031] "can go away" [puppet] - 10https://gerrit.wikimedia.org/r/173996 (owner: 10Dzahn) [21:44:45] gwicke: pong [21:46:18] hey, the hiera changes you merged don't seem to have taken effect yet the way I thought they would; just as a sanity check- is there anything special that needs to be done apart from the puppet merge to have hiera changes take effect? [21:46:49] I *think* the answer is no, and the issue is elsewhere; just want to rule this out [21:48:08] odd, no merged on the puppet master [21:49:17] (03PS1) 10EBernhardson: Flow whitelist for pages converted from LQT on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175861 [21:49:44] (03CR) 10Cmjohnson: [C: 032] Adding bob west to data.yaml and to west1 to bastion only and analytics-privatedata-users RT8896 [puppet] - 10https://gerrit.wikimedia.org/r/175780 (owner: 10Cmjohnson) [21:50:10] cmjohnson: okay, I guess it might be related to the recent hiera setup changes then [21:51:24] (03PS1) 10AndyRussG: Enable CentralNotice client banner choice everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175863 [21:52:43] (03CR) 10Ejegg: [C: 032] Enable CentralNotice client banner choice everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175863 (owner: 10AndyRussG) [21:55:13] !log ejegg Synchronized wmf-config/CommonSettings.php: Turn CN client-side banner choice back on everywhere (duration: 00m 05s) [21:55:17] Logged the message, Master [22:00:05] spagewmf, ebernhardson: Dear anthropoid, the time has come. Please deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141125T2200). [22:01:20] (03PS3) 10Cmjohnson: admin: grant qchris tin access (through deployers) [puppet] - 10https://gerrit.wikimedia.org/r/175315 (owner: 10John F. Lewis) [22:03:44] (03CR) 10Cmjohnson: [C: 032] "It has been 3 days and req's met." [puppet] - 10https://gerrit.wikimedia.org/r/175315 (owner: 10John F. Lewis) [22:07:40] (03CR) 10Mattflaschen: [C: 031] "This is all of them, except User:SPage/TestZero" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175861 (owner: 10EBernhardson) [22:09:24] YuviPanda, just tried to log into Quarry - 502 Bad Gateway upon returning back to it [22:12:40] any documentation on changes to deployment process? getting permission denied on tin trying to fetch php-1.25wmf9 [22:13:00] (03CR) 10Dzahn: [C: 032] delete class facilities::dc-cam-transcoder [puppet] - 10https://gerrit.wikimedia.org/r/173996 (owner: 10Dzahn) [22:13:47] (03PS2) 10Dzahn: kill facilities.pp, move to nagios_common [puppet] - 10https://gerrit.wikimedia.org/r/173999 [22:15:04] (03CR) 10Dzahn: kill facilities.pp, move to nagios_common (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [22:16:46] (03PS3) 10Dzahn: delete class ldap::client::autofs [puppet] - 10https://gerrit.wikimedia.org/r/173991 [22:17:22] (03CR) 10Dzahn: [C: 032] delete class ldap::client::autofs [puppet] - 10https://gerrit.wikimedia.org/r/173991 (owner: 10Dzahn) [22:20:02] !log ebernhardson Synchronized php-1.25wmf9/extensions/Echo/: Bump Echo in 1.25wmf9 (duration: 00m 08s) [22:20:07] Logged the message, Master [22:20:24] !log ebernhardson Started scap: Bump Echo and Flow in 1.25wmf9 for officewiki deployment [22:20:53] Logged the message, Master [22:21:10] I'm getting 503 error trying to log in via OAuth on phabricator. [22:22:05] worked now. [22:25:11] (03PS5) 10Dzahn: LDAP: rm pmtpa, +codfw, gluster/NFS server undef [puppet] - 10https://gerrit.wikimedia.org/r/173470 [22:30:44] (03PS1) 10Legoktm: Remove enwiki's OTRS-member group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175872 [22:30:53] (03CR) 10Andrew Bogott: LDAP: rm pmtpa, +codfw, gluster/NFS server undef (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173470 (owner: 10Dzahn) [22:33:58] (03PS2) 10EBernhardson: Flow whitelist for pages converted from LQT on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175861 [22:37:34] (03CR) 10Dzahn: [C: 04-2] "rebased, change is smaller now, but that cert name change needs to be checked closer, leaving to Andrew for the moment per comments above" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173470 (owner: 10Dzahn) [22:37:39] (03CR) 10Mattflaschen: [C: 031] "That's everything including the test page." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175861 (owner: 10EBernhardson) [22:38:24] (03CR) 10Mattflaschen: "We'll start this conversion when we're done scapping." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175861 (owner: 10EBernhardson) [22:39:59] (03CR) 10Dzahn: "now depends on: "solve this with hiera instead" comment on https://gerrit.wikimedia.org/r/#/c/174015/3" [puppet] - 10https://gerrit.wikimedia.org/r/172313 (https://bugzilla.wikimedia.org/35611) (owner: 10Dereckson) [22:40:59] (03CR) 10Dzahn: [C: 04-2] "technical downvote - stalled but should still be done once we have the requirements" [puppet] - 10https://gerrit.wikimedia.org/r/172313 (https://bugzilla.wikimedia.org/35611) (owner: 10Dereckson) [22:42:12] (03CR) 10Dzahn: "time to do now that phab migration has happened?" [puppet] - 10https://gerrit.wikimedia.org/r/174583 (owner: 10Aklapper) [22:46:30] (03PS3) 1020after4: Phab: Change user visible strings "Execute Query" and "Real Name" [puppet] - 10https://gerrit.wikimedia.org/r/174583 (owner: 10Aklapper) [22:50:16] (03CR) 10Dzahn: [C: 032] Phab: Change user visible strings "Execute Query" and "Real Name" [puppet] - 10https://gerrit.wikimedia.org/r/174583 (owner: 10Aklapper) [22:50:42] !log ebernhardson Finished scap: Bump Echo and Flow in 1.25wmf9 for officewiki deployment (duration: 30m 17s) [22:50:44] Logged the message, Master [23:02:41] (03PS1) 10Dzahn: apachesync - delete sync-apache script [puppet] - 10https://gerrit.wikimedia.org/r/175884 [23:02:51] (03CR) 10Mattflaschen: [C: 04-1] "Postponing this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175861 (owner: 10EBernhardson) [23:04:48] (03CR) 10Dzahn: "i'd do this first: https://gerrit.wikimedia.org/r/#/c/175884/1 and then talk about moving the remaining scripts, and then kill this in mu" [puppet] - 10https://gerrit.wikimedia.org/r/164508 (owner: 10Ori.livneh) [23:06:05] !log on osmium: removing stale static pcre and zip libraries in /usr/local , installed by hhvm [23:06:10] Logged the message, Master [23:09:55] (03PS1) 10Dzahn: add virt1000 to scap dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/175889 [23:11:37] (03CR) 10Dzahn: [C: 031] "afaict "mediawiki-installation" is the relevant one that is used by scap" [puppet] - 10https://gerrit.wikimedia.org/r/175889 (owner: 10Dzahn) [23:13:20] mutante: You can just do Bug: T12345 [23:13:22] That works [23:19:43] Reedy: notification bot? [23:19:57] oh, i'll try [23:20:05] You mean putting a commit into gerrit ending up with a comment on phab that it was made? [23:20:06] Yeah [23:22:26] legoktm: You going to deploy the magic for wikibugs? [23:22:36] James_F: Already deployed [23:22:41] I !log'd it in -labs [23:23:09] Reedy: yes, and when it gets merged [23:24:21] legoktm: Aha. My stalk didn't ping for some reason. [23:48:04] (03PS1) 10Aaron Schulz: Moved sampling to the profiler config itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175891 [23:48:18] (03CR) 10Aaron Schulz: [C: 04-2] Moved sampling to the profiler config itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175891 (owner: 10Aaron Schulz) [23:57:20] PROBLEM - Disk space on db2011 is CRITICAL: DISK CRITICAL - free space: /srv 57537 MB (3% inode=99%): [23:57:45] ya'll got a big backlog: https://phabricator.wikimedia.org/project/board/29/ ;)