[00:00:01] is anyone experencing images disapearing [00:00:02] ? [00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T0000). [00:00:05] i am on https://commons.wikimedia.org/wiki/File:Missing_avatar.svg [00:00:32] i see https://phabricator.wikimedia.org/F25671230 [00:00:54] 10Operations, 10ops-codfw: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10ops-monitoring-bot) [00:01:02] paladox: i see the actual file called "missing avatar". intersting name for this issue [00:01:31] shows missing images for the copyright tag too [00:01:58] https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Wiki_Loves_Monuments_Logo_notext.svg/55px-Wiki_Loves_Monuments_Logo_notext.svg.png [00:02:44] seems to work now after pressing the refresh button constantly [00:03:19] i cant confirm the issue on my side [00:03:24] ok [00:03:44] i was about to ask if other .svg files work [00:03:55] but then you linked to the .png there too [00:37:22] (03PS1) 10Dzahn: icinga: make the apache server name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) [00:42:20] (03PS2) 10Dzahn: icinga: make the apache virtual host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) [00:46:27] (03PS3) 10Dzahn: icinga: make the apache virtual host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) [00:51:28] (03PS4) 10Dzahn: icinga: make the apache virtual host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) [00:56:42] (03PS5) 10Dzahn: icinga: make the apache virtual host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) [01:00:04] legoktm and CindyCicaleseWMF: My dear minions, it's time we take the moon! Just kidding. Time for fixcopyright.wikimedia.org updates deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T0100). [01:01:05] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/12374/" [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:03:49] (03CR) 10Dzahn: [C: 031] "looks good to me now. see compiler. (basically) no change on einsteinium and changes on icinga1001" [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:05:39] (03PS1) 10Dzahn: wikistats (vps): remove scope.lookupvar from erb template [puppet] - 10https://gerrit.wikimedia.org/r/458338 [01:15:09] (03PS1) 10Dzahn: tor: make it possible to config service running/stopped in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/458339 (https://phabricator.wikimedia.org/T196701) [01:16:37] 10Operations, 10Patch-For-Review, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Dzahn) >>! In T196701#4537442, @MoritzMuehlenhoff wrote: > Plan looks good, two things to consider: > - ..On stretch the thirdparty/tor component needs to be explicitly added... [01:17:40] o/ [01:20:24] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/12375/radium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/458339 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [01:35:04] (03PS6) 10Dzahn: piwik: add support for stretch/PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/453553 [01:36:06] !log legoktm@deploy1001 Started scap: EUCopyrightCampaign updates [01:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:35] (03CR) 10Dzahn: piwik: add support for stretch/PHP7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/453553 (owner: 10Dzahn) [01:39:09] Krinkle: hit mwdebug1002 / fixcopyrightwiki [01:39:13] er, try it out [01:40:54] km@km-pt ~> curl -I -H "X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet" "https://fixcopyright.wikimedia.org/wiki/" | grep location [01:40:54] location: https://fixcopyright.wikimedia.org/ [01:51:14] legoktm: seems to work, '/' responds 200 OK with the main page, sidebar on other pages links to '/', and redirects to main page redirect to '/' (e.g. /w/, /w/index.php, /wiki) [01:51:40] (03PS1) 10Legoktm: Have canonical Main Page URL be the domain root for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458345 [01:51:50] meanwhile /wiki/Fix_copyright does not normalise, which is fine I guess. That's not a redirect for the same reason MW doesn't redirect ?title= or non-canonical url encodings [01:51:53] it used to, but not anymore. [01:51:55] Fine right? [01:52:17] I think that makes sense [01:52:36] and the language selector will add an extra ?title=Fix_copyright but that's OK IMO [01:52:48] (03CR) 10Legoktm: "This works :). Only concern I have is that it's unclear what license legoktm: oh, interesting. [01:54:50] I hadn't noticed the ULS behaviour. [01:54:53] I wonder why it does that [01:55:14] it's not actually ULS fwiw [01:55:39] Krinkle: see https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/skins/EUCopyrightCampaignSkin/+/master/resources/main.js [01:56:34] legoktm: oh, it overrides the whole query string [01:56:45] hm.. right, and the title is there for compatibility with index.php urls [01:57:03] mw.Uri, extend({uselang: .. }).toString() [01:57:05] :) [01:57:07] anyway [02:00:01] (03CR) 10Legoktm: "> Only concern I have is that it's unclear what license indeed [02:06:12] Krinkle: also, if you're not too busy, a review on https://gerrit.wikimedia.org/r/458323 would be appreciated so we can have the Accept-Language autodetection working [02:12:05] legoktm: also https://phabricator.wikimedia.org/T120085 [02:13:27] Krinkle: ty, is the task about changing it for everything blocked on anything specific? [02:13:53] we'd probably want to implement this feature in MW itself rather than relying on two hooks [02:14:00] legoktm: Not really, other than that we don't really know why these hacks work, whether there is any gotcha, and whether it is offiically supported/supportable. [02:14:08] (03CR) 10Krinkle: [C: 031] Have canonical Main Page URL be the domain root for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458345 (owner: 10Legoktm) [02:15:04] also, it may need product perspective and/or research evaluation to decide whether we actually want it (sepearate from the how) [02:15:47] I wonder how many automated checks depend on the fact that en.wikipedia.org is a redirect [02:21:10] Hehe [02:21:16] Yeah, probably a few [02:21:44] It also means the checks will actually fail if page views fail [02:21:48] which is probably a good thing [02:22:16] legoktm: btw, when you have a minute, could use review on https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/455229/ [02:22:22] would like to simplify some dashboards and queries [02:24:38] Krinkle: why is E_WARNING mapped to ERROR? [02:24:54] Why not? [02:25:05] In python all of these things would be exceptions [02:25:06] I would expect that stuff like wfWarn is...a warning [02:25:21] Right,but file_get_contents() should be error if file is missing [02:25:28] I wasn't thinking of wfWarn [02:25:41] Does that go through here? [02:25:53] Should probably have its own legacy channel [02:27:42] sigh [02:27:44] wfWarn is E_USER_NOTICE [02:28:08] Yeah, it goes to wfDebugLog('warning') so it's channel=warning I guess [02:28:46] well its also trigger_error (E_USER_NOTICE, ...) [02:28:48] wfWarn => MWdebug:warning => sendMessage(..,'warning') => wfDebugLog('warning', ..) => getLogger('warning')->info() [02:29:12] Right [02:29:20] Oh, and that gets handled by php error/mw error [02:29:22] I see [02:29:40] OK. I can separate *user* notice to remain as warning [02:29:46] as opposed to E_NOTICE [02:30:59] Nice catch [02:31:29] Krinkle: I +2'd it as an improvement over the status quo [02:31:44] I sure wish we were in Python where everything would be an exception! [02:31:56] oh [02:31:58] you pushed PS3 [02:32:15] And 4 [02:32:21] check again [02:32:23] Thanks :) [02:33:20] legoktm: yeah, it's tempting to just add a set_error_handler to MediaWiki that looks for built-in E_NOTICE and just turn them into hard exceptions [02:33:47] like PHPUnit does :) [02:34:34] Hmm I wouldn't be against it [02:37:37] The only reason I wouldn't propose it is because 1) error handling is already ugly and confusing enough as it is, and 2) as much as I like Python, changing low-level php behaviour would be bad for the learning curve and interoperability. [02:38:49] !log legoktm@deploy1001 Finished scap: EUCopyrightCampaign updates (duration: 62m 43s) [02:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:14] 10Operations, 10Release-Engineering-Team, 10Scap: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10Legoktm) [02:47:23] Krinkle: how long does the RL message cache take to pick up updated values? [02:47:31] and is there a way I can manually purge it? [02:47:36] !log legoktm@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/UniversalLanguageSelector/: ULS fixes for Accept-Language stuff (duration: 01m 44s) [02:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:12] legoktm: localiation cache change requires scap [02:48:17] which I assume was done? [02:48:21] yeah, I just finished [02:48:38] specifically 'eucc-email-newsletter-label' on fixcopyrightwiki in German [02:48:51] https://fixcopyright.wikimedia.org/wiki/MediaWiki:Eucc-email-newsletter-label/de is correct [02:49:31] mw.messages.get('eucc-email-newsletter-label'); is in English, even with uselang=de (and most other messages are in German) [02:49:36] LC::recache:: calls MessageBLobStore::clear already [02:49:55] so it should just ride along startup cycle the same a change to a JS file would [02:50:24] if you use mwdebug1002 and disable cache in browser and hard refresh (in particular so that modules=startup gets a fresh response), does it work? [02:50:52] that should compute a version hash that's different and then naturally fetch the module with a different hash etc. [02:51:26] still English [02:51:56] oh page edits for NS_MW don't need scap [02:52:01] those purge the keys directly [02:52:09] on edit hook within core [02:52:18] for messagecache (!= localisationcache) [02:52:27] ohhhhh [02:52:28] wait [02:52:32] this isn't like a contenet message right? [02:52:42] err [02:52:50] it's not a NS_MW page [02:52:53] it's just a normal message [02:53:01] nothing exists on-wiki [02:53:14] I just linked the /de subpage to show that the localisation cache had successfully updated [02:53:21] right [02:54:22] https://fixcopyright.wikimedia.org/w/load.php?debug=false&lang=de&modules=ext.3d%2CeuCopyrightCampaign&skin=eucopyrightcampaign&version=bogus! has the wrong message [02:54:47] check wfMessage() on eval? [02:55:10] >>> wfMessage('eucc-email-newsletter-label')->inLanguage('de')->text(); [02:55:10] => "Ich möchte Updates zu Möglichkeiten zur Unterstützung von Wikimedia per E-Mail erhalten. (Sie können sich jederzeit abmelden. Diese E-Mail-Liste wird von $1 betrieben.)" [02:55:32] the only way that can be wrong (assuming bogus isn't a varnish hit) is if localisation cache itself is outdated ... ^.. or if the recache didn't purge things when it should've [02:55:34] OK [02:55:58] run $blobStore = new MessageBlobStore(); [02:55:58] $blobStore->clear(); [02:56:02] for copyrightwiki [02:56:22] done [02:56:23] >>> $blobStore->clear(); [02:56:23] => null [02:56:24] those are the last two lines of LC::recache(), so scap shoul've triggered that [02:57:02] still german ofr me [02:57:04] https://fixcopyright.wikimedia.org/w/load.php?debug=false&lang=de&modules=ext.3d%2CeuCopyrightCampaign&skin=eucopyrightcampaign&version=bogus!sdkfhdjkfh2394734 is still wrong [02:57:06] I mean English [02:57:11] Yeah, so not RL related I guess. [02:57:32] wtf [02:57:38] I don't understand either. [02:57:52] MessageBLobStore just calls wfMessage->inLanguage->plain() after a purge [02:58:29] Memcached error for key "/*/mw-wan/WANCache:t:fixcopyrightwiki:MessageBlobStore" on server "127.0.0.1:11213": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [02:58:33] well, that might be a problem [02:58:41] uhhhh [02:58:48] Memcached error for key "WANCache:v:global:lag-times:1:db1075:0-1-2-3" on server "127.0.0.1:11213": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [02:58:53] it can't access anything on memc [02:59:09] fantastic [02:59:19] but wouldn't that also mean it can't read out of cache? [02:59:28] Maybe. [02:59:42] It has a stash for stale values in some cases [02:59:52] like when stuff is exploding [03:00:21] it's also possible that... [03:00:22] wait [03:00:26] I think I understand. [03:00:34] it migiht be working from web servers [03:00:40] but broken from deploy1001 [03:00:43] so it couldn't purge [03:00:52] it looks like mcrouter isn't set up on deploy1001 according to these errors [03:01:25] from where did you clear()? [03:01:41] try on mwmaint or mwdebug [03:01:53] deploy1001 [03:03:38] I guess the old > $wgDebugLogFile='php://stdout'; hack doesn't work anymore [03:03:42] I just cleared from mwmaint [03:03:55] german now [03:04:00] yay [03:04:21] So, memcached from mw stuff running on deploy1001 is borked [03:04:44] someone needs to install mcrouter there [03:05:45] 10Operations: deploy1001 can't talk to memcached, breaking invalidation of RL localization cache - https://phabricator.wikimedia.org/T203626 (10Legoktm) p:05Triage>03Unbreak! [03:05:58] Krinkle: I just copy/pasted from IRC into ^ [03:06:21] ok, now we're just waiting for startup module expiry? [03:06:22] thx [03:06:27] yeah [03:06:39] pipe to purge.php if you're impatient :) [03:06:57] or just if you want to get it done (impatient sounds negative) [03:07:06] you have no idea lol [03:07:10] https://fixcopyright.wikimedia.org/w/load.php?debug=false&lang=de&modules=startup&only=scripts&skin=eucopyrightcampaign right? [03:07:37] that worked [03:08:27] k now lets do the cool fancy URL [03:11:43] test [03:12:10] CindyCicaleseWMF: got it :) [03:12:23] cool - that was weird [03:13:08] lol - and I thought you both saw my comments as I followed along and told you I was still seeing English instead of German ;-) [03:13:31] :| [03:13:36] it was weird [03:13:44] ok, but do the german messages look good to you now? [03:14:21] (03CR) 10Legoktm: [C: 032] Have canonical Main Page URL be the domain root for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458345 (owner: 10Legoktm) [03:15:41] (03Merged) 10jenkins-bot: Have canonical Main Page URL be the domain root for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458345 (owner: 10Legoktm) [03:17:34] !log legoktm@deploy1001 Synchronized wmf-config/CommonSettings.php: Have canonical Main Page URL be the domain root for fixcopyrightwiki (duration: 00m 57s) [03:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:13] I think that's itttttt [03:21:15] * legoktm hugs Krinkle [03:21:24] Krinkle: ty for all the help [03:21:50] yw! [03:25:11] (03CR) 10jenkins-bot: Have canonical Main Page URL be the domain root for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458345 (owner: 10Legoktm) [03:39:52] 10Operations, 10Traffic, 10MW-1.32-release-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Legoktm) After those ULS patches, the current status is that MW is setting Vary: Accept-Language un... [04:06:24] !log tstarling@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/Quiz/Question.php: (no justification provided) (duration: 00m 57s) [04:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:14] !log tstarling@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/Quiz/Question.php: T203628 (duration: 00m 56s) [04:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:20] T203628: Infinite loop in quiz shuffleAnswers - https://phabricator.wikimedia.org/T203628 [04:38:27] !log on mwdebug1001 restarting hhvm after probably breaking it by trying to attach with gdb [04:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:10] (03CR) 10Legoktm: [C: 031] "My comments are all nitpicks." (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [04:58:36] (03CR) 10Nemo bis: "Yes, GPL is fine. Just yesterday we were discussing about the need to add GPL in some more places." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458345 (owner: 10Legoktm) [05:05:34] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Release-Engineering-Team, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10Legoktm) [05:10:18] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Release-Engineering-Team, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10Legoktm) > What does the current outbound smtp config look like in Jenkins? I bel... [05:15:55] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Marostegui) p:05Triage>03Normal a:03Papaul Can we get a new disk here @papaul? Thanks! [05:25:03] (03CR) 10Legoktm: [C: 031] "Should we schedule this in a puppet swat?" [puppet] - 10https://gerrit.wikimedia.org/r/439483 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [05:47:09] (03PS1) 10Marostegui: site.pp: Clarify db2093 current usage status [puppet] - 10https://gerrit.wikimedia.org/r/458353 [05:47:52] (03CR) 10Jcrespo: "Luca, Andrew O.?" [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [05:48:17] (03CR) 10Marostegui: [C: 032] site.pp: Clarify db2093 current usage status [puppet] - 10https://gerrit.wikimedia.org/r/458353 (owner: 10Marostegui) [05:50:00] (03CR) 10Jcrespo: [C: 031] quarry::database: Use mariadb instead of mysql module [puppet] - 10https://gerrit.wikimedia.org/r/454481 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [05:56:43] (03PS5) 10Jcrespo: mariadb: Fix DB configuration in preparation for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) [05:57:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix DB configuration in preparation for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) (owner: 10Jcrespo) [06:00:56] (03PS6) 10Jcrespo: mariadb: Fix DB configuration in preparation for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) [06:02:00] (03PS3) 10Giuseppe Lavagetto: role::deployment_server: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436760 [06:04:25] (03PS4) 10Giuseppe Lavagetto: role::deployment_server: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436760 (https://phabricator.wikimedia.org/T203626) [06:04:53] 10Operations, 10Patch-For-Review: deploy1001 can't talk to memcached, breaking invalidation of RL localization cache - https://phabricator.wikimedia.org/T203626 (10Joe) a:03Joe [06:11:12] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment_server: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436760 (https://phabricator.wikimedia.org/T203626) (owner: 10Giuseppe Lavagetto) [06:11:41] (03CR) 10Jcrespo: [C: 031] m5 grants: replace designate password hash with a private lookup [puppet] - 10https://gerrit.wikimedia.org/r/458284 (owner: 10Andrew Bogott) [06:20:10] (03CR) 10Jcrespo: [C: 031] sre.switchdc.mediawiki: phase 2 add sleep [cookbooks] - 10https://gerrit.wikimedia.org/r/458221 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [06:21:05] (03CR) 10Volans: [C: 031] "Although this is duplicate code I agree with the patch as it's temporary in the sense that those scripts will be migrated to cookbooks and" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458325 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [06:21:54] 10Operations, 10Patch-For-Review: deploy1001 can't talk to memcached, breaking invalidation of RL localization cache - https://phabricator.wikimedia.org/T203626 (10Joe) Clearly I just forgot to merge a change at the time of the mcrouter rollout, sorry about that. [06:22:16] (03CR) 10Jcrespo: "I support strongly this idea, but I am not the right person to review it." [puppet] - 10https://gerrit.wikimedia.org/r/413745 (https://phabricator.wikimedia.org/T157133) (owner: 10Andrew Bogott) [06:26:47] (03PS7) 10Elukey: piwik: add support for stretch/PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/453553 (owner: 10Dzahn) [06:28:48] (03CR) 10Elukey: [C: 032] "Looks good! https://puppet-compiler.wmflabs.org/compiler1002/12376/bohrium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/453553 (owner: 10Dzahn) [06:35:57] (03CR) 10Volans: [C: 04-1] "reply inline" (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458223 (owner: 10Faidon Liambotis) [06:38:56] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: phase 2 add sleep [cookbooks] - 10https://gerrit.wikimedia.org/r/458221 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [06:39:49] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: phase 2 add sleep [cookbooks] - 10https://gerrit.wikimedia.org/r/458221 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [06:47:36] (03PS1) 10Giuseppe Lavagetto: monitoring: add fact to the spec [puppet] - 10https://gerrit.wikimedia.org/r/458356 [06:49:07] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: add fact to the spec [puppet] - 10https://gerrit.wikimedia.org/r/458356 (owner: 10Giuseppe Lavagetto) [06:49:14] (03PS2) 10Muehlenhoff: Decommission mw2213 [puppet] - 10https://gerrit.wikimedia.org/r/458139 (https://phabricator.wikimedia.org/T203434) [06:50:52] (03PS1) 10Volans: Tests: improve naming for SSH key file [software/cumin] - 10https://gerrit.wikimedia.org/r/458357 [06:50:54] (03PS1) 10Volans: Documentation: fix typo [software/cumin] - 10https://gerrit.wikimedia.org/r/458358 [06:52:17] (03PS5) 10Giuseppe Lavagetto: conftool: add class for writing to state to file [puppet] - 10https://gerrit.wikimedia.org/r/457490 [06:55:57] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12377/ seems to DTRT. merging." [puppet] - 10https://gerrit.wikimedia.org/r/457490 (owner: 10Giuseppe Lavagetto) [06:57:04] (03PS3) 10Muehlenhoff: Decommission mw2213 [puppet] - 10https://gerrit.wikimedia.org/r/458139 (https://phabricator.wikimedia.org/T203434) [06:58:47] (03CR) 10Elukey: [C: 031] "Checked bohrium (piwik), analytics1003 (various hadoop services), db110[78] and dbstore1002 (even though I am 99.9% sure that those were a" [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [07:01:12] (03CR) 10Jcrespo: "I will deploy this soon, as I and other people confirmed no mysql installation has non-system mysql users. Last chance to block the deploy" [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [07:01:42] (03PS4) 10Muehlenhoff: Decommission mw2213 [puppet] - 10https://gerrit.wikimedia.org/r/458139 (https://phabricator.wikimedia.org/T203434) [07:02:36] (03CR) 10Volans: conftool: add class for writing to state to file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/457490 (owner: 10Giuseppe Lavagetto) [07:02:38] (03CR) 10Muehlenhoff: [C: 032] Decommission mw2213 [puppet] - 10https://gerrit.wikimedia.org/r/458139 (https://phabricator.wikimedia.org/T203434) (owner: 10Muehlenhoff) [07:03:02] (03CR) 10Jcrespo: [C: 031] mysql user: Remove exception for mysql user being removed [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [07:05:28] (03PS1) 10Alexandros Kosiaris: Introduce orespoolcounter[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/458454 (https://phabricator.wikimedia.org/T203465) [07:06:02] (03CR) 10jerkins-bot: [V: 04-1] Introduce orespoolcounter[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/458454 (https://phabricator.wikimedia.org/T203465) (owner: 10Alexandros Kosiaris) [07:07:50] (03PS1) 10Giuseppe Lavagetto: profile::conftool::state: fix template [puppet] - 10https://gerrit.wikimedia.org/r/458455 [07:08:44] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::conftool::state: fix template [puppet] - 10https://gerrit.wikimedia.org/r/458455 (owner: 10Giuseppe Lavagetto) [07:10:30] !log run decomission_appserver on mw2213 (T203434) [07:10:30] (03PS2) 10Alexandros Kosiaris: Introduce orespoolcounter[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/458454 (https://phabricator.wikimedia.org/T203465) [07:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:36] T203434: Decom mw2213 - https://phabricator.wikimedia.org/T203434 [07:11:18] (03CR) 10jerkins-bot: [V: 04-1] Introduce orespoolcounter[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/458454 (https://phabricator.wikimedia.org/T203465) (owner: 10Alexandros Kosiaris) [07:14:45] (03PS3) 10Alexandros Kosiaris: Introduce orespoolcounter[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/458454 (https://phabricator.wikimedia.org/T203465) [07:14:58] (03PS3) 10Giuseppe Lavagetto: realm.pp: drop mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/457491 [07:15:48] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::maintenance: depend on mediawiki config, not hiera [puppet] - 10https://gerrit.wikimedia.org/r/457492 [07:17:26] 10Operations, 10RESTBase, 10Availability, 10Performance, and 5 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204 (10tstarling) [07:17:29] 10Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 6 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192 (10tstarling) 05Resolved>03Open It's not fixed, or has regressed. I noticed this today due to T203628 an... [07:17:35] (03CR) 10Gehel: [C: 031] "LGTM (see minor proposition inline)" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/458357 (owner: 10Volans) [07:17:43] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce orespoolcounter[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/458454 (https://phabricator.wikimedia.org/T203465) (owner: 10Alexandros Kosiaris) [07:18:08] (03CR) 10Gehel: [C: 032] "trivial enough" [software/cumin] - 10https://gerrit.wikimedia.org/r/458358 (owner: 10Volans) [07:18:56] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10MoritzMuehlenhoff) [07:19:17] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10MoritzMuehlenhoff) p:05Triage>03Normal a:05MoritzMuehlenhoff>03None [07:24:48] <_joe_> !log rolling restart of eqiad HHVM appservers [07:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:08] 10Operations, 10Traffic, 10MW-1.32-release-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Nikerabbit) > The language selectors are generating URLs with ?uselang=XX Why are you not using (j... [07:44:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10MoritzMuehlenhoff) [07:44:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10MoritzMuehlenhoff) Added to pwstore. [07:44:25] (03CR) 10Marostegui: [C: 031] mariadb: Fix DB configuration in preparation for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) (owner: 10Jcrespo) [07:44:36] 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10MoritzMuehlenhoff) Added to pwstore. [07:44:47] 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10MoritzMuehlenhoff) [07:45:20] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10MoritzMuehlenhoff) Added to pwstore. [07:45:32] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10MoritzMuehlenhoff) [07:47:58] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: fix typo in manifest [puppet] - 10https://gerrit.wikimedia.org/r/458456 [07:49:38] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_compiler: fix typo in manifest [puppet] - 10https://gerrit.wikimedia.org/r/458456 (owner: 10Giuseppe Lavagetto) [07:52:21] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs1003 - https://phabricator.wikimedia.org/T202780 (10Gehel) New SSD in place, server reimaged and data reimported. We're all good! [07:52:57] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: WDQS diskspace is low - https://phabricator.wikimedia.org/T196485 (10Gehel) 05Open>03Resolved a:03Gehel New SSD in place, server reimaged and data reimported. We're all good! [07:53:15] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add ssds to wdqs2003 - https://phabricator.wikimedia.org/T202778 (10Gehel) New SSD in place, server reimaged and data reimported. We're all good! [07:53:37] 10Operations, 10ops-codfw, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs200[12] - https://phabricator.wikimedia.org/T202777 (10Gehel) New SSD in place, server reimaged and data reimported. We're all good! [07:53:50] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs100[45] - https://phabricator.wikimedia.org/T202779 (10Gehel) New SSD in place, server reimaged and data reimported. We're all good! [07:54:15] !log Upgraded packages on contint1001 and contint2001 [07:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:44] (03CR) 10Jcrespo: [C: 032] mariadb: Fix DB configuration in preparation for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) (owner: 10Jcrespo) [07:56:02] (03Merged) 10jenkins-bot: mariadb: Fix DB configuration in preparation for dc switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457847 (https://phabricator.wikimedia.org/T189107) (owner: 10Jcrespo) [07:57:11] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: add mediawiki_active_dc function [puppet] - 10https://gerrit.wikimedia.org/r/345531 (owner: 10Giuseppe Lavagetto) [07:58:22] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Fix DB configuration in preparation for dc switchover (duration: 00m 57s) [07:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:21] !log rebooting contint1001 for kernel security update [08:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:28] so long wikibugs [08:06:16] !log Enable replication codfw -> eqiad on s5,s6,s2 - T189107 [08:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:21] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [08:06:54] !log repair sde1 on ms-be2042 - T199198 [08:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:00] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:07:09] ugh, icinga isn't here, known? [08:08:00] I'll take a look at ircecho [08:09:01] Sep 04 19:59:26 einsteinium ircecho[41929]: Starting notifier loop [08:09:01] Sep 04 23:34:05 einsteinium ircecho[41929]: Error writing: Not connected.Dropping this message: "PROBLEM - Disk space on notebook1003 is CRITICAL: Return code of 255 is out of bounds" [08:09:04] !log rebooting contint2001 for kernel security update [08:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:13] and then a series of those "error writing" [08:09:30] !log bounce ircecho on einsteinium, stuck and not on irc [08:09:31] (03PS1) 10Giuseppe Lavagetto: Fix condition for using nutcracker instead of mcrouter on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458457 [08:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:36] <_joe_> jynus: ^^ [08:10:05] _joe_: one sec, in the middle of other db maintenance [08:10:20] <_joe_> yeah take your time [08:10:25] <_joe_> not urgent at all :) [08:11:33] (03CR) 10Hashar: "We have restarted the CI stack and the patch has been dropped from the CI queue. You can drop your Code-Review+2 and vote again to get th" [software/cumin] - 10https://gerrit.wikimedia.org/r/458358 (owner: 10Volans) [08:11:49] (03CR) 10jerkins-bot: [V: 04-1] Fix condition for using nutcracker instead of mcrouter on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458457 (owner: 10Giuseppe Lavagetto) [08:11:59] gehel: you can CR+2 again the cumin patch https://gerrit.wikimedia.org/r/c/operations/software/cumin/+/458358 CI got restarted and the event has been lost [08:12:37] hashar: it is actually waiting for the parent to be ready [08:12:40] hashar: ack, thanks, no hurry at all (it's chained to another one) [08:14:47] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Fsalutari) [08:14:50] (03CR) 10Vgutierrez: Add make_account CLI script (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (owner: 10Alex Monk) [08:14:52] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Flavia Salutari - https://phabricator.wikimedia.org/T201199 (10Fsalutari) 05declined>03Open [08:17:07] !log Enable replication codfw -> eqiad on s1,s3,s4 - T189107 [08:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:13] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [08:17:38] (03CR) 10Vgutierrez: "> good, though I'm providing a config.example.yaml in the package" [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [08:18:03] _joe_: did you check why the -1? [08:18:26] also add the bug number: [08:19:01] T203479 (forgot the please) [08:19:02] T203479: labtestweb2001: Memcached error for key on server "127.0.0.1:11213": SERVER HAS FAILED - https://phabricator.wikimedia.org/T203479 [08:19:09] <_joe_> jynus: thanks, yes I'll add it [08:19:33] <_joe_> jynus: and no I didn't check the -1, I was in a query conversation [08:20:13] 10Operations, 10Patch-For-Review: deploy1001 can't talk to memcached, breaking invalidation of RL localization cache - https://phabricator.wikimedia.org/T203626 (10Peachey88) [08:22:50] (03PS1) 10Filippo Giunchedi: ircecho: don't SASL if not provided with a password [puppet] - 10https://gerrit.wikimedia.org/r/458459 [08:23:15] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Rossi.dario.g) [08:23:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10Rossi.dario.g) 05declined>03Open [08:24:29] someone available to rubberstamp https://gerrit.wikimedia.org/r/c/operations/puppet/+/458459 ? fixes ircecho on einsteinium [08:24:36] looking [08:24:51] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 73390 bytes in 0.821 second response time [08:25:07] !log Enable replication codfw -> eqiad on s7,s8,x1 - T189107 [08:25:10] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.065 second response time [08:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:13] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [08:25:26] thanks [08:25:29] (03CR) 10Muehlenhoff: [C: 031] ircecho: don't SASL if not provided with a password [puppet] - 10https://gerrit.wikimedia.org/r/458459 (owner: 10Filippo Giunchedi) [08:25:41] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.071 second response time [08:26:20] 10Operations, 10SRE-Access-Requests: NDA access for Telecom Paristech Research Team - https://phabricator.wikimedia.org/T200800 (10Rossi.dario.g) Hi all, back from holidays, was finally able to create my wikitech account. I guess everything is ready now so I re-opened task T201196 -- thanks in advance for... [08:26:48] (03CR) 10Filippo Giunchedi: [C: 032] ircecho: don't SASL if not provided with a password [puppet] - 10https://gerrit.wikimedia.org/r/458459 (owner: 10Filippo Giunchedi) [08:30:04] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Flavia Salutari - https://phabricator.wikimedia.org/T201199 (10Fsalutari) Dear all, Sorry for the delay. I've signed the L3 agreement document too, and these are my user information: wikitech username = "Fsalutari" preferred shell use... [08:31:23] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12379/ looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/457492 (owner: 10Giuseppe Lavagetto) [08:31:30] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::maintenance: depend on mediawiki config, not hiera [puppet] - 10https://gerrit.wikimedia.org/r/457492 [08:31:59] icinga-wm: welcome back [08:32:05] !log Enable replication codfw -> eqiad on es2,es3 - T189107 [08:32:08] (03PS3) 10Vgutierrez: Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) [08:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:13] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [08:32:30] thanks godog ! [08:35:24] PROBLEM - MariaDB Slave SQL: x1 on db1120 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table heartbeat.heartbeat: Duplicate entry 180355159 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1069-bin.000138, end_log_pos 1001544944 [08:35:38] ^ checking [08:36:10] elukey: no problem [08:36:37] let's depool db1120 [08:36:43] yep [08:36:49] (03PS3) 10Vgutierrez: README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) [08:36:58] you doing that? [08:37:05] (03CR) 10Filippo Giunchedi: mtail: add exim tls ciphersuite metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458289 (https://phabricator.wikimedia.org/T203260) (owner: 10Herron) [08:37:07] yes [08:37:11] ok [08:37:14] check x1-codfw-master is in ro [08:37:41] yeah [08:37:43] i checked it [08:37:43] it is [08:38:10] (03PS1) 10Jcrespo: mariadb: Depool db1120 from x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458460 [08:38:11] I think I know why this happened, I will explain once it is depooled [08:38:23] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1120 from x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458460 (owner: 10Jcrespo) [08:38:32] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Depool db1120 from x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458460 (owner: 10Jcrespo) [08:39:50] (03CR) 10Vgutierrez: Add make_account CLI script (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (owner: 10Alex Monk) [08:40:05] I have fixed the issue [08:40:15] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1120 (duration: 00m 57s) [08:40:19] <_joe_> marostegui: ok so, what happened? [08:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:40] * _joe_ curious [08:40:55] RECOVERY - MariaDB Slave SQL: x1 on db1120 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:42:00] there were 300 exceptions or so [08:42:15] So the issue was with the heartbeat table that had a duplicate entry for the server_id from codfw (db2034), the reason it already had that row there is because db1120 was restored via backups (codfw) and was started to replicate as a codfw slave and later moved to eqiad, so the row was existing there alerady [08:42:32] <_joe_> oh I see [08:42:34] oh, so no data issue [08:42:35] <_joe_> wow :P [08:42:42] only the way replication is checked [08:42:56] <_joe_> well, we had a duplicate entry [08:42:57] no, no data issue, just heartbeat [08:43:10] <_joe_> but it was really a duplicated entry :) [08:43:11] yeah, but we don't really consider heatbeat as data :p [08:43:19] but heartbeat is suppsed to REPLACE [08:43:20] <_joe_> eheh fair enough [08:43:28] but because it was a replica [08:43:30] on row [08:43:35] <_joe_> I see [08:43:37] the replace gets converted to an insert [08:43:41] <_joe_> right [08:43:48] which leads to a duplicate key issue [08:44:02] so a quite rare case of problems [08:44:06] <_joe_> who told us, back in the day, that ROW based replica was better than STATEMENT was clearly lying [08:44:17] I told you that [08:44:41] <_joe_> jynus: oh I know the reasons why row-based replication is better [08:44:50] it make heartbeat fail in the rare case [08:44:59] but it makes the data not fail :-) [08:45:23] (03CR) 10Mathew.onipe: "> Patch Set 20:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [08:46:03] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1120 from x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458462 [08:47:41] RECOVERY - Filesystem available is greater than filesystem size on ms-be2042 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [08:48:18] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db1120 from x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458462 (owner: 10Jcrespo) [08:48:42] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) Replication codfw -> eqiad has been enabled on s1-s8,x1,es2,es3 [08:49:06] (03PS2) 10Giuseppe Lavagetto: Fix condition for using nutcracker instead of mcrouter on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458457 (https://phabricator.wikimedia.org/T203479) [08:51:24] (03PS1) 10Muehlenhoff: Disable fetching the netboot image via HTTP for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/458463 (https://phabricator.wikimedia.org/T199125) [08:52:00] (03CR) 10jerkins-bot: [V: 04-1] Disable fetching the netboot image via HTTP for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/458463 (https://phabricator.wikimedia.org/T199125) (owner: 10Muehlenhoff) [08:52:31] (03CR) 10jenkins-bot: mariadb: Depool db1120 from x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458460 (owner: 10Jcrespo) [08:54:57] (03CR) 10Marostegui: [C: 032] Revert "mariadb: Depool db1120 from x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458462 (owner: 10Jcrespo) [08:56:01] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1120 from x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458462 (owner: 10Jcrespo) [08:56:18] (03PS2) 10Muehlenhoff: Disable fetching the netboot image via HTTP for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/458463 (https://phabricator.wikimedia.org/T199125) [08:56:54] (03CR) 10jerkins-bot: [V: 04-1] Disable fetching the netboot image via HTTP for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/458463 (https://phabricator.wikimedia.org/T199125) (owner: 10Muehlenhoff) [08:57:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1120 (duration: 00m 58s) [08:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:15] (03CR) 10Jcrespo: [C: 031] "Let's deploy now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458457 (https://phabricator.wikimedia.org/T203479) (owner: 10Giuseppe Lavagetto) [08:59:17] (03PS36) 10Gehel: Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [08:59:19] (03PS1) 10Gehel: elasticsearch: disable the default elasticsearch unit [puppet] - 10https://gerrit.wikimedia.org/r/458464 (https://phabricator.wikimedia.org/T198351) [09:00:40] (03PS3) 10Muehlenhoff: Disable fetching the netboot image via HTTP for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/458463 (https://phabricator.wikimedia.org/T199125) [09:01:17] (03CR) 10jerkins-bot: [V: 04-1] Disable fetching the netboot image via HTTP for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/458463 (https://phabricator.wikimedia.org/T199125) (owner: 10Muehlenhoff) [09:07:10] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix condition for using nutcracker instead of mcrouter on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458457 (https://phabricator.wikimedia.org/T203479) (owner: 10Giuseppe Lavagetto) [09:07:31] (03PS2) 10Ema: ATS: ship service file as a systemd override [puppet] - 10https://gerrit.wikimedia.org/r/458201 (https://phabricator.wikimedia.org/T200178) [09:08:12] (03Merged) 10jenkins-bot: Fix condition for using nutcracker instead of mcrouter on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458457 (https://phabricator.wikimedia.org/T203479) (owner: 10Giuseppe Lavagetto) [09:08:16] (03CR) 10Ema: [C: 032] ATS: ship service file as a systemd override [puppet] - 10https://gerrit.wikimedia.org/r/458201 (https://phabricator.wikimedia.org/T200178) (owner: 10Ema) [09:08:17] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1120 from x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458462 (owner: 10Jcrespo) [09:08:24] (03CR) 10jenkins-bot: Fix condition for using nutcracker instead of mcrouter on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458457 (https://phabricator.wikimedia.org/T203479) (owner: 10Giuseppe Lavagetto) [09:09:04] (03CR) 10ArielGlenn: "> Should those be *XML* dumps configs or just dumps configs, which" [puppet] - 10https://gerrit.wikimedia.org/r/456439 (owner: 10Smalyshev) [09:10:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10ArielGlenn) May we close this now, or is there something not listed that is yet to be done? [09:11:15] !log oblivian@deploy1001 Synchronized wmf-config/mc.php: Fixing memcached configuration for labstestwiki T203479 (duration: 00m 56s) [09:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:21] T203479: labtestweb2001: Memcached error for key on server "127.0.0.1:11213": SERVER HAS FAILED - https://phabricator.wikimedia.org/T203479 [09:12:56] I am checking the logs to search for that error [09:13:15] it is not on fatalmonitor [09:13:20] maybe on other channels [09:15:59] <_joe_> it was on other channels [09:16:06] https://logstash.wikimedia.org/goto/51365182a025c5701785d3644220f744 [09:16:11] <_joe_> by i verified from cli, seleccting labtestwiki [09:16:11] ^I guess that's it [09:17:22] should we go for the mw_primary one? [09:17:33] <_joe_> in a few, sorry [09:17:34] or are there issues? [09:17:35] ok [09:17:45] <_joe_> if you can build a list of hosts to run the compiler on [09:17:57] <_joe_> that would help my confidence in wwhat I did [09:18:43] a single one is enough- all core host share the same code, and the others don't use it [09:18:53] db1067 is the enwiki master [09:18:59] <_joe_> ok [09:19:02] if it works there, it works evertwhere [09:19:03] <_joe_> one in codfw too? [09:19:06] ok [09:19:08] db2070 [09:19:16] codfw replica [09:19:34] .eqiad and .codfw BTW, not wikimedia [09:19:39] volans: spicerack should now use the "backports" debian-glue job :) [09:20:01] hashar: yeah noticed you've merged the patch, thanks a lot [09:22:56] <_joe_> jynus: https://puppet-compiler.wmflabs.org/compiler1002/12380/ is kinda beautiful [09:23:17] <_joe_> we have to be careful to check what happens on icinga though [09:23:33] _joe_: hi, the spec for the puppet module "install_server" also fails with "Could not find the daemon directory (tested [/etc/sv,/var/lib/service])" [09:23:38] <_joe_> jynus: my plan for deployment would be - merge, run puppet on one core host [09:23:42] guess I will have to dig into it and find a proper fix [09:23:54] <_joe_> hashar: but it doesn't fail in our docker [09:23:56] <_joe_> go figure [09:24:06] <_joe_> hashar: in another week, I'd help you [09:24:07] it failed on https://gerrit.wikimedia.org/r/c/operations/puppet/+/458463 :] [09:24:25] (03CR) 10Gehel: [C: 031] Convert elasticsearch to systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [09:24:33] <_joe_> jynus: then verify what happens on einsteinium; then run on all core hosts, then verify einsteinium again [09:24:38] <_joe_> seems sensible? [09:24:58] ? [09:25:06] sorry, I may not understand you [09:25:14] <_joe_> when I merge the puppet change [09:25:18] einstinium should not change at all [09:25:24] <_joe_> it shouldn't [09:25:30] it should only change the local check at the dbs [09:25:34] <_joe_> let's make sure we don't stop pages for eqiad dbs [09:25:43] well, change with noop [09:25:53] but you get what I mean, it changes the npr config [09:25:58] npre [09:25:59] <_joe_> yes, but exported resources are thorny [09:26:04] ok [09:26:13] <_joe_> anyways, I'll merge [09:26:19] +1 [09:26:32] tell me when done so I run puppet on a less imporarnt db [09:27:19] 10Operations, 10Discovery-Search, 10Elasticsearch: Alert when elasticsearch has shards larger than a maximum size - https://phabricator.wikimedia.org/T203546 (10Gehel) For reference, https://github.com/wikimedia/puppet/blob/production/modules/elasticsearch/files/nagios/check_elasticsearch.py is a similar che... [09:27:56] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12380/ seems to say the patch is correct." [puppet] - 10https://gerrit.wikimedia.org/r/457491 (owner: 10Giuseppe Lavagetto) [09:28:08] (03PS4) 10Giuseppe Lavagetto: realm.pp: drop mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/457491 [09:28:42] <_joe_> jynus: I disabled puppet everywhere on role::mariadb::core, so you will need to run puppet agent --enable first [09:28:50] oh, thanks [09:28:57] <_joe_> with a specific reason [09:29:00] I don't think it was necessary, but I don't disagree [09:29:07] <_joe_> so I won't override any disabling of yours [09:29:41] <_joe_> the change is now merged, you might proceed whenever you want [09:29:50] ok, testing [09:30:20] noop on db1082 [09:30:44] <_joe_> let me do a run on einsteinium [09:30:57] noop on db1113 [09:31:21] <_joe_> check one in codfw just to be sure? [09:31:28] that's next :-) [09:31:49] 10Operations, 10Continuous-Integration-Config: rspec-puppet fails with Could not find the daemon directory (tested [/etc/sv,/var/lib/service]) - https://phabricator.wikimedia.org/T203645 (10hashar) [09:32:37] noop on db2070 [09:32:51] I think we are good [09:33:04] !log disabling puppet on install1002 for some d-i tests [09:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:22] 10Operations, 10Continuous-Integration-Config: rspec-puppet fails with Could not find the daemon directory (tested [/etc/sv,/var/lib/service]) - https://phabricator.wikimedia.org/T203645 (10hashar) [09:33:27] <_joe_> jynus: I think so too [09:33:36] <_joe_> I'll check einsteinium's puppet logs later [09:33:48] <_joe_> but for now I'll reenable puppet everywhere [09:35:00] <_joe_> jynus: done [09:35:09] <_joe_> we got rid of mw_primary [09:38:18] great job, _joe_ [09:38:41] you not only made the switchover a lot faster [09:38:48] you unblocked a lot of stuff [09:41:39] 10Operations, 10Puppet, 10DBA: Remove all usages of $::mw_primary on puppet - https://phabricator.wikimedia.org/T199124 (10jcrespo) a:05jcrespo>03Joe [09:41:49] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10MoritzMuehlenhoff) I also tried to disable an HTTP-based PXE boot via https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/45846... [09:45:11] _joe_: did you enabled puppet again, or should I do that? [09:46:33] (03Abandoned) 10Jcrespo: Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/449742 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [09:46:46] <_joe_> jynus: I did [09:46:55] 10Operations, 10Icinga: register a nickserv account for icinga-wm - https://phabricator.wikimedia.org/T22771 (10Krenair) [09:46:57] 10Operations, 10IRCecho, 10Patch-For-Review: ircecho should support nickserv registration - https://phabricator.wikimedia.org/T48254 (10Krenair) 05Open>03Resolved a:03Krenair [09:46:59] (03CR) 10Alex Monk: [C: 031] "dammit, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/458459 (owner: 10Filippo Giunchedi) [09:47:05] (03Abandoned) 10Jcrespo: Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [09:52:29] (03PS2) 10Ema: trafficserver (7.1.3+ds-4wm3) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/458195 (https://phabricator.wikimedia.org/T199720) [09:59:21] 10Operations, 10Discovery-Search, 10Elasticsearch: Alert when elasticsearch has shards larger than a maximum size - https://phabricator.wikimedia.org/T203546 (10Mathew.onipe) a:03Mathew.onipe [10:10:38] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Alert when elasticsearch has shards larger than a maximum size - https://phabricator.wikimedia.org/T203546 (10Mathew.onipe) [10:15:31] (03PS21) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [10:16:04] (03CR) 10Alex Monk: Add make_account CLI script (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (owner: 10Alex Monk) [10:16:42] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [10:19:02] (03PS1) 10Volans: WIP mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) [10:19:55] (03CR) 10Volans: "Tests are on their way, preview of the code for now, please give early feedback, we're really short on time." [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:20:04] (03CR) 10jerkins-bot: [V: 04-1] WIP mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:21:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10jijiki) 05Open>03Resolved Boarded! [10:22:09] (03PS22) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [10:22:35] !log rebooting mc2* hosts for kernel security update [10:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:59] (03CR) 10Alex Monk: Prepare for packaging stuff and readme (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [10:27:35] (03PS39) 10Alex Monk: Prepare for packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [10:28:19] (03CR) 10jerkins-bot: [V: 04-1] Prepare for packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [10:36:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10ArielGlenn) This task has been re-opened by the user in accordance with T200800#4562028 [10:38:27] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10ArielGlenn) It's a week later; do we want to talk about this at the SRE meeting or can we come to some sort of agreement here on the task? [10:39:00] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:40:40] moritzm: ^^^ [10:41:36] (03PS2) 10Volans: mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) [10:42:00] (03CR) 10Volans: "patch completed, please review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:44:36] volans: caused by the reboots, should recover after a while [10:45:01] moritzm: ack, wasn't sure was expected, thx [10:45:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:45:42] (03PS1) 10Giuseppe Lavagetto: parsoid: connect to MediaWiki via https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/458475 [10:45:44] (03PS1) 10Giuseppe Lavagetto: service::node::config::scap3: get rid of confd-controlled configs [puppet] - 10https://gerrit.wikimedia.org/r/458476 [10:46:41] (03CR) 10jerkins-bot: [V: 04-1] service::node::config::scap3: get rid of confd-controlled configs [puppet] - 10https://gerrit.wikimedia.org/r/458476 (owner: 10Giuseppe Lavagetto) [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:01:35] I'm around, but I'll need to go soon; on the other hand hasharAway is not around but he should return soo [11:01:37] soon :D [11:01:44] and there are no patches anyway... [11:01:54] zeljkof: goood :] [11:10:33] 10Operations, 10Continuous-Integration-Infrastructure, 10Math: quibble-vendor-mysql-hhvm-docker no space left on device, write - https://phabricator.wikimedia.org/T203649 (10Physikerwelt) [11:22:41] (03CR) 10Alexandros Kosiaris: [C: 031] mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:27:04] (03CR) 10Paladox: [C: 031] icinga: make the apache virtual host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [11:44:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I might be wrong, but I think there is an error in the check for the heartbeat to be in sync." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:54:11] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:55:41] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:59:11] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:00:01] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T1200) [12:04:23] 10Operations, 10SRE-Access-Requests: analytics-privatedata-users access for Flavia Salutari - https://phabricator.wikimedia.org/T201199 (10Gilles) a:05Gilles>03None [12:06:06] (03CR) 10Gehel: "I have no idea about the mysql side of this CR, but a few comments on the style. None of those comments are blocker, I know the deadlines," (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:23:42] (03CR) 10Ema: [C: 032] trafficserver (7.1.3+ds-4wm3) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/458195 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [12:24:17] (03CR) 10Jcrespo: "Answering first one of the comments." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:29:26] (03CR) 10Jcrespo: mysql: refactor sync check to avoid GTID (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:30:26] (03CR) 10Jcrespo: ">" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:31:31] (03CR) 10Jcrespo: "My pragmatic recommendation is, let's deploy, and let's test with and without lag, and later fix the issues mentioned which are mostly ref" [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:31:36] (03CR) 10Jcrespo: [C: 031] mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:34:40] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:36:51] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:39:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10Rossi.dario.g) Dear all, I re-opened the task, created the wikitech account and commented here and via email please let me know if y... [12:45:51] (03CR) 10DCausse: Convert elasticsearch to systemd unit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [12:47:21] (03PS1) 10Giuseppe Lavagetto: service: fix spec for debian 9+ [puppet] - 10https://gerrit.wikimedia.org/r/458495 [12:47:25] <_joe_> hashar: ^^ [12:48:31] <_joe_> hashar: I'm going to add it to all spec helpers I guess :/ [12:49:23] (03CR) 10DCausse: "I don't get where" [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [12:49:54] (03PS2) 10Giuseppe Lavagetto: service: fix spec for debian 9+ [puppet] - 10https://gerrit.wikimedia.org/r/458495 [12:49:56] (03PS2) 10Giuseppe Lavagetto: service::node::config::scap3: get rid of confd-controlled configs [puppet] - 10https://gerrit.wikimedia.org/r/458476 [12:50:27] _joe_: yup that would do. It is a bit terrible to have it on each spec though :^\ [12:50:41] (03CR) 10Gehel: "Mostly minor, but still a few more cleanups!" (0310 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [12:50:58] <_joe_> hashar: short of creating a gem ourselves to do that... [12:51:04] <_joe_> and I'd really like not to [12:51:32] <_joe_> if you can suggest a more gentle comment, It would be welcome [12:51:42] <_joe_> I can't find better words than "stupid" [12:52:04] "outdated" would work :] [12:52:28] or we fork puppet [12:53:07] <_joe_> nah, monkey-patching is ok imho [12:55:01] PROBLEM - DPKG on cp1074 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:55:01] PROBLEM - DPKG on cp2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:55:11] PROBLEM - DPKG on cp1072 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:55:11] PROBLEM - DPKG on cp2009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:55:25] <_joe_> uhm [12:55:30] PROBLEM - DPKG on cp2015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:55:31] PROBLEM - DPKG on cp1071 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:55:31] PROBLEM - DPKG on cp1073 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:55:31] <_joe_> I guess some upgrade? [12:55:38] not me, looking [12:55:38] (03CR) 10Gehel: [C: 031] Convert elasticsearch to systemd unit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [12:56:00] (03PS37) 10Gehel: Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [12:56:01] PROBLEM - DPKG on cp2021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:56:11] RECOVERY - DPKG on cp2003 is OK: All packages OK [12:56:43] probably ema, these are all ATS hosts [12:57:15] yup, he just uploaded a new ATS release [12:57:20] RECOVERY - DPKG on cp1074 is OK: All packages OK [12:57:21] RECOVERY - DPKG on cp1072 is OK: All packages OK [12:57:25] yep that's me, sorry for the noise [12:57:30] RECOVERY - DPKG on cp2009 is OK: All packages OK [12:57:32] <_joe_> vgutierrez: but the upgrade seems to include more than just ats? [12:57:41] RECOVERY - DPKG on cp2015 is OK: All packages OK [12:57:41] RECOVERY - DPKG on cp1071 is OK: All packages OK [12:57:41] RECOVERY - DPKG on cp1073 is OK: All packages OK [12:58:00] probably conffile fun during upgrade [12:58:11] RECOVERY - DPKG on cp2021 is OK: All packages OK [12:58:18] (03CR) 10Gehel: mysql: refactor sync check to avoid GTID (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:59:31] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [13:00:05] hashar: Dear deployers, time to do the MediaWiki train - European version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T1300). [13:00:28] o/ [13:01:44] (03PS1) 10Hashar: all wikis to 1.32.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458497 [13:01:46] (03CR) 10Hashar: [C: 032] all wikis to 1.32.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458497 (owner: 10Hashar) [13:01:57] deploying the train ... whish me luck [13:02:33] <_joe_> https://upload.wikimedia.org/wikipedia/commons/1/19/Train_wreck_at_Montparnasse_1895.jpg [13:02:54] we have https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#/media/File:TGVA_n%C2%B0341_au_PN_401_bis_%C3%A0_La_Baule_par_Cramos.JPG now :] [13:02:58] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458497 (owner: 10Hashar) [13:03:20] taken at Baule-Escoublac rail station which is a few kilometers away from my city \o/ [13:03:25] <_joe_> eheh [13:03:42] !log all wikis to 1.32.0-wmf.20 | T191066 [13:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:48] T191066: 1.32.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T191066 [13:04:40] (03CR) 10DCausse: Convert elasticsearch to systemd unit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [13:04:40] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:05:01] canaries pending [13:06:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: analytics-privatedata-users access for Dario Rossi (username drossi) - https://phabricator.wikimedia.org/T201196 (10Aklapper) @Rossi.dario.g: See 3rd and 4th bullet point in T201196#4520314 - #3 seems to be https://wikitech.wikimedia.org/wiki/User:Dar... [13:08:33] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.20 [13:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:25] (03CR) 10Ottomata: [C: 031] "Dunno much about this sounds fine to meeeee!" [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [13:11:40] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:11:50] :( [13:12:03] (03CR) 10Gehel: Convert elasticsearch to systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [13:13:13] those memcached errors, that apparently has been going on for a while [13:13:25] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen&from=1536219424589&to=1536239597512 [13:13:45] since 10:20 UTC apparently [13:13:50] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:14:19] 10:22 rebooting mc2* hosts for kernel security update [13:14:33] yeah, that's expected [13:14:52] I'm almost done, 2033 is currently booting and three more to go [13:15:49] (03PS38) 10Gehel: Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [13:16:48] (03PS2) 10Gehel: elasticsearch: disable the default elasticsearch unit [puppet] - 10https://gerrit.wikimedia.org/r/458464 (https://phabricator.wikimedia.org/T198351) [13:16:52] (03PS2) 10Andrew Bogott: m5 grants: replace designate password hash with a private lookup [puppet] - 10https://gerrit.wikimedia.org/r/458284 [13:16:57] moritzm: great :] [13:17:18] 1log reboot kafka-jumbo1001 for openjdk-8 + kernel security upgrades [13:17:35] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458497 (owner: 10Hashar) [13:17:39] (03PS3) 10Bstorm: labstore: Change tcp buffer settings [puppet] - 10https://gerrit.wikimedia.org/r/458291 (https://phabricator.wikimedia.org/T203254) [13:18:01] (03CR) 10Andrew Bogott: [C: 032] m5 grants: replace designate password hash with a private lookup [puppet] - 10https://gerrit.wikimedia.org/r/458284 (owner: 10Andrew Bogott) [13:18:36] (03PS4) 10Bstorm: labstore: Change tcp buffer settings [puppet] - 10https://gerrit.wikimedia.org/r/458291 (https://phabricator.wikimedia.org/T203254) [13:19:17] (03CR) 10Bstorm: [C: 032] labstore: Change tcp buffer settings [puppet] - 10https://gerrit.wikimedia.org/r/458291 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [13:24:41] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:24:56] uff [13:24:57] !log reboot kafka-jumbo1001 for openjdk-8 + kernel security upgrades [13:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:53] ah this one is for codfw [13:25:56] (03CR) 10DCausse: Convert elasticsearch to systemd unit (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [13:25:57] Cc: moritzm [13:26:05] so I think 1.32.0-wmf.20 is fine [13:26:20] still rebooting mc2* ?? [13:26:43] ah sorry people already asked to you [13:26:46] my bda [13:26:47] *bad [13:26:48] elukey: two more two go [13:26:53] * elukey returns to his analytics corner [13:27:14] I thought it was the mc1035 problem [13:28:18] (03PS1) 10Bstorm: labstore: fix priority on sysctl file [puppet] - 10https://gerrit.wikimedia.org/r/458504 (https://phabricator.wikimedia.org/T203254) [13:29:18] (03CR) 10Bstorm: [C: 032] labstore: fix priority on sysctl file [puppet] - 10https://gerrit.wikimedia.org/r/458504 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [13:37:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:44:53] (03CR) 10Gehel: Convert elasticsearch to systemd unit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [13:45:43] !log reboot kafka100[2-6] for kernel + openjdk-8 upgrades [13:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:24] !log reboots of mc hosts in codfw completed [13:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:40] (03PS1) 10Giuseppe Lavagetto: sre.switchdc.services: minor bugfixes [cookbooks] - 10https://gerrit.wikimedia.org/r/458507 [13:59:34] (03PS2) 10Giuseppe Lavagetto: sre.switchdc.services: minor bugfixes [cookbooks] - 10https://gerrit.wikimedia.org/r/458507 [14:03:19] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) [14:03:49] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Andrew) [14:04:12] (03CR) 10Giuseppe Lavagetto: [C: 031] mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:07:06] (03PS3) 10Volans: mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) [14:07:21] (03CR) 10Volans: "replies inline" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:09:21] (03CR) 10Volans: [C: 031] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/458507 (owner: 10Giuseppe Lavagetto) [14:10:43] (03CR) 10Thcipriani: [C: 032] "Nice improvement!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 (owner: 10Hashar) [14:11:30] (03Merged) 10jenkins-bot: Generate documentation with Sphinx [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 (owner: 10Hashar) [14:15:11] (03CR) 10DCausse: Convert elasticsearch to systemd unit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [14:18:02] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: add SSDs to wdqs1003 - https://phabricator.wikimedia.org/T202780 (10Cmjohnson) 05Open>03Resolved [14:20:20] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/458339 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [14:20:56] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 56 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [14:20:57] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 77 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [14:21:07] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 80 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [14:21:14] (03PS1) 10Mark Bergsma: s/mainteance/maintenance/ [cookbooks] - 10https://gerrit.wikimedia.org/r/458511 [14:21:27] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 107 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [14:21:56] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 14 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [14:22:26] (03PS1) 10Vgutierrez: Allow specifying a list of dns servers for dns-01 validation purposes [software/certcentral] - 10https://gerrit.wikimedia.org/r/458512 (https://phabricator.wikimedia.org/T203396) [14:22:35] elukey, ottomata: ^^^ [14:22:51] there's reboots for kafka-jumbo ongoing [14:22:56] probably controlled fallout [14:23:37] ah yes it is kafka-jumbo1003 taking a bit too much [14:23:50] for some definition of "controlled" :P [14:24:00] it is recovering, sorry for the spam [14:24:19] thanks for the trust volans :D [14:24:46] :-P [14:24:55] I trust you, not the JVM [14:25:16] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [14:25:27] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [14:25:44] (03CR) 10Giuseppe Lavagetto: [C: 032] sre.switchdc.services: minor bugfixes [cookbooks] - 10https://gerrit.wikimedia.org/r/458507 (owner: 10Giuseppe Lavagetto) [14:26:16] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [14:26:26] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [14:26:57] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [14:27:33] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) [14:33:25] (03PS1) 10Bstorm: labstore: load monitoring should be based on number of processors [puppet] - 10https://gerrit.wikimedia.org/r/458514 (https://phabricator.wikimedia.org/T203254) [14:34:47] (03PS2) 10Ema: cache_canary: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/457850 (https://phabricator.wikimedia.org/T199079) [14:36:27] (03CR) 10Ema: [C: 032] cache_canary: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/457850 (https://phabricator.wikimedia.org/T199079) (owner: 10Ema) [14:37:56] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10MoritzMuehlenhoff) +1 for creating a deb. I can give you an introduction on how to do that if you want. [14:38:15] !log reboot kafka2001 (eventbus codfw host) for kernel + openjdk-8 upgrades [14:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:39] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10jcrespo) So my initial suggestion was to create a debian package for the following reasons: * Source control patches on a separate repo so upgrades are easy * Complexity can be hidd... [14:41:58] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Marostegui) +1 to create the package too, I like the idea of having the "patched" versions of pt-XXX in our own package. [14:42:46] (03Abandoned) 10Ladsgroup: Enable poolcounter for orespoolcounter[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/458056 (https://phabricator.wikimedia.org/T201824) (owner: 10Ladsgroup) [14:46:32] !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (switchdc/oblivian@neodymium) [14:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:22] (03CR) 10Thcipriani: "It looks like gerrit-theme.html will work for us!" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [14:47:50] (03PS1) 10Volans: Revert "cache_canary: switch mediawiki to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458515 [14:48:20] (03CR) 10Ema: [C: 031] Revert "cache_canary: switch mediawiki to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458515 (owner: 10Volans) [14:48:46] (03CR) 10Paladox: [C: 031] "@Thcipriani i created this patch https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439504/ to symnlink it" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [14:48:52] (03CR) 10Volans: [C: 032] Revert "cache_canary: switch mediawiki to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458515 (owner: 10Volans) [14:49:14] akosiaris: thanks for the puppet patches, but it seems ores nodes can't talk to orespoolcounter (at least right now) [14:49:17] ladsgroup@ores1001:~$ echo 'STATS FULL' | nc -w1 orespoolcounter1001.eqiad.wmnet 7531 [14:49:17] orespoolcounter1001.eqiad.wmnet [10.64.0.87] 7531 (?) : Connection refused [14:51:36] !log END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) (switchdc/oblivian@neodymium) [14:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:57] !log START - Cookbook sre.switchdc.services.01-switch-dc (switchdc/oblivian@neodymium) [14:51:57] !log Switching services pdfrender: eqiad => codfw (switchdc/oblivian@neodymium) [14:51:58] !log END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) (switchdc/oblivian@neodymium) [14:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:19] <_joe_> oh is this thing loud [14:52:28] !log START - Cookbook sre.switchdc.services.02-restore-ttl (switchdc/oblivian@neodymium) [14:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:31] !log END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) (switchdc/oblivian@neodymium) [14:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:37] lol [14:53:00] (03PS1) 10Ema: cache_canary: temporarily get rid of debug directors [puppet] - 10https://gerrit.wikimedia.org/r/458516 [14:55:13] I likea lot the logging [14:55:15] very nice work [14:55:38] thx :) [14:55:47] you'll probably be the only one :D [14:55:47] I suppose that those are all goodies that everybody can use via spicerack right? [14:56:11] yep, more or less battery included! [14:56:12] <_joe_> volans: I found another issue, sigh how can we not notice it [14:56:29] what now? [14:56:43] <_joe_> I inverted dc_from and dc_to apparently [14:57:04] <_joe_> and no, I did not, so this baffles me [14:57:09] discovery.pool(args.dc_to) [14:57:10] discovery.depool(args.dc_from) [14:57:12] seems correct to me [14:57:35] <_joe_> yeah but on the cli options I mean [14:58:00] wanna change to more explicit --dc-from and --dc-to? [14:58:14] I Was also thinking to add a check if they are "correcT" [14:58:17] but is tricky [14:58:25] <_joe_> no sorry, I got confused [14:58:29] based on the phase you are in [14:58:34] <_joe_> I did eqiad => codfw on purpose [14:58:40] <_joe_> to do a real-null test [14:58:41] <_joe_> lol [14:58:48] <_joe_> sorry [14:58:51] lol [14:58:52] no prob [14:58:55] cough [14:58:58] stage 0: coffee.pour(person.joe) [14:59:09] <_joe_> mark: I just had one [14:59:10] (sorry, have a bad case of coughing) [14:59:13] <_joe_> and clearly wasn't enough [14:59:25] not yet in the blood stream [14:59:45] !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (switchdc/oblivian@neodymium) [14:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:15] damn found a bug [15:00:24] switchdc/oblivian that relic from switchdc [15:00:36] should be spicerack or something else [15:00:50] probably can just be dropped [15:00:56] thoughts? [15:01:14] I'd vote for fropping the 'switchdc' [15:01:21] leave just user@host [15:01:27] ok to me [15:03:46] Amir1: it's not fully done yet, that's expected. I 'll resolve the ticket when they are done [15:03:49] (03PS1) 10Elukey: Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) [15:04:02] oh okay, thanks [15:04:06] (03PS1) 10Volans: log: remove relic from switchdc [software/spicerack] - 10https://gerrit.wikimedia.org/r/458520 (https://phabricator.wikimedia.org/T199079) [15:04:26] (03CR) 10jerkins-bot: [V: 04-1] Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [15:04:50] !log END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) (switchdc/oblivian@neodymium) [15:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:55] win 15 [15:05:57] (03CR) 10Thcipriani: "> @Thcipriani i created this patch https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439504/" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [15:07:03] (03CR) 10Elukey: "For some reason I get -1 from jenkins, but not if I run my tests locally." [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [15:07:20] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) What about the email address? Are we still waiting for that? [15:07:49] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:08:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:09:16] (03CR) 10Volans: [C: 031] "Damn autocomplete! Good catch, thanks for the fix. Nitpick in the commit message." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/458511 (owner: 10Mark Bergsma) [15:09:44] (03CR) 10Jcrespo: [C: 031] mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:14:24] 10Operations, 10Traffic: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Vgutierrez) p:05Triage>03Normal [15:17:38] (03CR) 10Gehel: [C: 031] "LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:18:36] gehel: yeah I added it to the docstring and then refactored, sorry forgot to update the comment in the CR [15:18:47] 10Operations: Support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T202255 (10MoritzMuehlenhoff) Status update: I've created a stretch backport of a 4.14 kernel which should support both QLogic 41xx and the new HP Perc megaraid controller properly. To allow to use thi... [15:18:55] 10Operations, 10Puppet, 10Cloud-VPS, 10Release-Engineering-Team, and 3 others: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10herron) Stretch compilers `compiler100[12].puppet-diffs.eqiad.wmflabs` are now live in the operations-puppet-catalog-compiler Jenkins project, an... [15:19:51] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10MoritzMuehlenhoff) See https://phabricator.wikimedia.org/T202255#4563157 for the 4.14 kernel. [15:20:41] volans: no, looks good like it is [15:21:08] (03PS3) 10Cwhite: profile/cumin: update python scripts to detect command file [puppet] - 10https://gerrit.wikimedia.org/r/458325 (https://phabricator.wikimedia.org/T202782) [15:21:14] yes, I meant I forgot to update the CR comment in which I was telling I added it [15:21:17] ;) [15:21:32] (03PS4) 10Cwhite: profile/cumin: update python scripts to detect command file [puppet] - 10https://gerrit.wikimedia.org/r/458325 (https://phabricator.wikimedia.org/T202782) [15:22:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] mysql: refactor sync check to avoid GTID (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:23:01] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10thcipriani) >>! In T191921#4558120, @Krinkle wrote: > [...] > Also remember that enabling JIT will speed things up... [15:25:15] (03CR) 10Volans: "@akosiaris, reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:28:05] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): Scap should use Eval.Jit=1 when calling rebuildLocalisationCache.php via HHVM - https://phabricator.wikimedia.org/T203680 (10thcipriani) p:05Triage>03High [15:29:54] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10Papaul) [15:30:21] !log START - Cookbook sre.switchdc.services.01-switch-dc (switchdc/oblivian@neodymium) [15:30:21] !log Switching services pdfrender: codfw => eqiad (switchdc/oblivian@neodymium) [15:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:26] !log END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) (switchdc/oblivian@neodymium) [15:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:36] !log START - Cookbook sre.switchdc.services.02-restore-ttl (switchdc/oblivian@neodymium) [15:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:40] !log END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) (switchdc/oblivian@neodymium) [15:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:58] <_joe_> pdfrender is now eqiad-only [15:31:18] <_joe_> and now back again [15:31:38] !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (switchdc/oblivian@neodymium) [15:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:36:42] !log END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) (switchdc/oblivian@neodymium) [15:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:29] !log START - Cookbook sre.switchdc.services.01-switch-dc (switchdc/oblivian@neodymium) [15:37:29] !log Switching services pdfrender: eqiad => codfw (switchdc/oblivian@neodymium) [15:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:33] !log END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) (switchdc/oblivian@neodymium) [15:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:51] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:41:12] !log START - Cookbook sre.switchdc.services.02-restore-ttl (switchdc/oblivian@neodymium) [15:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:16] !log END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) (switchdc/oblivian@neodymium) [15:41:16] (03PS3) 10Gehel: elasticsearch: disable the default elasticsearch unit [puppet] - 10https://gerrit.wikimedia.org/r/458464 (https://phabricator.wikimedia.org/T198351) [15:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:58] (03CR) 10Alexandros Kosiaris: [C: 031] parsoid: connect to MediaWiki via https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/458475 (owner: 10Giuseppe Lavagetto) [15:42:46] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Kalliope Tsouroupidou - https://phabricator.wikimedia.org/T202486 (10Kalliope) Unfortunately when I did the set up I opted for "no password" but then the system would... [15:44:02] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:44:53] (03PS1) 10Paladox: Replace "wikimedia-polygerrit-style" plugin with gerrit-theme [puppet] - 10https://gerrit.wikimedia.org/r/458523 (https://phabricator.wikimedia.org/T196835) [15:47:01] (03PS2) 10Paladox: Replace "wikimedia-polygerrit-style" plugin with gerrit-theme [puppet] - 10https://gerrit.wikimedia.org/r/458523 (https://phabricator.wikimedia.org/T196835) [15:47:31] (03CR) 10Paladox: [C: 031] "@Thcipriani moved to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458523/" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [15:48:20] (03PS1) 10Paladox: Remove wikimedia-polygerrit-style.html plugin [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/458524 [15:48:27] (03PS2) 10Paladox: Remove wikimedia-polygerrit-style.html plugin [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/458524 [15:48:32] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 24 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [15:48:55] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2053 - https://phabricator.wikimedia.org/T203623 (10Marostegui) Talked to Papaul - this disk will be replaced on Monday, as he is on a different DC! Thanks! [15:49:02] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:49:12] (03Abandoned) 10Paladox: Add gerrit-theme.html and also add footer links [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [15:49:19] (03Abandoned) 10Paladox: Link to gerrit-theme.html in scap repo [puppet] - 10https://gerrit.wikimedia.org/r/439504 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [15:49:33] (03PS2) 10Bstorm: labstore: load monitoring should be based on number of processors [puppet] - 10https://gerrit.wikimedia.org/r/458514 (https://phabricator.wikimedia.org/T203254) [15:50:05] (03CR) 10Paladox: "This needs to be merged at the same time as https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458523/ to prevent it duplicating the p" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/458524 (owner: 10Paladox) [15:51:31] (03PS39) 10Gehel: Convert elasticsearch to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [15:51:34] (03PS4) 10Gehel: elasticsearch: disable the default elasticsearch unit [puppet] - 10https://gerrit.wikimedia.org/r/458464 (https://phabricator.wikimedia.org/T198351) [15:52:11] (03CR) 10Thcipriani: [C: 031] "Tried the gerrit-theme.html on my testing instance. Working for new UI." [puppet] - 10https://gerrit.wikimedia.org/r/458523 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [15:52:44] (03CR) 10Gehel: Convert elasticsearch to systemd unit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [15:53:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 15 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [15:58:45] (03CR) 10Alexandros Kosiaris: [C: 031] mysql: refactor sync check to avoid GTID (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:59:17] (03CR) 10Volans: [C: 032] mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:00:04] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:36] (03Merged) 10jenkins-bot: mysql: refactor sync check to avoid GTID [software/spicerack] - 10https://gerrit.wikimedia.org/r/458470 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:00:51] volans: I was about to leave- could we do a quick check of that just merged code? [16:00:57] or maybe you are busy [16:01:05] (03PS2) 10Volans: sre.switchdc.mediawiki: fix typo s/mainteance/maintenance/ [cookbooks] - 10https://gerrit.wikimedia.org/r/458511 (https://phabricator.wikimedia.org/T199079) (owner: 10Mark Bergsma) [16:01:34] jynus: I need few minutes to do the release + deb package, not sure how much in hurry you're [16:01:40] oh [16:01:53] ok, then maybe I can check tomorrow [16:02:04] ack or I'll send you the result of the test [16:02:18] yeah, no problem wit hthat [16:02:23] but I wanted to create lag [16:02:30] and that may not be simple for you [16:02:33] or maybe is? [16:02:39] (03PS3) 10Volans: sre.switchdc.mediawiki: fix typo s/mainteance/maintenance/ [cookbooks] - 10https://gerrit.wikimedia.org/r/458511 (https://phabricator.wikimedia.org/T199079) (owner: 10Mark Bergsma) [16:02:42] dowtime a full section etc [16:02:55] stop replication [16:02:57] we can test the lag together [16:02:59] see it fail [16:03:14] I though you weren't available tomorrow [16:03:17] if not, next week [16:03:30] the working test should be simple [16:03:37] I shouldn't in theory [16:03:49] monday? [16:04:35] sure [16:04:43] ok, see you next week [16:04:47] don't work too much [16:04:48] ack [16:04:51] bye [16:10:06] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) >>! In T199125#4560182, @MoritzMuehlenhoff wrote: > I tried an installation from cloudvirt1023, but the PXELINUX version on... [16:11:28] (03CR) 10Gehel: [C: 031] "puppet compiler looks reasonable: https://puppet-compiler.wmflabs.org/compiler1002/12387/" [puppet] - 10https://gerrit.wikimedia.org/r/454722 (https://phabricator.wikimedia.org/T200740) (owner: 10EBernhardson) [16:12:40] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/458520 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:12:49] (03CR) 10Andrew Bogott: [C: 031] labstore: load monitoring should be based on number of processors [puppet] - 10https://gerrit.wikimedia.org/r/458514 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [16:13:55] (03PS3) 10Bstorm: labstore: load monitoring should be based on number of processors [puppet] - 10https://gerrit.wikimedia.org/r/458514 (https://phabricator.wikimedia.org/T203254) [16:15:07] (03CR) 10Bstorm: [C: 032] labstore: load monitoring should be based on number of processors [puppet] - 10https://gerrit.wikimedia.org/r/458514 (https://phabricator.wikimedia.org/T203254) (owner: 10Bstorm) [16:15:54] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: fix typo s/mainteance/maintenance/ [cookbooks] - 10https://gerrit.wikimedia.org/r/458511 (https://phabricator.wikimedia.org/T199079) (owner: 10Mark Bergsma) [16:15:58] (03PS5) 10Gehel: Deploy msearch daemon to cirrus servers [puppet] - 10https://gerrit.wikimedia.org/r/454722 (https://phabricator.wikimedia.org/T200740) (owner: 10EBernhardson) [16:16:45] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: fix typo s/mainteance/maintenance/ [cookbooks] - 10https://gerrit.wikimedia.org/r/458511 (https://phabricator.wikimedia.org/T199079) (owner: 10Mark Bergsma) [16:16:58] (03PS2) 10Volans: log: remove relic from switchdc [software/spicerack] - 10https://gerrit.wikimedia.org/r/458520 (https://phabricator.wikimedia.org/T199079) [16:17:13] (03CR) 10Gehel: [C: 032] Deploy msearch daemon to cirrus servers [puppet] - 10https://gerrit.wikimedia.org/r/454722 (https://phabricator.wikimedia.org/T200740) (owner: 10EBernhardson) [16:18:45] (03CR) 10Volans: [C: 032] log: remove relic from switchdc [software/spicerack] - 10https://gerrit.wikimedia.org/r/458520 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:19:53] (03Merged) 10jenkins-bot: log: remove relic from switchdc [software/spicerack] - 10https://gerrit.wikimedia.org/r/458520 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:20:31] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 51653 MB (10% inode=99%) [16:20:46] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/458325 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [16:29:59] (03PS1) 10EBernhardson: Mjolnir msearch: Use correct cli args [puppet] - 10https://gerrit.wikimedia.org/r/458530 [16:30:51] (03PS1) 10Volans: Upstream release v0.0.6 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/458531 (https://phabricator.wikimedia.org/T199079) [16:30:55] (03PS5) 10Cwhite: profile/cumin: update python scripts to detect command file [puppet] - 10https://gerrit.wikimedia.org/r/458325 (https://phabricator.wikimedia.org/T202782) [16:31:39] (03CR) 10Gehel: [C: 032] Mjolnir msearch: Use correct cli args [puppet] - 10https://gerrit.wikimedia.org/r/458530 (owner: 10EBernhardson) [16:31:47] (03CR) 10Cwhite: [C: 032] profile/cumin: update python scripts to detect command file [puppet] - 10https://gerrit.wikimedia.org/r/458325 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [16:32:02] ema: FYI Cole change ^^^ [16:32:06] (03PS6) 10Cwhite: profile/cumin: update python scripts to detect command file [puppet] - 10https://gerrit.wikimedia.org/r/458325 (https://phabricator.wikimedia.org/T202782) [16:33:43] (03CR) 10Volans: [C: 032] Upstream release v0.0.6 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/458531 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:34:12] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) raid and bios updated, there are no broadcom updates available. So this pxe boots in the currently deployed pxe/installer,... [16:34:51] (03CR) 10Cwhite: [C: 031] icinga: make the apache virtual host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [16:35:00] (03Merged) 10jenkins-bot: Upstream release v0.0.6 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/458531 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:35:07] (03PS6) 10Cwhite: icinga: make the apache virtual host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [16:37:05] (03CR) 10Hashar: "And now we have:" [puppet] - 10https://gerrit.wikimedia.org/r/441397 (owner: 10Paladox) [16:38:36] !log uploaded spicerack_0.0.6-1{,+deb9u1} to apt.wikimedia.org {jessie,stretch}-wikimedia - T199079 [16:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:41] T199079: Refactor the switchdc script - https://phabricator.wikimedia.org/T199079 [16:40:00] !log upgraded spicerack to version 0.0.6 on sarin/neodymium - T199079 [16:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:12] (03CR) 10Hashar: "New data:" [puppet] - 10https://gerrit.wikimedia.org/r/441391 (owner: 10Paladox) [16:43:58] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@735235b]: new cli flags for msearch daemon, bump kafka-python dep to 1.4.x [16:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:05] !log ebernhardson@deploy1001 deploy aborted: new cli flags for msearch daemon, bump kafka-python dep to 1.4.x (duration: 00m 07s) [16:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:19] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@735235b]: new cli flags for msearch daemon, bump kafka-python dep to 1.4.x [16:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:00] (03Abandoned) 10Ema: cache_canary: temporarily get rid of debug directors [puppet] - 10https://gerrit.wikimedia.org/r/458516 (owner: 10Ema) [16:45:02] (03CR) 10Alex Monk: [C: 04-1] Add make_account CLI script [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (owner: 10Alex Monk) [16:45:49] (03CR) 10Alex Monk: [C: 032] README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [16:48:56] (03CR) 10Alex Monk: [C: 032] Allow specifying a list of dns servers for dns-01 validation purposes [software/certcentral] - 10https://gerrit.wikimedia.org/r/458512 (https://phabricator.wikimedia.org/T203396) (owner: 10Vgutierrez) [16:49:44] (03PS1) 10Alexandros Kosiaris: Fix orespoolcounter2001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/458535 [16:50:35] (03CR) 10jerkins-bot: [V: 04-1] Fix orespoolcounter2001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/458535 (owner: 10Alexandros Kosiaris) [16:51:14] (03PS1) 10Ema: ATS: specify mapping rules for all text/upload backends [puppet] - 10https://gerrit.wikimedia.org/r/458536 (https://phabricator.wikimedia.org/T199720) [16:52:42] (03PS40) 10Alex Monk: Prepare for packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [16:53:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix orespoolcounter2001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/458535 (owner: 10Alexandros Kosiaris) [16:53:58] (03CR) 10jerkins-bot: [V: 04-1] Prepare for packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [16:54:03] RECOVERY - Disk space on elastic1017 is OK: DISK OK [16:54:22] 10Operations, 10Continuous-Integration-Infrastructure, 10Math: quibble-vendor-mysql-hhvm-docker no space left on device, write - https://phabricator.wikimedia.org/T203649 (10Umherirrender) Sounds like handled in T202457 [16:54:49] 10Operations, 10Continuous-Integration-Infrastructure, 10Math: quibble-vendor-mysql-hhvm-docker no space left on device, write - https://phabricator.wikimedia.org/T203649 (10Krinkle) [16:55:02] 10Operations, 10Continuous-Integration-Infrastructure, 10Math: quibble-vendor-mysql-hhvm-docker no space left on device, write - https://phabricator.wikimedia.org/T203649 (10Krinkle) [16:55:51] (03PS41) 10Alex Monk: Prepare for packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [16:57:35] (03PS42) 10Alex Monk: Prepare for packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [16:59:08] (03PS2) 10Ema: ATS: specify mapping rules for all text/upload backends [puppet] - 10https://gerrit.wikimedia.org/r/458536 (https://phabricator.wikimedia.org/T199720) [16:59:40] (03CR) 10Alex Monk: [C: 032] "Fixed up the minor comments on PS38 and tests pass again, let's get this in." [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T1700). [17:00:38] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@735235b]: new cli flags for msearch daemon, bump kafka-python dep to 1.4.x (duration: 16m 19s) [17:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:52] (03Merged) 10jenkins-bot: Prepare for packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [17:02:25] (03PS4) 10Alex Monk: Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [17:02:35] (03CR) 10Alex Monk: [C: 032] Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [17:02:50] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@735235b]: re-try bump to master [17:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:04:41] (03CR) 10jenkins-bot: Prepare for packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [17:04:52] (03Merged) 10jenkins-bot: Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [17:05:41] (03PS4) 10Alex Monk: README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [17:06:26] (03CR) 10Alex Monk: README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [17:06:31] (03CR) 10Alex Monk: [C: 032] README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [17:06:38] (03CR) 10jenkins-bot: Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [17:06:47] (03PS2) 10Alex Monk: Allow specifying a list of dns servers for dns-01 validation purposes [software/certcentral] - 10https://gerrit.wikimedia.org/r/458512 (https://phabricator.wikimedia.org/T203396) (owner: 10Vgutierrez) [17:06:53] (03CR) 10Alex Monk: Allow specifying a list of dns servers for dns-01 validation purposes [software/certcentral] - 10https://gerrit.wikimedia.org/r/458512 (https://phabricator.wikimedia.org/T203396) (owner: 10Vgutierrez) [17:08:04] (03Merged) 10jenkins-bot: README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [17:08:12] (03CR) 10Alex Monk: [C: 032] Allow specifying a list of dns servers for dns-01 validation purposes [software/certcentral] - 10https://gerrit.wikimedia.org/r/458512 (https://phabricator.wikimedia.org/T203396) (owner: 10Vgutierrez) [17:08:41] will sort out the debian branch later [17:08:59] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:09:37] (03CR) 10jenkins-bot: README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [17:09:45] (03Merged) 10jenkins-bot: Allow specifying a list of dns servers for dns-01 validation purposes [software/certcentral] - 10https://gerrit.wikimedia.org/r/458512 (https://phabricator.wikimedia.org/T203396) (owner: 10Vgutierrez) [17:11:22] (03CR) 10jenkins-bot: Allow specifying a list of dns servers for dns-01 validation purposes [software/certcentral] - 10https://gerrit.wikimedia.org/r/458512 (https://phabricator.wikimedia.org/T203396) (owner: 10Vgutierrez) [17:11:28] 10Operations, 10ORES, 10Scoring-platform-team, 10vm-requests, 10Patch-For-Review: Site: 4 VM request for ORES poolcounter - https://phabricator.wikimedia.org/T203465 (10akosiaris) 05Open>03Resolved @Ladsgroup Hosts in both DCs up and running! [17:11:59] (03PS1) 10Volans: dnsdisc: fix dry-run in check_if_depoolable [software/spicerack] - 10https://gerrit.wikimedia.org/r/458539 (https://phabricator.wikimedia.org/T199079) [17:12:27] checking dbproxy1003, low on disk space [17:13:28] (03CR) 10Alexandros Kosiaris: [C: 031] dnsdisc: fix dry-run in check_if_depoolable [software/spicerack] - 10https://gerrit.wikimedia.org/r/458539 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [17:15:36] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@735235b]: re-try bump to master (duration: 12m 46s) [17:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:36] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10jcrespo) 05Resolved>03Open p:05Normal>03High There is now too much logging, or it is not rotated fast enough: logs are consuming 70% of available disk:... [17:19:55] (03CR) 10Giuseppe Lavagetto: [C: 031] "I had written basically the same patch, with the change in dnsdisc being the same bit-by-bit, so of course it LGTM :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/458539 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [17:20:12] !log dropping old logs from dbproxy1003 T201021 [17:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:17] T201021: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 [17:20:32] (03CR) 10Volans: [C: 032] dnsdisc: fix dry-run in check_if_depoolable [software/spicerack] - 10https://gerrit.wikimedia.org/r/458539 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [17:21:40] (03Merged) 10jenkins-bot: dnsdisc: fix dry-run in check_if_depoolable [software/spicerack] - 10https://gerrit.wikimedia.org/r/458539 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [17:26:33] (03PS1) 10Volans: Upstream release v0.0.7 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/458541 (https://phabricator.wikimedia.org/T199079) [17:27:48] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Spin up a new poolcounter node for ores - https://phabricator.wikimedia.org/T201824 (10Ladsgroup) Thank you @akosiaris [17:28:55] (03CR) 10Volans: [C: 032] Upstream release v0.0.7 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/458541 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [17:29:57] (03Merged) 10jenkins-bot: Upstream release v0.0.7 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/458541 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [17:30:18] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) I will play with the different log levels tomorrow to see which is the minimum we can do to still get the requests logged, or at least the failures [17:34:10] !log uploaded spicerack_0.0.7-1{,+deb9u1} to apt.wikimedia.org {jessie,stretch}-wikimedia - T199079 [17:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:16] T199079: Refactor the switchdc script - https://phabricator.wikimedia.org/T199079 [17:39:37] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) I have purged logs from other dbproxies too just to make sure they are ok. [17:45:28] !log thcipriani@deploy1001 scap failed: average error rate on 6/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [17:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:01] uck [17:47:16] serialize(): "" returned as member variable from __sleep() but does not exist [17:47:33] I'm guessing a variable got added to serialized jobs? [17:48:34] should I just --force through that? and deal with fallout? [17:48:58] Krinkle: any ideas about ^ [17:49:43] Adjusting the logstash query to not exclude INFO, and exclude hhvm, to find the stracktrace for that one [17:49:45] comes from SqlBagOStuff->serialize(ParserOutput) [17:49:57] not job queue [17:50:23] might actually be exposing another bug from the MCR/RevisionRendering refactor. [17:50:39] Objects can't have empty string keys in PHP afaik, not in HHVM/PHP7.0 [17:51:06] also, we're not supposed to have breaking changes in that class. because of cache. [17:51:08] ugh [17:51:26] it's just more common now during a rollback but we'll probably find this in the logs from yesterday as well [17:52:10] so which would have less user-impact at this point? wmf.19 or wmf.20? [17:52:58] it's a PHP Notice, which means php returns null instead of the array the code asked for, which is like swallowing exception without there being a catch statement in the code. It usualy breaks stuff as null isn't expected, but it also means it isn't fatal and either happens to work, or happens to do weird stuff afterwards. [17:53:20] Will need to check with Daniel whether the real key that "" was supposed to represent still exists in the new format as well. [17:53:24] For the old code to consume. [17:53:26] Don't know myself. [17:54:10] * Krinkle asks [17:54:22] oh, he left. [17:54:33] I'm going to put all the canary machines back to wmf.20 in the interim [17:54:41] yeah [17:56:43] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: Canary machines back to wmf.20 [17:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:23] We should rush out https://gerrit.wikimedia.org/r/c/mediawiki/core/+/458545 as an UBN fix. [17:59:35] (CC anomie) [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T1800). [18:00:04] Ebe123: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:15] * Ebe123 is ready [18:00:24] Ebe123: I think we may have to delay. [18:00:41] For when? [18:01:06] a few minutes [18:01:19] That's fine [18:03:09] yes ,please wait [18:03:13] rolling out fix for wmf.20 now [18:03:21] well, as soon as Jenkins is done [18:06:23] * Krinkle staging om mwdebug1002 [18:08:39] Krinkle: LGTM. [18:09:02] Hmm. "Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBTransactionError' with message 'Transaction round stage must be 'cursory' (not 'within-rollback')' in /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/lbfactory/LBFactory.php:706". That's a new one to me. (Only one instance in the live log.) [18:09:14] I guess EditStash doesn't know about it. I just had a strange one where the preview was good, but the saved one still wrong. [18:09:20] but on second try on another page it was fine [18:09:31] https://test2.wikipedia.org/w/index.php?title=Why_is_there_an_article%3F&diff=prev&oldid=380679 [18:09:38] Was the EditStash from before switching to mwdebug? [18:09:51] (03PS3) 10Hashar: service: fix spec for debian 9+ [puppet] - 10https://gerrit.wikimedia.org/r/458495 (https://phabricator.wikimedia.org/T203645) (owner: 10Giuseppe Lavagetto) [18:09:53] Maybe , yeah [18:09:56] I an't reproduce it now [18:10:02] https://en.wikipedia.org/wiki/User:Jdforrester_(WMF)/sandbox223 worked as expected for me (in the 2010 wikitext editor; it was working fine in 2017WTE anyway). [18:10:09] I'll deploy to local cherry-pick and sync tin after with actual branch [18:10:17] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review: rspec-puppet fails with Could not find the daemon directory (tested [/etc/sv,/var/lib/service]) - https://phabricator.wikimedia.org/T203645 (10hashar) I went with a monkey patch in rspec-puppet https://github.com/rodjek/rspec-puppet/pull/720 [18:12:05] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.20/includes/: I31a97d0168 - T203583 (duration: 01m 13s) [18:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:27] T203583: {{subst:REVISIONUSER}} no longer substitutes into the current user name, but the username of the last revision - https://phabricator.wikimedia.org/T203583 [18:13:06] confirmed again outside mwdebug, lgtm [18:18:15] OK. tin is now clean [18:18:35] One more for Flow backup dumps and then I'm done [18:20:22] krinkle thanks for the merge for the flow dumps [18:20:29] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/Flow/: Ia0112ae62e6b - T203647 (duration: 01m 02s) [18:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:34] T203647: Clean up flow dump job problems for Sept 1 2018 dumps run - https://phabricator.wikimedia.org/T203647 [18:23:29] James_F: Hm.. assuming not in phab already, wanna report it with trace? [18:23:37] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@d3e2c23]: repair msearch daemon cli args [18:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:46] Krinkle: Sure, if I had any idea how to get that. :-) [18:23:56] James_F: Where did you spot it? [18:24:09] Krinkle: `fatalmonitor` on mwlog1001. [18:24:15] oh [18:24:24] I have it permanently open off on a sidescreen. [18:24:34] Try Fatal-Monitor on logstash, then click the (+) next to the trending message, and pick one from the feed below to get the context+trace [18:25:09] or search on mediawiki-errors dashboard if it's not trending [18:25:10] That involves HTTP auth copy-pasting uselessness. [18:25:16] * James_F grumbles. [18:26:00] The dashboard's been getting better. and with https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/455229/ we can remove most of the filter hacks as well [18:27:33] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@d3e2c23]: repair msearch daemon cli args (duration: 03m 56s) [18:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:46] (03PS7) 10Dzahn: icinga: make the apache virtual host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) [18:29:10] MatmaRex: ready now. deploy is done. jut waiting for the gerrit commit to land so I can clean the staging area, but start +2'ing etc :) [18:29:29] (03CR) 10Dzahn: [C: 032] icinga: make the apache virtual host name configurable [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:29:43] James_F: Ebe123: ^ [18:29:45] Krinkle: Filed as T203700 but it's not a very useful trace. [18:29:46] T203700: Fix "Transaction round stage must be 'cursory' (not 'within-rollback')" from LBFactory/DeferredUpdates - https://phabricator.wikimedia.org/T203700 [18:30:14] James_F: Hm.. indeed, was it a POST? I guess so, given only 1 parameter. [18:30:22] Krinkle: Yeah. [18:30:26] Ready [18:30:43] Krinkle: On `/w/api.php?format=xml` [18:34:07] PROBLEM - puppet last run on debmonitor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:35:07] 10Operations, 10ops-eqiad: decommission thulium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T203520 (10Jgreen) [18:36:20] (03PS1) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [18:36:37] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:37] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:36:47] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.6136 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:37:06] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.6515 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:37:17] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.6294 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:37:47] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:37:48] 10Operations, 10Traffic: certcentral: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Krenair) [18:37:50] (03PS1) 10RobH: set cloudvirt1023 to install stretch via tftp not http [puppet] - 10https://gerrit.wikimedia.org/r/458556 (https://phabricator.wikimedia.org/T199125) [18:37:57] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:38:27] (03CR) 10jerkins-bot: [V: 04-1] set cloudvirt1023 to install stretch via tftp not http [puppet] - 10https://gerrit.wikimedia.org/r/458556 (https://phabricator.wikimedia.org/T199125) (owner: 10RobH) [18:38:46] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:39:25] OK, tin is clean now [18:39:26] Ebe123: It's not going to happen this SWAT window, sorry. [18:39:28] I mean deploy1001 [18:39:36] PROBLEM - puppet last run on db1123 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:40:00] now what is up with memcached alerts? [18:40:01] Ebe123: The master patch won't merge for ~5 minutes. I'm not going to try to create a cherry-pick, merge, pull, test, and deploy in the remaining 20 minutes. [18:40:02] errr, did we rollback? [18:40:12] That's fine; another time [18:40:44] legoktm: Something broken? [18:40:58] I saw backscroll about rolling back the train? [18:41:05] but it looks like everything on wmf.20? [18:41:16] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [18:41:18] legoktm: Krinkle back-ported the UBN fix. [18:41:27] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [18:41:28] legoktm: T203583 specifically. [18:41:28] T203583: {{subst:REVISIONUSER}} no longer substitutes into the current user name, but the username of the last revision - https://phabricator.wikimedia.org/T203583 [18:41:29] ok [18:41:37] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:41:46] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [18:43:10] thanks [18:43:22] (03CR) 10RobH: [V: 032 C: 032] set cloudvirt1023 to install stretch via tftp not http [puppet] - 10https://gerrit.wikimedia.org/r/458556 (https://phabricator.wikimedia.org/T199125) (owner: 10RobH) [18:44:17] (03CR) 10Vgutierrez: Debian packaging (032 comments) [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (owner: 10Alex Monk) [18:44:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10GTirloni) [18:46:24] 04Critical Alert for device cr2-ulsfo.wikimedia.org - Primary outbound port utilisation over 80% [18:46:47] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:48:09] 08Warning Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper environment status [18:49:46] (03PS2) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [18:50:27] (03CR) 10Alex Monk: Debian packaging (031 comment) [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (owner: 10Alex Monk) [18:51:52] (03CR) 10Vgutierrez: [C: 031] Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (owner: 10Alex Monk) [18:52:24] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-ulsfo.wikimedia.org recovered from Primary outbound port utilisation over 80% [18:53:27] (03PS1) 10Cwhite: add icinga-stretch subdomain cnamed for icinga1001 [dns] - 10https://gerrit.wikimedia.org/r/458560 (https://phabricator.wikimedia.org/T202782) [18:54:35] (03CR) 10Dzahn: [C: 031] add icinga-stretch subdomain cnamed for icinga1001 [dns] - 10https://gerrit.wikimedia.org/r/458560 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:55:42] (03CR) 10Cwhite: [C: 032] add icinga-stretch subdomain cnamed for icinga1001 [dns] - 10https://gerrit.wikimedia.org/r/458560 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:59:36] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:00:05] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T1900) [19:01:22] 10Operations, 10cloud-services-team, 10netops: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10RobH) Ok, so we've been discussing this in IRC. when trying to use cloudvirt1023 in the labs-hosts1-b-eqiad vlan, if it has NO specific entry for the kerne... [19:02:09] (03CR) 10Legoktm: [C: 04-1] "See inline comments. Also you probably want to add a gbp.conf, mostly to set the proper debian-branch (ex https://gerrit.wikimedia.org/r/p" (035 comments) [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (owner: 10Alex Monk) [19:03:16] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:03:17] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:04:27] RECOVERY - puppet last run on debmonitor1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:06:11] (03PS1) 10RobH: adding cloudvirt1024b dns entry [dns] - 10https://gerrit.wikimedia.org/r/458563 (https://phabricator.wikimedia.org/T190424) [19:07:17] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:07:17] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:08:34] (03CR) 10Alex Monk: Debian packaging (031 comment) [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (owner: 10Alex Monk) [19:09:47] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.6639 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [19:09:57] RECOVERY - puppet last run on db1123 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:10:06] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.7259 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [19:10:17] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:10:37] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.6573 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [19:10:57] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [19:11:16] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [19:11:47] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [19:12:36] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:13:09] 08Warning Alert for device cr2-eqiad.wikimedia.org - Inbound interface errors [19:15:59] (03CR) 10Dzahn: [C: 032] "noop on tegmen and einsteinium.. changed as wanted on icinga1001" [puppet] - 10https://gerrit.wikimedia.org/r/458336 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:24:09] (03PS2) 10RobH: adding cloudvirt1024b dns entry [dns] - 10https://gerrit.wikimedia.org/r/458563 (https://phabricator.wikimedia.org/T190424) [19:24:27] (03CR) 10RobH: [C: 032] adding cloudvirt1024b dns entry [dns] - 10https://gerrit.wikimedia.org/r/458563 (https://phabricator.wikimedia.org/T190424) (owner: 10RobH) [19:24:43] XioNoX: Inbound interface errors? sounds unusual and potentially bad [19:25:19] https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound%2Foutbound_interface_errors [19:25:24] will have a look [19:25:42] oh, did not expect that link, and thanks! [19:27:34] 10Operations, 10cloud-services-team, 10netops, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10RobH) Ok, I'm going to outline all the troubleshooting steps below that I've done to demonstrate that the issue is inherently one with... [19:31:06] 08Warning Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper environment status [19:35:12] (03PS1) 10Dzahn: icinga: enable acme (letsencrypt) on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/458570 (https://phabricator.wikimedia.org/T202782) [19:35:59] (03PS2) 10Gehel: Enable kafka poller on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/449109 (https://phabricator.wikimedia.org/T189458) (owner: 10Smalyshev) [19:36:14] (03PS3) 10Gehel: Enable kafka poller on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/449109 (https://phabricator.wikimedia.org/T189458) (owner: 10Smalyshev) [19:37:11] (03CR) 10Gehel: [C: 032] Enable kafka poller on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/449109 (https://phabricator.wikimedia.org/T189458) (owner: 10Smalyshev) [19:37:13] (03CR) 10Dzahn: [C: 032] icinga: enable acme (letsencrypt) on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/458570 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:37:23] (03PS2) 10Dzahn: icinga: enable acme (letsencrypt) on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/458570 (https://phabricator.wikimedia.org/T202782) [19:37:54] puppetmaster race :) [19:38:44] I win! [19:38:59] mutante: sorry for that :/ [19:39:06] hitting rebase is hard [19:40:04] 10Operations, 10cloud-services-team, 10netops, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10RobH) **cloudvirt1024.eqiad.wmnet is in the labs-hosts1-b-eqiad vlan/subnet with the IP address of 10.64.20.43. loading stretch over h... [19:40:07] gehel: hehe! no worries :) [19:51:01] (03PS3) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [19:53:13] (03PS4) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [19:55:24] (03PS1) 10Ladsgroup: labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458579 (https://phabricator.wikimedia.org/T196671) [19:58:26] (03CR) 10Ladsgroup: [C: 032] "labs-only change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458579 (https://phabricator.wikimedia.org/T196671) (owner: 10Ladsgroup) [20:00:10] (03Merged) 10jenkins-bot: labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458579 (https://phabricator.wikimedia.org/T196671) (owner: 10Ladsgroup) [20:00:55] 10Operations, 10cloud-services-team, 10netops, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10RobH) **cloudvirt1024b.eqiad.wmnet is in the private1-b-eqiad vlan/subnet with the IP address of 10.64.16.27. loading stretch over ht... [20:01:21] (03CR) 10jenkins-bot: labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458579 (https://phabricator.wikimedia.org/T196671) (owner: 10Ladsgroup) [20:02:40] ^ rebased on deploy1001 [20:16:07] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Release-Engineering-Team, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10herron) Thanks! Looks like it would indeed be affected by mx1001 downtime. We sh... [20:16:37] (03CR) 10Alex Monk: [C: 04-1] Add make_account CLI script (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (owner: 10Alex Monk) [20:17:38] (03PS1) 10Sbisson: Enable PageTriage AfC on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458583 (https://phabricator.wikimedia.org/T203184) [20:18:17] (03CR) 10Alex Monk: [C: 04-1] Add make_account CLI script (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (owner: 10Alex Monk) [20:19:56] (03PS6) 10Alex Monk: Add make_account CLI script [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 [20:27:47] 10Operations, 10Puppet, 10Cloud-VPS, 10Release-Engineering-Team, and 3 others: Upgrade Puppet compilers to Stretch - https://phabricator.wikimedia.org/T191438 (10Krenair) >>! In T191438#4563158, @herron wrote: > #cloud-vps is there anything else involved in moving a web proxy from one project to another be... [20:30:19] !log ppchelko@deploy1001 Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190 [20:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:26] T194190: Infinite rerender loop in RESTBase - https://phabricator.wikimedia.org/T194190 [20:31:08] 08Warning Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper environment status [20:34:16] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190 (duration: 03m 56s) [20:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:32] !log ppchelko@deploy1001 Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 2 [20:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:35] (03PS1) 10Andrew Bogott: Added a script to migrate nova quotas between region [puppet] - 10https://gerrit.wikimedia.org/r/458587 [20:43:58] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 2 (duration: 09m 26s) [20:44:03] !log ppchelko@deploy1001 Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 3 [20:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:04] T194190: Infinite rerender loop in RESTBase - https://phabricator.wikimedia.org/T194190 [20:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:18] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 3 (duration: 07m 16s) [20:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:24] T194190: Infinite rerender loop in RESTBase - https://phabricator.wikimedia.org/T194190 [20:51:29] !log ppchelko@deploy1001 Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 4 [20:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:51] (03PS2) 10Andrew Bogott: Added a script to migrate nova quotas between region [puppet] - 10https://gerrit.wikimedia.org/r/458587 [20:56:15] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 4 (duration: 04m 47s) [20:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:14] !log ppchelko@deploy1001 Started deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 5 [20:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:19] T194190: Infinite rerender loop in RESTBase - https://phabricator.wikimedia.org/T194190 [21:02:15] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@53bc0a6]: Revert bumping Parsoid content-type filter T194190, take 5 (duration: 04m 01s) [21:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:33] (03PS1) 10Paladox: Gerrit: Make header a blue and the text white [puppet] - 10https://gerrit.wikimedia.org/r/458593 [21:04:50] (03PS2) 10Paladox: Gerrit: Make header a blue and the text white [puppet] - 10https://gerrit.wikimedia.org/r/458593 [21:06:30] (03CR) 10Paladox: "This is what it looks like: https://phabricator.wikimedia.org/F25681185" [puppet] - 10https://gerrit.wikimedia.org/r/458593 (owner: 10Paladox) [21:19:41] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.20/includes/exception/MWExceptionHandler.php: I3f35a519b50ae (duration: 00m 58s) [21:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:45] (03PS3) 10Paladox: Gerrit: Make header a blue and the text white [puppet] - 10https://gerrit.wikimedia.org/r/458593 (https://phabricator.wikimedia.org/T200739) [21:31:08] 08Warning Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper environment status [21:32:26] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.20/includes/MovePage.php: T203661 - I9ebdcbc566b (duration: 00m 57s) [21:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:32] T203661: Old page title is displayed after renaming a page until the page is subsequently edited/null edited - https://phabricator.wikimedia.org/T203661 [21:43:22] !log Deleting all user_properties rows with up_property='pagetriage-lastuse' (T202175) [21:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:28] T202175: Delete rows for 'pagetriage-lastuse' preference - https://phabricator.wikimedia.org/T202175 [21:44:46] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:49:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:00:08] (03PS2) 10Dzahn: tor: make it possible to config service running/stopped in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/458339 (https://phabricator.wikimedia.org/T196701) [22:04:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:04:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:05:27] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [22:07:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:09:26] (03CR) 10Dzahn: [C: 032] tor: make it possible to config service running/stopped in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/458339 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [22:13:18] (03CR) 10Dzahn: ":) thanks for merging" [puppet] - 10https://gerrit.wikimedia.org/r/453553 (owner: 10Dzahn) [22:13:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:14:17] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [22:14:57] (03CR) 10Dzahn: "libapache2-mod-php5 used to be pulled in by the apache module. then we converted it to httpd module and that doesn't automatically pull it" [puppet] - 10https://gerrit.wikimedia.org/r/453553 (owner: 10Dzahn) [22:16:14] !log restart mjolnir-kafka-bulk-daemon service to pickup earlier deploy. [22:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:38] 10Operations, 10ops-eqiad, 10netops: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 (10ayounsi) p:05Triage>03High [22:18:46] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Release-Engineering-Team, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10hashar) IIRC, a while ago (like in 2012) it was configured to use localhost for re... [22:21:31] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Release-Engineering-Team, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10hashar) Sorry I forgot, @herron is there a smarthost on all of our servers or does... [22:27:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:30:34] (03PS2) 10Dzahn: site: add tor_relay role to torrelay1001 [puppet] - 10https://gerrit.wikimedia.org/r/455744 (https://phabricator.wikimedia.org/T196701) [22:32:10] 08Warning Alert for device cr2-eqiad.wikimedia.org - Inbound interface errors got acknowledged [22:32:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:37:12] (03CR) 10Dzahn: "weird, looks like the tests for the DHCP server are failing due to:" [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [22:38:08] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Juniper environment status [22:38:16] (03CR) 10Faidon Liambotis: Pass argv from to main() -> parse_args -> argparse (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458223 (owner: 10Faidon Liambotis) [22:41:29] (03CR) 10Dzahn: [C: 031] "lgtm, just that switch ports need to be disabled as well when doing this" [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [22:41:50] (03PS2) 10Dzahn: Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [22:42:27] (03CR) 10jerkins-bot: [V: 04-1] Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [22:42:33] out for belated lunch, bbl [22:43:34] (03PS3) 10Andrew Bogott: Added a script to migrate nova quotas between region [puppet] - 10https://gerrit.wikimedia.org/r/458587 [22:50:25] (03CR) 10Andrew Bogott: [C: 032] Added a script to migrate nova quotas between region [puppet] - 10https://gerrit.wikimedia.org/r/458587 (owner: 10Andrew Bogott) [22:51:30] Do we re-schedule again on account of CindyCicaleseWMF's 2 patches? [22:58:43] (03PS1) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458610 (https://phabricator.wikimedia.org/T191086) [22:58:57] PROBLEM - Filesystem available is greater than filesystem size on ms-be1041 is CRITICAL: cluster=swift device=/dev/sdn1 fstype=xfs instance=ms-be1041:9100 job=node mountpoint=/srv/swift-storage/sdn1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180906T2300). [23:00:05] Ebe123 and CindyCicaleseWMF: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:19] I'm here :-) [23:02:40] Let me know if you need anything from me. This is my first SWAT. [23:05:41] I can SWAT [23:05:59] Ebe1 doesn't seem to be around [23:06:24] thank you! [23:06:49] I just saw their message - I didn't mean to bump them! [23:07:42] CindyCicaleseWMF: since yours are all l10n stuff, I'm going to deploy them all at once [23:07:54] makes sense [23:08:52] syncing them out is going to require a full scap which can take a while, but mostly just waiting for l10n cache to rebuild [23:09:21] since there aren't any code changes I'll start scap once they are merged by jenkins [23:09:33] Yup, I went through this twice yesterday with legoktm. It did take a while. [23:10:00] cool, just wanted to make sure you were prepared :) [23:10:09] lol - yes, thanks :-) [23:13:17] (03PS2) 10Faidon Liambotis: Pass argv from to main() -> parse_args -> argparse [software/keyholder] - 10https://gerrit.wikimedia.org/r/458223 [23:13:20] (03PS2) 10Faidon Liambotis: Add setuptools, LICENSE, README.rst etc. [software/keyholder] - 10https://gerrit.wikimedia.org/r/458224 [23:13:22] (03PS2) 10Faidon Liambotis: Add pytest support for unit/integration testing [software/keyholder] - 10https://gerrit.wikimedia.org/r/458225 [23:13:24] (03PS2) 10Faidon Liambotis: Don't barf on an empty or invalid YAML config [software/keyholder] - 10https://gerrit.wikimedia.org/r/458226 [23:13:26] (03PS2) 10Faidon Liambotis: Drop legacy SSHv1 support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458227 [23:13:28] (03PS2) 10Faidon Liambotis: Drop MD5 (pre-6.8) digest support [software/keyholder] - 10https://gerrit.wikimedia.org/r/458228 [23:13:30] (03PS2) 10Faidon Liambotis: Don't drop the colon between hash type/digest [software/keyholder] - 10https://gerrit.wikimedia.org/r/458229 [23:13:32] (03PS2) 10Faidon Liambotis: Only show tracebacks on DEBUG logging levels [software/keyholder] - 10https://gerrit.wikimedia.org/r/458230 [23:13:34] (03PS2) 10Faidon Liambotis: Respond with SSH_AGENT_FAILURE on protocol errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/458231 [23:13:36] (03PS2) 10Faidon Liambotis: Switch to using Enum for SSH protocol codes [software/keyholder] - 10https://gerrit.wikimedia.org/r/458232 [23:13:42] (03PS2) 10Faidon Liambotis: Switch to Construct for the SSH agent protocol [software/keyholder] - 10https://gerrit.wikimedia.org/r/458233 [23:13:44] (03PS2) 10Faidon Liambotis: Split handle_client_request() into multiple methods [software/keyholder] - 10https://gerrit.wikimedia.org/r/458234 [23:13:46] (03PS2) 10Faidon Liambotis: Stop referring to the daemon as a "proxy" [software/keyholder] - 10https://gerrit.wikimedia.org/r/458235 [23:13:48] (03PS2) 10Faidon Liambotis: Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 [23:13:50] James_F: I'm here to deploy! [23:13:50] (03PS2) 10Faidon Liambotis: Split SshAgentCommand type to Request/Response [software/keyholder] - 10https://gerrit.wikimedia.org/r/458237 [23:13:52] (03PS2) 10Faidon Liambotis: Make pylint a little happier [software/keyholder] - 10https://gerrit.wikimedia.org/r/458238 [23:13:54] (03PS2) 10Faidon Liambotis: Use mlockall() to avoid any potential swapping [software/keyholder] - 10https://gerrit.wikimedia.org/r/458239 [23:13:56] (03PS2) 10Faidon Liambotis: Add permission checks for various commands [software/keyholder] - 10https://gerrit.wikimedia.org/r/458240 [23:13:58] (03PS2) 10Faidon Liambotis: Verify the validity of signature requests [software/keyholder] - 10https://gerrit.wikimedia.org/r/458241 [23:14:00] (03PS2) 10Faidon Liambotis: Implement SSH_AGENTC_LOCK/SSH_AGENTC_UNLOCK [software/keyholder] - 10https://gerrit.wikimedia.org/r/458242 [23:14:04] (03PS2) 10Faidon Liambotis: Parse/build agent request/responses once [software/keyholder] - 10https://gerrit.wikimedia.org/r/458243 [23:14:06] (03PS2) 10Faidon Liambotis: Refactor handle() [software/keyholder] - 10https://gerrit.wikimedia.org/r/458244 [23:14:08] (03PS2) 10Faidon Liambotis: Add compatibility with Construct 2.8.22 and 2.9.45 [software/keyholder] - 10https://gerrit.wikimedia.org/r/458245 [23:14:10] (03PS2) 10Faidon Liambotis: Switch path handling to pathlib.Path [software/keyholder] - 10https://gerrit.wikimedia.org/r/458246 [23:14:12] (03PS2) 10Faidon Liambotis: Unlink the Unix domain socket when exiting [software/keyholder] - 10https://gerrit.wikimedia.org/r/458247 [23:14:14] (03PS2) 10Faidon Liambotis: Abstract the SSH fingerprint generation [software/keyholder] - 10https://gerrit.wikimedia.org/r/458248 [23:14:16] (03PS2) 10Faidon Liambotis: Stop spawning ssh-keygen but generate fps ourselves [software/keyholder] - 10https://gerrit.wikimedia.org/r/458249 [23:15:01] (03CR) 10Faidon Liambotis: Add setuptools, LICENSE, README.rst etc. (031 comment) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458224 (owner: 10Faidon Liambotis) [23:15:11] Is it too late to add one more translation patch? I completely understand if the answer is no. [23:17:03] thcipriani: ^^ [23:17:26] thanks, paladox! [23:17:29] CindyCicaleseWMF: sure, they all take the same amount of time before I start sync :) [23:17:45] greg-g: Given there's no train next week, can there be a Tuesday "morning" SWAT slot added at 18:00 UTC? (It's currently empty.) [23:17:47] well, minus the 5 minutes waiting for Jenkins [23:17:53] thcipriani: Ebe123 is here. [23:18:06] thecipriani: thanks! getting it ready now. [23:18:13] CindyCicaleseWMF: your welcome :) [23:18:33] Ebe123: hi! didn't see you join, let's get your patch out if you're available [23:18:49] I'm here [23:18:50] FYI, the additional patch is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EUCopyrightCampaign/+/458616 [23:19:10] (03CR) 10Faidon Liambotis: Don't barf on an empty or invalid YAML config (032 comments) [software/keyholder] - 10https://gerrit.wikimedia.org/r/458226 (owner: 10Faidon Liambotis) [23:23:15] CindyCicaleseWMF: created wmf.20 cherry pick of that patch at https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EUCopyrightCampaign/+/458617/ everything look right there? [23:24:08] thcipriani: yes, thanks! I thought I needed to wait for it to be merged to cherry-pick [23:24:20] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Pine) [23:26:09] okie doke, +2'd, waiting on jenkins for merges now [23:26:19] yay! [23:27:31] Ping me with which server :) [23:27:45] Ebe123: your change to Score is live on mwdebug1002, check please [23:37:05] Ebe123: any luck? Able to test OK? [23:38:11] (03CR) 10Krinkle: "Still appears used by wikimedia/puppet – exim4.conf.mx.erb:" [dns] - 10https://gerrit.wikimedia.org/r/143762 (owner: 10Faidon Liambotis) [23:38:54] will do [23:41:21] It's good [23:41:49] thanks for checking, going live [23:43:10] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/Score/includes/Score.php: SWAT: [[gerrit:458558|Add checks for length of generated audio files]] T203560 (duration: 00m 58s) [23:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:16] T203560: Notice: Undefined index: qb4tlxyr.ogg in /srv/mediawiki/php-1.32.0-wmf.19/extensions/Score/includes/Score.php on line 507 - https://phabricator.wikimedia.org/T203560 [23:43:22] ^ Ebe123 live everywhere [23:46:51] !log thcipriani@deploy1001 Started scap: SWAT: [[gerrit:458586|Add italian translation]] T203297 [[gerrit:458609|Improve German translation]] [[gerrit:458617|German translation: Replace "Vertreter" by "EU-Abgeordnete"]] [23:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:57] T203297: Manage translations for fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T203297 [23:47:05] ^ CindyCicaleseWMF started scap sync FYI [23:47:28] thcipriani - cool, thanks for letting me know [23:49:36] Thanks! [23:50:18] thanks for the patch :)