[00:00:10] mergable or merjable? [00:00:56] !log krenair Synchronized php-1.25wmf21/extensions/Flow: https://gerrit.wikimedia.org/r/#/c/197434/ (duration: 00m 08s) [00:01:00] ebernhardson [00:01:01] Logged the message, Master [00:04:54] ebernhardson, everything ok? [00:06:08] Krenair: looks to be working well enough, i don't even have permissions to use this page though :) [00:06:29] Is there a page which you can test it on @testwiki? [00:06:58] same with permissions. [00:07:21] it can't hurt anything, its an isolated page with very limited permissions (only accounts with staff rights currently) [00:07:49] Now would be a convenient time for one of us to have sysadmin/staff :/ [00:08:26] ebernhardson, I have an idea [00:10:26] ebernhardson, try now [00:10:29] on testwiki [00:11:50] Krenair: works as advertised. Thanks [00:11:54] yw [00:14:05] !log krenair Synchronized php-1.25wmf21/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/197415/ (duration: 00m 08s) [00:14:11] Logged the message, Master [00:14:32] (checked, seems OK) [00:15:20] okay, swat over [00:23:04] Krenair: Thanks! [00:28:14] legoktm, you should probably fix that before tomorrow [00:28:20] Or maybe revert and fix later [00:35:09] (03PS1) 10Ori.livneh: Drop support for 75 languages in SyntaxHighlighter_GeSHi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 [00:40:50] (03CR) 10Aaron Schulz: [C: 031] Drop support for 75 languages in SyntaxHighlighter_GeSHi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 (owner: 10Ori.livneh) [00:41:24] (03CR) 10Legoktm: "T85794 is related" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 (owner: 10Ori.livneh) [00:43:34] Krenair: Looks like the VE deploy is causing problems on mobile in wmf21. If you’re done with the SWAT deploy, I’m going to do an emergecy fix for the mobile breakage. [00:43:45] yep, that's fine [00:44:22] * Krenair notes to test the mobile site in future [00:44:26] (03CR) 10Krinkle: "Just a thought, but would it make sense to filter it with e.g. array_intersect/array_values so that it will never accidentally add values " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 (owner: 10Ori.livneh) [00:46:34] robh: [17:45] wikibugs (no projects): Add Stephen LaPorte to dns-admin alias - https://phabricator.wikimedia.org/T92968 (RobH) [00:46:50] robh: That's screaming "no projects" in red over in #wikimedia-dev :) [00:47:05] RoanKattouw: ? [00:47:07] it has projects. [00:47:15] yeah [00:47:16] i messed with the permissions but then chagned back and did something else [00:47:22] it's a known bug with the bot [00:47:22] it was set to viewable nda only for a moment [00:47:29] wtf that's weird [00:47:30] RoanKattouw: sounds like the bot is fubar man [00:47:35] Yeah that's a bot bug [00:47:42] It's to do with it not being a public bug at the moment [00:47:46] it's restricted to logged in users [00:47:50] * robh is anti phab bot echo [00:47:51] a bunch of operations tasks have this [00:47:53] so im glad its broken! [00:47:58] wat [00:47:58] ;D [00:48:03] That's no effective restriction at all [00:48:07] Anyone can create an account, right? [00:48:09] yes [00:48:12] it's because the bot can't read it [00:48:20] i set it to NDA and the bot isnt nda [00:48:24] but the bot can get events about it [00:48:26] so thats expected [00:48:34] but again, i undid it and put it back to normal [00:48:35] robh, it's visible to all logged in users [00:48:39] and instead make a linked private ticket [00:48:41] you didn't put it back to public [00:48:46] oh [00:48:47] these are not the same thing [00:48:55] well... still stupid for bot to bitch [00:48:57] but yea [00:49:01] (03PS7) 10BBlack: certs: remove legacy ensure => absent Files [puppet] - 10https://gerrit.wikimedia.org/r/197339 (owner: 10Faidon Liambotis) [00:49:03] (03PS7) 10BBlack: Kill unused/old/test certificates [puppet] - 10https://gerrit.wikimedia.org/r/197338 (owner: 10Faidon Liambotis) [00:49:05] (03PS7) 10BBlack: sslcert: add sslcert::certificate [puppet] - 10https://gerrit.wikimedia.org/r/197337 (owner: 10Faidon Liambotis) [00:49:07] (03PS7) 10BBlack: sslcert: add sslcert::ca define, use it from certs [puppet] - 10https://gerrit.wikimedia.org/r/197336 (owner: 10Faidon Liambotis) [00:49:07] the bot receives events from a logged in user, and looks up project data while logged out [00:49:09] (03PS7) 10BBlack: sslcert: generate chained certs automatically [puppet] - 10https://gerrit.wikimedia.org/r/197341 (owner: 10Faidon Liambotis) [00:49:11] (03PS7) 10BBlack: sslcert: add sslcert::chainedcert [puppet] - 10https://gerrit.wikimedia.org/r/197340 (owner: 10Faidon Liambotis) [00:49:13] (03PS1) 10BBlack: Introduce a new sslcert module to replace certs.pp [puppet] - 10https://gerrit.wikimedia.org/r/197458 [00:49:15] (03PS1) 10BBlack: replace certificates::base code with ref to ::sslcert [puppet] - 10https://gerrit.wikimedia.org/r/197459 [00:49:18] 10Ops-Access-Requests, 6operations: Add Stephen LaPorte to dns-admin alias - https://phabricator.wikimedia.org/T92968#1127096 (10RobH) [00:49:39] (03Abandoned) 10BBlack: Introduce a new sslcert module (to replace certs.pp) (#2) [puppet] - 10https://gerrit.wikimedia.org/r/197404 (owner: 10BBlack) [00:49:42] and the bots logged in user is a nothing special permissions user i assume [00:49:53] since it doesnt have this issue for tickets iwth restrictive reads [00:49:57] only read for all user [00:50:03] definitely does not have WMF-NDA access :) [00:50:06] heh [00:50:09] just a normal logged in user AFAIK [00:50:13] Yeah let's not give the bot NDA access :) [00:50:31] I think it runs in tools :) [00:50:37] Oh right :) [00:50:55] 6operations, 10Citoid, 10VisualEditor, 3VisualEditor 2014/15 Q3 blockers: Improve citoid production service - https://phabricator.wikimedia.org/T90281#1127102 (10Jdforrester-WMF) [00:50:56] 6operations, 10Citoid, 6Services, 5Patch-For-Review: Provide service alerting/statistics for the citoid and zotero services - https://phabricator.wikimedia.org/T87496#1127100 (10Jdforrester-WMF) 5Open>3Resolved Stats collection will go live in the next citoid service production deploy. [00:51:08] 6operations, 10Citoid, 6Services: Provide service alerting/statistics for the citoid and zotero services - https://phabricator.wikimedia.org/T87496#1127103 (10Jdforrester-WMF) [00:51:17] (03CR) 10BBlack: [C: 032] Introduce a new sslcert module to replace certs.pp [puppet] - 10https://gerrit.wikimedia.org/r/197458 (owner: 10BBlack) [00:51:32] 6operations, 10Citoid, 10VisualEditor, 3VisualEditor 2014/15 Q3 blockers: Improve citoid production service - https://phabricator.wikimedia.org/T90281#1127104 (10Jdforrester-WMF) 5Open>3Resolved a:3Jdforrester-WMF All dependencies closed. Marking as fixed. [00:51:35] I think the code that receives events is separate from the IRC bot [00:51:40] hence the difference in login status [00:52:11] Of course, you could just not have logged-in-users-only tasks [00:52:35] most of it runs as a logged in user, but the part that gets projects is forced to screen scrape as an anonymous user [00:52:45] 10Ops-Access-Requests, 6operations: Add Stephen LaPorte to dns-admin alias - https://phabricator.wikimedia.org/T92968#1127113 (10RobH) [00:53:36] so i set a blocking task to all private but its existence still isnt... [00:53:49] durn, was hoping that fix had been applied when i wasnt paying attention [00:54:19] I'm going to test scap right now, and it should fail on syntax errors... [00:54:26] !log legoktm Started scap: (no message) [00:54:33] Logged the message, Master [00:54:59] !log testing scap for syntax errors bug [00:55:03] Logged the message, Master [00:56:13] hm, I probably should have just sync-dir'd everything. [00:56:31] !log legoktm scap aborted: (no message) (duration: 02m 04s) [00:56:34] Logged the message, Master [00:57:09] 6operations, 10ops-codfw: mw2008 has Hyper threading disabled - https://phabricator.wikimedia.org/T92738#1127120 (10Dzahn) p:5Triage>3Normal [00:58:17] 00:57:55 sync-file failed: /srv/mediawiki-staging/php-1.25wmf20/tests/parser/parserTests.txt has content before opening PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 26 failures [00:58:43] PROBLEM - puppet last run on virt1005 is CRITICAL: CRITICAL: puppet fail [00:58:43] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: puppet fail [00:58:54] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Puppet has 8 failures [00:59:02] PROBLEM - puppet last run on mw2049 is CRITICAL: CRITICAL: puppet fail [00:59:02] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail [00:59:02] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [00:59:02] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: puppet fail [00:59:18] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/197459 (owner: 10BBlack) [00:59:23] oh [00:59:23] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: puppet fail [00:59:24] facepalm [00:59:33] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Puppet has 7 failures [00:59:42] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 3 failures [00:59:52] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 7 failures [00:59:53] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Puppet has 7 failures [00:59:53] (03CR) 10Ori.livneh: "The list of languages that get dropped is: 'f1', 'gambas', 'proftpd', 'klonecpp', 'cuesheet', 'oxygene', 'smarty', 'vim', 'klonec', 'frees" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 (owner: 10Ori.livneh) [01:00:02] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 4 failures [01:00:03] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 12 failures [01:00:03] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: Puppet has 39 failures [01:00:05] ^ I suspect that's from me restarting the puppetmasters, in which case it should be transient and quick [01:00:22] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 5 failures [01:01:13] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: puppet fail [01:01:24] (03CR) 10Dzahn: "i didn't expect vim and nagios to be in the list of dropped languages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 (owner: 10Ori.livneh) [01:02:23] !log legoktm Synchronized php-1.25wmf20/tests/parser/: testing syntax error bug (duration: 00m 05s) [01:02:27] Logged the message, Master [01:02:37] Krenair: you only saw the error with sync-file right? [01:02:52] yes [01:03:21] I got another one on some scribunto thing via sync-dir [01:03:27] which will probably break scap tomorrow [01:03:32] yeah, fixing both [01:03:50] (03PS2) 10Ori.livneh: Drop support for 75 languages in SyntaxHighlighter_GeSHi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 [01:03:58] one was that we were checking for patches coming in a minute [01:04:38] Krenair: did you file bugs for this? [01:04:43] No [01:04:49] I complained on the patch [01:04:52] ok :P [01:05:27] doh. I should have caught the sync-file one legoktm :/ [01:05:48] !log kaldari Synchronized php-1.25wmf21/extensions/VisualEditor: syncing update to VE to fix mobile (duration: 00m 06s) [01:05:48] (03CR) 10Ori.livneh: [C: 032] "@Krinkle: Good idea; done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 (owner: 10Ori.livneh) [01:05:50] (03CR) 10Jforrester: [C: 031] "But let's split into more sane individual modules and a "most common" one or whatever longer term." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 (owner: 10Ori.livneh) [01:05:52] Logged the message, Master [01:05:53] (03Merged) 10jenkins-bot: Drop support for 75 languages in SyntaxHighlighter_GeSHi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197449 (owner: 10Ori.livneh) [01:05:55] The !log ori Synchronized wmf-config/CommonSettings.php: Idf3491140: Drop support for 75 languages in SyntaxHighlighter_GeSHi (duration: 00m 05s) [01:06:42] Logged the message, Master [01:07:03] bd808: we could. But isn't ori: removing it from that array will result in a Parser error though, there's not even a plain fallback to showing the raw source code. Perhaps something to change in the extension? [01:07:52] in 5.3 yes if you enable short tags in php.ini. After 5.6 (5.5?) it's always valid [01:07:52] I guess if they're not used at all, we can just remove them. Not sure. [01:08:17] I'm not sure that they are not used at all. I didn't see them in two hours' worth of logs. [01:08:39] We should combine the more obscure languages, IMO. [01:09:06] (03PS1) 10Legoktm: Have utils.check_php_opening_tag check the file extension suffix [tools/scap] - 10https://gerrit.wikimedia.org/r/197460 [01:09:15] ori: I wouldn't be surprised if they're used on smaller wikis. Changing the page rendering from syntax highlighted to "Error: code is invalid" (the geshi extension is really verbose in the handling of such error, it's horrible in its own right) [01:09:33] (03CR) 10BBlack: [C: 032] replace certificates::base code with ref to ::sslcert [puppet] - 10https://gerrit.wikimedia.org/r/197459 (owner: 10BBlack) [01:09:50] Krinkle: I'll leave a logger on for 24h and see [01:09:59] (03PS2) 10Legoktm: Have utils.check_php_opening_tag check the file extension suffix [tools/scap] - 10https://gerrit.wikimedia.org/r/197460 [01:10:01] ^ may break puppet runs again. if so, I have a revert prepped :P [01:10:04] bd808: ^ [01:10:32] but yeah, the extension should fall back to just plain text [01:10:50] ori: cool. Where are you logging it for the 24h? [01:10:55] s/Where/How [01:11:23] (03CR) 10BryanDavis: [C: 032] Have utils.check_php_opening_tag check the file extension suffix [tools/scap] - 10https://gerrit.wikimedia.org/r/197460 (owner: 10Legoktm) [01:11:39] (03Merged) 10jenkins-bot: Have utils.check_php_opening_tag check the file extension suffix [tools/scap] - 10https://gerrit.wikimedia.org/r/197460 (owner: 10Legoktm) [01:11:49] legoktm: Do you want to update beta and prod to use that or do you want me to? [01:11:57] success! (stupid puppet) [01:12:00] Krinkle: actually, looking at the code, it looks like it does fall back to text [01:12:09] bd808: how do I do that? [01:12:14] heh [01:12:18] I'll do it :) [01:12:22] note to future self: when adding a new module, one must add the module in a commit, deploy, restart puppetmasters, then do a second commit to actually use the module [01:12:30] see formatLanguageError (in SyntaxHighlight_GeSHi.class.php) [01:12:31] it's a trebuchet deploy for both [01:12:36] alright :D [01:12:51] * legoktm goes to find files with short tags [01:12:57] ori: Right. It outputs it below the big error [01:13:08] having the error (so that someone can fix it) but the plain source as well seems like a fine result for the tiny tail end of edge cases [01:13:32] PROBLEM - puppet last run on protactinium is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:33] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:33] PROBLEM - puppet last run on mw1074 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:53] PROBLEM - puppet last run on mw2038 is CRITICAL: CRITICAL: Puppet has 1 failures [01:13:53] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:03] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:13] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:13] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:13] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:16] bblack: ^ [01:14:22] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:22] PROBLEM - puppet last run on mw1087 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:23] PROBLEM - puppet last run on mc1013 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:34] PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:37] meh [01:14:49] I did check several test hosts, I wonder if this is transient also [01:14:50] Error: /Stage[main]/Certificates::Base/File[/etc/apparmor.d/abstractions/ssl_certs]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///files/ssl/ssl_certs [01:14:52] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 1 failures [01:14:55] ori: Yeah. It's a larger issue that we have no good way to output information for editors. [01:15:07] This kind of stuff would just look weird to a reader. [01:15:24] PROBLEM - puppet last run on mw2027 is CRITICAL: CRITICAL: Puppet has 1 failures [01:15:38] ori: yeah I'm looking... [01:16:02] RECOVERY - puppet last run on virt1005 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [01:16:03] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:16:03] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [01:16:05] but, the file is there in the repo, again I think this is a case of some kind master cache/sync issue for modules [01:16:06] !log Trebuchet error from mw1222 for scap deploy (status code 128), no response from mw2003 [01:16:09] Logged the message, Master [01:16:19] !log mw2008 rebooting to fix BIOS HT setting [01:16:22] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [01:16:22] Logged the message, Master [01:16:23] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:16:23] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [01:16:24] yup, I re-ran some of the failing nodes above, and they succeeded [01:16:44] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:16:54] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [01:16:54] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [01:17:03] bd808: Scribunto is the only WMF-deployed extension that has issues with short tags: https://gerrit.wikimedia.org/r/197461 [01:17:03] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:17:03] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:17:22] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [01:17:22] RECOVERY - puppet last run on mw1074 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:17:22] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [01:17:33] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:17:37] so, this will all eventually fix all the puppetfails, once the masters both decide to actually serve the committed content faithfully. which seems to mostly be happening now. [01:17:42] RECOVERY - puppet last run on mw2049 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [01:17:42] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:17:51] bd808: should we do backports now? [01:17:51] legoktm: +2'd [01:17:53] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [01:17:57] (03PS1) 1020after4: Improved test for content preceeding (03CR) 10jenkins-bot: [V: 04-1] Improved test for content preceeding I don't think its needed actually. we only scan on sync-dir right? [01:18:32] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:18:34] and scap right? [01:18:36] scap doesn't scan all the files, just config [01:18:39] oh [01:18:56] it assumes that Jenkins scanned branches [01:19:35] !log Updated scap to Ie1d1642 (Have utils.check_php_opening_tag check the file extension suffix) [01:19:38] Logged the message, Master [01:19:48] ok [01:20:15] (03CR) 10BBlack: [C: 032] sslcert: add sslcert::ca define, use it from certs [puppet] - 10https://gerrit.wikimedia.org/r/197336 (owner: 10Faidon Liambotis) [01:21:16] 6operations: Delete stat1002:/a/squid/archive/bannerImpressions - https://phabricator.wikimedia.org/T92330#1127212 (10kevinator) @Jgreen @ellery There are only 2 files in this directory and they date back from Sep 28 2012. I think it's safe to delete them. [01:21:35] hey opsen, deployment-salt is critical right now. anyone available to lend a hand? [01:21:38] 6operations, 10ops-codfw: mw2008 has Hyper threading disabled - https://phabricator.wikimedia.org/T92738#1127213 (10Dzahn) rebooted to BIOS and enabled HT ("Logical Processor: enabled"). it's back up now [01:21:40] 6operations: Delete stat1002:/a/squid/archive/bannerImpressions - https://phabricator.wikimedia.org/T92330#1127214 (10kevinator) p:5Normal>3High [01:21:53] i'm seeing tons of sustained io wait (95%) [01:22:07] (03PS2) 1020after4: Improved test for content preceeding (03CR) 10BBlack: [C: 032] sslcert: add sslcert::certificate [puppet] - 10https://gerrit.wikimedia.org/r/197337 (owner: 10Faidon Liambotis) [01:22:49] 6operations: Delete stat1002:/a/squid/archive/arabic-banner - https://phabricator.wikimedia.org/T92329#1127217 (10kevinator) p:5Normal>3High No one spoke up to keep this. Proceed with deletion. [01:23:31] 6operations: Delete stat1002:/a/squid/archive/sampled-geocoded - https://phabricator.wikimedia.org/T92334#1127219 (10kevinator) p:5Normal>3High Proceed with deletion. [01:23:53] (03CR) 10BBlack: [C: 032] Kill unused/old/test certificates [puppet] - 10https://gerrit.wikimedia.org/r/197338 (owner: 10Faidon Liambotis) [01:23:59] (03CR) 10Legoktm: Improved test for content preceeding 6operations, 10ops-codfw: mw2008 has Hyper threading disabled - https://phabricator.wikimedia.org/T92738#1127221 (10Dzahn) 5Open>3Resolved a:3Dzahn root@mw2008:~# lshw -class cpu | grep config configuration: cores=6 enabledcores=6 threads=12 configuration: cores=6 enabledcores=6 threads=12... [01:24:52] !log legoktm Synchronized php-1.25wmf20/tests/parser/parserTests.txt: testing syntax error bug (duration: 00m 07s) [01:24:56] Krenair: ^ [01:24:57] Logged the message, Master [01:24:59] 6operations: Delete stat1002:/a/squid/archive/mobile-geocoded - https://phabricator.wikimedia.org/T92333#1127225 (10kevinator) p:5Normal>3High Proceed with deletion. [01:25:43] legoktm, could you do wmf21 as well? [01:25:55] 6operations: Delete stat1002:/a/squid/archive/edits-geocoded - https://phabricator.wikimedia.org/T92332#1127227 (10kevinator) p:5Normal>3High rm empty directory [01:26:07] !log legoktm Synchronized php-1.25wmf21/tests/parser/parserTests.txt: testing syntax error bug (duration: 00m 07s) [01:26:11] Logged the message, Master [01:27:53] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: puppet fail [01:28:05] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: puppet fail [01:28:39] (03PS3) 10BryanDavis: Improved test for content preceeding 6operations: Delete gadolinium:/a/log/nginx/ - https://phabricator.wikimedia.org/T92337#1127229 (10kevinator) p:5Normal>3High Proceed with deletion [01:29:13] PROBLEM - puppet last run on virt1001 is CRITICAL: CRITICAL: puppet fail [01:29:13] PROBLEM - puppet last run on antimony is CRITICAL: CRITICAL: puppet fail [01:29:43] PROBLEM - puppet last run on virt1004 is CRITICAL: CRITICAL: puppet fail [01:29:53] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: puppet fail [01:30:09] (03CR) 10BryanDavis: Improved test for content preceeding RECOVERY - puppet last run on mw1087 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [01:30:33] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: puppet fail [01:30:42] PROBLEM - puppet last run on virt1003 is CRITICAL: CRITICAL: puppet fail [01:30:52] PROBLEM - puppet last run on plutonium is CRITICAL: CRITICAL: puppet fail [01:31:02] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:31:19] (03CR) 10Legoktm: "@20after4: We fixed that in Ie1d16423787a25e3c45e77d9447e8e2d51fd0299, to have the function check for the extension suffix." [tools/scap] - 10https://gerrit.wikimedia.org/r/197462 (https://phabricator.wikimedia.org/T92534) (owner: 1020after4) [01:31:22] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [01:31:22] RECOVERY - puppet last run on mw2038 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [01:31:23] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:31:33] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [01:31:33] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [01:31:37] (03CR) 1020after4: "I see, concurrent development. doh!" [tools/scap] - 10https://gerrit.wikimedia.org/r/197462 (https://phabricator.wikimedia.org/T92534) (owner: 1020after4) [01:31:44] RECOVERY - puppet last run on mw2027 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [01:31:46] RECOVERY - puppet last run on mc1013 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:31:53] PROBLEM - puppet last run on virt1007 is CRITICAL: CRITICAL: puppet fail [01:32:02] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:32:12] RECOVERY - puppet last run on protactinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:32:12] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:32:32] (03CR) 1020after4: "I still think this is a bit of an improvement to the readability of check_php_opening_tag. Merge or abandon?" [tools/scap] - 10https://gerrit.wikimedia.org/r/197462 (https://phabricator.wikimedia.org/T92534) (owner: 1020after4) [01:32:34] PROBLEM - puppet last run on virt1010 is CRITICAL: CRITICAL: puppet fail [01:32:53] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:32:54] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:32:58] (03PS1) 10BBlack: Fix key source temporarily [puppet] - 10https://gerrit.wikimedia.org/r/197463 [01:33:13] (03CR) 10Legoktm: "I think the "if ' PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: puppet fail [01:34:19] (03PS2) 10BBlack: Fix key source temporarily [puppet] - 10https://gerrit.wikimedia.org/r/197463 [01:34:43] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: puppet fail [01:34:58] 6operations: Delete gadolinium:/a/log/fundraising/ - https://phabricator.wikimedia.org/T92336#1127234 (10kevinator) p:5Normal>3Low @Jgreen these are a really naive questions: does it still need to be mounted here? can we unmount it? [01:35:14] (03CR) 10BBlack: [C: 032] Fix key source temporarily [puppet] - 10https://gerrit.wikimedia.org/r/197463 (owner: 10BBlack) [01:35:49] (03PS4) 1020after4: Improved test for content preceeding (03CR) 10jenkins-bot: [V: 04-1] Improved test for content preceeding PROBLEM - puppet last run on iodine is CRITICAL: CRITICAL: puppet fail [01:36:33] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:36:33] PROBLEM - puppet last run on virt1011 is CRITICAL: CRITICAL: puppet fail [01:36:36] (03CR) 1020after4: [C: 031] "removed the check for if ' 6operations: Delete stat1002:/a/squid/archive/sopa - https://phabricator.wikimedia.org/T92344#1127236 (10kevinator) p:5Normal>3High Proceed with deletion [01:37:03] PROBLEM - puppet last run on virt1008 is CRITICAL: CRITICAL: puppet fail [01:37:38] 6operations: Delete stat1002:/a/squid/archive/teahouse - https://phabricator.wikimedia.org/T92335#1127238 (10kevinator) p:5Normal>3High proceed with deletion [01:38:31] 6operations, 10Wikimedia-Blog: Delete stat1002:/a/squid/archive/blog - https://phabricator.wikimedia.org/T92331#1127241 (10kevinator) p:5Normal>3High proceed with deletion [01:38:52] RECOVERY - puppet last run on virt1010 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:39:05] (03PS5) 1020after4: Improved test for content preceeding (03CR) 10jenkins-bot: [V: 04-1] Improved test for content preceeding (03CR) 1020after4: "well without that check, it fails some tests." [tools/scap] - 10https://gerrit.wikimedia.org/r/197462 (owner: 1020after4) [01:40:43] PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: puppet fail [01:41:43] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [01:43:23] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:43:23] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:43:23] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:43:23] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:43:23] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:43:23] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:43:24] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:43:24] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:43:24] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:44:33] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [01:44:34] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [01:44:34] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:44:42] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [01:44:42] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [01:44:42] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [01:44:42] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [01:44:42] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:44:43] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [01:45:04] 6operations, 10Wikimedia-Shop, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1127251 (10Dzahn) @vshchepakina confirmed i have received the invitation to the shop and could login. The admin interface or the docs don't seem to mention certificates at all... [01:45:06] ^ and now we'll get a spam of roughly 200 total messages of that nature over the coming half-hour or so. The way that check recovers from long-term disable is silly. [01:45:16] 6operations: Purge > 90 days stat1002:/a/squid/archive/sampled - https://phabricator.wikimedia.org/T92342#1127252 (10kevinator) p:5Normal>3Low These logs are presently being used for QA tests on the new pageview definition. That QA should wrap up soon and free us to purge the older logs. [01:45:33] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:45:53] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:45:53] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:45:53] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:45:53] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:45:53] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:45:54] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:45:54] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:45:55] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:45:55] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:45:56] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:47:12] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [01:47:12] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:47:12] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:47:12] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:47:12] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:47:13] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:47:13] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:47:14] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:47:14] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:47:15] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:47:15] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [01:47:16] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:47:16] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [01:47:17] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [01:47:44] (03PS8) 10BBlack: certs: remove legacy ensure => absent Files [puppet] - 10https://gerrit.wikimedia.org/r/197339 (owner: 10Faidon Liambotis) [01:47:46] (03PS8) 10BBlack: sslcert: generate chained certs automatically [puppet] - 10https://gerrit.wikimedia.org/r/197341 (owner: 10Faidon Liambotis) [01:47:48] (03PS8) 10BBlack: sslcert: add sslcert::chainedcert [puppet] - 10https://gerrit.wikimedia.org/r/197340 (owner: 10Faidon Liambotis) [01:47:53] RECOVERY - puppet last run on virt1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:47:53] RECOVERY - puppet last run on antimony is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:48:03] RECOVERY - puppet last run on virt1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [01:48:23] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [01:48:23] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:48:23] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [01:48:23] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [01:48:23] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [01:48:23] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [01:48:23] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [01:48:24] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:48:24] RECOVERY - puppet last run on virt1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:48:28] (03CR) 10BBlack: [C: 04-1] "These needs to hold a bit for the .p12 part to actually push around, will do later tonight..." [puppet] - 10https://gerrit.wikimedia.org/r/197339 (owner: 10Faidon Liambotis) [01:48:33] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:48:33] RECOVERY - puppet last run on cp1074 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:48:33] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:49:02] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:49:14] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:49:14] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:49:14] RECOVERY - puppet last run on virt1007 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [01:49:22] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:49:32] RECOVERY - puppet last run on plutonium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [01:49:42] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:49:43] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:49:49] 6operations: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#1127257 (10kevinator) @leila are you OK with purging older data? [01:49:52] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:49:52] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [01:49:52] PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:49:52] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:49:53] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:49:53] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:03] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:13] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [01:50:24] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:50:32] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:50:53] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:53] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:53] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:53] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:53] PROBLEM - puppet last run on amssq62 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:53] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:53] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:54] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:54] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:55] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:50:55] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [01:50:56] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:51:03] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:51:03] RECOVERY - puppet last run on cp1008 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:51:03] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [01:51:03] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [01:51:03] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [01:51:13] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:51:53] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:52:03] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:52:04] PROBLEM - puppet last run on amssq50 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:52:04] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:52:04] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [01:52:04] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [01:52:04] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:52:04] RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:52:05] RECOVERY - puppet last run on amssq53 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [01:52:05] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:52:06] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:52:06] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:52:07] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:52:07] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [01:52:08] PROBLEM - puppet last run on amssq44 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:22] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:22] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:22] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:53:22] PROBLEM - puppet last run on amssq41 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:22] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [01:53:23] RECOVERY - puppet last run on amssq50 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:53:23] PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:24] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:24] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:25] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:25] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:53:26] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:53:26] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [01:53:27] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:53:37] (03PS6) 1020after4: Improved test for content preceeding RECOVERY - puppet last run on iodine is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [01:54:32] RECOVERY - puppet last run on virt1008 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [01:54:33] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [01:54:33] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [01:54:33] RECOVERY - puppet last run on amssq41 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [01:54:33] PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:54:33] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [01:54:33] RECOVERY - puppet last run on amssq34 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [01:54:33] (03CR) 1020after4: [C: 032] Improved test for content preceeding RECOVERY - puppet last run on amssq39 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:54:34] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:54:35] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:54:35] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:54:36] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [01:54:36] RECOVERY - puppet last run on amssq37 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:54:48] (03Merged) 10jenkins-bot: Improved test for content preceeding RECOVERY - puppet last run on virt1011 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:55:44] RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [01:55:44] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:56:51] 6operations: Purge > 90 days stat1002:/a/squid/archive/edits - https://phabricator.wikimedia.org/T92339#1127259 (10kevinator) @dartar , @Jdforrester-WMF do you know anything about these logs and what they are used for? I'd like to automate deletion of logs older than 90 days, or even remove everything altogethe... [01:57:08] I tihnk that's the end of the spam, sorry! [02:02:59] !log l10nupdate Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 00m 04s) [02:03:06] Logged the message, Master [02:04:07] !log Updated scap to I58e817b (Improved test for content preceeding !log LocalisationUpdate completed (1.25wmf20) at 2015-03-18 02:03:05+00:00 [02:04:11] Logged the message, Master [02:04:14] Logged the message, Master [02:04:29] (03PS1) 10BBlack: Fix ldap SSL cert deps [puppet] - 10https://gerrit.wikimedia.org/r/197465 [02:04:44] (03PS14) 10Ori.livneh: Gzip SVGs on back upload varnishes. [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) [02:04:52] bblack: ;) [02:04:54] (03CR) 10BBlack: [C: 032 V: 032] Fix ldap SSL cert deps [puppet] - 10https://gerrit.wikimedia.org/r/197465 (owner: 10BBlack) [02:05:33] you know you want to [02:06:04] lol [02:06:12] do you have a cronjob that rebases that every 2 weeks? :) [02:06:39] it was last rebased in december! [02:07:16] december feels like two weeks ago. everything has been a haze of work and sleep since January [02:07:27] mmmmmm... sleep [02:07:30] 6operations, 10Wikimedia-Shop, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1127295 (10Dzahn) I have talked to their support in chat and they had to escalate it to their second level. I requested they add store.wikipedia.org to the shared certificate... [02:07:34] RECOVERY - puppet last run on nembus is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:08:47] I still have really basic todo list items from everyday life that have been deferred since longer than december :p [02:09:34] RECOVERY - puppet last run on neptunium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [02:10:27] (03PS1) 10Jforrester: Enable VisualEditor in plwiki's NS 102 ('Wikiprojekt') [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197466 (https://phabricator.wikimedia.org/T92698) [02:10:33] ori: -1, has superfluous whitespace formatting change :) [02:10:53] that should keep it not my problem for at least 10 whole minutes [02:12:05] ok as far as I know, nothing's still broken from the chunk of sslcert stuff that went in today, and I'm not touching it again for hours. [02:12:11] if something related seems to break, feel free to call me. [02:12:16] off for a bit [02:21:23] (03PS1) 10Jforrester: RESTbase production enablement step 1 – ptwiki, ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197468 [02:21:25] (03PS1) 10Jforrester: RESTbase production enablement step 2 – itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197469 [02:21:27] (03PS1) 10Jforrester: RESTbase production enablement step 3 – frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197470 [02:21:29] (03PS1) 10Jforrester: RESTbase production enablement step 4 – enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197471 [02:21:31] (03PS1) 10Jforrester: RESTbase production enablement step 5 – dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197472 [02:21:33] (03PS1) 10Jforrester: RESTbase production enablement step 6 – all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197473 [02:24:24] !log l10nupdate Synchronized php-1.25wmf21/cache/l10n: (no message) (duration: 06m 51s) [02:24:31] Logged the message, Master [02:28:52] !log LocalisationUpdate completed (1.25wmf21) at 2015-03-18 02:27:48+00:00 [02:28:56] Logged the message, Master [02:56:07] Hm.. is wikitech thumbscaler broken? [02:56:18] https://wikitech.wikimedia.org/wiki/Ashburn_cluster [02:56:25] https://wikitech.wikimedia.org/w/images/thumb/8/8e/Eqiad_logical.png/220px-Eqiad_logical.png [02:56:28] https://wikitech.wikimedia.org/w/images/thumb/8/8e/Eqiad_logical.png/222px-Eqiad_logical.png?foop [02:56:41] both 404 [02:59:54] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [03:02:10] ssh to silver.wikimedia.org timed out [03:06:42] Although I can ssh to it via bast1001 [03:07:18] very strange [03:11:55] 6operations, 10Wikimedia-Labs-wikitech-interface: wikitech.wikimedia.org thumbnails broken - https://phabricator.wikimedia.org/T93041#1127368 (10Krinkle) 3NEW [03:11:57] Krinkle, I can get that image via img_auth.php [03:12:00] https://wikitech.wikimedia.org/w/img_auth.php/8/8e/Eqiad_logical.png [03:12:29] 6operations, 10Wikimedia-Labs-wikitech-interface, 7Regression: wikitech.wikimedia.org thumbnails broken - https://phabricator.wikimedia.org/T93041#1127376 (10Krinkle) [03:12:45] Krenair: The regular image works, too https://wikitech.wikimedia.org/w/images/8/8e/Eqiad_logical.png [03:12:49] withotu image_auth [03:12:59] ah right [03:13:02] just the thumb [03:14:06] 6operations, 10Wikimedia-Labs-wikitech-interface, 7Regression: wikitech.wikimedia.org thumbnails broken - https://phabricator.wikimedia.org/T93041#1127381 (10Krinkle) [03:14:29] Krinkle, https://wikitech.wikimedia.org/w/images/thumb/8/8e/Eqiad_logical.png/242px-Eqiad_logical.png [03:14:52] cache :) [03:14:55] https://wikitech.wikimedia.org/w/images/thumb/8/8e/Eqiad_logical.png/604px-Eqiad_logical.png [03:14:59] where were you getting that link? [03:15:08] First link of the bug report [03:15:46] Krinkle, okay, it's now not 404ing for me [03:16:13] what did you do [03:16:19] I didn't do anything [03:16:27] okay, first link works, second does not [03:16:40] https://wikitech.wikimedia.org/w/images/thumb/8/8e/Eqiad_logical.png/220px-Eqiad_logical.png -> OK [03:16:48] https://wikitech.wikimedia.org/w/images/thumb/8/8e/Eqiad_logical.png/222px-Eqiad_logical.png?foop -> 404 [03:16:56] https://wikitech.wikimedia.org/wiki/Category:Clusters [03:17:02] That one has a broken thumb, too [03:18:14] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [03:22:14] Krinkle, https://phabricator.wikimedia.org/P410 [03:22:31] Krenair: Yeah, cache [03:22:32] So something is expecting it to render to 222px, for some reason [03:22:44] but it's not doing so [03:22:54] What do you mean? [03:22:58] I made the 222px url manually [03:23:02] to bypass cache [03:23:36] What is expecting it to make 222px? [03:23:40] There's hidpi as well [03:23:42] ... Helpful. [03:23:53] Is the 404 handler disabled? [03:24:08] did cache get purged somehow? [03:24:20] Something updates those two somehow... [03:24:49] uh, 3 [03:26:29] Krinkle, so it's supposed to be able to take any random size? [03:26:35] I don't understand what's missing here. [03:26:42] Yes, that's how production has worked for many years and still does [03:26:56] because otherwise we'd have to create thumbnails while saving the page [03:27:04] Which doesn't scale for our purposes [03:27:19] as well as to allow cache to be purged and auto-created on-demand [03:27:23] without needing to re-parse all pages [03:29:51] any idea what bits and pieces are involved in achieving that? [03:30:37] Krenair: A mediawiki config flag to disable thumb creation during parsing, and an apache rewrite rule for thumb urls to thumb.php [03:30:46] (with caching in front of it) [03:36:30] More broken thumbnails (on HiDPI 2x) https://wikitech.wikimedia.org/wiki/Kennisnet_cluster [03:36:39] > var_dump( $wgUploadThumbnailRenderHttpCustomDomain ); [03:36:40] string(22) "upload.svc.eqiad.wmnet" [03:36:47] I wonder if this is part of the issue [03:38:02] It's used in includes/jobqueue/jobs/ThumbnailRenderJob.php [03:48:20] I don't think it's running that code [04:01:12] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [04:01:13] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [04:12:03] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:12:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:33:38] (03PS1) 10Dzahn: dsh: delete 'ALL' and 'cp1' groups [puppet] - 10https://gerrit.wikimedia.org/r/197477 (https://phabricator.wikimedia.org/T92259) [04:42:53] 6operations, 3Continuous-Integration-Isolation: Review Jenkins isolation architecture with Antoine - https://phabricator.wikimedia.org/T92324#1127416 (10Krinkle) [04:49:23] 6operations, 10Continuous-Integration, 6Labs: Evaluate options to make puppet errors more visible - https://phabricator.wikimedia.org/T92710#1127424 (10Krinkle) >>! In T92710#1118969, @scfc wrote: > Looking at http://shinken.wmflabs.org/service/integration-slave1402/Puppet%20failure, shinken seems to have no... [04:58:23] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [05:02:12] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [05:05:26] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Mar 18 05:04:23 UTC 2015 (duration 4m 22s) [05:05:32] Logged the message, Master [05:34:52] !log Disabled puppet on osmium for testing a chromium thingy [05:34:58] Logged the message, Master [05:44:23] (03Abandoned) 10Yuvipanda: Combine deployment_server roles [puppet] - 10https://gerrit.wikimedia.org/r/195336 (owner: 10Thcipriani) [05:47:17] eh. wait. this can't be right. I have a 2nd account (didn't even remember) that gonna get SUL renamed. but the global account was still available, until 6 hours ago (when the bot ran?) why wasn't my existing account upgraded to a global account ? [05:48:08] https://meta.wikimedia.org/wiki/Special:CentralAuth/Flying_Spaghetti_Monster [05:48:31] oh ... user mentions.... ? [05:48:42] that would be... sad... [05:51:46] legoktm: ^ [05:51:55] (03PS2) 10Yuvipanda: Tools: Port portgrabber to Python [puppet] - 10https://gerrit.wikimedia.org/r/197439 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [05:52:08] (03CR) 10Yuvipanda: [C: 032 V: 032] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/197439 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [05:54:15] thedj: looking [05:55:51] thedj: the accounts have different emails. if they have the same password or you know both of them, you can go to Special:MergeAccount and merge them [05:58:37] legoktm: but home wiki shows Global account creation date or something ? [05:58:49] oh wait, it's 'attached on' [05:58:51] nvmd [05:59:06] yeah, that's when the script ran over it [06:00:37] k, nvmd, i thought that was the cretion date, which would have made it weird [06:11:43] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [06:11:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [06:22:53] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:22:53] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:28:12] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [06:29:03] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: puppet fail [06:29:12] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:13] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:13] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:23] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:23] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:23] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:23] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:42] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:12] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:55] thedj: if you login as Flying_Spaghetti_Monster [06:35:01] on nl.wiki, it will be merged [06:35:15] (03CR) 10Florianschmidtwelzow: [C: 031] Set $wgRateLimits['badcaptcha'] to counter bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [06:35:39] Just tested with https://meta.wikimedia.org/wiki/Special:CentralAuth/Nemo_nonies [06:36:35] Wait, what's happening? [06:36:50] Nemo, what's happening with the merging? [06:37:33] 6operations, 7HTTPS, 3HTTPS-by-default, 7Performance: HTTPS performance tuning - https://phabricator.wikimedia.org/T86666#1127505 (10Eloquence) Congrats for getting jessie into prod! Can we project an ETA yet for the remaining tuning tasks? Thanks. [06:38:18] Bsadowski1: same as months ago ;) logging in merges the local account with the global account with the same name, if you own it [06:44:03] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [06:45:42] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 43.20 ms [06:46:13] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:23] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:23] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:33] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:43] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:43] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:03] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:47:33] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:42] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:54] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:30] (03PS1) 10Yuvipanda: Revert "Include hiera classes in lab instance role" [puppet] - 10https://gerrit.wikimedia.org/r/197482 [06:57:42] (03PS2) 10Yuvipanda: Revert "Include hiera classes in lab instance role" [puppet] - 10https://gerrit.wikimedia.org/r/197482 [06:57:50] (03CR) 10Yuvipanda: [C: 032] Revert "Include hiera classes in lab instance role" [puppet] - 10https://gerrit.wikimedia.org/r/197482 (owner: 10Yuvipanda) [07:44:18] mutante: gaaaah, I thought I had emailed engineering@ about dsh, but apparently I forgot to hit send [07:44:20] * YuviPanda facepalms [07:45:26] (03PS2) 10Yuvipanda: dsh: delete 'ALL' and 'cp1' groups [puppet] - 10https://gerrit.wikimedia.org/r/197477 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [07:45:35] (03CR) 10Yuvipanda: [C: 032 V: 032] dsh: delete 'ALL' and 'cp1' groups [puppet] - 10https://gerrit.wikimedia.org/r/197477 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [07:46:18] (03PS6) 10Yuvipanda: dsh: delete most groups [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [07:46:37] (03CR) 10Yuvipanda: "I had forgotten to actually send the email to engineering@, was sitting in drafts :( I had just sent it now." [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [07:50:30] 6operations, 7HTTPS, 3HTTPS-by-default, 7Performance: HTTPS performance tuning - https://phabricator.wikimedia.org/T86666#1127579 (10BBlack) OCSP Stapling should happen by the end of the week, I think. Worst case mid-next-week. I've been merging up some of Faidon's previous puppet cert refactoring today,... [07:51:19] 10Ops-Access-Requests, 6operations: Add Stephen LaPorte to dns-admin alias - https://phabricator.wikimedia.org/T92968#1127582 (10yuvipanda) 5Open>3Resolved a:3yuvipanda In this case I don't think this needs too wide a discussion / wait period. Done. [07:51:38] 6operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T92835#1127585 (10yuvipanda) a:3fgiunchedi [07:51:51] 6operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=8 dev=sdi failed - https://phabricator.wikimedia.org/T92834#1127587 (10yuvipanda) a:3fgiunchedi [07:52:38] 6operations, 10ops-eqiad: Failed Raid Analytics1010 - https://phabricator.wikimedia.org/T92957#1127590 (10yuvipanda) a:3Ottomata I think @ottomata mentioned that this is a cisco that's no longer being used. In that case it should be decommissioned. [07:53:53] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1127593 (10yuvipanda) p:5Triage>3Normal [07:54:15] 6operations: Migrate host lists out of cache.pp to reference values in Hiera - https://phabricator.wikimedia.org/T92601#1127595 (10yuvipanda) p:5Triage>3Normal [07:55:30] 6operations, 6MediaWiki-Core-Team, 10hardware-requests, 5Patch-For-Review: Fluorine needs bigger disks - https://phabricator.wikimedia.org/T92417#1127597 (10yuvipanda) p:5High>3Normal Back to normal, since there don't seem to have been any alerts of late. [07:55:53] 6operations, 6MediaWiki-Core-Team, 10hardware-requests, 5Patch-For-Review: Fluorine needs bigger disks - https://phabricator.wikimedia.org/T92417#1127599 (10yuvipanda) a:3Andrew (Assigning to @Andrew as the last person to have touched it :) ) [07:58:28] 6operations: for the love of all that is good, puppetize udpmcast - https://phabricator.wikimedia.org/T82092#1127603 (10yuvipanda) @bblack is this still valid? [07:59:31] 6operations: Automate bare metal builds to work similar to labs - https://phabricator.wikimedia.org/T79997#1127608 (10yuvipanda) 5Open>3Invalid a:3yuvipanda A lot of this is @Joe's wmf-re-image script, I think. Closing as not particularly actionable atm. [08:02:44] 7Puppet, 6operations: empty hiera yaml file makes lookup fail - https://phabricator.wikimedia.org/T89957#1127615 (10yuvipanda) p:5Normal>3Low Interim solution should be to not have empty yaml files :) [08:03:56] 6operations: Retire Torrus - https://phabricator.wikimedia.org/T87840#1127621 (10yuvipanda) @Gage can torrus be killed now? [08:05:26] 6operations, 7network: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#1127632 (10yuvipanda) @cmjohnson @papaul is this done now? [08:07:08] 6operations, 10MediaWiki-extensions-ConfirmEdit-(CAPTCHA-extension): bogus captchaid results in http 500, should be http 400 instead - https://phabricator.wikimedia.org/T88970#1127647 (10Florian) [08:07:12] (03CR) 10Mobrovac: [C: 031] RESTbase production enablement step 1 – ptwiki, ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197468 (owner: 10Jforrester) [08:07:37] (03CR) 10Mobrovac: [C: 031] RESTbase production enablement step 2 – itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197469 (owner: 10Jforrester) [08:33:12] YuviPanda: thnx for fixing citoid in beta! [08:33:29] mobrovac: :) yw! It would’ve worked fine until a restart... [08:33:39] heh [08:33:47] mobrovac: also, if some param is *needed* for a class to work, it shouldn’t have default set to undef [08:34:02] it should either have no default (and puppet will error) or have a sane default [08:34:26] that's a valuable piece of info to know [08:34:27] :) [08:37:15] mobrovac: :) [08:38:22] (03PS1) 10Yuvipanda: citoid: Don't set default to undef for required parameters [puppet] - 10https://gerrit.wikimedia.org/r/197484 [08:38:23] mobrovac: ^ [08:40:27] YuviPanda: how is that different from https://gerrit.wikimedia.org/r/#/c/197421/2 [08:40:28] ? [08:41:13] ah ok, now i see [08:41:15] mobrovac: that one immediately unbroke citoid on beta / staging. This one will make puppet complain if a new citoid is setup anywhere and doesn’t have the hostnames set in some form. [08:41:53] yep [08:42:27] you should then probably rebase and bring back the lines from citoid.yaml [08:42:39] since the first one has already been merged [08:42:51] mobrovac: what do you mean by ‘bring back the lines’? [08:43:26] YuviPanda: these were removed in your previous patch and needed: https://gerrit.wikimedia.org/r/#/c/197421/2/hieradata/common/citoid.yaml [08:43:32] mobrovac: why are they needed? [08:43:42] I only removed the ones that don’t have a default set :) [08:44:27] hm, wait i'm confused now [08:44:30] https://gerrit.wikimedia.org/r/#/c/197421 is merged [08:44:41] and it removes the conf for the proxy and zotero [08:44:49] and the port [08:45:11] mobrovac: it only removes the ports. nothing else. [08:45:22] and the ports are set to default on the citoid class in the same patch [08:45:37] ah ok, you're right [08:45:40] sorry, my bad [08:45:47] * YuviPanda hands mobrovac coffee :) [08:45:52] yeah [08:45:53] :) [08:46:17] brazilian strong with non-fat milk [08:46:20] that's my drug [08:46:23] :D [08:47:59] (03CR) 10Yuvipanda: [C: 032] citoid: Don't set default to undef for required parameters [puppet] - 10https://gerrit.wikimedia.org/r/197484 (owner: 10Yuvipanda) [08:57:19] (03PS1) 10Yuvipanda: staging: Add ocg role to ocg hosts [puppet] - 10https://gerrit.wikimedia.org/r/197486 (https://phabricator.wikimedia.org/T91555) [08:58:25] (03CR) 10Yuvipanda: [C: 032] staging: Add ocg role to ocg hosts [puppet] - 10https://gerrit.wikimedia.org/r/197486 (https://phabricator.wikimedia.org/T91555) (owner: 10Yuvipanda) [09:05:43] (03PS1) 10Yuvipanda: Revert "Revert "Include hiera classes in lab instance role"" [puppet] - 10https://gerrit.wikimedia.org/r/197487 [09:06:00] (03PS2) 10Yuvipanda: Revert "Revert "Include hiera classes in lab instance role"" [puppet] - 10https://gerrit.wikimedia.org/r/197487 [09:07:17] (03CR) 10Yuvipanda: [C: 032] Revert "Revert "Include hiera classes in lab instance role"" [puppet] - 10https://gerrit.wikimedia.org/r/197487 (owner: 10Yuvipanda) [09:36:45] <_joe_> what's up with icinga? [09:37:34] <_joe_> neon fails puppet? [09:38:17] oh? [09:38:21] _joe_: let me take a look [09:39:07] jesus fucking christ, my DNS server [09:39:12] * YuviPanda shoots his ISP [09:40:12] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 740 [09:42:35] (03PS1) 10Yuvipanda: ocg: Move defaults to class params, rather than hiera [puppet] - 10https://gerrit.wikimedia.org/r/197490 (https://phabricator.wikimedia.org/T91555) [09:43:20] _joe_: neon seems ok? [09:43:37] (03PS2) 10Yuvipanda: ocg: Move defaults to class params, rather than hiera [puppet] - 10https://gerrit.wikimedia.org/r/197490 (https://phabricator.wikimedia.org/T91555) [09:44:25] (03CR) 10Yuvipanda: [C: 032] ocg: Move defaults to class params, rather than hiera [puppet] - 10https://gerrit.wikimedia.org/r/197490 (https://phabricator.wikimedia.org/T91555) (owner: 10Yuvipanda) [09:48:03] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [09:49:27] <_joe_> YuviPanda: no it's not, almost 100% sure [09:49:33] <_joe_> but I'll show you how in a few [09:49:38] alright. [09:49:41] <_joe_> I have to finish a patch binge run [09:49:47] puppet run seems ok and the web interface seems fine... [09:49:49] alright :) [09:50:12] RECOVERY - check_mysql on db1008 is OK: Uptime: 500448 Threads: 2 Questions: 3192902 Slow queries: 3309 Opens: 10343 Flush tables: 2 Open tables: 64 Queries per second avg: 6.380 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:01:03] (03PS1) 10KartikMistry: config: Enable CX campagin in cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 [10:02:51] I'm getting dkim issues with mail from meta [10:03:40] "invalid (publickey: granularity mismatch) [10:06:43] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:09:24] (03PS1) 10Giuseppe Lavagetto: MWRealm: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) [10:09:26] (03PS1) 10Giuseppe Lavagetto: db-config: Add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197494 (https://phabricator.wikimedia.org/T91754) [10:09:28] (03PS1) 10Giuseppe Lavagetto: poolcounter: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197495 (https://phabricator.wikimedia.org/T91754) [10:09:30] (03PS1) 10Giuseppe Lavagetto: memcached: add configurations for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197496 (https://phabricator.wikimedia.org/T91754) [10:09:32] (03PS1) 10Giuseppe Lavagetto: proxy: add codfw networks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197497 (https://phabricator.wikimedia.org/T91754) [10:09:34] (03PS1) 10Giuseppe Lavagetto: jobqueue: add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197498 (https://phabricator.wikimedia.org/T91754) [10:09:36] (03PS1) 10Giuseppe Lavagetto: filebackend: add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197499 (https://phabricator.wikimedia.org/T91754) [10:09:38] (03CR) 10jenkins-bot: [V: 04-1] MWRealm: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [10:12:06] <_joe_> sigh [10:12:23] <_joe_> that's on me for wanting to split it up too much [10:24:23] (03PS1) 10Hoo man: Log HttpErrors with a 500+ code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197503 (https://phabricator.wikimedia.org/T85795) [10:43:08] (03PS2) 10Giuseppe Lavagetto: db-config: Add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197494 (https://phabricator.wikimedia.org/T91754) [10:43:10] (03PS2) 10Giuseppe Lavagetto: poolcounter: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197495 (https://phabricator.wikimedia.org/T91754) [10:43:12] (03PS2) 10Giuseppe Lavagetto: MWRealm: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) [10:43:14] (03PS2) 10Giuseppe Lavagetto: jobqueue: add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197498 (https://phabricator.wikimedia.org/T91754) [10:43:16] (03PS2) 10Giuseppe Lavagetto: filebackend: add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197499 (https://phabricator.wikimedia.org/T91754) [10:43:18] (03PS2) 10Giuseppe Lavagetto: memcached: add configurations for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197496 (https://phabricator.wikimedia.org/T91754) [10:43:20] (03PS2) 10Giuseppe Lavagetto: proxy: add codfw networks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197497 (https://phabricator.wikimedia.org/T91754) [10:43:22] (03CR) 10jenkins-bot: [V: 04-1] MWRealm: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [10:47:40] (03PS3) 10Giuseppe Lavagetto: MWRealm: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) [10:49:35] (03Abandoned) 10Giuseppe Lavagetto: db-config: Add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197494 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [10:50:37] (03PS3) 10Giuseppe Lavagetto: poolcounter: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197495 (https://phabricator.wikimedia.org/T91754) [10:56:47] Nemo_bis: yeah, but it might be that I don't actually own the en.wp account [10:58:46] (03PS3) 10Giuseppe Lavagetto: memcached: add configurations for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197496 (https://phabricator.wikimedia.org/T91754) [10:59:59] (03PS1) 10Nemo bis: Add users_to_rename table to fullview in labsdb replica [software] - 10https://gerrit.wikimedia.org/r/197507 [11:00:51] (03PS3) 10Giuseppe Lavagetto: proxy: add codfw networks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197497 (https://phabricator.wikimedia.org/T91754) [11:01:24] (03PS3) 10Giuseppe Lavagetto: jobqueue: add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197498 (https://phabricator.wikimedia.org/T91754) [11:01:55] (03CR) 10Nemo bis: "Springle, can you handle the replication?" [software] - 10https://gerrit.wikimedia.org/r/197507 (owner: 10Nemo bis) [11:05:26] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: add configs to support the Dallas DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194830 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [11:19:39] (03PS1) 10Phuedx: [WikiGrok] Actor campaign suggests occupations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197508 [11:21:31] YuviPanda|food: did you manage to get the git deploy thing on staging resolved ? [11:21:56] akosiaris: yes, with a massive hammer: https://phabricator.wikimedia.org/P409 [11:22:08] that makes it all ‘work' [11:23:59] YuviPanda: seems like a nice solution :P [11:24:08] anyway. trebuchet bug. Need to file a task [11:24:16] or github issue [11:24:34] and reproduction steop of course for ryan [11:24:37] steps* [11:24:38] akosiaris: I filed a task somewhere... [11:24:50] somewhere == ? [11:24:56] on a post it note on your desk ? :P [11:25:07] niah, you are way to digital for that [11:25:23] I might do it, somehow I doubt that though as well [11:25:26] no no, on phab [11:25:29] looking for the link now [11:25:29] sigh, I am getting old [11:25:50] akosiaris: https://phabricator.wikimedia.org/T92978 [11:26:47] YuviPanda: that is different though from the one I was referring to [11:27:05] akosiaris: yup. that’s the primary bug, I guess. [11:27:06] I am not sure it is a bug as well... although we could probably fix it [11:27:13] but the ‘bootstrap script’ fixed them all [11:27:24] well, it a process issue. [11:27:27] its* [11:27:28] oh [11:27:30] it's* [11:27:42] man... not good on a laptop keyboard anymore [11:27:44] I don’t know how much deep down the git-deploy hole I want to go. [11:27:47] akosiaris: thanks for the comment. what about numbers? not quoting them ? [11:28:22] matanya: no, no quoting. Except for the mode => '0XYZ' attribute of file resource [11:28:33] I know weird, sorry... [11:28:38] ok, many patches to come :) [11:28:48] (03Abandoned) 10Matanya: logstash: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195874 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [11:29:27] YuviPanda: ok point taken [11:29:38] I 'll try to go down that rabbit hole, but not today [11:29:55] akosiaris: :D [11:32:31] (03PS2) 10Matanya: wikistats: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195866 (https://phabricator.wikimedia.org/T91908) [11:32:33] artopurge broken? When deleting file page, the cache server are still sending file description content [11:32:36] example: https://commons.wikimedia.org/wiki/File:Ayoub_Kalantari.jpg [11:33:47] and the link in the deletion log is blue, not red... :/ but the file is deleted O_O [11:36:31] (03PS2) 10Matanya: wikimania_scholarships: resource attributes quoting and minor lint [puppet] - 10https://gerrit.wikimedia.org/r/195864 (https://phabricator.wikimedia.org/T91908) [11:37:12] (03Abandoned) 10Matanya: racktables: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195863 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [11:38:01] created https://phabricator.wikimedia.org/T93052 [11:42:29] (03CR) 10Alexandros Kosiaris: "Actually, this is not true. citoid is usable without zotero, can be installed without having to go through a proxy and does not need to se" [puppet] - 10https://gerrit.wikimedia.org/r/197484 (owner: 10Yuvipanda) [11:44:26] <_joe_> wikibugs is gone? [11:45:23] yes [11:46:02] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [11:47:05] _joe_ , do you know what is going on with caching atm? After deletion/filemoved pages needing a &action=purge by hand [11:47:49] akosiaris: ah, hmm. so maybe for zotero / url-proxy… maaaybe. but port should’ve been specified as a default, since the JS i nlocalsettings.js just is invalid js if port is undef [11:50:04] (03Abandoned) 10Matanya: shinken: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195862 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [11:50:41] YuviPanda: you are right of course. But I did not comment on the port change ;-) [11:50:47] (03PS2) 10Matanya: ocg: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195860 (https://phabricator.wikimedia.org/T91908) [11:51:04] (03Abandoned) 10Matanya: ocg: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195860 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [11:52:22] (03PS2) 10Matanya: etherpad: minor lint [puppet] - 10https://gerrit.wikimedia.org/r/195859 [11:52:41] <_joe_> Steinsplitter: no, no idea [11:52:54] <_joe_> that sounds like a jobqueue problem though [11:52:56] akosiaris: :) I have been flitting across so many different parts of our codebase... [11:52:58] * YuviPanda sighs [11:53:07] <_joe_> Steinsplitter: is there a bug for that? [11:53:10] currently looking at - why is ocg class and role so intermingleeedddddd [11:53:16] _joe_: https://phabricator.wikimedia.org/T93052 [11:53:29] <_joe_> akosiaris, YuviPanda ^^ [11:53:33] <_joe_> we should take a look [11:53:58] <_joe_> Steinsplitter: on only when deleting files [11:54:02] * YuviPanda clicks [11:54:10] (03PS3) 10Matanya: etherpad: minor lint [puppet] - 10https://gerrit.wikimedia.org/r/195859 [11:54:19] _joe_: it changed since my report. undeletions and filemoves affected to. [11:54:37] <_joe_> Steinsplitter: I will do a sanity check but I'm not sure it's a systems problem [11:54:56] Steinsplitter: are you seeing this problem only when logged out, or both when logged in / out? [11:55:27] <_joe_> YuviPanda: that would've been my next question [11:55:30] (03PS1) 10Filippo Giunchedi: diamond: stop collecting disk partitions stats [puppet] - 10https://gerrit.wikimedia.org/r/197509 (https://phabricator.wikimedia.org/T1075) [11:55:59] YuviPanda: when logged out too, tested with other browser. [11:56:11] (03CR) 10Yuvipanda: "Wondering how this will affect labs, where we use this heavily for shinken alerts." [puppet] - 10https://gerrit.wikimedia.org/r/197509 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [11:56:34] (03PS2) 10Matanya: ferm: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195858 (https://phabricator.wikimedia.org/T91908) [11:57:22] 6operations, 6Commons: Cache servers does not purge after deletions - https://phabricator.wikimedia.org/T93052#1128216 (10yuvipanda) @Steinsplitter says this occurs both when logged in / out. [11:57:50] YuviPanda: it is both [11:58:28] YuviPanda: mhh for individual partitions? can you point me at the alerts so I can take a closer lok? [11:58:51] godog: modules/shinken/files/basic-instance-checks.cfg [11:59:11] <_joe_> Steinsplitter: jobrunners are working ok [11:59:54] YuviPanda: cool, yeah that's diskspace not iostat [12:00:03] godog: aaaaaah, right. nvm then :) [12:00:40] <_joe_> Steinsplitter: do you have a time when this started [12:00:56] (03PS2) 10Filippo Giunchedi: diamond: stop collecting disk IO stats for partitions [puppet] - 10https://gerrit.wikimedia.org/r/197509 (https://phabricator.wikimedia.org/T1075) [12:00:58] (03Abandoned) 10Matanya: memcached: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195857 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [12:01:31] godog: curious as to why [12:01:40] unfortunatly not exactly. just went back at home from work two horus ago [12:01:43] (03CR) 10Filippo Giunchedi: "shouldn't be affecting disk space free checks since this is iostat hierarchy, I've clarified the commit message tho" [puppet] - 10https://gerrit.wikimedia.org/r/197509 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [12:01:48] alright, so I just uploaded and undeleted https://test.wikipedia.org/wiki/File:Test_Image_2.jpeg [12:01:51] I was thinking about having ganglia collect disk IO stats all over the cluster as well [12:01:53] err [12:01:53] deleted [12:01:57] and can see that the page is still showing up [12:01:59] (03Abandoned) 10Matanya: motd: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195770 (owner: 10Matanya) [12:02:09] the images themselves are gone. [12:02:25] akosiaris: mostly because it takes up a ton of space for little benefit, numbers are https://phabricator.wikimedia.org/T1075#1128140 too [12:02:44] <_joe_> YuviPanda: so we are not purging the cache, or are we not deleting images? [12:02:57] so not a swift issue [12:03:09] _joe_: we’re not purging something *somewhere*. I don’t think it’s a varnish layer issue. [12:03:18] (since i see the anomalous page even when logged in) [12:03:22] I’m poking around the db [12:03:24] to see what’s up [12:03:43] godog: Oh, ok change LGTM, makes a ton of sense [12:03:46] <_joe_> YuviPanda: I don't think so, we changed anything yesterday? [12:04:43] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:05:02] <_joe_> I may have found something [12:06:32] not that I know of, no. [12:07:58] hmm, the db tables seem fine (page / image) [12:10:54] (03PS1) 10Springle: Update m1 grants template to reflect reality. [puppet] - 10https://gerrit.wikimedia.org/r/197511 (https://phabricator.wikimedia.org/T92694) [12:12:50] (03PS2) 10Matanya: zuul: lint [puppet] - 10https://gerrit.wikimedia.org/r/195769 [12:14:16] 6operations, 7HTTPS, 3HTTPS-by-default, 7Performance: HTTPS performance tuning - https://phabricator.wikimedia.org/T86666#1128255 (10Nemo_bis) [12:17:07] (03PS2) 10Matanya: extdist: minor lint [puppet] - 10https://gerrit.wikimedia.org/r/195743 [12:18:57] (03CR) 10Springle: [C: 032] Update m1 grants template to reflect reality. [puppet] - 10https://gerrit.wikimedia.org/r/197511 (https://phabricator.wikimedia.org/T92694) (owner: 10Springle) [12:19:13] (03CR) 10Nikerabbit: config: Enable CX campagin in cawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 (owner: 10KartikMistry) [12:20:10] chasemp: https://gerrit.wikimedia.org/r/#/c/197509/ for your eyes too [12:22:27] 6operations, 6Commons: Cache servers does not purge after deletions - https://phabricator.wikimedia.org/T93052#1128265 (10yuvipanda) Confirmed. Seems to be limited only to File pages (my test Mainspace page is fine) [12:25:49] 6operations, 10ops-eqiad: Rack and set up ms-be1016-1018 - https://phabricator.wikimedia.org/T90922#1128268 (10faidon) There is a third option, which is to have another profile where sda/sdb are the SSDs and the rest are spindles and configured like this everywhere (d-i, puppet, swift ring files). Tampa server... [12:26:59] (03PS2) 10Matanya: scap: lint [puppet] - 10https://gerrit.wikimedia.org/r/195680 [12:27:43] (03Abandoned) 10Matanya: locales: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195665 (https://phabricator.wikimedia.org/T91908) (owner: 10Matanya) [12:30:13] (03PS2) 10Matanya: backup: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195661 [12:32:17] 6operations, 6Commons: Cache servers does not purge after deletions - https://phabricator.wikimedia.org/T93052#1128270 (10yuvipanda) My hunch is that this has nothing to do with varnish itself, since it happens both when logged in / out. [12:32:49] 6operations, 6Commons: Cache servers does not purge after deletions - https://phabricator.wikimedia.org/T93052#1128271 (10yuvipanda) And the database itself sesems to be fine, at least the page / image tables. (https://test.wikipedia.org/wiki/File:Test_Image_2.jpeg for test image, please do not purge atm) [12:33:23] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [12:33:23] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [12:33:48] (03PS2) 10Matanya: puppet_compiler: resource attributes quoting and minor lints [puppet] - 10https://gerrit.wikimedia.org/r/195660 [12:34:08] 6operations, 6Commons: Deleted image pages do not show up as deleted until a manual action=purge - https://phabricator.wikimedia.org/T93052#1128272 (10yuvipanda) [12:35:21] 6operations, 6Commons, 6MediaWiki-Core-Team, 6Multimedia: Deleted image pages do not show up as deleted until a manual action=purge - https://phabricator.wikimedia.org/T93052#1128169 (10yuvipanda) [12:36:13] 7Blocked-on-Operations, 6operations, 10Continuous-Integration: Jenkins is using php-luasandbox 1.9-1 for zend unit tests; precise should be upgraded to 2.0-7+wmf2.1 or equivalent - https://phabricator.wikimedia.org/T88798#1128277 (10akosiaris) a:3akosiaris [12:36:44] (03PS2) 10Matanya: reprepro: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195658 [12:36:53] (03CR) 10jenkins-bot: [V: 04-1] reprepro: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195658 (owner: 10Matanya) [12:37:22] 6operations, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1128285 (10faidon) a:5BBlack>3Yurik [12:37:55] paravoid, any reason? [12:38:32] 6operations, 6Zero, 7Varnish, 6WMF-NDA: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1128287 (10Yurik) [12:38:59] (03PS3) 10Matanya: reprepro: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195658 [12:39:35] 6operations, 6Zero, 7Varnish, 6WMF-NDA: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1128288 (10faidon) I see nothing private here, no reason to restrict with NDA. I also see no tasks that this blocks (so removing blocked-on-operations) and no agreement that t... [12:39:44] what the hell? [12:40:31] you know that just adding the project does nothing, right? [12:40:43] and there's absolutely nothing private on that task, so stop trying to make it private [12:41:45] paravoid NDA was added per biz request [12:41:52] please check with us before removing things like that [12:42:08] <_joe_> "biz"? [12:42:15] well it didn't do anything anyway [12:42:21] permissions weren't adjusted, just the project was added [12:42:28] paravoid, and thanks for reminding - switched security on [12:42:41] why is it private? [12:43:15] this is a ticket destined for us/in our queue, so please check with *us* before writing anything that might considered private [12:43:20] _joe_, paravoid, zero business development - frequently confidential. Zero team felt we were discussing things private to partners internal systems [12:43:45] like what? [12:43:50] and no, usually the creator determines if something they are requesting is private to them [12:43:53] everything discussed there is under public VCS [12:44:22] the code is under public, not the discussion. This was not my request. You are welcome to discuss it with zero development team [12:44:49] 6operations, 6Commons, 6MediaWiki-Core-Team, 6Multimedia: Deleted image pages do not show up as deleted until a manual action=purge - https://phabricator.wikimedia.org/T93052#1128301 (10yuvipanda) (am fairly unfamiliar with our LocalFile setup, am trying to trace the actions caused by action=purge, to see... [12:44:51] i don't really care either way, but i will not take upon myself to set security of the issue [12:45:20] you're the one writing what is considered "private", so you don't care what it was exactly that you said that was private? [12:45:26] and the zero development+legal (not dev) team is the one who knows what and why should be private [12:45:39] paravoid, talk to dfoy [12:45:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:46:02] ... [12:46:02] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:46:03] i am answering technical questions, not legal [12:46:40] everyone needs to know which pieces of information are restricted so that they won't speak about them in public [12:47:24] everything I said in that conversation I'd say in public in front of non-NDA people and I'm assuming you'd do the same [12:47:29] so something is clearly wrong here [12:47:49] either this characterization is wrong or we'd both violate our NDA doing so [12:48:18] (and in fact, already did, as the task was restricted just now, ex post facto) [12:48:32] paravoid, you are welcome to talk about it with dfoy. I was told to secure it. The discussion was related to internal configuration of the partners, and even though most of it is public, Zero felt uncomfortable as we were borderlining. Again, feel free to talk to dfoy about it. Lets concentrate on the technical aspects [12:48:44] so next time someone asks you to "secure" something, ask why [12:48:50] and before tagging it, explain why on the ticket [12:48:51] okay? [12:49:04] (and do it properly) [12:49:25] (03PS2) 10Matanya: haproxy: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195652 [12:49:54] paravoid, i mostly agree with you, and don't really want to deal with security of the tickets. In this case, the ticket was requested to be NDAed. Sadly I did not realize that NDA group does not change security settings (which i think it should) [12:50:07] it was not initially [12:50:22] it was not initially meant to be secure [12:50:33] only when we went in depths of partner relationships [12:50:46] and their configuration [12:54:29] (03PS2) 10Matanya: bastionhost: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195650 [12:54:31] (03PS1) 10Giuseppe Lavagetto: labs::dns: remove duplicate monitoring [puppet] - 10https://gerrit.wikimedia.org/r/197513 [12:54:38] <_joe_> YuviPanda: ^^ [12:54:50] (03Abandoned) 10Matanya: bastionhost: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195650 (owner: 10Matanya) [12:54:53] <_joe_> I'll clean up that mess today, or I will merge it [12:55:48] yurik: has this tagging of all traffic on all clusters been discussed in zero quarterly reviews or anything anywhere? [12:55:49] _joe_: ok. andrewbogott_afk is in the middle of doing pdns stuff tho [12:55:50] I can't find it [12:55:56] (03Abandoned) 10Matanya: diamond: resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195644 (owner: 10Matanya) [12:56:41] <_joe_> YuviPanda: well, whatever, having two modules named labs_dns and labs_ldap_dns [12:56:53] <_joe_> which are basically a copy-paste of each other [12:56:58] <_joe_> makes me itchy [12:57:09] <_joe_> also, they don't work [12:57:11] <_joe_> ;) [12:57:30] (03CR) 10Yuvipanda: [C: 031] "Good enough hotfix, I think, but @andrewbogott is in the middle of moving things around here, afaik." [puppet] - 10https://gerrit.wikimedia.org/r/197513 (owner: 10Giuseppe Lavagetto) [12:58:14] mark, don't know, it has been discussed directly with Brandon, was part of the Zero team's requests, and in various discussions with analytics what data partners need [12:58:25] i'll check with brandon [12:58:34] i meant - discussed with zero, and with analytics [12:58:52] (and brandon) [13:01:43] (03PS1) 10Springle: Allow tendril access to silver for monitoring mysqld. [puppet] - 10https://gerrit.wikimedia.org/r/197515 [13:02:09] (03PS2) 10Matanya: limn: minor lint and Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195616 [13:02:35] (03CR) 10Giuseppe Lavagetto: [C: 031] Allow tendril access to silver for monitoring mysqld. [puppet] - 10https://gerrit.wikimedia.org/r/197515 (owner: 10Springle) [13:02:55] (03CR) 10jenkins-bot: [V: 04-1] limn: minor lint and Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195616 (owner: 10Matanya) [13:04:56] (03PS3) 10Matanya: limn: minor lint and Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195616 [13:05:48] (03CR) 10jenkins-bot: [V: 04-1] limn: minor lint and Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195616 (owner: 10Matanya) [13:05:53] (03CR) 10Springle: "@YuviPanda, I noticed a SAL entry from you on March 16; curious if this changeset is appropriate or whether silver firewall is being contr" [puppet] - 10https://gerrit.wikimedia.org/r/197515 (owner: 10Springle) [13:06:32] YuviPanda: might be a silly question on my part ^ just checking [13:06:42] springle: the firewall for deployment-ssh was handled with I6017ba0d294bfad980ec24beff04d199201d87c4 [13:06:55] (03CR) 10Yuvipanda: "That was handled with I6017ba0d294bfad980ec24beff04d199201d87c4" [puppet] - 10https://gerrit.wikimedia.org/r/197515 (owner: 10Springle) [13:07:04] (03PS4) 10Matanya: limn: minor lint and Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195616 [13:07:51] YuviPanda: does that mean i should be adding a rule to nova.pp instead? [13:09:23] springle: hmm, that seems like an ok place to put it [13:09:37] springle: the other ferm rule was in nova because for some reason the mediawiki role isn’t being used... [13:09:39] not sure why [13:09:43] ah ok [13:09:44] thanks [13:10:02] (03CR) 10Springle: [C: 032] Allow tendril access to silver for monitoring mysqld. [puppet] - 10https://gerrit.wikimedia.org/r/197515 (owner: 10Springle) [13:11:42] (03PS2) 10Matanya: puppetception: Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195613 [13:12:53] (03Abandoned) 10Matanya: system: quote strings [puppet] - 10https://gerrit.wikimedia.org/r/195611 (owner: 10Matanya) [13:13:26] (03CR) 10Nikerabbit: [C: 04-1] config: Enable CX campagin in cawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 (owner: 10KartikMistry) [13:21:38] 6operations, 6Commons, 6MediaWiki-Core-Team, 6Multimedia: Deleted image pages do not show up as deleted until a manual action=purge - https://phabricator.wikimedia.org/T93052#1128387 (10Gilles) [13:23:47] (03PS2) 10Matanya: swift_new: lint and resource quoting [puppet] - 10https://gerrit.wikimedia.org/r/195607 [13:25:32] (03PS9) 10BBlack: certs: remove legacy ensure => absent Files [puppet] - 10https://gerrit.wikimedia.org/r/197339 (owner: 10Faidon Liambotis) [13:27:46] (03CR) 10BBlack: [C: 032] certs: remove legacy ensure => absent Files [puppet] - 10https://gerrit.wikimedia.org/r/197339 (owner: 10Faidon Liambotis) [13:28:59] (03PS9) 10BBlack: sslcert: add sslcert::chainedcert [puppet] - 10https://gerrit.wikimedia.org/r/197340 (owner: 10Faidon Liambotis) [13:29:04] (03CR) 10Andrew Bogott: [C: 032] etherpad: minor lint [puppet] - 10https://gerrit.wikimedia.org/r/195859 (owner: 10Matanya) [13:29:45] I'd like to pick someone's brain about adding RESTBase to the beta cluster :) [13:29:58] i guess YuviPanda && hasharFood ? [13:30:58] hi mobrovac [13:31:00] maaabe :D [13:31:07] ah ah [13:31:16] ok, let's start with the easy questions [13:31:16] :) [13:31:19] mobrovac: sure. [13:31:34] so, i see deployment-restbase02 in deployment-prep [13:31:41] but can't log in there [13:31:54] also, i'd like to set up at least two VMs for RB there [13:32:04] right. [13:32:10] so, let’s first give you access! [13:32:16] (03Abandoned) 10Matanya: clamav: Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195601 (owner: 10Matanya) [13:32:21] so, can you (a) add me as a member to the project; (b) create another instance ? [13:32:28] YuviPanda: neat! [13:32:32] I can do (a) and you can do (b)! :) [13:32:39] it's a deal [13:32:48] mobrovac: what’s your wikitech username? [13:32:54] mobrovac [13:33:01] heh mentioning myself [13:33:04] surprsiiiiseee :D [13:33:11] (03Abandoned) 10Matanya: librenms: Resource attributes quoting [puppet] - 10https://gerrit.wikimedia.org/r/195599 (owner: 10Matanya) [13:33:11] * mobrovac sounds rather egocentric [13:34:13] YuviPanda: also, could we possibly increase the storage a bit for deployment-restbase02 ? [13:34:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Still failing with a:" [debs/contenttranslation/apertium-mk] - 10https://gerrit.wikimedia.org/r/195244 (https://phabricator.wikimedia.org/T89936) (owner: 10KartikMistry) [13:34:54] (03Abandoned) 10Matanya: statsdlb: fix string containing only a variable [puppet] - 10https://gerrit.wikimedia.org/r/195534 (owner: 10Matanya) [13:35:10] _joe_,YuviPanda, ‘ldap_dns’ is the old dns system for labs, labs_dns is the new one I’m setting up now. The should both be monitored for the time being (but, presumably, not using the same name, which I will see about fixing.) [13:35:29] mobrovac: alright, you should have all the rights there now [13:35:43] i see myself in the list [13:36:18] ok, now i've disappeared [13:36:23] log out and back in, i guess [13:36:35] mobrovac: yeah :) [13:38:03] <_joe_> andrewbogott_afk: you are not monitoring anything correctly, I suspect [13:38:16] <_joe_> but well, I don't have time for this anyway [13:38:18] mobrovac: so what do you mean by ‘extra storage' [13:38:22] <_joe_> now, fud! [13:38:57] YuviPanda: i mean just resizing the root partition, 20GB should be ok, but to be on the safe side [13:39:16] mobrovac: aaaah. you have to delete and re-create, I’m afraid. all new instances have 20GB / [13:39:19] since restbase/cass will have to swallow all revchanges in the beta cluster [13:39:19] and rest allocatable via lvm [13:39:26] ah ok [13:40:11] mobrovac: you should also join #wikimedia-releng :) [13:40:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM. Also built the package and I 'll upload it to apt.w.o." [debs/contenttranslation/apertium-af-nl] - 10https://gerrit.wikimedia.org/r/195838 (https://phabricator.wikimedia.org/T91750) (owner: 10KartikMistry) [13:40:39] YuviPanda: for the new instance, should i also create a new security group (restbase, services, or sth) ? does that make sense? [13:40:45] ok, joining [13:41:03] mobrovac: yup, it does. even if you don’t customize it now, you can’t add a security group to an instance after the fact... [13:41:10] so always preferable to have one now, just in cas.e.. [13:41:17] right [13:41:19] cool [13:42:53] ^d: reviewing your ES patch now [13:44:38] (03PS1) 10Andrew Bogott: Move labs/designate/pdns to ns2.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/197517 [13:46:03] <^d> YuviPanda: ty sir [13:46:08] <^d> It works in staging and beta [13:46:49] ^d: I’m just very wary of breaking ES prod, mostly because I’ll have 0 idea on how to bring that back up :P [13:47:34] <^d> If only we had somebody around who knew ES :p [13:48:32] :P [13:48:52] (03PS8) 10Yuvipanda: Hiera-ize most of the Elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/196640 (owner: 10Chad) [13:49:11] ^d: :) am looking at it once more... [13:53:08] (03PS1) 10Andrew Bogott: Alias Holmium to labs-ns2 [dns] - 10https://gerrit.wikimedia.org/r/197519 [13:58:32] ^d: alright, manybubbles also seems to be here :D am going to merge [13:58:48] (03CR) 10Yuvipanda: [C: 032] Hiera-ize most of the Elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/196640 (owner: 10Chad) [13:58:56] (03PS10) 10BBlack: sslcert: add sslcert::chainedcert [puppet] - 10https://gerrit.wikimedia.org/r/197340 (owner: 10Faidon Liambotis) [14:00:04] chasemp: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150318T1400). Please do the needful. [14:00:05] (03CR) 10BBlack: [C: 032] sslcert: add sslcert::chainedcert [puppet] - 10https://gerrit.wikimedia.org/r/197340 (owner: 10Faidon Liambotis) [14:00:31] (03PS1) 10Yuvipanda: Revert "Hiera-ize most of the Elasticsearch config" [puppet] - 10https://gerrit.wikimedia.org/r/197520 [14:00:37] (03PS2) 10Yuvipanda: Revert "Hiera-ize most of the Elasticsearch config" [puppet] - 10https://gerrit.wikimedia.org/r/197520 [14:00:38] ^d: nope [14:00:46] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Hiera-ize most of the Elasticsearch config" [puppet] - 10https://gerrit.wikimedia.org/r/197520 (owner: 10Yuvipanda) [14:01:21] > Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Cannot reassign variable master_eligible at /etc/puppet/manifests/role/elasticsearch.pp:35 on node elastic1001.eqiad.wmnet [14:01:22] Warning: Not using cache on failed catalog [14:01:22] Error: Could not retrieve catalog; skipping run [14:03:03] (03CR) 10Andrew Bogott: [C: 032] Alias Holmium to labs-ns2 [dns] - 10https://gerrit.wikimedia.org/r/197519 (owner: 10Andrew Bogott) [14:03:32] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: puppet fail [14:03:33] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: puppet fail [14:04:11] ^ should be fine in a bit [14:04:24] (03PS1) 10Yuvipanda: Revert "Revert "Hiera-ize most of the Elasticsearch config"" [puppet] - 10https://gerrit.wikimedia.org/r/197522 [14:04:28] (03PS2) 10Andrew Bogott: Move labs/designate/pdns to ns2.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/197517 (https://phabricator.wikimedia.org/T93053) [14:04:55] <^d> Oh crap, I forgot to revisit that bit [14:06:25] <^d> YuviPanda: I can do all but that part :p [14:06:26] ^d: http://puppet-compiler.wmflabs.org/637/change/197522/html/elastic1001.eqiad.wmnet.html puppet compiler waaarns [14:07:04] <^d> Yeah I know why [14:07:12] <^d> I'm a bit stuck on the "how to work around it" [14:07:21] <^d> I'd rather not make per-host hiera entries... [14:07:33] <^d> Although I guess it's just 3 [14:09:01] (03CR) 10Andrew Bogott: [C: 032] Move labs/designate/pdns to ns2.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/197517 (https://phabricator.wikimedia.org/T93053) (owner: 10Andrew Bogott) [14:11:00] (03PS1) 10Filippo Giunchedi: swift: provision ms-be101[678] [puppet] - 10https://gerrit.wikimedia.org/r/197526 (https://phabricator.wikimedia.org/T90922) [14:12:12] (03PS2) 10Filippo Giunchedi: swift: provision ms-be101[678] [puppet] - 10https://gerrit.wikimedia.org/r/197526 (https://phabricator.wikimedia.org/T90922) [14:14:18] YuviPanda: can has OAuth approval: https://www.mediawiki.org/w/index.php?title=Special:OAuthListConsumers/view/a63d996c8416c35a0d45824fd44afd95&name=&publisher=&stage=0 [14:14:39] ragesoss: no https callback? tch tch [14:15:27] ragesoss: I’ll approve once I figure out how to [14:15:35] YuviPanda: thanks much [14:15:52] (03CR) 10Andrew Bogott: [C: 04-2] "This should be fixed by https://gerrit.wikimedia.org/r/#/c/197517/ (although that patch has an incorrect commit line which I just noticed" [puppet] - 10https://gerrit.wikimedia.org/r/197513 (owner: 10Giuseppe Lavagetto) [14:18:27] 6operations: Adv - Cambodia property by Singapore Developer. Only USD1k / SGD1400 to own - https://phabricator.wikimedia.org/T93072#1128588 (10emailbot) [14:19:43] RECOVERY - configured eth on rhenium is OK: NRPE: Unable to read output [14:19:53] RECOVERY - SSH on rhenium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [14:19:53] RECOVERY - Disk space on rhenium is OK: DISK OK [14:19:53] RECOVERY - salt-minion processes on rhenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:19:53] RECOVERY - dhclient process on rhenium is OK: PROCS OK: 0 processes with command name dhclient [14:19:53] RECOVERY - DPKG on rhenium is OK: All packages OK [14:19:53] RECOVERY - RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:19:56] !log powercycling rhenium [14:20:04] Logged the message, Master [14:20:53] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:20:53] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:21:02] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:25:56] (03CR) 10BryanDavis: [C: 031] Log HttpErrors with a 500+ code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197503 (https://phabricator.wikimedia.org/T85795) (owner: 10Hoo man) [14:28:38] (03CR) 10JanZerebecki: [C: 031] Don't use bits for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197381 (https://phabricator.wikimedia.org/T92949) (owner: 10Aude) [14:29:57] ragesoss: done btw [14:30:04] YuviPanda: thanks! [14:30:12] ragesoss: yw! [14:33:51] 6operations, 3Interdatacenter-IPsec: Implement a big IPsec off switch - https://phabricator.wikimedia.org/T88536#1128647 (10faidon) [14:34:05] (03PS3) 10Faidon Liambotis: IPsec: big off switch [puppet] - 10https://gerrit.wikimedia.org/r/196498 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage) [14:34:36] (03PS1) 10Chad: Hiera-ize the Elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/197533 [14:35:03] PROBLEM - puppet last run on mw2009 is CRITICAL: CRITICAL: puppet fail [14:36:09] (03PS2) 10KartikMistry: config: CX: Enable newarticle campagin in cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 [14:36:16] <^d> YuviPanda: I redid the master_eligible bit to use per-host hiera data [14:36:26] <^d> (and moved the default => false to the default prod config) [14:36:55] ^d: cool. food first, though :) and I’ll take a look *carefully* this time :) also, you should try the puppet compiler! [14:37:04] https://wikitech.wikimedia.org/wiki/Puppet_Testing [14:37:10] (the last section the only useful one there) [14:37:13] (03PS1) 10Andrew Bogott: Attempt to increase the ttl for the SOA [puppet] - 10https://gerrit.wikimedia.org/r/197535 [14:37:37] Coren: ^ [14:37:45] <^d> Yeah doing it now [14:38:01] (03CR) 10Ottomata: ":p" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/197411 (owner: 10Nuria) [14:38:33] RECOVERY - NTP on rhenium is OK: NTP OK: Offset -0.0009208917618 secs [14:39:16] (03PS3) 10KartikMistry: config: CX: Enable newarticle campagin in cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 [14:39:45] <^d> Going to test against a master /and/ non-master node [14:41:18] <_joe_> ^d: I took your advice and split my patch in pieces [14:42:53] (03CR) 10Filippo Giunchedi: [C: 04-1] "trying out UEFI first" [puppet] - 10https://gerrit.wikimedia.org/r/197526 (https://phabricator.wikimedia.org/T90922) (owner: 10Filippo Giunchedi) [14:43:25] (03CR) 10coren: Attempt to increase the ttl for the SOA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/197535 (owner: 10Andrew Bogott) [14:43:34] <^d> _joe_: I saw. I'll have a look at the poolcounter one first [14:43:38] <^d> That can likely go in whenever [14:43:41] andrewbogott: Comment over a... comment. :-) [14:44:12] (03PS4) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-mk] - 10https://gerrit.wikimedia.org/r/195244 (https://phabricator.wikimedia.org/T89936) [14:44:18] 6operations, 10ops-eqiad: cp1047 down - https://phabricator.wikimedia.org/T88045#1128679 (10Cmjohnson) Dell sent a new DIMM but that was also bad. - Replaced DIMM b5 with new DIMM, rebooted and same error appeared during post MEMBIST Memory Test failure DIMM B5 - Swapped DIMM B5 to B1 and rebooted and durin... [14:44:41] (03PS2) 10Andrew Bogott: Attempt to increase the ttl for the SOA [puppet] - 10https://gerrit.wikimedia.org/r/197535 [14:46:07] (03PS2) 10QChris: Run analytics/refinery/source guards on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/195262 [14:47:02] (03CR) 10coren: [C: 031] "Moar sane." [puppet] - 10https://gerrit.wikimedia.org/r/197535 (owner: 10Andrew Bogott) [14:48:12] (03CR) 10QChris: Run analytics/refinery/source guards on stat1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195262 (owner: 10QChris) [14:49:13] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [14:49:13] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [14:50:03] (03CR) 10Andrew Bogott: [C: 032] Attempt to increase the ttl for the SOA [puppet] - 10https://gerrit.wikimedia.org/r/197535 (owner: 10Andrew Bogott) [14:50:47] (03CR) 10BryanDavis: "Comments on groups NOT removed in this patch:" [puppet] - 10https://gerrit.wikimedia.org/r/195840 (https://phabricator.wikimedia.org/T92259) (owner: 10Dzahn) [14:51:26] legoktm: You're the only one for SWAT this morning, do you just want to deploy your own patch? [14:51:29] 6operations, 10ops-eqiad: cp1047 down - https://phabricator.wikimedia.org/T88045#1128708 (10Cmjohnson) New Work Order submitted [14:51:37] (03CR) 10QChris: Run analytics/refinery/source guards on stat1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195262 (owner: 10QChris) [14:51:42] anomie: yup, I can do it [14:51:52] ok [14:53:33] RECOVERY - puppet last run on mw2009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:59:11] (03CR) 10Faidon Liambotis: IPsec: big off switch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/196498 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage) [15:00:05] manybubbles, anomie, ^d, thcipriani, marktraceur, legoktm: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150318T1500). [15:00:43] o/ [15:01:13] legoktm: I think I might try to sneak https://gerrit.wikimedia.org/r/#/c/197553/ into swat if no one minds [15:01:27] meh, I'll wait 'till tomorrow [15:01:36] let the change live in group0 for a bit today [15:02:13] I don't mind [15:03:28] PROBLEM - configured eth on mw2164 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:28] PROBLEM - configured eth on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:28] PROBLEM - configured eth on mw2162 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:28] PROBLEM - configured eth on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:28] PROBLEM - configured eth on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:29] PROBLEM - configured eth on mw2161 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:29] PROBLEM - configured eth on mw2166 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:37] PROBLEM - configured eth on mw2165 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:37] PROBLEM - configured eth on mw2167 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:37] PROBLEM - configured eth on mw2170 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:48] PROBLEM - dhclient process on mw2164 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:48] PROBLEM - dhclient process on mw2165 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:48] PROBLEM - dhclient process on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:48] PROBLEM - dhclient process on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:48] PROBLEM - dhclient process on mw2170 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:48] PROBLEM - dhclient process on mw2162 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:49] PROBLEM - dhclient process on mw2161 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:49] PROBLEM - dhclient process on mw2166 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:50] PROBLEM - dhclient process on mw2167 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:50] PROBLEM - dhclient process on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:03:52] building the submodule update - might sneak it in. OTOH that PROBLEM looks important [15:04:07] PROBLEM - nutcracker port on mw2161 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:07] PROBLEM - nutcracker port on mw2164 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:07] PROBLEM - nutcracker port on mw2167 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:07] PROBLEM - nutcracker port on mw2162 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:07] PROBLEM - nutcracker port on mw2165 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:08] PROBLEM - nutcracker port on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:08] PROBLEM - nutcracker port on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:09] PROBLEM - nutcracker port on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:09] PROBLEM - nutcracker port on mw2166 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:10] PROBLEM - nutcracker port on mw2170 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:10] PROBLEM - nutcracker process on mw2161 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:11] PROBLEM - nutcracker process on mw2162 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:11] PROBLEM - nutcracker process on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:12] PROBLEM - nutcracker process on mw2164 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:28] PROBLEM - salt-minion processes on mw2167 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:28] PROBLEM - salt-minion processes on mw2161 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:28] PROBLEM - salt-minion processes on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:28] PROBLEM - salt-minion processes on mw2165 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:28] PROBLEM - salt-minion processes on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:29] PROBLEM - salt-minion processes on mw2170 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:29] PROBLEM - salt-minion processes on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:30] PROBLEM - salt-minion processes on mw2162 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:30] PROBLEM - salt-minion processes on mw2166 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:31] PROBLEM - salt-minion processes on mw2164 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:47] PROBLEM - DPKG on mw2165 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:47] PROBLEM - DPKG on mw2166 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:47] PROBLEM - DPKG on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:47] PROBLEM - DPKG on mw2161 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:47] PROBLEM - DPKG on mw2167 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:48] PROBLEM - DPKG on mw2164 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:48] PROBLEM - DPKG on mw2162 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:49] PROBLEM - DPKG on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:49] PROBLEM - DPKG on mw2170 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:50] PROBLEM - DPKG on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:57] PROBLEM - Disk space on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:58] PROBLEM - Disk space on mw2164 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:58] PROBLEM - Disk space on mw2162 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:58] PROBLEM - Disk space on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:58] PROBLEM - Disk space on mw2170 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:58] PROBLEM - Disk space on mw2167 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:58] PROBLEM - Disk space on mw2165 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:59] PROBLEM - Disk space on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:04:59] PROBLEM - Disk space on mw2166 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:00] PROBLEM - Disk space on mw2161 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:10] 6operations, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1128751 (10hashar) 3NEW [15:05:30] O.O [15:05:38] heh [15:05:48] PROBLEM - RAID on mw2162 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:48] PROBLEM - RAID on mw2161 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:48] PROBLEM - RAID on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:48] PROBLEM - RAID on mw2170 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:48] PROBLEM - RAID on mw2164 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:49] PROBLEM - RAID on mw2167 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:49] PROBLEM - RAID on mw2165 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:50] PROBLEM - RAID on mw2168 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:50] PROBLEM - RAID on mw2166 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:51] PROBLEM - RAID on mw2169 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:43] err, [15:07:47] RECOVERY - dhclient process on mw2166 is OK: PROCS OK: 0 processes with command name dhclient [15:07:47] RECOVERY - dhclient process on mw2162 is OK: PROCS OK: 0 processes with command name dhclient [15:07:50] is it ok to deploy? [15:07:59] RECOVERY - nutcracker port on mw2161 is OK: TCP OK - 0.000 second response time on port 11212 [15:07:59] RECOVERY - nutcracker port on mw2162 is OK: TCP OK - 0.000 second response time on port 11212 [15:07:59] RECOVERY - nutcracker port on mw2166 is OK: TCP OK - 0.000 second response time on port 11212 [15:08:07] (03CR) 10Legoktm: [C: 032] Load 3 extensions via extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197427 (owner: 10Legoktm) [15:08:07] RECOVERY - nutcracker process on mw2161 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:08:08] RECOVERY - nutcracker process on mw2162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:08:08] RECOVERY - nutcracker process on mw2166 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:08:08] RECOVERY - nutcracker process on mw2170 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:08:15] legoktm: the spam above is just codfw, not current prod [15:08:23] ok [15:08:28] RECOVERY - RAID on mw2161 is OK: OK: no RAID installed [15:08:28] RECOVERY - RAID on mw2170 is OK: OK: no RAID installed [15:08:28] RECOVERY - RAID on mw2167 is OK: OK: no RAID installed [15:08:28] RECOVERY - RAID on mw2165 is OK: OK: no RAID installed [15:08:28] RECOVERY - RAID on mw2166 is OK: OK: no RAID installed [15:08:28] RECOVERY - RAID on mw2169 is OK: OK: no RAID installed [15:08:28] RECOVERY - RAID on mw2162 is OK: OK: no RAID installed [15:08:29] RECOVERY - salt-minion processes on mw2165 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:08:29] RECOVERY - salt-minion processes on mw2167 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:08:30] RECOVERY - salt-minion processes on mw2162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:08:30] RECOVERY - salt-minion processes on mw2161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:08:31] RECOVERY - salt-minion processes on mw2166 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:08:31] RECOVERY - salt-minion processes on mw2169 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:08:32] RECOVERY - salt-minion processes on mw2170 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:08:48] RECOVERY - configured eth on mw2162 is OK: NRPE: Unable to read output [15:08:48] RECOVERY - configured eth on mw2164 is OK: NRPE: Unable to read output [15:08:48] RECOVERY - configured eth on mw2168 is OK: NRPE: Unable to read output [15:08:48] RECOVERY - configured eth on mw2161 is OK: NRPE: Unable to read output [15:08:48] RECOVERY - configured eth on mw2169 is OK: NRPE: Unable to read output [15:08:48] RECOVERY - configured eth on mw2163 is OK: NRPE: Unable to read output [15:08:48] RECOVERY - configured eth on mw2167 is OK: NRPE: Unable to read output [15:08:49] RECOVERY - configured eth on mw2165 is OK: NRPE: Unable to read output [15:08:49] RECOVERY - configured eth on mw2170 is OK: NRPE: Unable to read output [15:08:50] RECOVERY - configured eth on mw2166 is OK: NRPE: Unable to read output [15:08:57] RECOVERY - Disk space on mw2169 is OK: DISK OK [15:08:57] RECOVERY - Disk space on mw2164 is OK: DISK OK [15:08:57] RECOVERY - Disk space on mw2163 is OK: DISK OK [15:08:57] RECOVERY - Disk space on mw2162 is OK: DISK OK [15:08:57] RECOVERY - Disk space on mw2170 is OK: DISK OK [15:08:58] RECOVERY - Disk space on mw2168 is OK: DISK OK [15:08:58] RECOVERY - Disk space on mw2167 is OK: DISK OK [15:08:59] RECOVERY - Disk space on mw2165 is OK: DISK OK [15:08:59] RECOVERY - Disk space on mw2161 is OK: DISK OK [15:09:00] RECOVERY - Disk space on mw2166 is OK: DISK OK [15:09:07] RECOVERY - dhclient process on mw2163 is OK: PROCS OK: 0 processes with command name dhclient [15:09:07] RECOVERY - dhclient process on mw2165 is OK: PROCS OK: 0 processes with command name dhclient [15:09:07] RECOVERY - dhclient process on mw2164 is OK: PROCS OK: 0 processes with command name dhclient [15:09:07] RECOVERY - dhclient process on mw2169 is OK: PROCS OK: 0 processes with command name dhclient [15:09:07] RECOVERY - dhclient process on mw2167 is OK: PROCS OK: 0 processes with command name dhclient [15:09:08] RECOVERY - dhclient process on mw2170 is OK: PROCS OK: 0 processes with command name dhclient [15:09:08] RECOVERY - dhclient process on mw2168 is OK: PROCS OK: 0 processes with command name dhclient [15:09:09] RECOVERY - dhclient process on mw2161 is OK: PROCS OK: 0 processes with command name dhclient [15:09:18] RECOVERY - nutcracker port on mw2167 is OK: TCP OK - 0.000 second response time on port 11212 [15:09:18] RECOVERY - nutcracker port on mw2163 is OK: TCP OK - 0.000 second response time on port 11212 [15:09:18] RECOVERY - nutcracker port on mw2169 is OK: TCP OK - 0.000 second response time on port 11212 [15:09:18] RECOVERY - nutcracker port on mw2165 is OK: TCP OK - 0.000 second response time on port 11212 [15:09:18] RECOVERY - nutcracker port on mw2164 is OK: TCP OK - 0.000 second response time on port 11212 [15:09:19] RECOVERY - nutcracker port on mw2170 is OK: TCP OK - 0.000 second response time on port 11212 [15:09:19] RECOVERY - nutcracker port on mw2168 is OK: TCP OK - 0.000 second response time on port 11212 [15:09:28] RECOVERY - nutcracker process on mw2163 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:09:28] RECOVERY - nutcracker process on mw2169 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:09:28] RECOVERY - nutcracker process on mw2165 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:09:28] RECOVERY - nutcracker process on mw2167 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:09:28] RECOVERY - nutcracker process on mw2168 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:09:29] RECOVERY - nutcracker process on mw2164 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:09:47] RECOVERY - RAID on mw2163 is OK: OK: no RAID installed [15:09:47] RECOVERY - RAID on mw2164 is OK: OK: no RAID installed [15:09:47] RECOVERY - RAID on mw2168 is OK: OK: no RAID installed [15:09:47] RECOVERY - salt-minion processes on mw2163 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:09:47] RECOVERY - salt-minion processes on mw2168 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:10:26] <_joe_> that's me sorry guys [15:11:54] (in general anytime you see an alert on a hostname with a 4-digit number and the number is 2xxx, that's in codfw, which is not currently doing productiony things) [15:12:28] (03Merged) 10jenkins-bot: Load 3 extensions via extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197427 (owner: 10Legoktm) [15:12:36] gotcha [15:13:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:13:48] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:14:28] (03CR) 10Ottomata: Run analytics/refinery/source guards on stat1002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/195262 (owner: 10QChris) [15:14:47] (03CR) 10Chad: [C: 032] MWRealm: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [15:14:59] !log legoktm Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/197427 (duration: 00m 53s) [15:15:02] Logged the message, Master [15:16:06] woot [15:18:37] legoktm: cool. my turn? [15:18:44] <^d> I've got one in the pipeline too [15:18:49] oh cool [15:18:57] ^d: I can sync if if you'd like [15:19:17] I just realized I also have to backport https://gerrit.wikimedia.org/r/#/c/197630/ ...and I just merged it :/ [15:19:18] ^d: its not on the page. unless I ate it [15:19:33] <^d> manybubbles: I'm going rogue :p [15:19:33] i don't like to be annyoing, but this cache isuisse is confusing a lot of users. [15:19:39] <_joe_> manybubbles: I did actually becuase I had no reviewers [15:19:53] _joe_: ? [15:20:07] <_joe_> I removed my patches from today's swat [15:20:10] ah [15:20:12] k [15:20:16] <_joe_> because no reviews [15:20:28] _joe_: oh - I can probably review them though [15:20:30] !log legoktm Synchronized php-1.25wmf21/includes/registration/ExtensionRegistry.php: https://gerrit.wikimedia.org/r/#/c/197630/ (duration: 00m 05s) [15:20:30] <_joe_> but now I have one, so let's do this [15:20:34] Logged the message, Master [15:20:53] _joe_: if you add it back I'll look at it [15:21:20] manybubbles, ^d: ok, I'm done now :) [15:21:29] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:29] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:29] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:29] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:29] PROBLEM - puppet last run on mw2162 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:30] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:30] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:31] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:31] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:32] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 6 failures [15:21:40] legoktm: If you have time, can we squeeze in https://gerrit.wikimedia.org/r/#/c/197503/ so I can merge the logging changes that h.oo made? [15:21:42] legoktm: thank you thank you [15:22:48] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:22:48] RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:22:48] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:22:48] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:22:48] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:22:59] bd808: yeah, manybubbles and ^d are going now, and then I can do it after them [15:23:12] my hero [15:23:13] <^d> Whenever jenkins decides to catch the fuck up [15:23:23] manybubbles, ^d: we're experimenting having jenkins run tests for l10n bot, you might just want to V+2 for now [15:23:33] * bd808 hands ^d a clue-stick to whack things with [15:24:08] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:08] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:08] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:08] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:08] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:57] !log Fiddling with uwsgi on labmon1001, ignore errors [15:25:02] Logged the message, Master [15:25:30] !log manybubbles Synchronized php-1.25wmf21/extensions/CirrusSearch/includes/Util.php: SWAT fix some batch scripts in cirrus (duration: 00m 07s) [15:25:33] Logged the message, Master [15:26:27] <_joe_> manybubbles: added [15:26:49] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.018 second response time [15:27:08] _joe_: got it. just finished with mine - looks good. So now I'll review yours! [15:27:59] <_joe_> manybubbles: the db-codfw file is reviewed by sean, but still has no effect as the codfw appservers are not in the scap dsh list anyways [15:28:47] <^d> I already reviewed _joe_'s [15:28:54] <^d> MWRealm + db-codfw [15:28:59] <^d> It's waiting on a merge from jenkins [15:29:36] (03CR) 10Manybubbles: [C: 031] MWRealm: add support for codfw (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [15:29:50] ^d: just V+2 [15:29:56] (03CR) 10Manybubbles: [V: 032] MWRealm: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [15:29:58] <^d> No [15:30:01] ? [15:30:01] <^d> I was being patient [15:30:03] <^d> :) [15:30:06] <^d> It's almost done [15:30:14] (03CR) 10Manybubbles: MWRealm: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [15:30:21] * manybubbles will stop pretending to be a robot butler [15:30:27] <_joe_> manybubbles: I agree that that feels weird [15:30:34] <_joe_> but I kept the logic intact [15:30:46] _joe_: indeed - we do lots of weird stuff and I'm loath to break it [15:31:46] 6operations, 10Wikimedia-Labs-wikitech-interface, 7Regression: wikitech.wikimedia.org thumbnails broken - https://phabricator.wikimedia.org/T93041#1128870 (10Krinkle) [15:32:14] (03Merged) 10jenkins-bot: MWRealm: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197493 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [15:34:12] !log demon Synchronized multiversion/MWRealm.php: initial codfw support (duration: 00m 06s) [15:34:17] Logged the message, Master [15:34:19] !log demon Synchronized wmf-config/db-codfw.php: initial codfw support (duration: 00m 06s) [15:34:22] Logged the message, Master [15:35:42] <^d> unrelated, but dafuq? [15:35:45] <^d> #012Fatal error: syntax error, unexpected T_STRING in /srv/mediawiki/php-1.25wmf21/includes/TemplateParser.php(136) : eval()'d code on line 1 [15:36:11] <_joe_> wat? [15:37:08] <^d> Mustache uses eval() :( [15:37:41] 6operations, 10RESTBase: (nodetool )cleanup needed on restbase1006 - https://phabricator.wikimedia.org/T93079#1128878 (10Eevans) [15:38:02] ^d: several voted against it, but it still got merged [15:38:08] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1128881 (10Nuria) We are going to enable our event-ingesting pipeline to use varnishkafka rtaher thna varnishncsa.... [15:38:26] ^d: that error probably means the template has a typo in it [15:38:40] <^d> Clearly. [15:38:41] * YuviPanda pats ^d [15:38:57] <_joe_> ^d: but why on earth are we using the standard php mustache library? [15:39:18] <_joe_> it's utterly horrible, I rewrote it from scratch at $DAYJOB~1 [15:39:20] _joe_: its not, its lightncandy. it doesn't require eval, only when you expect to store the code in memcache [15:39:32] <_joe_> ebernhardson: oh ok [15:40:02] <_joe_> ebernhardson: I definitely remember a moustache php library that used eval all over [15:40:29] _joe_: quite possibly, but this one copiles templates into php code, then you run the php code instead of parsing the template at runtime [15:40:53] <_joe_> ebernhardson: which makes perfect sense [15:41:08] <_joe_> it's what smarty taught us is "the way" [15:41:29] Anyone around who knows about thumbnail 404 handling? [15:41:49] ^d: have been moaning about that error for several days [15:42:42] ^d: you finished right? [15:42:52] <^d> Yes [15:43:00] (03CR) 10Legoktm: [C: 032] Log HttpErrors with a 500+ code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197503 (https://phabricator.wikimedia.org/T85795) (owner: 10Hoo man) [15:43:03] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1128897 (10RobH) p:5Normal>3High [15:43:03] I think I have some old code somewhere that knows how to find the real error from an eval failure... [15:43:04] <_joe_> manybubbles, ^d thanks a lot [15:43:16] I live to serve [15:43:27] <_joe_> bd808: it's called "the necronomicon", right? [15:43:44] <^d> ebernhardson, Krenair Maybe we should pass it through token_get_all() first ;-) [15:43:47] (03CR) 10Legoktm: [V: 032] Log HttpErrors with a 500+ code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197503 (https://phabricator.wikimedia.org/T85795) (owner: 10Hoo man) [15:43:52] <^d> Make sure it parses! [15:44:13] _joe_: heh. it's in a lib called "suxles" because it suck less than our first gen attempt [15:44:19] ^d: sounds like a good option actually [15:44:26] (03CR) 10Jforrester: "How did this go?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197427 (owner: 10Legoktm) [15:44:32] !log legoktm Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/197503/ (duration: 00m 05s) [15:44:34] !log uploaded apertium-af-nl 0.2.0 on apt.wikimedia.org [15:44:38] <^d> ebernhardson: I'm being facetious. It's slow as fuck. [15:44:38] Logged the message, Master [15:44:41] Logged the message, Master [15:44:51] 6operations, 6Labs, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1128900 (10yuvipanda) 3NEW a:3coren [15:44:58] bd808: ^ [15:45:18] legoktm: thx! [15:45:20] ^d: it shouldn't be compiled regularly, once before shoving it into cache should be fine [15:45:41] 6operations, 6Labs, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1128908 (10coren) Currently being worked on. There is confusing is the upstart config that prevents it from properly determining status, confusing both puppet and icinga. [15:46:00] <^d> php_check_syntax() would be nice [15:46:06] <^d> If it didn't have the "and execute" bit [15:46:12] <^d> php_check_syntax — Check the PHP syntax of (and execute) the specified file [15:46:15] <^d> lol [15:46:20] (03CR) 10Legoktm: "Well! Nothing blew up AFAICT. Did require a backport of I4e5fc50059745a89fb69bc1e05a299fd9aaee968 so they would appear on Special:Version " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197427 (owner: 10Legoktm) [15:46:44] <^d> Oh it was removed [15:47:03] (03PS3) 10QChris: Run analytics/refinery/source guards on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/195262 [15:47:30] and execute. wow. [15:49:15] (03CR) 10QChris: Run analytics/refinery/source guards on stat1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195262 (owner: 10QChris) [15:49:16] 7Puppet, 6operations, 10Deployment-Systems, 10Staging: provider => trebuchet doesn't work until manual 'git deploy start' on deployment-server - https://phabricator.wikimedia.org/T92978#1128917 (10greg) p:5Triage>3Normal [15:49:24] ^d: alternatively, i suppose a unit test could iterate the possible templates and eval them to ensure no bad templates get committed [15:49:38] <^d> That's a good idea. [15:50:35] (03CR) 10Jforrester: "Awesome. :-) What's next, and when? Is there a Phabricator task?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197427 (owner: 10Legoktm) [15:51:23] Found the eval() error trap -- https://dpaste.de/tNWx [15:51:46] (03CR) 10QChris: Run analytics/refinery/source guards on stat1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195262 (owner: 10QChris) [15:51:54] 6operations, 6Labs, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1128934 (10Nuria) [15:53:24] 6operations, 10RESTBase: (nodetool )cleanup needed on restbase1006 - https://phabricator.wikimedia.org/T93079#1128937 (10fgiunchedi) LGTM, are there mechanisms to alert us when a manual cleanup is needed? or perhaps how much data is pending cleanup so we can track it? [15:57:33] !log Started uwsgi on labmon1001 by hand (which works) so that graphite isn't broken during debugging. [15:57:37] Logged the message, Master [15:58:28] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.893 second response time [16:00:31] (03PS4) 10Ottomata: Run analytics/refinery/source guards on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/195262 (owner: 10QChris) [16:02:36] 6operations, 10ops-ulsfo: cp4009 hardware fault - https://phabricator.wikimedia.org/T92476#1129007 (10RobH) a:5RobH>3Cmjohnson I don't want to meet some dude at the datacenter. He called and it seems this was not a parts only dispatch, and he cannot just ship us the part. Please redo this warranty board... [16:03:01] 6operations, 6Labs, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1129013 (10coren) More data: Currently, three distinct interfaces to manage uwsgi are in place: * sysvinit * upstart (which is split in subservices) * /sbin/uwsgictl (which invokes the upstart interface)... [16:03:35] (03CR) 10Ottomata: [C: 032] Run analytics/refinery/source guards on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/195262 (owner: 10QChris) [16:04:23] 6operations, 10ops-ulsfo: cp4009 hardware fault - https://phabricator.wikimedia.org/T92476#1129017 (10RobH) Reasoning for parts only dispatch: I don't feel like wasting half a day onsite waiting on, then working with, a random dell certified tech when its simply a mainboard swap. Since we don't need the syste... [16:08:15] (03PS4) 10Nikerabbit: config: CX: Enable newarticle campaign in cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 (owner: 10KartikMistry) [16:10:19] (03PS4) 10Gage: IPsec: big off switch [puppet] - 10https://gerrit.wikimedia.org/r/196498 (https://phabricator.wikimedia.org/T88536) [16:11:03] (03CR) 10Gage: "Use $PATH per Faidon's suggestion" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/196498 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage) [16:11:49] ACKNOWLEDGEMENT - uWSGI web apps on labmon1001 is CRITICAL: CRITICAL: Not all configured uWSGI apps are running. Coren False positive service is running. Issue with test is being worked on. [16:26:18] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet last ran 11 hours ago [16:27:36] 6operations, 10ops-eqiad: Increase asw-d-eqiad uplink capacity - https://phabricator.wikimedia.org/T92914#1129162 (10Cmjohnson) I am short of sfp+ transceivers. I have 4 on-site. Waiting on https://rt.wikimedia.org/Ticket/Display.html?id=9190 [16:28:33] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban, 5Patch-For-Review: Eventlogging JS client should warn users when serialized event is more than "N" chars long and not sent the event [8 pts] - https://phabricator.wikimedia.org/T91918#1129163 (10mforns) 5Open>3Resolved [16:28:34] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1129164 (10mforns) [16:29:35] (03CR) 10Alexandros Kosiaris: "Failing with" [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) (owner: 10KartikMistry) [16:30:17] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:32:56] (03PS1) 10Ottomata: Set up eventlogging varnishkafka instance on production bits [puppet] - 10https://gerrit.wikimedia.org/r/197650 [16:33:13] (03CR) 10Ottomata: [C: 04-1] "WIP, not ready for deploy yet." [puppet] - 10https://gerrit.wikimedia.org/r/197650 (owner: 10Ottomata) [16:36:53] andrewbogott, what is this exactly? [16:36:58] mkdir: cannot create directory �/sys/fs/cgroup/memory/mediawiki/job/13520�: No such file or directory [16:36:58] limit.sh: failed to create the cgroup. [16:37:17] Krenair: no idea, where are you seeing that? [16:37:30] every time I try to run a line in mwscript on silver [16:38:52] Krenair: hm, that’s new [16:38:56] no [16:38:59] other people have messed with the config there, must be a regression [16:39:07] I've been seeing it for a while [16:39:11] how long? [16:39:11] I filed a bug [16:39:19] 7Blocked-on-Operations, 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1129218 (10mobrovac) [16:39:28] few days [16:39:32] 4: https://phabricator.wikimedia.org/T92712 [16:39:47] surprised you didn't see that really. it is assigned to you... [16:40:17] Krenair, andrewbogott: it may be related to the change Tim made to always run mwscript as the www-data user [16:40:28] bd808: probably [16:40:36] That landed the day before that report I think [16:41:53] bd808: is mw in charge of creating one of those parent dirs? [16:42:20] akosiaris: you're really finding issues :) [16:42:23] Thanks! [16:42:24] cgroups are black magic to me :/ [16:43:03] (03CR) 10Filippo Giunchedi: swift_new: lint and resource quoting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195607 (owner: 10Matanya) [16:45:43] 6operations, 10Wikimedia-IRC, 10Wikimedia-Labs-wikitech-interface: Enable irc feed for wikitech.wikimedia.org site - https://phabricator.wikimedia.org/T36685#1129227 (10Krenair) a:3Krenair I'll volunteer for it. [16:46:04] kart_: I tried fixing it btw, I admit I failed :-( [16:46:36] (03PS1) 10Alex Monk: Fix wmgRC2UDPPrefix generation to work with wikitech's non-protorel wgServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197652 (https://phabricator.wikimedia.org/T36685) [16:46:48] PROBLEM - MariaDB disk space on silver is CRITICAL: DISK CRITICAL - free space: /sys/fs/cgroup 0 MB (0% inode=99%): [16:46:51] kart_: I do got this though warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory [16:47:04] andrewbogott: silver:/sys/fs/cgroup/memory/mediawiki/job exists and looks like it has the right permissions. Not sure where to go from there. [16:47:09] that looks bad [16:47:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] Added initial Debian packaging [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) (owner: 10KartikMistry) [16:47:17] bd808: that’s beause I just created it. [16:47:29] PROBLEM - Disk space on silver is CRITICAL: DISK CRITICAL - free space: /sys/fs/cgroup 0 MB (0% inode=99%): [16:47:30] heh [16:47:32] bd808: but, it can’t write there, whatever fs or pseudo-fs that is is full [16:47:35] Ah, there you see [16:47:56] is that just a coincidence that icinga noticed as we were speaking about it? [16:48:15] Krenair: it’s because I created subdirs [16:48:18] Did someone do something? Or has icinga learnt english? [16:48:18] ok [16:48:22] and there was only space for about one inode [16:48:31] I have never even heard of /sys/fs/cgroup before today [16:48:55] it's linux kernel process limits stuff [16:49:18] (03PS3) 10Filippo Giunchedi: diamond: stop collecting disk IO stats for partitions [puppet] - 10https://gerrit.wikimedia.org/r/197509 (https://phabricator.wikimedia.org/T1075) [16:49:22] Linux containers (LXC) is in part built on it [16:49:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] diamond: stop collecting disk IO stats for partitions [puppet] - 10https://gerrit.wikimedia.org/r/197509 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [16:49:42] godog: Can I get a hand with silver? I’m lost. [16:49:57] we use it to isolate exec() processes spawned by PHP to limit memory and such [16:49:58] andrewbogott: sure, reading backlog [16:50:11] godog: you can start by logging on to silver and typing ‘df’ [16:52:12] The wrapper script that tries to use cgroups is /srv/mediawiki/php-1.25wmf21/includes/limit.sh [16:52:12] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#1129264 (10Papaul) [16:52:14] 6operations, 10ops-codfw, 3wikis-in-codfw: Configure mw2001-2134 correctly - https://phabricator.wikimedia.org/T91238#1129261 (10Papaul) 5Open>3Resolved a:3Papaul Logical processor and redirection after boot are enabled on all mw servers. [16:52:35] ah, remounted with 1M [16:52:47] RECOVERY - Disk space on silver is OK: DISK OK [16:53:00] I have no idea why it was 4k to begin with tho [16:53:18] RECOVERY - MariaDB disk space on silver is OK: DISK OK [16:53:32] (03PS1) 10Chad: Move web::sites to web::prod_sites; begin unification in new class [puppet] - 10https://gerrit.wikimedia.org/r/197655 [16:53:40] <^d> YuviPanda|brb: hehe ^ [16:53:41] !log remount cgroup on silver with 1M of space [16:53:43] godog: that’s something that should’ve been set up by the trusty installer, right? [16:53:48] Logged the message, Master [16:53:49] Not partman or something I touched? [16:53:55] ^d: just saw that, and was like ‘waaait, is that the apache config? brave soul!' [16:54:04] 6operations: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#1129270 (10leila) @kevinator, I'm not sure who currently uses that data and for what purposes. Do you know it? If not, Looping in @ezachte is a good idea. I know @Multichill is potentially interested in it. [16:54:27] (03CR) 10jenkins-bot: [V: 04-1] Move web::sites to web::prod_sites; begin unification in new class [puppet] - 10https://gerrit.wikimedia.org/r/197655 (owner: 10Chad) [16:54:29] andrewbogott: good question, I don't know what's supposed to be setting that up, better to track this in a ticket if not already [16:55:08] (03PS2) 10Chad: Move web::sites to web::prod_sites; begin unification in new class [puppet] - 10https://gerrit.wikimedia.org/r/197655 [16:55:14] ah this being T92712 [16:55:28] 6operations: mwscript showing errors on silver/labswiki/wikitech - https://phabricator.wikimedia.org/T92712#1129274 (10Andrew) cgroup was mounted with only 4k of space, so nothing could write there. Filippo remounted cgroup on silver with 1M of space and I created the needed subdirs and that seems to be making... [16:56:03] 6operations: mwscript showing errors on silver/labswiki/wikitech - https://phabricator.wikimedia.org/T92712#1129282 (10Krenair) 5Open>3Resolved [16:56:33] 6operations: mwscript showing errors on silver/labswiki/wikitech - https://phabricator.wikimedia.org/T92712#1118661 (10Krenair) a:5Andrew>3fgiunchedi [16:57:11] <^d> YuviPanda|brb: Baby initial step :) [16:57:19] ^d: :) indeed. [16:57:20] 6operations: mwscript showing errors on silver/labswiki/wikitech - https://phabricator.wikimedia.org/T92712#1129297 (10fgiunchedi) 5Resolved>3Open reopening, we don't know why this happened for reference, this is what I did ``` 9 2015-03-18T16:52:26Z sudo mount -o remount,relatime,size=1M,mode=755 /sy... [16:57:26] ^d: did your ES patch pass puppet compiler? [16:57:27] 6operations: mwscript showing errors on silver/labswiki/wikitech - https://phabricator.wikimedia.org/T92712#1129299 (10fgiunchedi) a:5fgiunchedi>3None [16:57:29] * YuviPanda|brb should take a look at it now [16:57:34] <^d> No, but I can't grok the output [16:57:53] <^d> http://puppet-compiler.wmflabs.org/638/change/197533/html/ [16:58:22] ^d: heh, you entered ‘elastic1001’. You should have entered: ‘elastic1001.eqiad.wmnet’ :) [16:58:29] (03PS1) 10Tim Landscheidt: Tools: Factor out registering with proxies [puppet] - 10https://gerrit.wikimedia.org/r/197658 (https://phabricator.wikimedia.org/T91954) [16:58:30] <^d> Boo, ok [16:58:43] 6operations, 10ops-codfw: ms-be2009.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T92833#1129303 (10Papaul) Drive replacement complete. [16:59:13] <^d> Retrying [16:59:19] <^d> https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/639/ [16:59:26] 6operations, 10RESTBase: (nodetool )cleanup needed on restbase1006 - https://phabricator.wikimedia.org/T93079#1129304 (10Eevans) > LGTM, are there mechanisms to alert us when a manual cleanup is needed? or perhaps how much data is pending cleanup so we can track it? Not really, no. It is limited to range mov... [16:59:48] (03CR) 10jenkins-bot: [V: 04-1] Tools: Factor out registering with proxies [puppet] - 10https://gerrit.wikimedia.org/r/197658 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [17:01:39] 6operations: Investigate why cgroup on silver was only mounted with 4k of space - https://phabricator.wikimedia.org/T92712#1129309 (10Krenair) [17:02:00] (03CR) 10Tim Landscheidt: [C: 04-1] "I tested this piecemeally, but I have to set it up on Toolsbeta to see if the puppetry works. @Yuvipanda, could you take a look at the uw" [puppet] - 10https://gerrit.wikimedia.org/r/197658 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [17:02:19] !log ms1001 - short maintenance downtime for bonding networks interfaces [17:02:24] Logged the message, Master [17:02:54] (03PS4) 10Dzahn: set up bonded interface for ms1001 plus ipv6 for it [puppet] - 10https://gerrit.wikimedia.org/r/193837 (owner: 10ArielGlenn) [17:03:38] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:04:36] (03CR) 10Dzahn: [C: 032] set up bonded interface for ms1001 plus ipv6 for it [puppet] - 10https://gerrit.wikimedia.org/r/193837 (owner: 10ArielGlenn) [17:05:26] <^d> YuviPanda|brb: http://puppet-compiler.wmflabs.org/639/change/197533/html/ [17:05:42] > Error: Cannot reassign variable rack at /opt/wmf/software/compare-puppet-catalogs/external/change/197533/puppet/manifests/role/elasticsearch.pp:35 on node elastic1001.eqiad.wmnet [17:05:48] ^d: ^ [17:05:56] <^d> fuck [17:06:11] (03PS1) 10Mobrovac: Adjust RESTBase / Cassandra settings for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/197662 (https://phabricator.wikimedia.org/T91102) [17:08:18] <_joe_> ^d: do you need assistance for some hieraization change? because I may find some time this weekened to do a good review. [17:09:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "configure:2831: error: You don't have lt-print installed." [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [17:11:31] 6operations, 7Icinga, 5Patch-For-Review: Icinga config fails on duplicate definition for host 'labs-ns0.wikimedia.org' - https://phabricator.wikimedia.org/T93053#1129334 (10Joe) 5Open>3Resolved [17:11:57] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#1129337 (10Joe) p:5Normal>3High [17:12:06] (03PS3) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) [17:13:58] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [17:13:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [17:15:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM. Lintian is happy as well, only says:" [debs/contenttranslation/apertium-mk] - 10https://gerrit.wikimedia.org/r/195244 (https://phabricator.wikimedia.org/T89936) (owner: 10KartikMistry) [17:17:28] legoktm: got a minute? [17:17:36] legoktm: some python brainstorming... [17:19:52] YuviPanda: what's up? [17:20:34] legoktm: trying to figure out how to do https://phabricator.wikimedia.org/T85279 [17:20:54] legoktm: in a way that’ll scale for all of labs... [17:22:18] legoktm: so basic problem is that there’s some shared datastructures that should not be re-created for each request... [17:22:28] YuviPanda: I don't really understand that bug report :/ [17:22:31] like the parsed structure of the YAML files, and also the LDAP connection. [17:22:39] !log uploaded apertium-mkd_0.1.0-1 on apt.wikimedia.org [17:22:41] legoktm: ok, so let me start from the beginnng :) [17:22:45] Logged the message, Master [17:22:50] legoktm: an ‘ENC’ is an ‘External Node Classifier’ in puppet. [17:23:06] legoktm: puppet gives it a hostname, and it gives back a set of puppet classes to be applied for that host... [17:23:28] legoktm: right now, we’re using wikitech to write to LDAP, and puppet uses LDAP as the source of ‘hostname -> classes’ mapping... [17:23:34] legoktm: now I want to replace that with this YAML thing... [17:23:40] (with fallback to LDAP) [17:24:05] legoktm: so basically, this needs to be… something that could take as input a hostname, and return a set of roles, based off the YAML files + LDAP. [17:24:45] right now this ‘something’ is a shell script that parses the YAML files on each request, and also makes an LDAP connection on each request. It works for now, because it is available only on per-project puppetmasters, not labs-wide. So at most you get one request every 20-30seconds, and it’s ok [17:25:04] what's the bottleneck? parsing yaml or connecting to ldap? [17:25:27] legoktm: y’know, I haven’t actually measured… :P [17:25:35] legoktm: but my bet is on ‘both’... [17:25:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "configure:2831: error: You don't have lt-print installed." [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/197345 (owner: 10KartikMistry) [17:25:46] since it’s not just YAML parsing, but also regexing.. [17:26:07] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:26:08] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:27:15] 6operations, 6Commons, 6MediaWiki-Core-Team, 6Multimedia: Deleted image pages do not show up as deleted until a manual action=purge - https://phabricator.wikimedia.org/T93052#1129391 (10Tgr) Side note: the less obvious but much easier way to upload screenshots is to just drag&drop the file into the comment... [17:27:18] YuviPanda: figure out which part is slow first? :P [17:27:28] legoktm: yeah, doing that now... [17:28:18] (03PS2) 10KartikMistry: Added missing Build-Depends [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/197345 [17:29:07] !log uploaded php5_5.3.10-1ubuntu3.17+wmf1ubuntu1 on apt.wikimedia.org for precise-wikimedia [17:29:12] _joe_: hasharAway ^ [17:29:12] Logged the message, Master [17:29:37] <_joe_> akosiaris: thanks [17:31:14] 6operations, 10ops-ulsfo: cp4009 hardware fault - https://phabricator.wikimedia.org/T92476#1129394 (10Cmjohnson) Congratulations: Work Order WO6747229 was successfully submitted. [17:31:41] <^d> _joe_: Yep, https://gerrit.wikimedia.org/r/#/c/197533/ [17:31:44] <^d> (sorry, was in mtg) [17:32:17] legoktm: yup, yaml [17:33:30] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 3Continuous-Integration-Isolation: Review Jenkins isolation architecture with Antoine - https://phabricator.wikimedia.org/T92324#1129399 (10greg) [17:33:37] YuviPanda: where's the current code? [17:33:54] (03CR) 10JanZerebecki: [C: 04-1] "Thank you. See inline commments." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194856 (https://phabricator.wikimedia.org/T91352) (owner: 10Nemo bis) [17:34:14] legoktm: https://github.com/wikimedia/operations-puppet/blob/production/modules/puppetmaster/files/ldap-yaml-enc.py [17:35:57] legoktm: this already has the yaml C bindings... [17:36:06] godog: /sys/fs/cgroup is 4k on holmium as well. That’s a new trusty box as of a week ago. [17:36:17] <^d> _joe_: Works in beta and staging ;-) [17:36:22] (03CR) 10Krinkle: [C: 031] Don't use bits for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197381 (https://phabricator.wikimedia.org/T92949) (owner: 10Aude) [17:36:25] YuviPanda: hmm, is there a place I can test it? [17:36:28] googling has taught me nothing about where that mount comes from [17:36:44] legoktm: yup, deployment-salt. [17:36:56] <_joe_> ^d: so much puppet crap we created " [17:37:00] <_joe_> works" [17:37:09] <_joe_> I would use another parameter for advertising it [17:37:10] <_joe_> :P [17:37:13] <^d> Hehe, it's much nicer now! [17:37:20] <_joe_> "it's 11 less line of code!" [17:37:31] <^d> How would I define it then? [17:37:38] <^d> For those hosts? [17:37:39] YuviPanda: how? :D [17:38:03] legoktm: /usr/local/bin/ldap-yaml-enc.py i-0000015c.eqiad.wmflabs [17:38:20] legoktm: that should output some yaml [17:40:48] YuviPanda: how fast does it need to be? time says 0m0.130s [17:41:22] <_joe_> legoktm: not that easy ;) [17:41:37] :P [17:41:39] yeah, it has to be fast… under load. [17:41:46] plus right now it loads only one yaml file... [17:41:46] <_joe_> legoktm: account for a ruby process that forks, blocking itself, shells out to executing the enc [17:42:02] <_joe_> gets the results [17:42:16] just write it in C? [17:42:17] * ^d just noticed has_lvs for staging [17:42:20] <^d> :) [17:42:35] !log messing with live code on testwiki to test a fix for https://phabricator.wikimedia.org/T93009 [17:42:37] legoktm: _joe_ but actually, if I just turn this into a straightforward, simple flask app, and then the ENC script itself just curls... [17:42:38] Logged the message, Master [17:42:42] twentyafterfour: ^^ [17:42:50] <^d> Oh beta, rather [17:42:52] <_joe_> legoktm: no the right way to do it is probably to create a daemon, and have a small relay that communicates with it via a socket [17:42:55] should be done by MW train time [17:43:08] <_joe_> but, ENC is the wrong answer to a puppet problem you have :D [17:43:36] <_joe_> or, we could patch puppet to communicate via a unix socket with its ENC [17:43:43] * YuviPanda removes site.pp, makes _joe_ use wikitech [17:43:50] YuviPanda: shouldn't be hard to flaskify. [17:43:52] ^d: does beta have the same DB LB setup as production? [17:44:05] * _joe_ knows YuviPanda's whereabouts for the next few months [17:44:08] legoktm: yeah, how would you deal with not having to load the YAML file all the time... [17:44:13] <^d> tgr: I don't think so? [17:44:17] _joe_: your cat is actually a spy for me… [17:44:23] <_joe_> ahah [17:44:32] <_joe_> someone finally notices here during meetings [17:44:43] andrewbogott: sorry I'm caught up ATM [17:44:48] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Puppet last ran 6 days ago [17:45:12] (03PS1) 10Dzahn: Revert "set up bonded interface for ms1001 plus ipv6 for it" [puppet] - 10https://gerrit.wikimedia.org/r/197671 [17:46:08] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:46:24] (03PS2) 10Chad: Hiera-ize the Elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/197533 [17:47:36] YuviPanda: just load it once and only re-load if mtime changes? [17:47:48] (03CR) 10Dzahn: [C: 032] Revert "set up bonded interface for ms1001 plus ipv6 for it" [puppet] - 10https://gerrit.wikimedia.org/r/197671 (owner: 10Dzahn) [17:47:48] legoktm: uh, where do I, uh, put it? [17:47:53] it’s been a while since I wrote any flask... [17:48:00] and I have no idea how the threading goes these days. [17:48:05] I guess I can’t just put it in a global [17:48:12] and g. is per-request... [17:50:38] (03CR) 10Alexandros Kosiaris: [C: 04-1 V: 04-1] "./configure: line 2934: AP_MKINCLUDE: command not found" [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [17:51:06] 6operations, 10Wikimedia-Labs-wikitech-interface: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#1129484 (10Krenair) Unfortunately those files (silver.wikimedia.org:/a/backup/public/labswiki-*.gz) appear to be owned by root, which I'm guessing is why I can't download them fr... [17:51:48] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [17:52:35] !log branching 1.25wmf22 [17:52:40] Logged the message, Master [17:53:16] (03PS3) 10Chad: Hiera-ize the Elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/197533 [17:53:35] twentyafterfour: did you get https://gerrit.wikimedia.org/r/#/c/197633/ (branch update)? [17:53:42] or should i do a submodule update instead? [17:53:54] * aude forgot to add you as a reviewer... [17:54:21] aude: I didn't run the script yet ;) [17:54:30] ok, great [17:54:43] so merge that now before branching? no prob... [17:54:53] go ahead [17:55:08] thanks [17:55:49] twentyafterfour: all done [17:56:30] (03CR) 10Chad: [C: 032] Just use "en" as language code for WMCA wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195064 (https://phabricator.wikimedia.org/T88843) (owner: 10Nemo bis) [17:56:53] (03Merged) 10jenkins-bot: Just use "en" as language code for WMCA wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195064 (https://phabricator.wikimedia.org/T88843) (owner: 10Nemo bis) [17:57:39] (03PS7) 10Nemo bis: Hide "prefershttps" preference on HSTS domains (ru): it has no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194856 (https://phabricator.wikimedia.org/T91352) [17:58:00] !log demon Synchronized wmf-config/InitialiseSettings.php: cawikimedia language code (duration: 00m 08s) [17:58:05] Logged the message, Master [17:58:17] 6operations: NIC misassigned (double entries) by jessie installer - https://phabricator.wikimedia.org/T90236#1129528 (10fgiunchedi) a:5fgiunchedi>3faidon [17:59:32] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1129530 (10RobH) I chatted with Mark about our spare levels and systems we have allocated. Since the one system with the disks is over-provisioned in terms o... [18:00:05] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150318T1800). [18:00:29] (03CR) 10Nemo bis: Hide "prefershttps" preference on HSTS domains (ru): it has no effect (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194856 (https://phabricator.wikimedia.org/T91352) (owner: 10Nemo bis) [18:02:30] 6operations: NIC misassigned (double entries) by jessie installer - https://phabricator.wikimedia.org/T90236#1129541 (10faidon) 5Open>3Resolved So I debugged this extensively — the [[ https://bugs.debian.org/765577 | Debian bug ]] has all the details of my analysis & a workaround. I've also increased its sev... [18:06:08] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1129560 (10RobH) Disk order = https://rt.wikimedia.org/Ticket/Display.html?id=9268 Once these come in, they'll go in one of the two machines (pending hostnam... [18:10:18] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:15:54] (03PS1) 10Chmarkine: [puppet] - 10https://gerrit.wikimedia.org/r/197686 (https://phabricator.wikimedia.org/T40516) [18:16:47] (03PS2) 10Chmarkine: tendril - Increase HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/197686 (https://phabricator.wikimedia.org/T40516) [18:17:35] (03CR) 10JanZerebecki: Hide "prefershttps" preference on HSTS domains (ru): it has no effect (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194856 (https://phabricator.wikimedia.org/T91352) (owner: 10Nemo bis) [18:17:44] (03CR) 10JanZerebecki: [C: 031] Hide "prefershttps" preference on HSTS domains (ru): it has no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194856 (https://phabricator.wikimedia.org/T91352) (owner: 10Nemo bis) [18:17:50] (03PS3) 10Chmarkine: tendril - Increase HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/197686 (https://phabricator.wikimedia.org/T40516) [18:18:35] 6operations, 10Incident-20150205-SiteOutage, 6MediaWiki-Core-Team, 10MediaWiki-Debug-Logging, and 2 others: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1129636 (10Legoktm) [18:22:16] (03CR) 10Aaron Schulz: [C: 031] poolcounter: add support for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197495 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [18:23:28] !log starting nodetool clean on restbase1005 [18:23:33] Logged the message, Master [18:25:07] 6operations: Delete stat1002:/a/squid/archive/sopa - https://phabricator.wikimedia.org/T92344#1129656 (10Ottomata) 5Open>3Resolved [18:25:34] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1129661 (10RobH) chatted with Gabriel @ office. We're allocating server francium for this task. [18:27:18] 6operations, 10Wikimedia-IRC, 10Wikimedia-Labs-wikitech-interface, 5Patch-For-Review: Enable irc feed for wikitech.wikimedia.org site - https://phabricator.wikimedia.org/T36685#1129670 (10Petrb) Oh yes! After 3 years this might finally get solved, then I will send a bottle of Champaigne to WMF ops. [18:28:46] (03CR) 10Aaron Schulz: "Since labswiki has it's own runner, this should work fine...everything will just use MySQL. That keeps a bit more separation too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190406 (owner: 10Ori.livneh) [18:28:51] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/P413" [puppet] - 10https://gerrit.wikimedia.org/r/197671 (owner: 10Dzahn) [18:29:08] 6operations: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1129671 (10RobH) 3NEW a:3RobH [18:29:35] 6operations, 10Datasets-General-or-Unknown: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#1129680 (10RobH) [18:29:36] 6operations: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1129681 (10RobH) [18:29:42] 6operations: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1129671 (10RobH) [18:30:27] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1129685 (10RobH) I've linked T93113 for the setup and deployment of the system. I'm resolving this hardware-request task. [18:30:32] (03CR) 10Chmarkine: [C: 031] Hide "prefershttps" preference on HSTS domains (ru): it has no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194856 (https://phabricator.wikimedia.org/T91352) (owner: 10Nemo bis) [18:30:38] ori: I might need your help with the uwsgi upstart/uwsgi control thing. I"ve been banging my head on it for a while but am making no headway. [18:31:47] 6operations, 10ops-eqiad: install 4 * 3TB disks in francium - https://phabricator.wikimedia.org/T93114#1129690 (10RobH) 3NEW a:3Cmjohnson [18:32:13] 6operations, 10RESTBase: (nodetool )cleanup needed on restbase1006 - https://phabricator.wikimedia.org/T93079#1129701 (10Eevans) I've started a `nodetool cleanup` and will monitor its progress. [18:32:18] 6operations: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1129704 (10RobH) [18:33:36] (03PS1) 1020after4: Add 1.25wmf22 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197695 [18:33:37] (03PS1) 1020after4: Wikipedias to 1.25wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197696 [18:33:40] (03PS1) 1020after4: Group0 to 1.25wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197697 [18:35:36] (03CR) 10JanZerebecki: [C: 031] tendril - Increase HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/197686 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:35:53] (03CR) 1020after4: [C: 032] Add 1.25wmf22 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197695 (owner: 1020after4) [18:39:23] (03Merged) 10jenkins-bot: Add 1.25wmf22 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197695 (owner: 1020after4) [18:40:58] !log twentyafterfour Started scap: testwiki to php-1.25wmf22 and rebuild l10n cache [18:41:03] Logged the message, Master [18:43:02] 6operations, 10Wikimedia-Shop, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1129756 (10Dzahn) I got this reply from the "Plus team": ``` Effie here from the Plus team! I've started the process for issuing the SSL certificate for store.wikipedia.org.... [18:44:26] (03CR) 10Kaldari: [C: 031] [WikiGrok] Actor campaign suggests occupations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197508 (owner: 10Phuedx) [18:49:10] 6operations, 10ops-codfw, 7network, 3wikis-in-codfw: Codfw mediawiki appservers from any rows but row A can't communicate with the dhcp server - https://phabricator.wikimedia.org/T92815#1129778 (10RobH) a:5RobH>3Joe confirmed they were missing from vlan configuration and added, they shoudl work now. [18:53:10] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1129802 (10ArielGlenn) 3NEW [18:54:00] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1129809 (10ArielGlenn) [18:55:24] 6operations: Delete stat1002:/a/squid/archive/bannerImpressions - https://phabricator.wikimedia.org/T92330#1129814 (10atgo) @kevinator - @jgreen is on vacation this week. I'd really prefer to get confirmation from him before we delete anything that is potentially not replicated anywhere. Is that possible? [18:57:38] 6operations: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1129822 (10RobH) [18:58:35] 6operations: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1129671 (10RobH) [18:59:28] 6operations, 10Datasets-General-or-Unknown: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#1129828 (10ArielGlenn) https://phabricator.wikimedia.org/T93118 not a blocker but this is needed for the final configuration; for short-term we can proxy through dataset1001 to the kiw... [19:01:34] Coren: sure. what's up? [19:04:11] 6operations: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1129854 (10RobH) [19:05:51] (03PS3) 10Ori.livneh: scap: lint [puppet] - 10https://gerrit.wikimedia.org/r/195680 (owner: 10Matanya) [19:06:04] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security: securing the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T92680#1129855 (10Eevans) OK, if this approach is acceptable, then I'd like to propose the following: - For the INPUT chain - Reject all traffic by default - All... [19:06:05] (03CR) 10Ori.livneh: [C: 032 V: 032] "Thanks, Matanya!" [puppet] - 10https://gerrit.wikimedia.org/r/195680 (owner: 10Matanya) [19:06:40] ori: https://phabricator.wikimedia.org/T93083 The short of it -> the service starts perfectly fine with 'service uwsgi start' through the sysvinit mechanism, but the upstart one doesn't for no clear reason and both puppet and icinga rely on it for status information. [19:08:13] (03PS1) 10Ori.livneh: Fix typo in I841c880b7 [puppet] - 10https://gerrit.wikimedia.org/r/197705 [19:08:24] (03CR) 10Ori.livneh: "matanya, fyi" [puppet] - 10https://gerrit.wikimedia.org/r/197705 (owner: 10Ori.livneh) [19:08:29] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix typo in I841c880b7 [puppet] - 10https://gerrit.wikimedia.org/r/197705 (owner: 10Ori.livneh) [19:08:59] (03PS1) 10RobH: setting francium ip address [dns] - 10https://gerrit.wikimedia.org/r/197707 [19:09:28] Coren: reading up, sec [19:09:33] (03CR) 10RobH: [C: 032] setting francium ip address [dns] - 10https://gerrit.wikimedia.org/r/197707 (owner: 10RobH) [19:10:04] Coren: what's the name of the instance that graphite.wmflabs.org is running on? [19:10:09] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [19:10:25] ori: It's on metal: labmon1001 [19:10:37] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: puppet fail [19:10:43] puppet fail on tin caused by I841c880b7, fixed by I0a5e783b5 [19:11:53] Coren: which project? [19:12:48] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:12:49] ori: Not a project, it's an actual server. :-) [19:13:08] ah! gotcha. [19:14:41] Coren: how does this problem actually manifest? graphite.wmflabs.org is up, so I assume the problem is that icinga is issuing alerts because it cannot correctly suss out the service status? [19:15:25] ori: That's correct. But also, 'uwsgictl start' doesn't do the trick; need to go sysvinit [19:15:43] uwsgictl just fails quietly with the service not starting. [19:16:10] icinga uses 'uwsgictl check', which obviously fails when it's not upstart running with uwsgi/app up [19:17:15] Coren: can I stop the service for a moment? [19:17:40] ori: Sure, nothing's life depends on it. [19:17:53] (03PS1) 10Yuvipanda: WIP: Make yaml+ldap ENC into a http service [puppet] - 10https://gerrit.wikimedia.org/r/197712 (https://phabricator.wikimedia.org/T85279) [19:18:04] legoktm: ^ [19:18:13] !log Stopping uWSGI on labmon1001 to troubleshoot T93083 [19:18:16] Logged the message, Master [19:18:32] legoktm: y’know, I could probably make this a normal TCP service and be much more confident of the threading semantics. [19:18:49] (03CR) 10jenkins-bot: [V: 04-1] WIP: Make yaml+ldap ENC into a http service [puppet] - 10https://gerrit.wikimedia.org/r/197712 (https://phabricator.wikimedia.org/T85279) (owner: 10Yuvipanda) [19:19:02] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security: securing the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T92680#1129893 (10GWicke) @eevans, sounds good to me. [19:19:05] heh [19:19:18] !log Updated email of global account "Ar-ras". The email I set for it on February 17 was outdated. [19:19:21] Logged the message, Master [19:19:41] legoktm: but can you see anything obviously wrong with that flask code? I think the global dict there looks very fishyu [19:19:43] *fishy [19:20:29] "fishyu" sounds like a pokémon. :-) [19:20:37] YuviPanda: aside from the global it looks fine. [19:20:52] legoktm: yeah, but I bet the global is going to cause problems at some point... [19:21:33] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.013 second response time [19:23:01] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security: securing the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T92680#1129905 (10mobrovac) LGTM. One small comment. Since we are securing the cluster, how about allowing SSH only from bastion to be on the safe side? [19:23:25] ottomata: mobrovac suggested that I ping you about https://phabricator.wikimedia.org/T92560 ? [19:23:44] 6operations: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1129906 (10RobH) [19:24:49] 6operations, 10Citoid, 6Services: Add citoid service alerts to the "services" group for SMS alerts - https://phabricator.wikimedia.org/T92887#1129909 (10Dzahn) I added a contact in the private puppet repo in files/nagios/contacts.cfg. adding some comments inline: ``` define contact{ contact_nam... [19:28:23] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:28:24] 6operations, 10Citoid, 6Services: Add citoid service alerts to the "services" group for SMS alerts - https://phabricator.wikimedia.org/T92887#1129923 (10GWicke) @dzahn, it might make sense to also ping the entire team by mail: services@wikimedia.org [19:28:57] (03PS1) 10Dzahn: parsoid: add mobrovac to icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/197717 (https://phabricator.wikimedia.org/T92887) [19:29:46] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security: securing the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T92680#1129928 (10Eevans) > LGTM. One small comment. Since we are securing the cluster, how about allowing SSH only from bastion to be on the safe side? We could, I guess.... [19:31:27] 6operations, 10Citoid, 6Services, 5Patch-For-Review: Add citoid service alerts to the "services" group for SMS alerts - https://phabricator.wikimedia.org/T92887#1129929 (10Dzahn) >>! In T92887#1129923, @GWicke wrote: > @dzahn, it might make sense to also ping the entire team by mail: services@wikimedia.org... [19:33:21] !log twentyafterfour Finished scap: testwiki to php-1.25wmf22 and rebuild l10n cache (duration: 52m 23s) [19:33:24] Logged the message, Master [19:34:02] fixing l10nupdate made scap slow again :/ [19:34:22] urandom: he might be out for lunch [19:34:27] /cc ottomata [19:35:39] 6operations, 10Citoid, 6Services, 5Patch-For-Review: Add citoid service alerts to the "services" group for SMS alerts - https://phabricator.wikimedia.org/T92887#1129930 (10Dzahn) regarding the existing contact group called "parsoid". it currently has: rkattouw,jamesf i'm about to add: mobrovac @Jdforres... [19:36:22] 6operations, 10Citoid, 6Services, 5Patch-For-Review: Add citoid service alerts to the "services" group for SMS alerts - https://phabricator.wikimedia.org/T92887#1129935 (10GWicke) @dzahn, I think it makes sense to generalize it to 'services'. [19:38:27] Coren: /var/log/upstart/uwsgi_app-graphite-web.log shows the upstart-managed instance as failing to start because: "chmod(): Operation not permitted [core/utils.c line 237]". If you run "strace -f sudo -u www-data -g www-data /usr/bin/uwsgi --autoload --ini /etc/uwsgi/apps-enabled/graphite-web.ini 2>&1 | grep chmod", you see: [19:38:35] [pid 7109] chmod("/sys/fs/cgroup/memory/graphite-web", 0700) = -1 EPERM (Operation not permitted) [19:38:35] [pid 7109] write(2, "chmod(): Operation not permitted"..., 57chmod(): Operation not permitted [core/utils.c line 237] [19:38:36] bd808: the rsync part was about 30 minutes, does l10update cause more files to need syncing? [19:38:53] Coren: this fingers as the culprit, though I don't know why it works on graphite1001. [19:39:13] ori: Hm. a chmod on a pseudofile in /sys is, indeed, not a valid operation. [19:39:21] twentyafterfour: yeah. ~288 per active branch [19:39:32] graphite1001 might let it slide because older kernel that is being permissive? [19:39:58] twentyafterfour: but maybe we are swamping out the rsync servers again too? [19:40:08] * bd808 looks at ganglia for likely problems [19:40:22] Coren: per , "uWSGI has to be run as root to use cgroups. uid and gid are very, very necessary. ". But the upstart jobs setuid and setgid www-data. [19:40:46] A-ha. [19:40:49] Good catch. [19:41:02] Coren: again, though -- doesn't explain how it works on graphite1001 [19:41:56] checking 'mount' i see "cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory)" on both [19:42:05] twentyafterfour: mw1097 looks like it too an unfair number of clients -- http://ganglia.wikimedia.org/latest/?c=Application%20servers%20eqiad&h=mw1097.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [19:43:05] bd808: any way to optimize that? [19:43:11] is it randomized? [19:43:11] 6operations, 10Citoid, 6Services, 5Patch-For-Review: Add citoid service alerts to the "services" group for SMS alerts - https://phabricator.wikimedia.org/T92887#1129950 (10Dzahn) ``` define contact{ contact_name team_services alias Services Team... [19:44:37] twentyafterfour: the list isn't shuffled which I've been meaning to add [19:44:54] ori: Want to expend more effort to see why the switch to www-data works on graphite or want to switch back to root? [19:45:19] mw1033 actually took twice as many hits [19:45:39] by count mw1097 was actually the second lowest [19:45:40] Coren: it never ran as root; the cgroup change made that a requirement [19:45:53] Ah! [19:45:54] Coren: I'd probably revert for now instead [19:46:20] (03CR) 1020after4: [C: 032] Wikipedias to 1.25wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197696 (owner: 1020after4) [19:46:28] (03Merged) 10jenkins-bot: Wikipedias to 1.25wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197696 (owner: 1020after4) [19:46:35] (03PS2) 10Dzahn: parsoid: add mobrovac and services team to contacts [puppet] - 10https://gerrit.wikimedia.org/r/197717 (https://phabricator.wikimedia.org/T92887) [19:47:19] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf21 [19:47:22] Logged the message, Master [19:47:37] Coren: we could revert that and replace it with the line: "limit rss 8589934592 8589934592 ", in the upstart config [19:47:42] (03PS1) 10coren: Revert "graphite: limit uwsgi workers memory" [puppet] - 10https://gerrit.wikimedia.org/r/197721 [19:47:49] (03PS3) 10Dzahn: parsoid: add mobrovac and services team to contacts [puppet] - 10https://gerrit.wikimedia.org/r/197717 (https://phabricator.wikimedia.org/T92887) [19:47:55] (03CR) 10jenkins-bot: [V: 04-1] Revert "graphite: limit uwsgi workers memory" [puppet] - 10https://gerrit.wikimedia.org/r/197721 (owner: 10coren) [19:48:33] 6operations, 10Citoid, 6Services, 5Patch-For-Review: Add citoid service alerts to the "services" group for SMS alerts - https://phabricator.wikimedia.org/T92887#1129959 (10GWicke) @dzahn: Great, thanks! [19:48:38] Oh bleh, rebase won't even work because of a mismerge. [19:48:39] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security: securing the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T92680#1129960 (10BBlack) monitoring and bastion ssh are already account for in our puppet base::firewall stuff, we'd just need to set it up with ferm::rule for the app-leve... [19:49:36] (03CR) 1020after4: [C: 032] Group0 to 1.25wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197697 (owner: 1020after4) [19:49:41] (03Merged) 10jenkins-bot: Group0 to 1.25wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197697 (owner: 1020after4) [19:49:44] (03PS4) 10BBlack: tendril - Increase HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/197686 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [19:50:02] (03CR) 10Dzahn: [C: 032] parsoid: add mobrovac and services team to contacts [puppet] - 10https://gerrit.wikimedia.org/r/197717 (https://phabricator.wikimedia.org/T92887) (owner: 10Dzahn) [19:50:04] (03CR) 10BBlack: [C: 032 V: 032] tendril - Increase HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/197686 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [19:50:20] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf22 [19:50:23] Logged the message, Master [19:50:29] mutante: merged yours too [19:51:44] (03PS2) 10coren: Revert "graphite: limit uwsgi workers memory" [puppet] - 10https://gerrit.wikimedia.org/r/197721 [19:52:03] !log twentyafterfour Purged l10n cache for 1.25wmf20 [19:52:06] Logged the message, Master [19:53:08] (03CR) 10Ori.livneh: [C: 031] "It doesn't work on graphite1001 either; the upstart-managed service isn't running. We could replace this with "limit rss 8589934592 858993" [puppet] - 10https://gerrit.wikimedia.org/r/197721 (owner: 10coren) [19:53:40] Ah! [19:53:48] (03CR) 10coren: [C: 032] Revert "graphite: limit uwsgi workers memory" [puppet] - 10https://gerrit.wikimedia.org/r/197721 (owner: 10coren) [19:54:16] Coren: let's add filippo and giuseppe as reviewers (even if it's merged) so that they're up to speed [19:55:12] ori: kk. [19:55:17] twentyafterfour: I lied. We do shuffle the sync-poxy list when trying to find the closest host [19:55:28] greg-g: is there time for me to do a small CentralAuth backport (https://gerrit.wikimedia.org/r/#/c/197719/2/includes/UsersToRename/UsersToRenameDatabaseUpdates.php,cm) ? so I can re-start the SULF notification script [19:55:40] So bad RNG I guess? Or we have an overloaded rack [19:56:10] ori: Are you the one who disabled puppet on labmon1001? [19:56:19] Coren: yes, re-enabling it now [19:58:00] (03CR) 10Ori.livneh: "Summary of the problems:" [puppet] - 10https://gerrit.wikimedia.org/r/197721 (owner: 10coren) [19:58:53] RECOVERY - uWSGI web apps on labmon1001 is OK: OK: All defined uWSGI apps are runnning. [19:58:58] Coren: [19:59:00] root@labmon1001:/sys/fs/cgroup/memory/user# uwsgictl status [19:59:00] uwsgi/app (graphite-web) start/running, process 9688 [19:59:00] root@labmon1001:/sys/fs/cgroup/memory/user# uwsgictl check [19:59:02] OK: All defined uWSGI apps are runnning. [19:59:08] Yeay! Success! [19:59:14] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 1.025 second response time [19:59:33] RECOVERY - uWSGI web apps on graphite2001 is OK: OK: All defined uWSGI apps are runnning. [19:59:58] o_O. Nobody noticed the alert on graphite2001 I take it. [20:00:04] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150318T2000). [20:02:36] 6operations: Delete stat1002:/a/squid/archive/teahouse - https://phabricator.wikimedia.org/T92335#1130001 (10Ottomata) 5Open>3Resolved [20:03:09] 6operations: Delete stat1002:/a/squid/archive/sampled-geocoded - https://phabricator.wikimedia.org/T92334#1130011 (10Ottomata) 5Open>3Resolved [20:04:09] 6operations: Delete stat1002:/a/squid/archive/mobile-geocoded - https://phabricator.wikimedia.org/T92333#1130013 (10Ottomata) 5Open>3Resolved [20:04:12] 6operations, 6Labs, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1130015 (10coren) The root cause of the issue was the switch to using cgroup via https://gerrit.wikimedia.org/r/#/c/183256/ which made uwsgi require root to start, whereas the upstart scripts drop root priv... [20:04:14] 6operations, 6Labs, 7Graphite: graphite.wmflabs.org is down - https://phabricator.wikimedia.org/T93083#1130018 (10coren) 5Open>3Resolved [20:05:04] 6operations: Delete stat1002:/a/squid/archive/edits-geocoded - https://phabricator.wikimedia.org/T92332#1130027 (10Ottomata) 5Open>3Resolved [20:05:52] (03CR) 10Ori.livneh: "Another possibility: the limit-as, reload-on-as, reload-on-rss options, per " [puppet] - 10https://gerrit.wikimedia.org/r/197721 (owner: 10coren) [20:05:59] 6operations, 10Wikimedia-Blog: Delete stat1002:/a/squid/archive/blog - https://phabricator.wikimedia.org/T92331#1130036 (10Ottomata) 5Open>3Resolved [20:06:40] 6operations: Delete stat1002:/a/squid/archive/bannerImpressions - https://phabricator.wikimedia.org/T92330#1130038 (10Ottomata) 5Open>3Resolved [20:07:09] 6operations: Delete stat1002:/a/squid/archive/arabic-banner - https://phabricator.wikimedia.org/T92329#1130044 (10Ottomata) 5Open>3Resolved [20:07:43] Coren: re-favor: could you add urandom to the wmf ldap group? He's Eric from the services team. [20:09:09] 6operations, 6MediaWiki-Core-Team, 6Multimedia, 6Parsoid-Team, and 3 others: Prepare Platform April 2015 quarterly review presentation - https://phabricator.wikimedia.org/T91803#1130056 (10bd808) [20:09:23] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [20:09:53] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [20:10:17] ^ known [20:10:37] 6operations, 6MediaWiki-Core-Team, 6Multimedia, 6Parsoid-Team, and 3 others: Prepare Platform April 2015 quarterly review presentation - https://phabricator.wikimedia.org/T91803#1096558 (10bd808) [20:10:59] !log twentyafterfour Started scap: Security patches to php-1.25wmf22 [20:11:02] Logged the message, Master [20:13:04] 6operations: Delete stat1002:/a/squid/archive/bannerImpressions - https://phabricator.wikimedia.org/T92330#1130077 (10awight) @Ottomata: Please clarify "resolved" here, we were hoping to get a 1-week reprieve to loop in Jgreen. [20:13:52] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [20:14:24] PROBLEM - puppet last run on mw2026 is CRITICAL: CRITICAL: puppet fail [20:14:33] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [20:15:00] !log deployed parsoid sha b48f6e25 [20:15:01] ori: Is there a phab ticket for this? [20:15:05] Logged the message, Master [20:17:07] (03PS1) 10Ottomata: Remove unused udp2log::nginx and relay configs from gadolinium and protactinium [puppet] - 10https://gerrit.wikimedia.org/r/197732 (https://phabricator.wikimedia.org/T92337) [20:20:33] (03CR) 10Ottomata: [C: 032] Remove unused udp2log::nginx and relay configs from gadolinium and protactinium [puppet] - 10https://gerrit.wikimedia.org/r/197732 (https://phabricator.wikimedia.org/T92337) (owner: 10Ottomata) [20:21:56] 6operations, 6MediaWiki-Core-Team, 6Multimedia, 6Parsoid-Team, and 3 others: Prepare Platform April 2015 quarterly review presentation - https://phabricator.wikimedia.org/T91803#1130115 (10bd808) [20:26:39] Coren: there is not, is it enough for me to create one? [20:27:09] urandom: If you please, for tracability. :-) [20:30:54] HMMMM [20:31:20] Coren: https://phabricator.wikimedia.org/T93133 [20:31:46] 10Ops-Access-Requests, 6operations: login access for graphite - https://phabricator.wikimedia.org/T93133#1130141 (10coren) [20:32:06] (03PS1) 10BBlack: fixup anti-sub mitigation [puppet] - 10https://gerrit.wikimedia.org/r/197734 [20:32:42] 6operations, 10Citoid, 6Services, 5Patch-For-Review: Add citoid service alerts to the "services" group for SMS alerts - https://phabricator.wikimedia.org/T92887#1130146 (10Dzahn) since the citoid monitor in the LVS module already uses the parsoid contact group: ``` modules/lvs/manifests/monitor.pp: lv... [20:32:55] !log twentyafterfour Finished scap: Security patches to php-1.25wmf22 (duration: 21m 56s) [20:32:59] Logged the message, Master [20:33:03] RECOVERY - puppet last run on mw2026 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:33:31] (03PS2) 10Tim Landscheidt: Tools: Factor out registering with proxies [puppet] - 10https://gerrit.wikimedia.org/r/197658 (https://phabricator.wikimedia.org/T91954) [20:36:44] (03Abandoned) 10BBlack: fixup anti-sub mitigation [puppet] - 10https://gerrit.wikimedia.org/r/197734 (owner: 10BBlack) [20:36:48] 6operations, 10Citoid, 6Services, 5Patch-For-Review: Add citoid service alerts to the "services" group for SMS alerts - https://phabricator.wikimedia.org/T92887#1130169 (10Dzahn) 5Open>3Resolved a:3Dzahn [20:37:07] urandom: What's your shell username? [20:37:16] eevans [20:37:39] eevans is already a member of the group, skipping. [20:37:39] No changes to make; exiting. [20:38:24] Coren: so maybe this is not how one logs into graphite? [20:38:54] sorry for the typo ori [20:39:03] !log deployed restbase 73cc02abdb [20:39:03] on graphite and on icinga it should work for members of wmf or nda [20:39:06] urandom: It is, but you have to use your /wikitech/ username [20:39:07] Logged the message, Master [20:39:27] (And the password thereof) [20:39:37] 10Ops-Access-Requests, 6operations: login access for graphite - https://phabricator.wikimedia.org/T93133#1130178 (10coren) 5Open>3Invalid a:3coren ```eevans is already a member of the group, skipping.``` You are already cool enough. :-) [20:40:33] Coren: yeah, that's what I've been doing [20:40:50] (03PS3) 10Matanya: swift_new: lint and resource quoting [puppet] - 10https://gerrit.wikimedia.org/r/195607 [20:41:09] (03CR) 10Tim Landscheidt: "Tested the puppetry on Toolsbeta; tool-nodejs and tool-uwsgi-python get executed without import failures, but stop because the proxies are" [puppet] - 10https://gerrit.wikimedia.org/r/197658 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [20:42:04] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [20:42:10] urandom: does it work on icinga.wikimedia.org ? [20:42:41] mutante: it does, yes [20:42:59] then it's not the LDAP membership but something on graphite i suppose [20:43:49] 6operations: Strange trick makes girls fuck you? - https://phabricator.wikimedia.org/T93137#1130201 (10emailbot) [20:44:00] .. [20:44:06] o.O [20:44:33] wtf? [20:44:48] chasemp: is anyone using the phab email gateway for legitimate reasons? [20:44:53] why was that email allowed through? [20:45:10] paravoid: not sure what you mean, like incoming mail or? [20:45:16] yes [20:45:22] Platonides, for the same reason that you yourself sometimes receive spam [20:45:33] what I'm really asking is, can we turn that thing off so that we don't get tickets such as the above? [20:45:34] oh yes tons of comments and some new tickets come in from email [20:45:42] legitimate ones? [20:46:04] offhand yes, but I could break down numbers if we want to consider closing the avenue [20:46:07] MaxSem, I mean why did phabricator allow creating a task by email from an account it never saw before [20:46:13] it would be nice if MarkMonitor could mail to domain tickets but we haven't really started using it [20:46:14] but it's quite common to respond via email for a task comment [20:46:31] Platonides, because it's configured to work the same way RT did [20:46:34] legit ones I mean [20:46:42] AFAIK, only Legal routinely creates tickets by email and we should be weaning them away from it. :-) [20:46:52] oh is there another legal issue [20:46:57] that's from rt redirects [20:47:00] that I would love to kill [20:47:06] (I don't see wikibugs) [20:47:29] chasemp: https://phabricator.wikimedia.org/T93137#1130201 [20:47:30] paravoid: sorry scrap that, let's please make old rt addresses not do that and just bounce back and they can go to teh right place? [20:47:38] so I can spoof legal email on 2 weeks and create a ticket saying "For legal reasons, please ops shutdown wikipedia"? [20:48:20] so https://phabricator.wikimedia.org/T93137 is a product of us honoring old rt addresses [20:49:03] urandom: Coren: i also can't login on graphite, fwiw, it's something else there [20:49:41] Platonides: I'm pretty sure something this drastic would have one of us, like, poke someone at Legal before we acted on it. I've often poked them for requests like access to sensitive mailing list and not acted on an emailed request. [20:49:48] mutante: wmf. Odd. [20:50:20] Core, it still would be a nice April fools ;) [20:50:20] wfm* [20:51:07] "Hi guys, I need assistance because I just dropped teh enwiki database." [20:51:24] lol [20:52:01] MaxSem: You need to be more subtle. Something like "Er, guys, do we have a way to recover the enwiki db from a backup"? :-) [20:52:04] User account "BobbyTables" is not registered. [20:53:01] Did you mean "Little Bobby Tables" account? [20:53:19] yea, him and his GRANTma [20:54:09] xD [20:54:47] (03PS1) 10Faidon Liambotis: mailman: fix anti-mass sub mitigation [puppet] - 10https://gerrit.wikimedia.org/r/197740 [20:55:13] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mailman: fix anti-mass sub mitigation [puppet] - 10https://gerrit.wikimedia.org/r/197740 (owner: 10Faidon Liambotis) [20:57:53] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:59:27] 6operations, 10MediaWiki-extensions-Sentry, 6Multimedia, 10hardware-requests: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138#1130243 (10Tgr) 3NEW [21:06:01] (03PS1) 10Tim Landscheidt: Fix default matches in case statements [puppet] - 10https://gerrit.wikimedia.org/r/197743 [21:13:09] greg-g: also, could I have a deploy window sometime tomorrow? there are some CA changes I'd like to deploy, but don't want to rush it during SWAT [21:15:44] 6operations, 5Patch-For-Review: Delete gadolinium:/a/log/nginx/ - https://phabricator.wikimedia.org/T92337#1130273 (10Ottomata) 5Open>3Resolved [21:15:55] (03CR) 10Matanya: [C: 031] "Yes, we should do this, it is correct." [puppet] - 10https://gerrit.wikimedia.org/r/197743 (owner: 10Tim Landscheidt) [21:19:00] (03PS1) 10Faidon Liambotis: mailman: ban mail bait [puppet] - 10https://gerrit.wikimedia.org/r/197747 [21:19:14] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra compaction is getting behind - https://phabricator.wikimedia.org/T93140#1130277 (10Eevans) 3NEW a:3Eevans [21:19:22] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mailman: ban mail bait [puppet] - 10https://gerrit.wikimedia.org/r/197747 (owner: 10Faidon Liambotis) [21:22:58] (03PS2) 1020after4: add roles for staging-rdb[12] [puppet] - 10https://gerrit.wikimedia.org/r/197348 [21:25:27] (03PS1) 10Tim Landscheidt: WIP: Mute warnings about case statements without default matches [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) [21:25:39] (03CR) 10Dzahn: [C: 032] Fix default matches in case statements [puppet] - 10https://gerrit.wikimedia.org/r/197743 (owner: 10Tim Landscheidt) [21:25:49] !log increasing compaction throughput on restbase100[1-6] [21:25:54] Logged the message, Master [21:27:20] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra compaction is getting behind - https://phabricator.wikimedia.org/T93140#1130297 (10Eevans) ``` $ for host in 1 2 3 4 5 6; do echo -n "restbase100$host: "; nodetool -h restbase100$host getcompactionthroughput; done restbase1001: Current compaction thro... [21:27:45] legoktm: sure, go ahead and add one [21:27:50] (sorry, been in meetings since 9am) [21:28:07] (03CR) 10Dzahn: "why not just actually set a default?" [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:28:09] greg-g: ok, and can I do the small CA deploy now? [21:28:25] legoktm: oh right, sure thing [21:32:47] (03PS3) 1020after4: add roles for staging redis ( *-rdb\d\d? ) [puppet] - 10https://gerrit.wikimedia.org/r/197348 [21:33:39] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra compaction is getting behind - https://phabricator.wikimedia.org/T93140#1130312 (10chasemp) >>! In T93140#1130297, @Eevans wrote: > > ``` > $ for host in 1 2 3 4 5 6; do echo -n "restbase100$host: "; nodetool -h restbase100$host fyi you can do `for... [21:34:18] (03PS2) 10Tim Landscheidt: Mute warnings about case statements without default matches [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) [21:34:44] (03CR) 1020after4: [C: 031] add roles for staging redis ( *-rdb\d\d? ) [puppet] - 10https://gerrit.wikimedia.org/r/197348 (owner: 1020after4) [21:39:33] !log legoktm Synchronized php-1.25wmf22/extensions/CentralAuth/includes/UsersToRename/UsersToRenameDatabaseUpdates.php: https://gerrit.wikimedia.org/r/#/c/197754/ (duration: 00m 05s) [21:39:37] Logged the message, Master [21:40:25] !log legoktm Synchronized php-1.25wmf21/extensions/CentralAuth/includes/UsersToRename/UsersToRenameDatabaseUpdates.php: https://gerrit.wikimedia.org/r/#/c/197755/ (duration: 00m 06s) [21:40:28] Logged the message, Master [21:40:37] (03CR) 10Hashar: [C: 04-1] "Assuming you want to ignore case_without_default entirely, you can do so by adding in the repo /.puppet-lint.rc" [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:42:51] 6operations, 6MediaWiki-Core-Team, 7Varnish: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1130349 (10mark) [21:43:34] (03CR) 10Tim Landscheidt: "@Dzahn: In some cases where the information has an external source (e. g. $::realm) that may be useful, but for example in manifests/site." [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:46:08] (03CR) 10Tim Landscheidt: "@Hashar: No, I just want to ignore the warning on specific case statements where I have confirmed that the warning is not appropriate. I " [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:46:24] (03CR) 10Tim Landscheidt: [C: 04-1] "https://integration.wikimedia.org/ci/job/operations-puppet-puppetlint-strict/16595/console" [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:48:29] (03CR) 10Hashar: "@Tim Sounds good to me. You might want to amend the commit message summary to make it clearer :-)" [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:48:48] (03CR) 10Hashar: Mute warnings about case statements without default matches [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:50:49] 6operations: Delete stat1002:/a/squid/archive/bannerImpressions - https://phabricator.wikimedia.org/T92330#1130384 (10Ottomata) Oops! I had this page already loaded and didn't see the note before I did this. I'm pretty pretty certain that Jeff will be fine with this. I hope so! [21:51:10] (03CR) 10Dzahn: "ok. i agree with this approach then. let's not disable it globally but mute it in those special cases where a default doesn't make sense, " [puppet] - 10https://gerrit.wikimedia.org/r/197759 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:52:01] 6operations: Delete gadolinium:/a/log/fundraising/ - https://phabricator.wikimedia.org/T92336#1130385 (10Ottomata) I just sent Jeff an email about this. I'm pretty sure these can be deleted, because fundraising logs are collected and rotated on erbium right now. I will wait for him to confirm. [21:55:31] 6operations, 10Analytics, 10MediaWiki-General-or-Unknown, 6Services, and 4 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1130400 (10GWicke) See also: @aaron is working on a cache update service at https://github.com/AaronSchulz/python-m emcached-relay [22:15:49] 6operations, 6MediaWiki-Core-Team, 6Multimedia, 6Parsoid-Team, and 3 others: Prepare Platform April 2015 quarterly review presentation - https://phabricator.wikimedia.org/T91803#1130446 (10ssastry) [22:26:44] greg-g: you around? [22:28:18] (03CR) 10GWicke: [C: 031] RESTbase production enablement step 1 – ptwiki, ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197468 (owner: 10Jforrester) [22:29:37] gwicke: yeah, what's up? [22:31:18] greg-g: we are considering to enable RB for ptwiki and ruwiki again [22:31:44] I would prefer to have a bit of time tonight to monitor it [22:31:56] how long will you be around? [22:31:57] would it be okay for us to go before the swat window? [22:32:15] greg-g: until ~6/7 [22:32:28] then later from home [22:32:34] * greg-g nods [22:32:37] go ahead [22:32:44] but don't like deploying just before I want to head home [22:32:57] ok, thx [22:34:06] (03CR) 10GWicke: [C: 032] RESTbase production enablement step 1 – ptwiki, ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197468 (owner: 10Jforrester) [22:34:17] (03Merged) 10jenkins-bot: RESTbase production enablement step 1 – ptwiki, ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197468 (owner: 10Jforrester) [22:35:18] !log gwicke Synchronized wmf-config/InitialiseSettings.php: Enable RESTBase on ptwiki and ruwiki (duration: 00m 05s) [22:35:23] Logged the message, Master [22:35:40] {{done}} [22:36:21] 6operations, 10MediaWiki-extensions-Sentry, 6Multimedia, 10hardware-requests: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138#1130485 (10RobH) a:3Tgr @tgr: I prefer not to have placeholders in #hardware-requests long term, as it just means I always glance at it, and ignore it, even... [22:36:31] 6operations, 10MediaWiki-extensions-Sentry, 6Multimedia, 10hardware-requests: Procure hardware for Sentry - https://phabricator.wikimedia.org/T93138#1130487 (10RobH) p:5Triage>3Lowest [22:36:53] 6operations, 10MediaWiki-extensions-Sentry, 6Multimedia, 10hardware-requests: Procure hardware for Sentry - placeholder (not a live request) - https://phabricator.wikimedia.org/T93138#1130488 (10RobH) 5Open>3stalled [22:37:13] tgr: so i just messed with your ticket, let me know if anything isnt right [22:39:08] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1130491 (10RobH) a:3Andrew [22:39:20] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1063566 (10RobH) @Andrew: Please address the above and assign back to me, thanks. [22:39:36] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1130493 (10RobH) p:5Triage>3Normal [22:51:18] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1130530 (10RobH) 5Open>3stalled The dataset1001-array1 is presently 12 * 2TB nearline SAS 7.2k rpm. I've requested a quote for another shelf. While the hardware request is now handled in phabricat... [22:51:25] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1130534 (10RobH) a:3RobH [22:51:51] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1129802 (10RobH) p:5Triage>3Normal [22:52:40] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1130539 (10RobH) p:5High>3Normal Setting to normal priority as we have done all we can until the disks come in. Once they arrive, the install can proceed... [22:56:17] * RoanKattouw claims SWAT [22:57:05] (03PS15) 10Ori.livneh: Gzip SVGs on back upload varnishes. [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) [22:57:07] (03PS1) 10Ori.livneh: Enable gzip compression for SVGs on upload on beta [puppet] - 10https://gerrit.wikimedia.org/r/197791 [22:58:17] (03PS2) 10Ori.livneh: Enable gzip compression for SVGs on upload on beta [puppet] - 10https://gerrit.wikimedia.org/r/197791 [22:59:04] (03CR) 10Kaldari: [C: 031] Enable WikiLove extension at Ukrainian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196988 (https://phabricator.wikimedia.org/T91530) (owner: 10Glaisher) [22:59:15] bblack: I split the patch in two, with the first one only enabling compression on beta: [22:59:19] any objection? [22:59:25] to merging the former, I mean [23:00:04] RoanKattouw, ^d: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150318T2300). Please do the needful. [23:01:09] RoanKattouw, ori: do you know where the VE domLoad timings went in wmf21? http://grafana.wikimedia.org/#/dashboard/db/visualeditor-load-save [23:01:20] I'm taking SWAT [23:01:54] gwicke: no. Roan might. [23:02:14] PROBLEM - Check status of defined EventLogging jobs on graphite consumer on hafnium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/graphite [23:03:08] I'm busy with SWAT right now [23:04:11] gwicke, let me take a look for you [23:04:34] Krenair: thanks! [23:04:44] RECOVERY - Check status of defined EventLogging jobs on graphite consumer on hafnium is OK: OK: All defined EventLogging jobs are runnning. [23:06:42] 10Ops-Access-Requests, 6operations, 6Phabricator, 6Release-Engineering: Mukunda needs sudo on iridium (phab host) - https://phabricator.wikimedia.org/T93151#1130608 (10greg) 3NEW [23:12:07] 10Ops-Access-Requests, 6operations, 6Phabricator, 6Release-Engineering: Mukunda needs sudo on iridium (phab host) - https://phabricator.wikimedia.org/T93151#1130638 (10greg) {F100025} [23:12:30] (03PS1) 10Dzahn: make twentyafterfour a phabricator-admin [puppet] - 10https://gerrit.wikimedia.org/r/197798 (https://phabricator.wikimedia.org/T93151) [23:12:38] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1130642 (10bd808) 3NEW [23:13:41] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1130655 (10Andrew) There's not a current hardware issue. Faidon suggested that we replace this host, and I'd like to replace it mostly in order to upgrade to Trusty with minima... [23:14:06] tgr: kaldari: Krenair: Ping for SWAT [23:14:16] thnks [23:14:23] ready when you are [23:14:24] pong [23:14:51] ori: if the intent is to comprehensively test the gzip stuff on beta to prove it now works with the newer version (that we've had for months!), I'm all for it. [23:15:33] ori: you might want to be sure that the varnish package on the relevant beta hosts is updated to our 3.0.6-something variant, as opposed to 3.0.5 or earlier. Sometimes they're behind. [23:15:39] don't worry about my patch too much, the other 17 are more important [23:15:41] * ori checks [23:16:08] (there should be such a package already in the apt repo, even for precise. the even-newer stuff that was only built for jessie doesn't have any bearing on gzip) [23:16:49] Installed: 3.0.5plus~x-wm7 [23:16:53] (03CR) 10Catrope: [C: 032] [WikiGrok] Actor campaign suggests occupations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197508 (owner: 10Phuedx) [23:16:58] (03CR) 10Catrope: [C: 032] WikiGrok: Add a new 'politician' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197142 (owner: 10Bmansurov) [23:17:03] i'll try upgrading it [23:17:05] (03CR) 10Catrope: [C: 032] WikiGrok: Add a new 'writer' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197184 (owner: 10Bmansurov) [23:17:13] RoanKattouw: Oh. I put my patch in the wrong SWAT day :( [23:17:24] (03CR) 10Catrope: [C: 032] Fix wmgRC2UDPPrefix generation to work with wikitech's non-protorel wgServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197652 (https://phabricator.wikimedia.org/T36685) (owner: 10Alex Monk) [23:17:37] bd808: I'll add it, go ahead and move it up on the wiki page for me [23:17:44] thanks! [23:18:15] ori: it should update to 3.0.6 with "apt-get install varnish libvarnishapi1 varnish-dbg" I think [23:19:09] (03CR) 10GWicke: [C: 031] RESTbase production enablement step 2 – itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197469 (owner: 10Jforrester) [23:19:25] bblack: libvarnishapi1 is already the newest version. varnish is already the newest version. varnish-dbg is already the newest version. [23:19:29] (it's precise) [23:19:39] Alright I'll do the config patches in SWAT as soon as Jenkins recovers from all the +2s I've thrown at it [23:19:45] hmm oh, maybe I'm wrong about some statements above. Looking up the current state of affairs [23:20:03] Oh man Jenkins must be having so much fun [23:20:27] like 10 patches in gate-and-submit [23:20:50] ori: ah, yes, I never built 3.0.6 for precise :/ It was trusty that I built it for. [23:21:17] (and then jessie has what the trusty package has + fixups that are only relevant to prod disk storage allocation stuff) [23:21:41] RoanKattouw: if you'd like to do one more config patch you could do https://gerrit.wikimedia.org/r/197469 as well [23:21:55] gwicke: OK. Add it to the wiki page for me? [23:22:07] RoanKattouw: aye. [23:22:08] ori: we could go build a precise packaging of the latest, but honestly beta cache nodes should probably just update to jessie now anyways, to serve as a more accurate beta [23:22:15] Oh and I'll just leave this here: https://phabricator.wikimedia.org/P405 [23:22:24] If we're still accepting new patches :P [23:22:27] (03CR) 10Catrope: [C: 032] RESTbase production enablement step 2 – itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197469 (owner: 10Jforrester) [23:22:33] bblack: would there be value in having the patch run on 3.0.5? [23:23:22] if you want to test that, yes. You might be able to confirm some form of breakage there + fix on 3.0.6, perhaps. I think it was last seriously looked at back even earlier, and there were gzip fixes in both? it's hard to remember. [23:24:46] I think this was the only gzip-related fix that is exclusive to 3.0.6: https://www.varnish-cache.org/lists/pipermail/varnish-commit/2014-January/010477.html [23:24:47] cool [23:25:09] could you give https://gerrit.wikimedia.org/r/#/c/197791/ a quick look-over, then? [23:25:16] it's labs-only but any varnish change is a reason to be nervous [23:26:11] (03CR) 10BBlack: [C: 031] Enable gzip compression for SVGs on upload on beta [puppet] - 10https://gerrit.wikimedia.org/r/197791 (owner: 10Ori.livneh) [23:26:19] danke [23:26:34] if it works on 3.0.5, it will almost certainly work on 3.0.6, I'd think (but we'll still want to test that, too) [23:26:57] (03CR) 10Ori.livneh: [C: 032] Enable gzip compression for SVGs on upload on beta [puppet] - 10https://gerrit.wikimedia.org/r/197791 (owner: 10Ori.livneh) [23:26:57] The list of patches for swat today no longer fits on my screen at normal zoom level [23:27:24] lol [23:27:30] IIRC there are a bunch of scenarios to try to invoke for testing, which makes it a pain. the dimensions of the total problem I think are whether the content source server compressed or not, whether the client asked for compress or not, whether the object was cached already or streamed, etc.... [23:28:11] Krenair: At least by now the gate-and-submit queue fits on the 30" monitor on James's desk [23:28:13] It didn't earlier [23:28:17] :D [23:28:36] (And it's rotated 90 degrees as well!) [23:28:45] (normally we would just assume that software features work, but I'm pretty sure gzip crashed for us or generated invalid output, in some past versions under some scenarios) [23:29:41] (03CR) 10Alex Monk: "Why are all wikipedias being done before all non-wikipedias?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197473 (owner: 10Jforrester) [23:29:59] bblack: Yeah. beta actually gets hit by a pretty good range of devices / headless testing tools / browsers, usually driven by either a person or a tool looking to confirm that behavior / appearance conforms to expectations [23:30:03] (03CR) 10Catrope: "Because the non-Wikipedias haven't had their content imported into RESTbase yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197473 (owner: 10Jforrester) [23:30:10] (03CR) 10Jforrester: "> Why are all wikipedias being done before all non-wikipedias?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197473 (owner: 10Jforrester) [23:30:10] so just having that passively there is a start [23:31:47] (03Merged) 10jenkins-bot: [WikiGrok] Actor campaign suggests occupations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197508 (owner: 10Phuedx) [23:31:49] (03Merged) 10jenkins-bot: WikiGrok: Add a new 'politician' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197142 (owner: 10Bmansurov) [23:31:51] (03Merged) 10jenkins-bot: WikiGrok: Add a new 'writer' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197184 (owner: 10Bmansurov) [23:31:53] (03Merged) 10jenkins-bot: Fix wmgRC2UDPPrefix generation to work with wikitech's non-protorel wgServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197652 (https://phabricator.wikimedia.org/T36685) (owner: 10Alex Monk) [23:33:20] (03Merged) 10jenkins-bot: RESTbase production enablement step 2 – itwiki, plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197469 (owner: 10Jforrester) [23:36:28] OK here come the config patches [23:36:36] !log catrope Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 07s) [23:36:41] Logged the message, Master [23:36:43] !log catrope Synchronized wmf-config/CommonSettings.php: SWAT (duration: 00m 07s) [23:36:46] Logged the message, Master [23:36:49] !log catrope Synchronized wmf-config/mobile.php: SWAT (duration: 00m 05s) [23:36:52] Logged the message, Master [23:37:01] gwicke, kaldari, Krenair: ---^^ [23:37:17] looking [23:37:25] RoanKattouw: yay! [23:37:26] checking [23:38:15] RoanKattouw, looks good! [23:39:11] 6operations, 10Wikimedia-IRC, 10Wikimedia-Labs-wikitech-interface, 5Patch-For-Review: Enable irc feed for wikitech.wikimedia.org site - https://phabricator.wikimedia.org/T36685#1130712 (10Krenair) 5Open>3Resolved We waited 3 years for that? [23:40:05] RoanKattouw: Looks good. Thanks! [23:40:25] (#wikitech.wikimedia rc irc channel now works) [23:40:25] !log catrope Started scap: SWAT [23:40:29] Logged the message, Master [23:40:42] (03CR) 1020after4: [C: 031] make twentyafterfour a phabricator-admin [puppet] - 10https://gerrit.wikimedia.org/r/197798 (https://phabricator.wikimedia.org/T93151) (owner: 10Dzahn) [23:43:51] (03CR) 10GWicke: [C: 031] RESTbase production enablement step 3 – frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197470 (owner: 10Jforrester) [23:46:40] RoanKattouw: if you'd like to do one more.. https://gerrit.wikimedia.org/r/#q,197470,n,z [23:46:47] already added it to the deployments page [23:47:18] ori: Do you know what the appropiate way would be to get MariaDB on a labs instance? [23:47:20] gwicke: OK, but you have to wait for the SWAT being done [23:47:22] (Using puppet that is) [23:47:27] RoanKattouw: ok [23:47:41] Just noticed that the default (and we're using prod's mediawiki.pp already) results in MySQL being installed [23:47:55] MySQL 5.5.41-0ubuntu0.14.04.1 [23:48:02] (03CR) 10Alex Monk: "Why/how is that? When is everything else scheduled for?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197473 (owner: 10Jforrester) [23:48:33] prod runs 5.5.34-MariaDB-1~precise-log [23:53:36] (03CR) 10Jforrester: "> Why/how is that?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197473 (owner: 10Jforrester) [23:54:17] Krenair: (And yes, it sucks.) [23:56:17] Krinkle: does it matter? It shouldn't make a difference to testing [23:56:50] (03CR) 10Alex Monk: "The principal use case is Wikipedia?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197473 (owner: 10Jforrester) [23:58:15] ori: sure, I just noticed the difference. [23:58:36] ori: if there's an easy way to swap it, adding that to our slaves would be nice. [23:58:59] but I'll go ahead without. It should indeed work fine either way